`zipimport.invalidate_caches()` implementation causes performance regression for `importlib.invalidate_caches()`

In Python 3.10+, [an implementation of `zipimport.invalidate_caches()` was introduced](https://github.com/python/cpython/pull/24159).

An Apache Spark developer recently identified this implementation of `zipimport.invalidate_caches()` as the source of performance regressions for `importlib.invalidate_caches()`. They observed that importing only two zipped packages (py4j, and pyspark) slows down the speed of `importlib.invalidate_caches()` up to 3500%. See the [new discussion thread on the original PR](https://github.com/python/cpython/pull/24159#discussion_r1154007661) where `zipimport.invalidate_caches()` was introduced for more context.

The reason for this regression is an incorrect design for the API.

Currently in `zipimport.invalidate_caches()`, the cache of zip files is repopulated at the point of invalidation. This violates the semantics of cache invalidation which should simply clear the cache. Cache repopulation should occur on the next access of files.

There are three relevant events to consider:
1. The cache is accessed while valid
2. `invalidate_caches()` is called
3. The cache is accessed after being invalidated

Events (1) and (2)  should be fast, while event (3) can be slow since we're repopulating a cache. In the original PR, we made (1) and (3) fast, but (2) slow. To fix this we can do the following:

- Add a boolean flag `cache_is_valid` that is set to false when `invalidate_caches()` is called.
- In `_get_files()`, if `cache_is_valid` is true, use the cache. If `cache_is_valid` is false, call `_read_directory()`.

This approach avoids any behaviour change introduced in Python 3.10+ and keeps the common path of reading the cache performant, while also shifting the cost of reading the directory out of cache invalidation.

We can go further and consider the fact that we rarely expect zip archives to change. Given this, we can consider adding a new flag to give users the option of disabling implicit invalidation of zipimported libaries when `importlib.invalidate_caches()` is called.

cc @brettcannon @HyukjinKwon


### Linked PRs
* gh-103208

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`zipimport.invalidate_caches()` implementation causes performance regression for `importlib.invalidate_caches()` #103200

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

zipimport.invalidate_caches() implementation causes performance regression for importlib.invalidate_caches() #103200

Description

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`zipimport.invalidate_caches()` implementation causes performance regression for `importlib.invalidate_caches()` #103200