-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Prevent cache load race condition #20205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…tently reproduced by adding `__import__("time").sleep(0.2)` to `State.load_tree`
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
|
Any suggestions for good benchmark targets? Simply |
| manager.done_sccs.add(prev_scc.id) | ||
| process_fresh_modules(graph, sorted(prev_scc.mod_ids), manager) | ||
| if not process_fresh_modules(graph, sorted(prev_scc.mod_ids), manager): | ||
| process_stale_scc(graph, prev_scc, manager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bad idea to do this while GC is (potentially) frozen. In general, I think we should not complicate logic in build.py for some rare/niche use cases, it is already complicated and tricky enough. I think we should simply add try/except with an error message suggesting using different --cache-dir per invocation when running mypy in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bad idea to do this while GC is (potentially) frozen.
This is already an exceptional case, we can only reach this branch if multiple processes are running concurrently. Note that this also fixes an (unreported) crash that is almost guaranteed to happen if the existing data is None early return branch is taken in State.load_tree (and I see how that can reasonably happen even without concurrency - network error on a mapped drive, for example, or just a disk fault).
If I understand correctly, the worst possible outcome in this case is OOM (which is a very good way to say "please stop running multiple instances concurrently") or increased mem usage, so what do we risk by allowing this to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy about try/except because it may actually not happen even when the data is stale - it may be compatible by interface but not semantically, - resulting in inconsistent state somewhere and unpredictable heisenbugs no one can reproduce.
Or let's add a checkbox to the issue template asking if the OP is running some IDE integration without explicit --cache-dir configuration...
ilevkivskyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this is not going not happen, as I said, build.py is already complex enough.
Fixes #14521. Fixes #19359. Fixes #18473.
We discussed a top-level try/except as a potential solution, but perhaps it's better to just not crash here?
When several concurrent instances of mypy are running, current implementation allows a race condition. If the cache was generated with version X and two processes (A and B) of mypy versions X and Y, resp., run at the same time, the following is possible:
This can be avoided by running mtime check after reading the data file. If it is still fresh, then A is safe to assume that data is usable, otherwise it should fall back to reprocessing the whole SCC as stale.
This change should have very low performance impact: we already perform that mtime check, I just move it to a later phase. This will result in deserializing more data files than necessary when multiple instances are running in parallel, but this is not our normal mode of operation anyway.
I am also considering an alternative - writing a random bytestring to the meta and data file and requiring that they match. This should give the same level of protection and will not fail us on systems where mtime is unreliable, but will introduce more complexity to the serialization/deserialization layer.
The race condition can be reliably reproduced on master by adding
__import__("time").sleep(0.2)to the beginningState.load_treeand then launching two mypy processes as follows:If
/your/path/to/mypyhas current HEAD checked out with the sleep added.