Skip to content

Conversation

@sterliakov
Copy link
Collaborator

@sterliakov sterliakov commented Nov 7, 2025

Fixes #14521. Fixes #19359. Fixes #18473.

We discussed a top-level try/except as a potential solution, but perhaps it's better to just not crash here?

When several concurrent instances of mypy are running, current implementation allows a race condition. If the cache was generated with version X and two processes (A and B) of mypy versions X and Y, resp., run at the same time, the following is possible:

  • A reads fresh meta and believes data is also fresh
  • B reads stale meta, processes this SCC and writes meta and data
  • A reads data and fails to parse it correctly

This can be avoided by running mtime check after reading the data file. If it is still fresh, then A is safe to assume that data is usable, otherwise it should fall back to reprocessing the whole SCC as stale.

This change should have very low performance impact: we already perform that mtime check, I just move it to a later phase. This will result in deserializing more data files than necessary when multiple instances are running in parallel, but this is not our normal mode of operation anyway.

I am also considering an alternative - writing a random bytestring to the meta and data file and requiring that they match. This should give the same level of protection and will not fail us on systems where mtime is unreliable, but will introduce more complexity to the serialization/deserialization layer.

The race condition can be reliably reproduced on master by adding __import__("time").sleep(0.2) to the beginning State.load_tree and then launching two mypy processes as follows:

cd "$(mktemp -d)"
uv venv .venv && . .venv/bin/activate && uv pip install -e "/your/path/to/local/mypy"
uv venv .old && . .old/bin/activate && uv pip install "mypy == 1.16.1"
echo "import random" >a.py
rm -rf .mypy_cache; .venv/bin/mypy a.py; echo >>a.py; .venv/bin/mypy a.py -v --show-traceback & .old/bin/mypy a.py

If /your/path/to/mypy has current HEAD checked out with the sleep added.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

@sterliakov
Copy link
Collaborator Author

sterliakov commented Nov 8, 2025

Any suggestions for good benchmark targets? Simply mypy -c 'import torch' shows no difference (master: $3.12 \pm 0.07$, PR: $3.11 \pm 0.06$) over 10 cached runs ($\textrm{mean} \pm \textrm{sd}$).

@sterliakov sterliakov marked this pull request as ready for review November 8, 2025 00:50
manager.done_sccs.add(prev_scc.id)
process_fresh_modules(graph, sorted(prev_scc.mod_ids), manager)
if not process_fresh_modules(graph, sorted(prev_scc.mod_ids), manager):
process_stale_scc(graph, prev_scc, manager)
Copy link
Member

@ilevkivskyi ilevkivskyi Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad idea to do this while GC is (potentially) frozen. In general, I think we should not complicate logic in build.py for some rare/niche use cases, it is already complicated and tricky enough. I think we should simply add try/except with an error message suggesting using different --cache-dir per invocation when running mypy in parallel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad idea to do this while GC is (potentially) frozen.

This is already an exceptional case, we can only reach this branch if multiple processes are running concurrently. Note that this also fixes an (unreported) crash that is almost guaranteed to happen if the existing data is None early return branch is taken in State.load_tree (and I see how that can reasonably happen even without concurrency - network error on a mapped drive, for example, or just a disk fault).

If I understand correctly, the worst possible outcome in this case is OOM (which is a very good way to say "please stop running multiple instances concurrently") or increased mem usage, so what do we risk by allowing this to happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy about try/except because it may actually not happen even when the data is stale - it may be compatible by interface but not semantically, - resulting in inconsistent state somewhere and unpredictable heisenbugs no one can reproduce.

Or let's add a checkbox to the issue template asking if the OP is running some IDE integration without explicit --cache-dir configuration...

Copy link
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this is not going not happen, as I said, build.py is already complex enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

2 participants