-
-
Notifications
You must be signed in to change notification settings - Fork 28
[EXPERIMENT] Checkpoint reproducibility and entropy sources in PyTorch training #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
1b976f9
d12ad1d
41026ee
c2d70bb
980f396
f871f7d
5680a09
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| ## Key Result | ||
|
|
||
| With the following controls enabled: | ||
|
|
||
| • torch.manual_seed() | ||
| • torch.use_deterministic_algorithms(True) | ||
| • deterministic checkpoint serialization | ||
|
|
||
| two identical training runs produce identical SHA-256 checkpoint hashes. | ||
|
|
||
| # Checkpoint Reproducibility Experiment | ||
|
|
||
| Checks whether identical PyTorch training runs produce identical checkpoints. | ||
| This came up in Discord — the question was whether you could verify a training | ||
| run by hashing the checkpoint. Before that's useful you need to know if training | ||
| is actually deterministic. | ||
|
|
||
| Short answer: yes, but only if you seed the RNG and use a deterministic save | ||
| format. These experiments figure out which things break it and which don't. | ||
|
|
||
| --- | ||
|
|
||
| ## Files | ||
|
|
||
| ``` | ||
| run_experiment.py broad pass - toggles one thing at a time, prints hashes | ||
| test_entropy_sources.py one variable per test, run with pytest | ||
| utils.py set_seed() and hash_file() | ||
| WHY.md what questions I was trying to answer | ||
| RESULTS.md what the results mean | ||
| SOURCES.md sources | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Run it | ||
|
|
||
| ```bash | ||
| cd experiments/checkpoint_reproducibility | ||
|
|
||
| python run_experiment.py | ||
|
|
||
| pytest test_entropy_sources.py -v | ||
| ``` | ||
|
|
||
| Expected: `run_experiment.py` shows one genuinely non-deterministic result | ||
| (unseeded torch RNG). Pytest is 14 passed, 0 failed. | ||
|
|
||
| --- | ||
|
|
||
| ## Windows note | ||
|
|
||
| `num_workers > 0` is skipped automatically on Windows. PyTorch uses `spawn` for | ||
| DataLoader workers which re-imports the script in each process. Making it work | ||
| needs more setup than the experiment is worth. See RESULTS.md for expected Linux | ||
| behaviour. | ||
|
Comment on lines
+51
to
+56
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a trailing newline to satisfy markdownlint MD047. Static analysis indicates this Markdown file should end with a single newline. 🤖 Prompt for AI Agents |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| # RESULTS.md | ||
|
|
||
| Actual output. PyTorch 2.x, Windows, CPU only. | ||
|
|
||
| --- | ||
|
|
||
| ## run_experiment.py | ||
|
|
||
| ``` | ||
| baseline: 0ec6ccf669f7 | ||
|
|
||
| torch.save | ||
| run1: 859fee7dca98 run2: 859fee7dca98 same: True matches_baseline: False | ||
|
|
||
| torch rng not seeded | ||
| run1: 444629cb73c6 run2: e9b671c41080 same: False matches_baseline: False | ||
|
|
||
| numpy rng not seeded + shuffle | ||
| run1: 5d7a61399d8c run2: 5d7a61399d8c same: True matches_baseline: False | ||
|
|
||
| python rng not seeded + shuffle | ||
| run1: 5d7a61399d8c run2: 5d7a61399d8c same: True matches_baseline: False | ||
|
|
||
| dropout same seed | ||
| run1: 7bab2b432f0e run2: 7bab2b432f0e same: True matches_baseline: False | ||
|
|
||
| shuffle same seed | ||
| run1: 5d7a61399d8c run2: 5d7a61399d8c same: True matches_baseline: False | ||
| ``` | ||
|
|
||
| ## pytest | ||
|
|
||
| ``` | ||
| 14 passed, 0 failed | ||
| ``` | ||
|
Comment on lines
+31
to
+35
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update the pytest summary to match the current suite outcomes. The suite currently includes 🤖 Prompt for AI Agents |
||
|
|
||
| --- | ||
|
|
||
| ## What each result means | ||
|
|
||
| **torch.save** | ||
| Same hash both runs (`859fee` both times). PyTorch 2.x hardcodes the ZIP entry | ||
| date to `2011-01-01`, so the byte-level entropy from timestamps isn't present | ||
| here. `save_deterministic()` is still the right approach — it doesn't depend on | ||
| that implementation detail staying stable. | ||
|
|
||
| **torch rng not seeded** | ||
| Different hashes (`444629` vs `e9b671`). This is the only genuinely | ||
| non-deterministic result. Weight init without a seed draws from the OS random | ||
| state and changes every run. Fixing this requires `torch.manual_seed()`. | ||
|
|
||
| **numpy / python rng not seeded + shuffle** | ||
| Both produced `5d7a61` — same as shuffle-with-fixed-seed. All three shuffle | ||
| variants matched. DataLoader shuffle only draws from the torch generator, not | ||
| from numpy or python. Unseeding those two doesn't affect anything here. | ||
|
|
||
| **dropout same seed** | ||
| `7bab2b` both runs — deterministic. Dropout masks are drawn from the torch RNG, | ||
| so with a fixed seed they're reproducible. | ||
|
|
||
| **shuffle same seed** | ||
| `5d7a61` both runs — deterministic but different from the no-shuffle baseline. | ||
| DataLoader shuffle is seeded through `torch.Generator`, so it's reproducible | ||
| given the same seed. Different from baseline because the gradient order changed. | ||
|
|
||
| **test_torch_save_byte_stability** | ||
| Expected to fail (assert not match), but `torch.save` gave identical hashes | ||
| on this version. The test documents this outcome — see above. | ||
|
|
||
| **test_fixed_seed_isolates_save_format_entropy** | ||
| In-memory weights match (seed is fixed). File bytes also match on PyTorch 2.x. | ||
| The test prints the result either way — it's documentation, not a pass/fail gate. | ||
|
|
||
| **test_fixed_seed_plus_deterministic_save** | ||
| Passes. Same seed + `save_deterministic()` = identical SHA-256 end to end. | ||
|
Comment on lines
+66
to
+75
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use current test function names for traceability. This section references test names that do not match the current 🤖 Prompt for AI Agents |
||
|
|
||
| --- | ||
|
|
||
| ## Open questions | ||
|
|
||
| - Cross-machine: different hardware may produce different floats even with | ||
| identical seeds. Not tested. | ||
| - CUDA nondeterminism: needs a GPU to actually observe. The test is informational. | ||
| - Multiprocessing: skipped on Windows. Expected to work on Linux with an explicit | ||
| DataLoader generator seed. | ||
|
Comment on lines
+79
to
+85
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a trailing newline to satisfy markdownlint MD047. Static analysis reports this file should end with a single newline character. 🤖 Prompt for AI Agents |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| # Bibliography | ||
|
|
||
| Sources that were actually useful for this experiment. | ||
|
|
||
| --- | ||
|
|
||
| **PyTorch — Reproducibility docs** | ||
| https://pytorch.org/docs/stable/notes/randomness.html | ||
| The main reference. Covers `use_deterministic_algorithms`, DataLoader worker | ||
| seeding, and the fork vs spawn difference that caused the Windows issue. | ||
|
|
||
| **PyTorch source — `torch/serialization.py`** | ||
| https://github.com/pytorch/pytorch/blob/main/torch/serialization.py | ||
| Where the hardcoded ZIP date (`2011-01-01`) lives. Explains why `torch.save` | ||
| gave identical hashes in the experiment. | ||
|
|
||
| **Hugging Face — Safetensors** | ||
| https://github.com/huggingface/safetensors | ||
| The format `save_deterministic()` approximates. No pickle, no ZIP, raw tensor | ||
| bytes in a fixed layout. Worth reading if this experiment gets extended. | ||
|
|
||
| **Reproducible Builds project** | ||
| https://reproducible-builds.org/ | ||
| Background on why byte-level reproducibility is harder than it looks. The | ||
| timestamp problem in ZIP archives is a well-known issue in compiled software — | ||
| same class of problem, different domain. | ||
|
|
||
| **Bouthillier et al. (2019) — Unreproducible Research is Reproducible. ICML.** | ||
| https://proceedings.mlr.press/v97/bouthillier19a.html | ||
| Argues that undocumented variation is the real problem in ML reproducibility, not | ||
| variation itself. Relevant to the results where training is repeatable but | ||
| different from baseline — those aren't wrong, they're just undocumented states. | ||
|
|
||
| **Pineau et al. (2021) — Improving Reproducibility in ML Research. JMLR.** | ||
| https://jmlr.org/papers/v22/20-1364.html | ||
| The NeurIPS reproducibility checklist. Useful reference for what the | ||
| OpenVerifiableLLM training manifest should eventually include. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| # WHY.md | ||
|
|
||
| ## The original question | ||
|
|
||
| In a Discord discussion someone asked whether you could verify a training run just | ||
| by hashing the checkpoint. The idea being: re-run training on the same data and | ||
| if the hashes match, nothing was hidden. | ||
|
|
||
| That only works if training is actually deterministic. I wasn't sure it was, so I | ||
| ran some experiments to find out. | ||
|
|
||
| --- | ||
|
|
||
| ## What I checked and why | ||
|
|
||
| **Does the same seed give the same weights?** | ||
| First thing to confirm. If weight init isn't reproducible nothing else will be. | ||
| It is — that's `test_same_seed_same_weights`, and it's the floor everything else | ||
| builds on. | ||
|
Comment on lines
+18
to
+19
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update references to actual test names in the current suite. Examples like Also applies to: 45-46 🤖 Prompt for AI Agents |
||
|
|
||
| **Does "same weights" mean "same file"?** | ||
| Not automatically. `torch.save()` uses Python's zipfile internally, and ZIP | ||
| entries can embed timestamps. If the timestamp changes between saves, the SHA-256 | ||
| hash changes even though the model is identical. The fix is to write raw tensor | ||
| bytes directly with no container — which is what `save_deterministic()` does, and | ||
| what safetensors was designed for. | ||
|
|
||
| What I didn't expect: on this machine `torch.save()` produced the same hash both | ||
| times. Looking into it, PyTorch 2.x hardcodes the ZIP entry date to `2011-01-01`. | ||
| So the timestamp problem is already fixed in recent versions — but it's an | ||
| implementation detail, not a guarantee, so `save_deterministic()` is still the | ||
| right approach. | ||
|
|
||
| **Which RNG families actually matter?** | ||
| I assumed I needed to seed everything — torch, numpy, python. Turned out numpy | ||
| and python are inert for a pure PyTorch training loop. The broad survey confirmed | ||
| it: unseeding numpy or python (even with shuffle active) produced the same hash as | ||
| the seeded version. The torch seed is the only one that's load-bearing right now. | ||
|
|
||
| The numpy and python calls in `set_seed()` are still worth keeping — if the | ||
| pipeline ever adds numpy-based augmentation they'll matter — but right now they're | ||
| not doing anything. | ||
|
|
||
| **Are independent sources actually independent?** | ||
| `test_fixed_seed_isolates_save_format_entropy` checks this. Fix the seed so | ||
| weights are deterministic, then see if `torch.save()` still produces different | ||
| bytes. If they diverge, the two sources are independent. On PyTorch 2.x both | ||
| match, which confirms independence from the other direction. | ||
|
|
||
| --- | ||
|
|
||
| ## Things that went wrong | ||
|
|
||
| The original test for non-contiguous tensors asserted that calling `.numpy()` on | ||
| `weight.T` would raise a `RuntimeError`. That used to be true in older NumPy but | ||
| isn't anymore — it handles strided arrays silently. Caught on first run, rewrote | ||
| it to check byte layout consistency instead. | ||
|
|
||
| The first version of `run_experiment.py` created the dataset at module level with | ||
| no `if __name__ == "__main__"` guard. On Windows, DataLoader with `num_workers > 0` | ||
| spawns workers that re-import the script, which deadlocked immediately. Fixed by | ||
| moving dataset creation into `make_dataset()`. | ||
|
|
||
| --- | ||
|
|
||
| ## What this doesn't cover | ||
|
|
||
| - **Cross-machine**: different CPU architectures can produce slightly different | ||
| floating-point results even with the same seed. Not tested here. | ||
| - **CUDA**: the nondeterministic mode test is informational only — the test | ||
| machine is CPU-only. On CUDA some operations are nondeterministic by default. | ||
| - **Distributed training**: gradient averaging across workers introduces ordering | ||
| issues that are hard to control. Out of scope. | ||
|
Comment on lines
+68
to
+73
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a trailing newline to satisfy markdownlint MD047. Static analysis reports this file should end with a single newline character. 🧰 Tools🪛 LanguageTool[style] ~73-~73: To elevate your writing, try using a synonym here. (HARD_TO) 🪛 markdownlint-cli2 (0.21.0)[warning] 73-73: Files should end with a single newline character (MD047, single-trailing-newline) 🤖 Prompt for AI Agents |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refresh expected pytest result text.
The current test file includes xfail/xpass cases, so the expected output should not be documented as only
14 passed, 0 failed.🤖 Prompt for AI Agents