Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added _ckpt_Serialization
Empty file.
56 changes: 56 additions & 0 deletions experiments/checkpoint_reproducibility/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
## Key Result

With the following controls enabled:

• torch.manual_seed()
• torch.use_deterministic_algorithms(True)
• deterministic checkpoint serialization

two identical training runs produce identical SHA-256 checkpoint hashes.

# Checkpoint Reproducibility Experiment

Checks whether identical PyTorch training runs produce identical checkpoints.
This came up in Discord — the question was whether you could verify a training
run by hashing the checkpoint. Before that's useful you need to know if training
is actually deterministic.

Short answer: yes, but only if you seed the RNG and use a deterministic save
format. These experiments figure out which things break it and which don't.

---

## Files

```
run_experiment.py broad pass - toggles one thing at a time, prints hashes
test_entropy_sources.py one variable per test, run with pytest
utils.py set_seed() and hash_file()
WHY.md what questions I was trying to answer
RESULTS.md what the results mean
SOURCES.md sources
```

---

## Run it

```bash
cd experiments/checkpoint_reproducibility

python run_experiment.py

pytest test_entropy_sources.py -v
```

Expected: `run_experiment.py` shows one genuinely non-deterministic result
(unseeded torch RNG). Pytest is 14 passed, 0 failed.
Comment on lines +46 to +47
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Refresh expected pytest result text.

The current test file includes xfail/xpass cases, so the expected output should not be documented as only 14 passed, 0 failed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/README.md` around lines 46 - 47,
Update the README expected pytest result text that currently reads "14 passed, 0
failed" (the line mentioning `run_experiment.py` output) to match the actual
pytest summary which includes xfail/xpass counts; locate the "Expected:
`run_experiment.py` shows..." sentence and replace the simple "14 passed, 0
failed" wording with the real test summary that includes xfailed/xpassed (or a
generic phrasing like "tests pass with xfailed/xpassed reported") so the README
reflects the xfail/xpass cases.


---

## Windows note

`num_workers > 0` is skipped automatically on Windows. PyTorch uses `spawn` for
DataLoader workers which re-imports the script in each process. Making it work
needs more setup than the experiment is worth. See RESULTS.md for expected Linux
behaviour.
Comment on lines +51 to +56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a trailing newline to satisfy markdownlint MD047.

Static analysis indicates this Markdown file should end with a single newline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/README.md` around lines 51 - 56, The
README.md file's final section "Windows note" is missing a trailing newline
which triggers markdownlint rule MD047; open the README.md (the "Windows note"
section) and add a single newline character at the end of the file so the file
ends with exactly one trailing newline.

85 changes: 85 additions & 0 deletions experiments/checkpoint_reproducibility/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# RESULTS.md

Actual output. PyTorch 2.x, Windows, CPU only.

---

## run_experiment.py

```
baseline: 0ec6ccf669f7

torch.save
run1: 859fee7dca98 run2: 859fee7dca98 same: True matches_baseline: False

torch rng not seeded
run1: 444629cb73c6 run2: e9b671c41080 same: False matches_baseline: False

numpy rng not seeded + shuffle
run1: 5d7a61399d8c run2: 5d7a61399d8c same: True matches_baseline: False

python rng not seeded + shuffle
run1: 5d7a61399d8c run2: 5d7a61399d8c same: True matches_baseline: False

dropout same seed
run1: 7bab2b432f0e run2: 7bab2b432f0e same: True matches_baseline: False

shuffle same seed
run1: 5d7a61399d8c run2: 5d7a61399d8c same: True matches_baseline: False
```

## pytest

```
14 passed, 0 failed
```
Comment on lines +31 to +35
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the pytest summary to match the current suite outcomes.

The suite currently includes xfail/xpass scenarios; documenting only 14 passed, 0 failed is misleading for reproducibility interpretation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 31 - 35, The
pytest summary under the "## pytest" header currently shows only "14 passed, 0
failed" which omits xfail/xpass outcomes; update the fenced code block to
reflect the full test result counts (e.g., passed, failed, xfailed, xpassed,
skipped) produced by the current test run so the document matches reality—re-run
pytest (or use the saved CI output) to get the exact counts and replace the
existing block text accordingly while preserving the Markdown fenced code block
format.


---

## What each result means

**torch.save**
Same hash both runs (`859fee` both times). PyTorch 2.x hardcodes the ZIP entry
date to `2011-01-01`, so the byte-level entropy from timestamps isn't present
here. `save_deterministic()` is still the right approach — it doesn't depend on
that implementation detail staying stable.

**torch rng not seeded**
Different hashes (`444629` vs `e9b671`). This is the only genuinely
non-deterministic result. Weight init without a seed draws from the OS random
state and changes every run. Fixing this requires `torch.manual_seed()`.

**numpy / python rng not seeded + shuffle**
Both produced `5d7a61` — same as shuffle-with-fixed-seed. All three shuffle
variants matched. DataLoader shuffle only draws from the torch generator, not
from numpy or python. Unseeding those two doesn't affect anything here.

**dropout same seed**
`7bab2b` both runs — deterministic. Dropout masks are drawn from the torch RNG,
so with a fixed seed they're reproducible.

**shuffle same seed**
`5d7a61` both runs — deterministic but different from the no-shuffle baseline.
DataLoader shuffle is seeded through `torch.Generator`, so it's reproducible
given the same seed. Different from baseline because the gradient order changed.

**test_torch_save_byte_stability**
Expected to fail (assert not match), but `torch.save` gave identical hashes
on this version. The test documents this outcome — see above.

**test_fixed_seed_isolates_save_format_entropy**
In-memory weights match (seed is fixed). File bytes also match on PyTorch 2.x.
The test prints the result either way — it's documentation, not a pass/fail gate.

**test_fixed_seed_plus_deterministic_save**
Passes. Same seed + `save_deterministic()` = identical SHA-256 end to end.
Comment on lines +66 to +75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use current test function names for traceability.

This section references test names that do not match the current test_q* function names in test_entropy_sources.py, which makes cross-checking harder.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 66 - 75, The
RESULT.md section references old test names (e.g.,
test_torch_save_byte_stability, test_fixed_seed_isolates_save_format_entropy,
test_fixed_seed_plus_deterministic_save) which no longer match the actual test
functions in test_entropy_sources.py (they are now named test_q*); update the
text to use the current test function names from test_entropy_sources.py (or add
a short parenthetical mapping from the old names to the new test_q* names) so
readers can trace results to the real tests (search for test_q in
test_entropy_sources.py to find the exact identifiers to use).


---

## Open questions

- Cross-machine: different hardware may produce different floats even with
identical seeds. Not tested.
- CUDA nondeterminism: needs a GPU to actually observe. The test is informational.
- Multiprocessing: skipped on Windows. Expected to work on Linux with an explicit
DataLoader generator seed.
Comment on lines +79 to +85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a trailing newline to satisfy markdownlint MD047.

Static analysis reports this file should end with a single newline character.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 79 - 85, The
file ends without a final newline which violates markdownlint MD047; update the
file "experiments/checkpoint_reproducibility/RESULTS.md" by adding a single
trailing newline character at the end of the file so the last line (the
"Multiprocessing..." bullet) is terminated with a newline and the file ends with
exactly one newline.

37 changes: 37 additions & 0 deletions experiments/checkpoint_reproducibility/SOURCES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Bibliography

Sources that were actually useful for this experiment.

---

**PyTorch — Reproducibility docs**
https://pytorch.org/docs/stable/notes/randomness.html
The main reference. Covers `use_deterministic_algorithms`, DataLoader worker
seeding, and the fork vs spawn difference that caused the Windows issue.

**PyTorch source — `torch/serialization.py`**
https://github.com/pytorch/pytorch/blob/main/torch/serialization.py
Where the hardcoded ZIP date (`2011-01-01`) lives. Explains why `torch.save`
gave identical hashes in the experiment.

**Hugging Face — Safetensors**
https://github.com/huggingface/safetensors
The format `save_deterministic()` approximates. No pickle, no ZIP, raw tensor
bytes in a fixed layout. Worth reading if this experiment gets extended.

**Reproducible Builds project**
https://reproducible-builds.org/
Background on why byte-level reproducibility is harder than it looks. The
timestamp problem in ZIP archives is a well-known issue in compiled software —
same class of problem, different domain.

**Bouthillier et al. (2019) — Unreproducible Research is Reproducible. ICML.**
https://proceedings.mlr.press/v97/bouthillier19a.html
Argues that undocumented variation is the real problem in ML reproducibility, not
variation itself. Relevant to the results where training is repeatable but
different from baseline — those aren't wrong, they're just undocumented states.

**Pineau et al. (2021) — Improving Reproducibility in ML Research. JMLR.**
https://jmlr.org/papers/v22/20-1364.html
The NeurIPS reproducibility checklist. Useful reference for what the
OpenVerifiableLLM training manifest should eventually include.
73 changes: 73 additions & 0 deletions experiments/checkpoint_reproducibility/WHY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# WHY.md

## The original question

In a Discord discussion someone asked whether you could verify a training run just
by hashing the checkpoint. The idea being: re-run training on the same data and
if the hashes match, nothing was hidden.

That only works if training is actually deterministic. I wasn't sure it was, so I
ran some experiments to find out.

---

## What I checked and why

**Does the same seed give the same weights?**
First thing to confirm. If weight init isn't reproducible nothing else will be.
It is — that's `test_same_seed_same_weights`, and it's the floor everything else
builds on.
Comment on lines +18 to +19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update references to actual test names in the current suite.

Examples like test_same_seed_same_weights and test_fixed_seed_isolates_save_format_entropy do not match current test_q* names, which makes the narrative harder to verify against code.

Also applies to: 45-46

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/WHY.md` around lines 18 - 19, Update
the documentation references that mention outdated test names (e.g.,
test_same_seed_same_weights and test_fixed_seed_isolates_save_format_entropy) to
the current test function names used in the suite (look for the test_q*
functions); find the mentions in WHY.md and replace them with the exact current
test identifiers (or add a parenthetical mapping) so readers can directly
correlate the narrative to the actual tests (search for
test_same_seed_same_weights/test_fixed_seed_isolates_save_format_entropy and
substitute the correct test_q... names or add a short note mapping old→new).


**Does "same weights" mean "same file"?**
Not automatically. `torch.save()` uses Python's zipfile internally, and ZIP
entries can embed timestamps. If the timestamp changes between saves, the SHA-256
hash changes even though the model is identical. The fix is to write raw tensor
bytes directly with no container — which is what `save_deterministic()` does, and
what safetensors was designed for.

What I didn't expect: on this machine `torch.save()` produced the same hash both
times. Looking into it, PyTorch 2.x hardcodes the ZIP entry date to `2011-01-01`.
So the timestamp problem is already fixed in recent versions — but it's an
implementation detail, not a guarantee, so `save_deterministic()` is still the
right approach.

**Which RNG families actually matter?**
I assumed I needed to seed everything — torch, numpy, python. Turned out numpy
and python are inert for a pure PyTorch training loop. The broad survey confirmed
it: unseeding numpy or python (even with shuffle active) produced the same hash as
the seeded version. The torch seed is the only one that's load-bearing right now.

The numpy and python calls in `set_seed()` are still worth keeping — if the
pipeline ever adds numpy-based augmentation they'll matter — but right now they're
not doing anything.

**Are independent sources actually independent?**
`test_fixed_seed_isolates_save_format_entropy` checks this. Fix the seed so
weights are deterministic, then see if `torch.save()` still produces different
bytes. If they diverge, the two sources are independent. On PyTorch 2.x both
match, which confirms independence from the other direction.

---

## Things that went wrong

The original test for non-contiguous tensors asserted that calling `.numpy()` on
`weight.T` would raise a `RuntimeError`. That used to be true in older NumPy but
isn't anymore — it handles strided arrays silently. Caught on first run, rewrote
it to check byte layout consistency instead.

The first version of `run_experiment.py` created the dataset at module level with
no `if __name__ == "__main__"` guard. On Windows, DataLoader with `num_workers > 0`
spawns workers that re-import the script, which deadlocked immediately. Fixed by
moving dataset creation into `make_dataset()`.

---

## What this doesn't cover

- **Cross-machine**: different CPU architectures can produce slightly different
floating-point results even with the same seed. Not tested here.
- **CUDA**: the nondeterministic mode test is informational only — the test
machine is CPU-only. On CUDA some operations are nondeterministic by default.
- **Distributed training**: gradient averaging across workers introduces ordering
issues that are hard to control. Out of scope.
Comment on lines +68 to +73
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a trailing newline to satisfy markdownlint MD047.

Static analysis reports this file should end with a single newline character.

🧰 Tools
🪛 LanguageTool

[style] ~73-~73: To elevate your writing, try using a synonym here.
Context: ...s introduces ordering issues that are hard to control. Out of scope.

(HARD_TO)

🪛 markdownlint-cli2 (0.21.0)

[warning] 73-73: Files should end with a single newline character

(MD047, single-trailing-newline)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/WHY.md` around lines 68 - 73, The file
ends without a trailing newline which triggers markdownlint MD047; edit
experiments/checkpoint_reproducibility/WHY.md and add a single newline character
at the end of the file (after the last bullet "Distributed training: gradient
averaging across workers introduces ordering issues that are hard to control.
Out of scope.") so the file ends with exactly one trailing newline.

Loading
Loading