AOSSIE-Org · ryoari · Mar 13, 2026 · Mar 13, 2026 · Mar 13, 2026 · Mar 13, 2026
diff --git a/_ckpt_Serialization b/_ckpt_Serialization
diff --git a/experiments/checkpoint_reproducibility/README.md b/experiments/checkpoint_reproducibility/README.md
@@ -0,0 +1,56 @@
+## Key Result
+
+With the following controls enabled:
+
+• torch.manual_seed()
+• torch.use_deterministic_algorithms(True)
+• deterministic checkpoint serialization
+
+two identical training runs produce identical SHA-256 checkpoint hashes.
+
+# Checkpoint Reproducibility Experiment
+
+Checks whether identical PyTorch training runs produce identical checkpoints.
+This came up in Discord — the question was whether you could verify a training
+run by hashing the checkpoint. Before that's useful you need to know if training
+is actually deterministic.
+
+Short answer: yes, but only if you seed the RNG and use a deterministic save
+format. These experiments figure out which things break it and which don't.
+
+---
+
+## Files
+
+```
+run_experiment.py       broad pass - toggles one thing at a time, prints hashes
+test_entropy_sources.py one variable per test, run with pytest
+utils.py                set_seed() and hash_file()
+WHY.md                  what questions I was trying to answer
+RESULTS.md              what the results mean
+SOURCES.md         sources
+```
+
+---
+
+## Run it
+
+```bash
+cd experiments/checkpoint_reproducibility
+
+python run_experiment.py
+
+pytest test_entropy_sources.py -v
+```
+
+Expected: `run_experiment.py` shows one genuinely non-deterministic result
+(unseeded torch RNG). Pytest is 14 passed, 0 failed.
+
+---
+
+## Windows note
+
+`num_workers > 0` is skipped automatically on Windows. PyTorch uses `spawn` for
+DataLoader workers which re-imports the script in each process. Making it work
+needs more setup than the experiment is worth. See RESULTS.md for expected Linux
+behaviour.
diff --git a/experiments/checkpoint_reproducibility/RESULTS.md b/experiments/checkpoint_reproducibility/RESULTS.md
@@ -0,0 +1,85 @@
+# RESULTS.md
+
+Actual output. PyTorch 2.x, Windows, CPU only.
+
+---
+
+## run_experiment.py
+
+```
+baseline:  0ec6ccf669f7
+
+torch.save
+  run1: 859fee7dca98  run2: 859fee7dca98  same: True  matches_baseline: False
+
+torch rng not seeded
+  run1: 444629cb73c6  run2: e9b671c41080  same: False  matches_baseline: False
+
+numpy rng not seeded + shuffle
+  run1: 5d7a61399d8c  run2: 5d7a61399d8c  same: True  matches_baseline: False
+
+python rng not seeded + shuffle
+  run1: 5d7a61399d8c  run2: 5d7a61399d8c  same: True  matches_baseline: False
+
+dropout same seed
+  run1: 7bab2b432f0e  run2: 7bab2b432f0e  same: True  matches_baseline: False
+
+shuffle same seed
+  run1: 5d7a61399d8c  run2: 5d7a61399d8c  same: True  matches_baseline: False
+```
+
+## pytest
+
+```
+14 passed, 0 failed
+```
+
+---
+
+## What each result means
+
+**torch.save**
+Same hash both runs (`859fee` both times). PyTorch 2.x hardcodes the ZIP entry
+date to `2011-01-01`, so the byte-level entropy from timestamps isn't present
+here. `save_deterministic()` is still the right approach — it doesn't depend on
+that implementation detail staying stable.
+
+**torch rng not seeded**
+Different hashes (`444629` vs `e9b671`). This is the only genuinely
+non-deterministic result. Weight init without a seed draws from the OS random
+state and changes every run. Fixing this requires `torch.manual_seed()`.
+
+**numpy / python rng not seeded + shuffle**
+Both produced `5d7a61` — same as shuffle-with-fixed-seed. All three shuffle
+variants matched. DataLoader shuffle only draws from the torch generator, not
+from numpy or python. Unseeding those two doesn't affect anything here.
+
+**dropout same seed**
+`7bab2b` both runs — deterministic. Dropout masks are drawn from the torch RNG,
+so with a fixed seed they're reproducible.
+
+**shuffle same seed**
+`5d7a61` both runs — deterministic but different from the no-shuffle baseline.
+DataLoader shuffle is seeded through `torch.Generator`, so it's reproducible
+given the same seed. Different from baseline because the gradient order changed.
+
+**test_torch_save_byte_stability**
+Expected to fail (assert not match), but `torch.save` gave identical hashes
+on this version. The test documents this outcome — see above.
+
+**test_fixed_seed_isolates_save_format_entropy**
+In-memory weights match (seed is fixed). File bytes also match on PyTorch 2.x.
+The test prints the result either way — it's documentation, not a pass/fail gate.
+
+**test_fixed_seed_plus_deterministic_save**
+Passes. Same seed + `save_deterministic()` = identical SHA-256 end to end.
+
+---
+
+## Open questions
+
+- Cross-machine: different hardware may produce different floats even with
+  identical seeds. Not tested.
+- CUDA nondeterminism: needs a GPU to actually observe. The test is informational.
+- Multiprocessing: skipped on Windows. Expected to work on Linux with an explicit
+  DataLoader generator seed.
diff --git a/experiments/checkpoint_reproducibility/SOURCES.md b/experiments/checkpoint_reproducibility/SOURCES.md
@@ -0,0 +1,37 @@
+# Bibliography
+
+Sources that were actually useful for this experiment.
+
+---
+
+**PyTorch — Reproducibility docs**  
+https://pytorch.org/docs/stable/notes/randomness.html  
+The main reference. Covers `use_deterministic_algorithms`, DataLoader worker
+seeding, and the fork vs spawn difference that caused the Windows issue.
+
+**PyTorch source — `torch/serialization.py`**  
+https://github.com/pytorch/pytorch/blob/main/torch/serialization.py  
+Where the hardcoded ZIP date (`2011-01-01`) lives. Explains why `torch.save`
+gave identical hashes in the experiment.
+
+**Hugging Face — Safetensors**  
+https://github.com/huggingface/safetensors  
+The format `save_deterministic()` approximates. No pickle, no ZIP, raw tensor
+bytes in a fixed layout. Worth reading if this experiment gets extended.
+
+**Reproducible Builds project**  
+https://reproducible-builds.org/  
+Background on why byte-level reproducibility is harder than it looks. The
+timestamp problem in ZIP archives is a well-known issue in compiled software —
+same class of problem, different domain.
+
+**Bouthillier et al. (2019) — Unreproducible Research is Reproducible. ICML.**  
+https://proceedings.mlr.press/v97/bouthillier19a.html  
+Argues that undocumented variation is the real problem in ML reproducibility, not
+variation itself. Relevant to the results where training is repeatable but
+different from baseline — those aren't wrong, they're just undocumented states.
+
+**Pineau et al. (2021) — Improving Reproducibility in ML Research. JMLR.**  
+https://jmlr.org/papers/v22/20-1364.html  
+The NeurIPS reproducibility checklist. Useful reference for what the
+OpenVerifiableLLM training manifest should eventually include.
diff --git a/experiments/checkpoint_reproducibility/WHY.md b/experiments/checkpoint_reproducibility/WHY.md
@@ -0,0 +1,73 @@
+# WHY.md
+
+## The original question
+
+In a Discord discussion someone asked whether you could verify a training run just
+by hashing the checkpoint. The idea being: re-run training on the same data and
+if the hashes match, nothing was hidden.
+
+That only works if training is actually deterministic. I wasn't sure it was, so I
+ran some experiments to find out.
+
+---
+
+## What I checked and why
+
+**Does the same seed give the same weights?**
+First thing to confirm. If weight init isn't reproducible nothing else will be.
+It is — that's `test_same_seed_same_weights`, and it's the floor everything else
+builds on.
+
+**Does "same weights" mean "same file"?**
+Not automatically. `torch.save()` uses Python's zipfile internally, and ZIP
+entries can embed timestamps. If the timestamp changes between saves, the SHA-256
+hash changes even though the model is identical. The fix is to write raw tensor
+bytes directly with no container — which is what `save_deterministic()` does, and
+what safetensors was designed for.
+
+What I didn't expect: on this machine `torch.save()` produced the same hash both
+times. Looking into it, PyTorch 2.x hardcodes the ZIP entry date to `2011-01-01`.
+So the timestamp problem is already fixed in recent versions — but it's an
+implementation detail, not a guarantee, so `save_deterministic()` is still the
+right approach.
+
+**Which RNG families actually matter?**
+I assumed I needed to seed everything — torch, numpy, python. Turned out numpy
+and python are inert for a pure PyTorch training loop. The broad survey confirmed
+it: unseeding numpy or python (even with shuffle active) produced the same hash as
+the seeded version. The torch seed is the only one that's load-bearing right now.
+
+The numpy and python calls in `set_seed()` are still worth keeping — if the
+pipeline ever adds numpy-based augmentation they'll matter — but right now they're
+not doing anything.
+
+**Are independent sources actually independent?**
+`test_fixed_seed_isolates_save_format_entropy` checks this. Fix the seed so
+weights are deterministic, then see if `torch.save()` still produces different
+bytes. If they diverge, the two sources are independent. On PyTorch 2.x both
+match, which confirms independence from the other direction.
+
+---
+
+## Things that went wrong
+
+The original test for non-contiguous tensors asserted that calling `.numpy()` on
+`weight.T` would raise a `RuntimeError`. That used to be true in older NumPy but
+isn't anymore — it handles strided arrays silently. Caught on first run, rewrote
+it to check byte layout consistency instead.
+
+The first version of `run_experiment.py` created the dataset at module level with
+no `if __name__ == "__main__"` guard. On Windows, DataLoader with `num_workers > 0`
+spawns workers that re-import the script, which deadlocked immediately. Fixed by
+moving dataset creation into `make_dataset()`.
+
+---
+
+## What this doesn't cover
+
+- **Cross-machine**: different CPU architectures can produce slightly different
+  floating-point results even with the same seed. Not tested here.
+- **CUDA**: the nondeterministic mode test is informational only — the test
+  machine is CPU-only. On CUDA some operations are nondeterministic by default.
+- **Distributed training**: gradient averaging across workers introduces ordering
+  issues that are hard to control. Out of scope.