openai · dhruvpuri · Apr 30, 2026
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,36 @@ data/manifest.json
 data/docs_selected.jsonl
 .mypy_cache/
 .venv
-logs/
+logs/
+
+# Personal scratch and notes (not part of the submission)
+.claude/
+PARAMETER_GOLF_BATTLE_PLAN_*.md
+chats.md
+program.md
+errors.json
+best_sweep_config.json
+test_oracle.bin
+final_model.int6.ptz
+final_model.pt
+
+# Local working copies that duplicate the canonical files in records/
+/train_gpt.py
+/build_ngram_oracle.py
+/run_h100.sh
+/run_local_test.sh
+/run_3seeds.sh
+/sweep.py
+/ctw_prototype.py
+/ablation_colab.ipynb
+/kaggle_run.ipynb
+
+# Reference / scratch copies of other people's PRs and earlier forks
+reference_pr*_train_gpt.py
+train_gpt_pr*_*.py
+
+# Tooling clones
+autoresearch_ref/
+
+# Older incomplete draft record submission, never finished
+records/track_10min_16mb/2026-03-24_VR_GA_LeakyReLU_LegalTTT/
diff --git a/...record_16mb/2026-04-30_FrozenOracle_ComplementaryTraining_HedgeMixer/JOURNEY.md b/...record_16mb/2026-04-30_FrozenOracle_ComplementaryTraining_HedgeMixer/JOURNEY.md
diff --git a/...16mb/2026-04-30_FrozenOracle_ComplementaryTraining_HedgeMixer/PR_DESCRIPTION.md b/...16mb/2026-04-30_FrozenOracle_ComplementaryTraining_HedgeMixer/PR_DESCRIPTION.md
@@ -0,0 +1,54 @@
+# Non-record: Frozen N-gram Oracle + HedgeMixer + SGD TTT
+
+## Summary
+
+A hybrid system that bundles a frozen multi-order n-gram oracle (built offline from FineWeb training tokens, int8 log-probabilities with zstd-22, 3.42 MB compressed on a 10M-token slice) into a single artifact alongside the neural model. The oracle plugs into the existing Hedge mixer at TTT/eval time as additional experts. The submission also includes the SGD TTT switch (PR #967, reported -0.041 BPB) and `LeakyReLU(0.75)²` (PR #977, reported -0.008 BPB) as ancillary changes.
+
+This is a methodology submission, not a record claim. I didn't have 8×H100 access during the cohort. The pipeline runs end-to-end on Kaggle T4×2 NCCL DDP (8L/384d, 13.4M params, 172 training steps, 3,786 TTT chunks, 6.85 MB final artifact, exit 0). The README extrapolates wall-clock to H100×8 from those measurements (around 13 to 17 minutes for the full pipeline at 11L/512d).
+
+## Key contributions
+
+- A frozen multi-order n-gram oracle, packaged as part of the artifact. Standalone offline builder (`build_ngram_oracle.py`, 250 lines, NumPy only): exact unigram, exact bigram, FNV-1a-hashed orders 3 through 8 with bucket counts from 4096 down to 256. Built only from training tokens. Bundled inside the 16 MB cap, designed to address the compliance gap that flagged PR #924.
+
+- HedgeMixer extension. The existing 5-expert mixer (neural + online uni/bi/tri + decay cache) is extended to `5 + |oracle orders|` experts via a single multiplicative-weights update. With no oracle loaded, behavior matches the base.
+
+- Single-artifact format with a 16-byte versioned header (4-byte magic, 1 version byte, 3 reserved, neural and oracle blob lengths). `oracle_len = 0` degrades cleanly to base behavior. Reload uses an in-memory `FrozenNgramOracle.from_bytes` classmethod, no per-rank temp files.
+
+- SGD TTT as a configurable alternative to AdamW (PR #967). `LeakyReLU(0.75)²` configurable per PR #977. Both env-var-gated, both small reviewable diffs.
+
+- Bug fixes. Bucketed `dist.all_reduce` in TTT replaces about 100 per-parameter NCCL launches with one. `index_put_(..., accumulate=True)` replaces a non-deterministic `bi_counts[prev, targets] += 1.0` in HedgeMixer table updates. Inline `loss * weight.mean()` complementary scaling removed (mathematically not equivalent to per-token reweighting).
+
+## Negative results
+
+- Byte-level CTW (`ctw_prototype.py`, depth 8, 262K hash buckets/depth, KT estimator). 2M training bytes + 500K eval bytes from one FineWeb shard:
+  - Eval BPB: 6.33 (target < 1.2)
+  - Compressed: 21.31 MB (target < 5 MB)
+  - Throughput: 16,761 bytes/sec (target > 100K)
+  - Verdict: dominated by token-level n-grams at this vocab size. Token-level CTW is the natural follow-up.
+- Inline complementary loss scaling. Multiplying scalar-mean CE by `weight.mean()` is not equivalent to per-token reweighting. Removed. Standalone function kept for future inside-graph integration.
+
+## Limitations
+
+- No 8×H100 validation, no 3-seed mean, no competition-scale BPB number from me.
+- Oracle build verified on a 100M-token shard (32 s, 4.66 MB) and a 10M-token slice (2.5 s, 3.42 MB). Full 80-shard scan time and final compressed size are extrapolated.
+- Complementary training loss is implemented but currently disabled. Inside-graph integration is required for correct per-token weighting; the inline version was wrong (now removed).
+- All BPB numbers cited from other PRs (#803: 0.4416, #834: 0.1663, #924: 0.0280, #967: -0.041, #977: -0.008) are from those PRs' authors, not reproduced here.
+
+## Test plan
+
+- [x] Local end-to-end run on RTX 4060 (4L/256d toy, 50 steps, 7.6 MB artifact, exit 0)
+- [x] Kaggle T4×2 NCCL DDP run end-to-end (8L/384d, 172 steps, 3,786 TTT chunks, 6.85 MB artifact, exit 0)
+- [x] FNV-1a NumPy/Torch equivalence test passes (1000 samples, ctx_len=5, buckets=4096)
+- [x] Magic-prefix artifact roundtrip verified (`Header + blobs total: 7,181,197 == 7,181,197: True`)
+- [x] All `train_gpt.py` and `build_ngram_oracle.py` files syntax-checked
+- [ ] Not done: 8×H100 3-seed validation at competition spec
+- [ ] Not done: Full 80-shard oracle build
+- [ ] Not done: α-sweep for complementary loss after inside-graph integration
+- [ ] Not done: Per-order Hedge weight logging
+
+## Why submit this
+
+I joined late and didn't get 8×H100 access. Rather than fabricate numbers or skip submitting, I'm offering this as a methodology contribution: a clean, modular, reviewable design for hybrid frozen-oracle + neural systems. The README walks through the design, the negative results, the explicit limitations, and the concrete plan for what I'd do with H100 access.
+
+- [README.md](./README.md): technical writeup (~1,300 words) covering the three components, the validated DDP run, the H100 extrapolation, compliance, limitations, related April 2026 references, and reproducing instructions.
+- [JOURNEY.md](./JOURNEY.md): process journal documenting the 5-week research arc, the 5 GitHub sweeps, the 6 specialist agents consulted, the strategic pivots, the dead ends with measured numbers, and the day-of-deadline polish loop.
diff --git a/..._record_16mb/2026-04-30_FrozenOracle_ComplementaryTraining_HedgeMixer/README.md b/..._record_16mb/2026-04-30_FrozenOracle_ComplementaryTraining_HedgeMixer/README.md
@@ -0,0 +1,68 @@
+# Frozen N-gram Oracle + HedgeMixer + SGD TTT (non-record submission)
+
+**Author:** Dhruv Puri ([@dhruvpuri](https://github.com/dhruvpuri)), 2026-04-30
+
+A hybrid n-gram + neural language model for OpenAI's Parameter Golf 2026 (16 MB artifact, 10 minutes training on 8×H100, scored by bits-per-byte on FineWeb val).
+
+This is a methodology submission, not a record claim. I didn't have 8×H100 access, so the full-scale numbers (11L/512d) aren't here. The pipeline is validated end-to-end on Kaggle T4×2 with NCCL DDP. See [JOURNEY.md](./JOURNEY.md) for the research arc, agent-assisted review loop, and reproducing instructions.
+
+## What's in the patch
+
+Three pieces that share an artifact, plus three smaller fixes.
+
+**1. A frozen n-gram oracle.** [`build_ngram_oracle.py`](./build_ngram_oracle.py), 250 lines, NumPy only. Scans FineWeb training tokens once offline. Builds orders 1 through 8 as int8 log-probabilities with Laplace smoothing. Orders 1 and 2 are exact. Orders 3 through 8 use FNV-1a hashed contexts with bucket counts going from 4096 down to 256. zstd-22 compressed. A NumPy/Torch FNV-1a equivalence test runs before every build. If the offline NumPy hash and the online Torch hash disagree on any sample, the build aborts.
+
+**2. HedgeMixer with oracle experts.** The base stack already had a 5-expert online ensemble (neural, online uni/bi/tri, decay cache). I added one expert per loaded oracle order, taking the count to 13. Mixing in log-space, multiplicative-weights update on per-token NLL. Warm prior `log_w[0] = 2.0` so short eval streams aren't dominated by Hedge convergence noise. With no oracle loaded the mixer reduces to the original 5-expert form.
+
+**3. A magic-prefixed versioned artifact format.** 16-byte header (4-byte magic `0x50474152`, 1 version byte, 3 reserved, neural and oracle blob lengths), then the neural blob (int6 per-row + zstd-22) and the oracle blob. One file under the 16 MB cap. The version byte means future schema changes fail loudly instead of silently mis-slicing. Reload uses an in-memory `FrozenNgramOracle.from_bytes` classmethod, no per-rank temp files.
+
+Plus, in `train_gpt.py`:
+
+- A `TTT_OPTIMIZER=sgd` switch (lr=0.002, momentum=0.9), matching [PR #967](https://github.com/openai/parameter-golf/pull/967)'s reported -0.041 BPB.
+- `LEAKY_RELU_SLOPE` configurable. Setting 0.75 matches [PR #977](https://github.com/openai/parameter-golf/pull/977)'s -0.008 BPB.
+- Bucketed `dist.all_reduce` in TTT, replacing about 100 per-parameter NCCL launches per micro-step with one.
+
+About 280 new lines, 9 changed lines, all gated by environment variables. With `NGRAM_ORACLE_PATH=""` and `TTT_OPTIMIZER=adamw`, runtime behavior matches the base.
+
+## What I actually ran (Kaggle T4×2 NCCL DDP)
+
+| Stage | Result |
+|---|---|
+| Build | Oracle 3.42 MB / 10M tokens / 2.5 s |
+| Train | 8L/384d, 13.4M params, 172 steps in 180s, `world_size:2 grad_accum_steps:4` |
+| Quantize + bundle | int6 + zstd-22, artifact 6.85 MB / 16 MB (neural 3.43 MB + oracle 3.42 MB + 16 B header) |
+| Reload | `oracle:loaded from artifact orders=[1, 2, 3, 4, 5, 6, 7, 8]`, both ranks via `from_bytes` |
+| TTT | SGD, 3,786 chunks, oracle in HedgeMixer experts, 7,078 s wall on T4×2 |
+| Magic prefix check | `Header + blobs total: 7,181,197 == 7,181,197: True` |
+| Exit code | 0 |
+
+The 2.54 BPB from this run is a sanity check, not a competition number. 8L/384d trained for 180 seconds isn't going to land near 1.05 to 1.10. What it does prove is that the whole pipeline runs cleanly under DDP: HedgeMixer table updates, in-memory oracle reload, and the bucketed all-reduce path all work on more than one rank, which is what single-GPU testing can't show.
+
+## H100×8 extrapolation
+
+| Stage | Kaggle T4×2 (measured) | H100×8 (estimated, 11L/512d) |
+|---|---|---|
+| Training step | 1.05 s | 80 to 100 ms |
+| TTT chunk | 1.87 s | 80 to 100 ms |
+| Total wall | ~2 h | 13 to 17 min |
+
+T4 to H100 single-card bf16 is ~15x, DDP 2 to 8 is ~3.3x in practice. Net per-step gain is ~25x after accounting for the 2x larger competition model. Fits inside the 10-minute train + 10-minute eval budget with room to spare.
+
+## Negative results
+
+| What I tried | Result | Why it didn't ship |
+|---|---|---|
+| Byte-level CTW (`ctw_prototype.py`) | Eval BPB 6.33 vs target <1.2; 21.3 MB compressed vs target <5 MB; 16.7K bytes/sec vs target >100K | 256-symbol alphabet at depth 8 has too many states. Killed in 2 days, redirected to FNV-hashed token-level oracle. |
+| Inline complementary loss (`loss * weight.mean()`) | Mathematically not equivalent to per-token reweighting | Removed. Standalone `complementary_training_loss` is kept for reference; needs inside-graph integration to be correct. |
+| `bi_counts[prev, targets] += 1.0` in HedgeMixer | Non-deterministic on duplicate indices, silent correctness bug | Replaced with `index_put_(..., accumulate=True)`. |
+
+## Compliance
+
+The frozen-oracle pattern was rejected once already in this cohort ([PR #924 ruling](https://github.com/openai/parameter-golf/issues/1017)). I designed this to hold against [Issue #1017](https://github.com/openai/parameter-golf/issues/1017): training tokens only (`build_ngram_oracle.py` never reads `fineweb_val_*.bin`), deterministic build (fixed FNV-1a, fixed Laplace constant, no RNG), no eval-time data dependence (oracle is read-only during train/TTT/eval), and bundled inside the 16 MB cap at 3.42 MB.
+
+## Limitations
+
+- No 8×H100 validation, no 3-seed mean, no competition-scale BPB number.
+- Oracle build verified on a 10M-token slice (Kaggle) and a 100M-token shard (local). Full 80-shard build is extrapolated, not measured.
+- `complementary_training_loss` is implemented but not wired into training. Per-token reweighting needs logits access from inside the compiled graph; the function is kept for that future integration.
+- HedgeMixer's `bi_counts` is dense `vocab × vocab`, asserted for `vocab_size <= 2048`. SP4096 vocab would need a hashed bigram table.