Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,36 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/

# Personal scratch and notes (not part of the submission)
.claude/
PARAMETER_GOLF_BATTLE_PLAN_*.md
chats.md
program.md
errors.json
best_sweep_config.json
test_oracle.bin
final_model.int6.ptz
final_model.pt

# Local working copies that duplicate the canonical files in records/
/train_gpt.py
/build_ngram_oracle.py
/run_h100.sh
/run_local_test.sh
/run_3seeds.sh
/sweep.py
/ctw_prototype.py
/ablation_colab.ipynb
/kaggle_run.ipynb

# Reference / scratch copies of other people's PRs and earlier forks
reference_pr*_train_gpt.py
train_gpt_pr*_*.py

# Tooling clones
autoresearch_ref/

# Older incomplete draft record submission, never finished
records/track_10min_16mb/2026-03-24_VR_GA_LeakyReLU_LegalTTT/

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Non-record: Frozen N-gram Oracle + HedgeMixer + SGD TTT

## Summary

A hybrid system that bundles a frozen multi-order n-gram oracle (built offline from FineWeb training tokens, int8 log-probabilities with zstd-22, 3.42 MB compressed on a 10M-token slice) into a single artifact alongside the neural model. The oracle plugs into the existing Hedge mixer at TTT/eval time as additional experts. The submission also includes the SGD TTT switch (PR #967, reported -0.041 BPB) and `LeakyReLU(0.75)²` (PR #977, reported -0.008 BPB) as ancillary changes.

This is a methodology submission, not a record claim. I didn't have 8×H100 access during the cohort. The pipeline runs end-to-end on Kaggle T4×2 NCCL DDP (8L/384d, 13.4M params, 172 training steps, 3,786 TTT chunks, 6.85 MB final artifact, exit 0). The README extrapolates wall-clock to H100×8 from those measurements (around 13 to 17 minutes for the full pipeline at 11L/512d).

## Key contributions

- A frozen multi-order n-gram oracle, packaged as part of the artifact. Standalone offline builder (`build_ngram_oracle.py`, 250 lines, NumPy only): exact unigram, exact bigram, FNV-1a-hashed orders 3 through 8 with bucket counts from 4096 down to 256. Built only from training tokens. Bundled inside the 16 MB cap, designed to address the compliance gap that flagged PR #924.

- HedgeMixer extension. The existing 5-expert mixer (neural + online uni/bi/tri + decay cache) is extended to `5 + |oracle orders|` experts via a single multiplicative-weights update. With no oracle loaded, behavior matches the base.

- Single-artifact format with a 16-byte versioned header (4-byte magic, 1 version byte, 3 reserved, neural and oracle blob lengths). `oracle_len = 0` degrades cleanly to base behavior. Reload uses an in-memory `FrozenNgramOracle.from_bytes` classmethod, no per-rank temp files.

- SGD TTT as a configurable alternative to AdamW (PR #967). `LeakyReLU(0.75)²` configurable per PR #977. Both env-var-gated, both small reviewable diffs.

- Bug fixes. Bucketed `dist.all_reduce` in TTT replaces about 100 per-parameter NCCL launches with one. `index_put_(..., accumulate=True)` replaces a non-deterministic `bi_counts[prev, targets] += 1.0` in HedgeMixer table updates. Inline `loss * weight.mean()` complementary scaling removed (mathematically not equivalent to per-token reweighting).

## Negative results

- Byte-level CTW (`ctw_prototype.py`, depth 8, 262K hash buckets/depth, KT estimator). 2M training bytes + 500K eval bytes from one FineWeb shard:
- Eval BPB: 6.33 (target < 1.2)
- Compressed: 21.31 MB (target < 5 MB)
- Throughput: 16,761 bytes/sec (target > 100K)
- Verdict: dominated by token-level n-grams at this vocab size. Token-level CTW is the natural follow-up.
- Inline complementary loss scaling. Multiplying scalar-mean CE by `weight.mean()` is not equivalent to per-token reweighting. Removed. Standalone function kept for future inside-graph integration.

## Limitations

- No 8×H100 validation, no 3-seed mean, no competition-scale BPB number from me.
- Oracle build verified on a 100M-token shard (32 s, 4.66 MB) and a 10M-token slice (2.5 s, 3.42 MB). Full 80-shard scan time and final compressed size are extrapolated.
- Complementary training loss is implemented but currently disabled. Inside-graph integration is required for correct per-token weighting; the inline version was wrong (now removed).
- All BPB numbers cited from other PRs (#803: 0.4416, #834: 0.1663, #924: 0.0280, #967: -0.041, #977: -0.008) are from those PRs' authors, not reproduced here.

## Test plan

- [x] Local end-to-end run on RTX 4060 (4L/256d toy, 50 steps, 7.6 MB artifact, exit 0)
- [x] Kaggle T4×2 NCCL DDP run end-to-end (8L/384d, 172 steps, 3,786 TTT chunks, 6.85 MB artifact, exit 0)
- [x] FNV-1a NumPy/Torch equivalence test passes (1000 samples, ctx_len=5, buckets=4096)
- [x] Magic-prefix artifact roundtrip verified (`Header + blobs total: 7,181,197 == 7,181,197: True`)
- [x] All `train_gpt.py` and `build_ngram_oracle.py` files syntax-checked
- [ ] Not done: 8×H100 3-seed validation at competition spec
- [ ] Not done: Full 80-shard oracle build
- [ ] Not done: α-sweep for complementary loss after inside-graph integration
- [ ] Not done: Per-order Hedge weight logging

## Why submit this

I joined late and didn't get 8×H100 access. Rather than fabricate numbers or skip submitting, I'm offering this as a methodology contribution: a clean, modular, reviewable design for hybrid frozen-oracle + neural systems. The README walks through the design, the negative results, the explicit limitations, and the concrete plan for what I'd do with H100 access.

- [README.md](./README.md): technical writeup (~1,300 words) covering the three components, the validated DDP run, the H100 extrapolation, compliance, limitations, related April 2026 references, and reproducing instructions.
- [JOURNEY.md](./JOURNEY.md): process journal documenting the 5-week research arc, the 5 GitHub sweeps, the 6 specialist agents consulted, the strategic pivots, the dead ends with measured numbers, and the day-of-deadline polish loop.
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Frozen N-gram Oracle + HedgeMixer + SGD TTT (non-record submission)

**Author:** Dhruv Puri ([@dhruvpuri](https://github.com/dhruvpuri)), 2026-04-30

A hybrid n-gram + neural language model for OpenAI's Parameter Golf 2026 (16 MB artifact, 10 minutes training on 8×H100, scored by bits-per-byte on FineWeb val).

This is a methodology submission, not a record claim. I didn't have 8×H100 access, so the full-scale numbers (11L/512d) aren't here. The pipeline is validated end-to-end on Kaggle T4×2 with NCCL DDP. See [JOURNEY.md](./JOURNEY.md) for the research arc, agent-assisted review loop, and reproducing instructions.

## What's in the patch

Three pieces that share an artifact, plus three smaller fixes.

**1. A frozen n-gram oracle.** [`build_ngram_oracle.py`](./build_ngram_oracle.py), 250 lines, NumPy only. Scans FineWeb training tokens once offline. Builds orders 1 through 8 as int8 log-probabilities with Laplace smoothing. Orders 1 and 2 are exact. Orders 3 through 8 use FNV-1a hashed contexts with bucket counts going from 4096 down to 256. zstd-22 compressed. A NumPy/Torch FNV-1a equivalence test runs before every build. If the offline NumPy hash and the online Torch hash disagree on any sample, the build aborts.

**2. HedgeMixer with oracle experts.** The base stack already had a 5-expert online ensemble (neural, online uni/bi/tri, decay cache). I added one expert per loaded oracle order, taking the count to 13. Mixing in log-space, multiplicative-weights update on per-token NLL. Warm prior `log_w[0] = 2.0` so short eval streams aren't dominated by Hedge convergence noise. With no oracle loaded the mixer reduces to the original 5-expert form.

**3. A magic-prefixed versioned artifact format.** 16-byte header (4-byte magic `0x50474152`, 1 version byte, 3 reserved, neural and oracle blob lengths), then the neural blob (int6 per-row + zstd-22) and the oracle blob. One file under the 16 MB cap. The version byte means future schema changes fail loudly instead of silently mis-slicing. Reload uses an in-memory `FrozenNgramOracle.from_bytes` classmethod, no per-rank temp files.

Plus, in `train_gpt.py`:

- A `TTT_OPTIMIZER=sgd` switch (lr=0.002, momentum=0.9), matching [PR #967](https://github.com/openai/parameter-golf/pull/967)'s reported -0.041 BPB.
- `LEAKY_RELU_SLOPE` configurable. Setting 0.75 matches [PR #977](https://github.com/openai/parameter-golf/pull/977)'s -0.008 BPB.
- Bucketed `dist.all_reduce` in TTT, replacing about 100 per-parameter NCCL launches per micro-step with one.

About 280 new lines, 9 changed lines, all gated by environment variables. With `NGRAM_ORACLE_PATH=""` and `TTT_OPTIMIZER=adamw`, runtime behavior matches the base.

## What I actually ran (Kaggle T4×2 NCCL DDP)

| Stage | Result |
|---|---|
| Build | Oracle 3.42 MB / 10M tokens / 2.5 s |
| Train | 8L/384d, 13.4M params, 172 steps in 180s, `world_size:2 grad_accum_steps:4` |
| Quantize + bundle | int6 + zstd-22, artifact 6.85 MB / 16 MB (neural 3.43 MB + oracle 3.42 MB + 16 B header) |
| Reload | `oracle:loaded from artifact orders=[1, 2, 3, 4, 5, 6, 7, 8]`, both ranks via `from_bytes` |
| TTT | SGD, 3,786 chunks, oracle in HedgeMixer experts, 7,078 s wall on T4×2 |
| Magic prefix check | `Header + blobs total: 7,181,197 == 7,181,197: True` |
| Exit code | 0 |

The 2.54 BPB from this run is a sanity check, not a competition number. 8L/384d trained for 180 seconds isn't going to land near 1.05 to 1.10. What it does prove is that the whole pipeline runs cleanly under DDP: HedgeMixer table updates, in-memory oracle reload, and the bucketed all-reduce path all work on more than one rank, which is what single-GPU testing can't show.

## H100×8 extrapolation

| Stage | Kaggle T4×2 (measured) | H100×8 (estimated, 11L/512d) |
|---|---|---|
| Training step | 1.05 s | 80 to 100 ms |
| TTT chunk | 1.87 s | 80 to 100 ms |
| Total wall | ~2 h | 13 to 17 min |

T4 to H100 single-card bf16 is ~15x, DDP 2 to 8 is ~3.3x in practice. Net per-step gain is ~25x after accounting for the 2x larger competition model. Fits inside the 10-minute train + 10-minute eval budget with room to spare.

## Negative results

| What I tried | Result | Why it didn't ship |
|---|---|---|
| Byte-level CTW (`ctw_prototype.py`) | Eval BPB 6.33 vs target <1.2; 21.3 MB compressed vs target <5 MB; 16.7K bytes/sec vs target >100K | 256-symbol alphabet at depth 8 has too many states. Killed in 2 days, redirected to FNV-hashed token-level oracle. |
| Inline complementary loss (`loss * weight.mean()`) | Mathematically not equivalent to per-token reweighting | Removed. Standalone `complementary_training_loss` is kept for reference; needs inside-graph integration to be correct. |
| `bi_counts[prev, targets] += 1.0` in HedgeMixer | Non-deterministic on duplicate indices, silent correctness bug | Replaced with `index_put_(..., accumulate=True)`. |

## Compliance

The frozen-oracle pattern was rejected once already in this cohort ([PR #924 ruling](https://github.com/openai/parameter-golf/issues/1017)). I designed this to hold against [Issue #1017](https://github.com/openai/parameter-golf/issues/1017): training tokens only (`build_ngram_oracle.py` never reads `fineweb_val_*.bin`), deterministic build (fixed FNV-1a, fixed Laplace constant, no RNG), no eval-time data dependence (oracle is read-only during train/TTT/eval), and bundled inside the 16 MB cap at 3.42 MB.

## Limitations

- No 8×H100 validation, no 3-seed mean, no competition-scale BPB number.
- Oracle build verified on a 10M-token slice (Kaggle) and a 100M-token shard (local). Full 80-shard build is extrapolated, not measured.
- `complementary_training_loss` is implemented but not wired into training. Per-token reweighting needs logits access from inside the compiled graph; the function is kept for that future integration.
- HedgeMixer's `bi_counts` is dense `vocab × vocab`, asserted for `vocab_size <= 2048`. SP4096 vocab would need a hashed bigram table.
Loading