Skip to content

Non-record: Frozen N-gram Oracle + HedgeMixer + SGD TTT — methodology, validated on Kaggle T4×2 DDP#1982

Open
dhruvpuri wants to merge 1 commit intoopenai:mainfrom
dhruvpuri:submission/frozen-oracle-hedgemixer-sgd-ttt
Open

Non-record: Frozen N-gram Oracle + HedgeMixer + SGD TTT — methodology, validated on Kaggle T4×2 DDP#1982
dhruvpuri wants to merge 1 commit intoopenai:mainfrom
dhruvpuri:submission/frozen-oracle-hedgemixer-sgd-ttt

Conversation

@dhruvpuri
Copy link
Copy Markdown

@dhruvpuri dhruvpuri commented Apr 30, 2026

Non-record: Frozen N-gram Oracle + HedgeMixer + SGD TTT

Summary

A hybrid system that bundles a frozen multi-order n-gram oracle (built offline from FineWeb training tokens, int8 log-probabilities with zstd-22, 3.42 MB compressed on a 10M-token slice) into a single artifact alongside the neural model. The oracle plugs into the existing Hedge mixer at TTT/eval time as additional experts. The submission also includes the SGD TTT switch (PR #967, reported -0.041 BPB) and LeakyReLU(0.75)² (PR #977, reported -0.008 BPB) as ancillary changes.

This is a methodology submission, not a record claim. I didn't have 8×H100 access during the cohort. The pipeline runs end-to-end on Kaggle T4×2 NCCL DDP (8L/384d, 13.4M params, 172 training steps, 3,786 TTT chunks, 6.85 MB final artifact, exit 0). The README extrapolates wall-clock to H100×8 from those measurements (around 13 to 17 minutes for the full pipeline at 11L/512d).

Key contributions

  • A frozen multi-order n-gram oracle, packaged as part of the artifact. Standalone offline builder (build_ngram_oracle.py, 250 lines, NumPy only): exact unigram, exact bigram, FNV-1a-hashed orders 3 through 8 with bucket counts from 4096 down to 256. Built only from training tokens. Bundled inside the 16 MB cap, designed to address the compliance gap that flagged PR Order-16 Frozen N-gram Oracle + Score-First TTT (0.02801 BPB) #924.

  • HedgeMixer extension. The existing 5-expert mixer (neural + online uni/bi/tri + decay cache) is extended to 5 + |oracle orders| experts via a single multiplicative-weights update. With no oracle loaded, behavior matches the base.

  • Single-artifact format with a 16-byte versioned header (4-byte magic, 1 version byte, 3 reserved, neural and oracle blob lengths). oracle_len = 0 degrades cleanly to base behavior. Reload uses an in-memory FrozenNgramOracle.from_bytes classmethod, no per-rank temp files.

  • SGD TTT as a configurable alternative to AdamW (PR Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups #967). LeakyReLU(0.75)² configurable per PR LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean) #977. Both env-var-gated, both small reviewable diffs.

  • Bug fixes. Bucketed dist.all_reduce in TTT replaces about 100 per-parameter NCCL launches with one. index_put_(..., accumulate=True) replaces a non-deterministic bi_counts[prev, targets] += 1.0 in HedgeMixer table updates. Inline loss * weight.mean() complementary scaling removed (mathematically not equivalent to per-token reweighting).

Negative results

  • Byte-level CTW (ctw_prototype.py, depth 8, 262K hash buckets/depth, KT estimator). 2M training bytes + 500K eval bytes from one FineWeb shard:
    • Eval BPB: 6.33 (target < 1.2)
    • Compressed: 21.31 MB (target < 5 MB)
    • Throughput: 16,761 bytes/sec (target > 100K)
    • Verdict: dominated by token-level n-grams at this vocab size. Token-level CTW is the natural follow-up.
  • Inline complementary loss scaling. Multiplying scalar-mean CE by weight.mean() is not equivalent to per-token reweighting. Removed. Standalone function kept for future inside-graph integration.

Limitations

Test plan

  • Local end-to-end run on RTX 4060 (4L/256d toy, 50 steps, 7.6 MB artifact, exit 0)
  • Kaggle T4×2 NCCL DDP run end-to-end (8L/384d, 172 steps, 3,786 TTT chunks, 6.85 MB artifact, exit 0)
  • FNV-1a NumPy/Torch equivalence test passes (1000 samples, ctx_len=5, buckets=4096)
  • Magic-prefix artifact roundtrip verified (Header + blobs total: 7,181,197 == 7,181,197: True)
  • All train_gpt.py and build_ngram_oracle.py files syntax-checked
  • Not done: 8×H100 3-seed validation at competition spec
  • Not done: Full 80-shard oracle build
  • Not done: α-sweep for complementary loss after inside-graph integration
  • Not done: Per-order Hedge weight logging

Why submit this

I joined late and didn't get 8×H100 access. Rather than fabricate numbers or skip submitting, I'm offering this as a methodology contribution: a clean, modular, reviewable design for hybrid frozen-oracle + neural systems. The README walks through the design, the negative results, the explicit limitations, and the concrete plan for what I'd do with H100 access.

  • README.md: technical writeup (~1,300 words) covering the three components, the validated DDP run, the H100 extrapolation, compliance, limitations, related April 2026 references, and reproducing instructions.
  • JOURNEY.md: process journal documenting the 5-week research arc, the 5 GitHub sweeps, the 6 specialist agents consulted, the strategic pivots, the dead ends with measured numbers, and the day-of-deadline polish loop.

Submission directory: records/track_non_record_16mb/2026-04-30_FrozenOracle_ComplementaryTraining_HedgeMixer/

Three new components, env-var-gated, ~280 lines of new code:
- Frozen multi-order n-gram oracle (orders 1-8, FNV-1a hashed, int8
  log-probs + zstd-22, 3.42 MB compressed)
- HedgeMixer extended with oracle experts (5 -> 13 experts, warm prior
  log_w[0]=2.0)
- Magic-prefixed versioned binary artifact format bundling neural+oracle
  in a single 6.85 MB file (designed for Issue openai#1017 compliance)

Plus: SGD TTT switch (PR openai#967), configurable LeakyReLU slope (PR openai#977),
bucketed dist.all_reduce in TTT (~100 NCCL launches -> 1).

Bug fixes: non-deterministic index_put in HedgeMixer table updates,
TOCTOU race in oracle reload, mathematically-incorrect inline
complementary loss scaling.

Validated end-to-end on Kaggle T4x2 NCCL DDP (8L/384d, 13.4M params,
172 train steps, 3786 TTT chunks, 6.85 MB artifact, exit 0).
8xH100 validation not run (no cohort access). README extrapolates
H100x8 wall-clock to ~13-17 min for the full pipeline at 11L/512d.

Negative results published: byte-level CTW (BPB 6.33 / 21.3 MB / 16K
tok/s, all 3 thresholds missed); inline complementary loss bug;
mktemp TOCTOU; per-parameter dist.all_reduce churn.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant