Non-record: Frozen N-gram Oracle + HedgeMixer + SGD TTT — methodology, validated on Kaggle T4×2 DDP#1982
Open
dhruvpuri wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
Submission directory: records/track_non_record_16mb/2026-04-30_FrozenOracle_ComplementaryTraining_HedgeMixer/ Three new components, env-var-gated, ~280 lines of new code: - Frozen multi-order n-gram oracle (orders 1-8, FNV-1a hashed, int8 log-probs + zstd-22, 3.42 MB compressed) - HedgeMixer extended with oracle experts (5 -> 13 experts, warm prior log_w[0]=2.0) - Magic-prefixed versioned binary artifact format bundling neural+oracle in a single 6.85 MB file (designed for Issue openai#1017 compliance) Plus: SGD TTT switch (PR openai#967), configurable LeakyReLU slope (PR openai#977), bucketed dist.all_reduce in TTT (~100 NCCL launches -> 1). Bug fixes: non-deterministic index_put in HedgeMixer table updates, TOCTOU race in oracle reload, mathematically-incorrect inline complementary loss scaling. Validated end-to-end on Kaggle T4x2 NCCL DDP (8L/384d, 13.4M params, 172 train steps, 3786 TTT chunks, 6.85 MB artifact, exit 0). 8xH100 validation not run (no cohort access). README extrapolates H100x8 wall-clock to ~13-17 min for the full pipeline at 11L/512d. Negative results published: byte-level CTW (BPB 6.33 / 21.3 MB / 16K tok/s, all 3 thresholds missed); inline complementary loss bug; mktemp TOCTOU; per-parameter dist.all_reduce churn. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: Frozen N-gram Oracle + HedgeMixer + SGD TTT
Summary
A hybrid system that bundles a frozen multi-order n-gram oracle (built offline from FineWeb training tokens, int8 log-probabilities with zstd-22, 3.42 MB compressed on a 10M-token slice) into a single artifact alongside the neural model. The oracle plugs into the existing Hedge mixer at TTT/eval time as additional experts. The submission also includes the SGD TTT switch (PR #967, reported -0.041 BPB) and
LeakyReLU(0.75)²(PR #977, reported -0.008 BPB) as ancillary changes.This is a methodology submission, not a record claim. I didn't have 8×H100 access during the cohort. The pipeline runs end-to-end on Kaggle T4×2 NCCL DDP (8L/384d, 13.4M params, 172 training steps, 3,786 TTT chunks, 6.85 MB final artifact, exit 0). The README extrapolates wall-clock to H100×8 from those measurements (around 13 to 17 minutes for the full pipeline at 11L/512d).
Key contributions
A frozen multi-order n-gram oracle, packaged as part of the artifact. Standalone offline builder (
build_ngram_oracle.py, 250 lines, NumPy only): exact unigram, exact bigram, FNV-1a-hashed orders 3 through 8 with bucket counts from 4096 down to 256. Built only from training tokens. Bundled inside the 16 MB cap, designed to address the compliance gap that flagged PR Order-16 Frozen N-gram Oracle + Score-First TTT (0.02801 BPB) #924.HedgeMixer extension. The existing 5-expert mixer (neural + online uni/bi/tri + decay cache) is extended to
5 + |oracle orders|experts via a single multiplicative-weights update. With no oracle loaded, behavior matches the base.Single-artifact format with a 16-byte versioned header (4-byte magic, 1 version byte, 3 reserved, neural and oracle blob lengths).
oracle_len = 0degrades cleanly to base behavior. Reload uses an in-memoryFrozenNgramOracle.from_bytesclassmethod, no per-rank temp files.SGD TTT as a configurable alternative to AdamW (PR Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups #967).
LeakyReLU(0.75)²configurable per PR LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean) #977. Both env-var-gated, both small reviewable diffs.Bug fixes. Bucketed
dist.all_reducein TTT replaces about 100 per-parameter NCCL launches with one.index_put_(..., accumulate=True)replaces a non-deterministicbi_counts[prev, targets] += 1.0in HedgeMixer table updates. Inlineloss * weight.mean()complementary scaling removed (mathematically not equivalent to per-token reweighting).Negative results
ctw_prototype.py, depth 8, 262K hash buckets/depth, KT estimator). 2M training bytes + 500K eval bytes from one FineWeb shard:weight.mean()is not equivalent to per-token reweighting. Removed. Standalone function kept for future inside-graph integration.Limitations
Test plan
Header + blobs total: 7,181,197 == 7,181,197: True)train_gpt.pyandbuild_ngram_oracle.pyfiles syntax-checkedWhy submit this
I joined late and didn't get 8×H100 access. Rather than fabricate numbers or skip submitting, I'm offering this as a methodology contribution: a clean, modular, reviewable design for hybrid frozen-oracle + neural systems. The README walks through the design, the negative results, the explicit limitations, and the concrete plan for what I'd do with H100 access.