Record: GatedDeltaNet FLA + Score-First TTT + Brotli — val_bpb 1.00980 (3-seed mean) by aamodbhatt · Pull Request #1711 · openai/parameter-golf

aamodbhatt · 2026-04-18T04:15:03Z

Record Summary

Final submitted score: val_bpb 1.00980 (std 0.0015)
Reference pre-TTT roundtrip: 1.01902 (std 0.0017)

Hardware: 8×H100 80GB SXM | Artifact: ~15.6 MB | Train: 600s wallclock | TTT eval: ~276s

What Changed

GatedDeltaNet (FLA) K_KVShare_Wider architecture from PR Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean) #1687 — O(n) linear attention replacing softmax
Brotli-11 compression instead of zstd-22 — saves ~900KB, all artifacts well under 16 MB
Score-first TTT (SGD lr=0.005, momentum=0.9, 3 epochs, freeze first 2 blocks) adapted from PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

3-Seed Results

Seed	Post-TTT val_bpb	Pre-TTT val_bpb	TTT delta	Artifact bytes
1337	1.00803	1.01720	-0.00917	15,595,190
42	1.01069	1.02054	-0.00986	15,602,610
2025	1.01067	1.01933	-0.00866	15,608,600
Mean	1.00980	1.01902	-0.00923	—
Std	0.0015	0.0017	—	—

Submission Checklist

One folder added under records/track_10min_16mb/
README.md, submission.json, train_gpt.py, train_gdn_7k.py, 3 seed logs present
Training ≤ 600s wallclock
All artifacts < 16,000,000 bytes (max: 15,608,600)
TTT eval < 600s (~276s)
No tokenizer/dataset edits
Score-first TTT compliance (Issue A Field Guide to Valid Submissions #1017): every token scored before any weight update
No SLOT, no ETLB, no n-gram cache, no pre-quant TTT

Metric Verification

Post-TTT score from final_int6_ttt_exact in each seed log
Pre-TTT roundtrip from final_int6_roundtrip_exact in each seed log

Credits

GatedDeltaNet architecture: PR Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean) #1687 by @resouer
Flash Linear Attention: @sustcsonglin (fla-core 0.4.2)
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon

@resouer

…0 (3-seed mean) GatedDeltaNet linear attention (FLA) K_KVShare_Wider + legal score-first TTT (SGD 3ep freeze=2) + brotli-11 compression. 3-seed mean: 1.00980 BPB (std 0.0015). All artifacts under 16 MB. Seeds: 1337 (1.00803), 42 (1.01069), 2025 (1.01067) TTT gain: ~-0.009 BPB per seed Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.

PR openai#1711's records README explicitly enables score-first TTT, but the extracted wrapper still defaulted TTT off. This worker-side patch keeps the reproduction surface aligned with the submitted command instead of silently falling back to the no-TTT path. Constraint: Round31 is meant to validate the public claimed surface, not code-default drift Rejected: Keep the default-off wrapper | would reproduce the wrong surface Confidence: high Scope-risk: narrow Directive: Treat W88 results before this commit as code-default/no-TTT evidence, not faithful PR openai#1711 reproduction Tested: python3 -m py_compile train_gpt.py train_gdn_7k.py architectures.py configs.py Not-tested: remote end-to-end score after relaunch

Aweb's record-attempt submission, building on PR openai#1711 (1.00980 BPB) by adding EMA-Teacher Distillation (Tarvainen & Valpola NeurIPS 2017, 'Mean teachers are better role models') as the novel contribution. Loss: L = (1-α)·CE(target) + α·KL(student || teacher.detach()) Teacher is a separate copy of the student model, periodically (every K=16 steps) synchronized from the EMA-smoothed state already maintained by the frontier code. Alpha ramps linearly 0 → 0.3 over the middle 40%% of training (steps 30%%-70%%). Temperature scaling per Hinton soft-target convention (KL × T²). Verified novel via gh search (mean teacher / EMA teacher / distillation / KL soft targets) — zero matching open PRs in the competition. Verified legal under Issue openai#1017 conditions 1-4: - Causal (teacher uses same forward as student) - Full distribution (KL on full softmax over vocab) - Score-before-update (distillation is training-time only; eval unchanged) - Single L→R pass (no rescoring) CPU smoke test (8 cases, FLA-independent) passes: CE-only path correct, EMT path differs from CE, gradient routes to student not teacher, temperature scaling active, alpha schedule correct, mini-training loss decreases, KL of identical distributions = 0. Credits: PR openai#1711 (aamodbhatt) GDN+brotli base; PR openai#1687 (resouer) GDN K_KVShare; PR openai#461 (Christopher-Lee-McClendon) Score-First Legal TTT; FLA library (sustcsonglin); Tarvainen & Valpola (NeurIPS 2017) Mean Teacher framework.

aamodbhatt · 2026-04-18T13:42:24Z

Closing — byte-counting bug in inherited build_sentencepiece_luts double-credits the leading-space byte, inflating the denominator and deflating reported BPB. Will re-evaluate with the canonical LUT from PR #1019 before resubmitting.

…ing-space byte CRITICAL correctness fix. Our inherited train_gdn_7k.py from PR openai#1711's base contained a byte-counting bug that double-credits the leading-space byte: LUT set: base_bytes[i] = len(piece[1:].encode('utf-8')) + 1 Eval adds: tb += (has_leading_space[tgt] & ~is_boundary[prev]) This counts the space twice for every leading-space token (~80%% of tokens in SP1024). Denominator inflated by ~23%%, reported BPB deflated by same factor. PR openai#1711 self-closed for this same bug (author statement: 'byte- counting bug in inherited build_sentencepiece_luts double-credits the leading-space byte...'). Fix: adopt canonical LUT from PR openai#549 / PR openai#1019 verbatim. Base bytes hold UTF-8 length WITHOUT the leading-space marker; eval adds +1 conditionally. Also adds byte-fallback handling (sp.is_byte) and is_unused check that were missing. After this fix, our reported BPBs are on the same scale as the merged leaderboard. v6-v11 results are stale and need re-measurement.

This was referenced Apr 18, 2026

Record: GatedDeltaNet FLA + Brotli (No TTT) — val_bpb 1.01902 (3-seed mean) #1712

Closed

Byte-accounting bug in build_sentencepiece_luts affects GDN-family submissions #1719

Open

aamodbhatt closed this Apr 18, 2026

dexhunter mentioned this pull request Apr 23, 2026

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean) #1791

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: GatedDeltaNet FLA + Score-First TTT + Brotli — val_bpb 1.00980 (3-seed mean)#1711

Record: GatedDeltaNet FLA + Score-First TTT + Brotli — val_bpb 1.00980 (3-seed mean)#1711
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:gdn-fla-score-first-ttt

aamodbhatt commented Apr 18, 2026

Uh oh!

aamodbhatt commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aamodbhatt commented Apr 18, 2026

Record Summary

What Changed

3-Seed Results

Submission Checklist

Metric Verification

Credits

Uh oh!

aamodbhatt commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant