Skip to content

Record: GatedDeltaNet FLA + Score-First TTT + Brotli — val_bpb 1.00980 (3-seed mean)#1711

Closed
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:gdn-fla-score-first-ttt
Closed

Record: GatedDeltaNet FLA + Score-First TTT + Brotli — val_bpb 1.00980 (3-seed mean)#1711
aamodbhatt wants to merge 1 commit intoopenai:mainfrom
aamodbhatt:gdn-fla-score-first-ttt

Conversation

@aamodbhatt
Copy link
Copy Markdown
Contributor

Record Summary

Final submitted score: val_bpb 1.00980 (std 0.0015)
Reference pre-TTT roundtrip: 1.01902 (std 0.0017)

Hardware: 8×H100 80GB SXM | Artifact: ~15.6 MB | Train: 600s wallclock | TTT eval: ~276s

What Changed

3-Seed Results

Seed Post-TTT val_bpb Pre-TTT val_bpb TTT delta Artifact bytes
1337 1.00803 1.01720 -0.00917 15,595,190
42 1.01069 1.02054 -0.00986 15,602,610
2025 1.01067 1.01933 -0.00866 15,608,600
Mean 1.00980 1.01902 -0.00923
Std 0.0015 0.0017

Submission Checklist

  • One folder added under records/track_10min_16mb/
  • README.md, submission.json, train_gpt.py, train_gdn_7k.py, 3 seed logs present
  • Training ≤ 600s wallclock
  • All artifacts < 16,000,000 bytes (max: 15,608,600)
  • TTT eval < 600s (~276s)
  • No tokenizer/dataset edits
  • Score-first TTT compliance (Issue A Field Guide to Valid Submissions #1017): every token scored before any weight update
  • No SLOT, no ETLB, no n-gram cache, no pre-quant TTT

Metric Verification

  • Post-TTT score from final_int6_ttt_exact in each seed log
  • Pre-TTT roundtrip from final_int6_roundtrip_exact in each seed log

Credits

…0 (3-seed mean)

GatedDeltaNet linear attention (FLA) K_KVShare_Wider + legal score-first
TTT (SGD 3ep freeze=2) + brotli-11 compression. 3-seed mean: 1.00980 BPB
(std 0.0015). All artifacts under 16 MB.

Seeds: 1337 (1.00803), 42 (1.01069), 2025 (1.01067)
TTT gain: ~-0.009 BPB per seed

Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 18, 2026
PR openai#1711's records README explicitly enables score-first TTT, but the extracted
wrapper still defaulted TTT off. This worker-side patch keeps the reproduction
surface aligned with the submitted command instead of silently falling back to
the no-TTT path.

Constraint: Round31 is meant to validate the public claimed surface, not code-default drift
Rejected: Keep the default-off wrapper | would reproduce the wrong surface
Confidence: high
Scope-risk: narrow
Directive: Treat W88 results before this commit as code-default/no-TTT evidence, not faithful PR openai#1711 reproduction
Tested: python3 -m py_compile train_gpt.py train_gdn_7k.py architectures.py configs.py
Not-tested: remote end-to-end score after relaunch
manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Apr 18, 2026
Aweb's record-attempt submission, building on PR openai#1711 (1.00980 BPB) by
adding EMA-Teacher Distillation (Tarvainen & Valpola NeurIPS 2017,
'Mean teachers are better role models') as the novel contribution.

Loss: L = (1-α)·CE(target) + α·KL(student || teacher.detach())

Teacher is a separate copy of the student model, periodically (every K=16 steps)
synchronized from the EMA-smoothed state already maintained by the frontier code.
Alpha ramps linearly 0 → 0.3 over the middle 40%% of training (steps 30%%-70%%).
Temperature scaling per Hinton soft-target convention (KL × T²).

Verified novel via gh search (mean teacher / EMA teacher / distillation / KL
soft targets) — zero matching open PRs in the competition.

Verified legal under Issue openai#1017 conditions 1-4:
  - Causal (teacher uses same forward as student)
  - Full distribution (KL on full softmax over vocab)
  - Score-before-update (distillation is training-time only; eval unchanged)
  - Single L→R pass (no rescoring)

CPU smoke test (8 cases, FLA-independent) passes:
  CE-only path correct, EMT path differs from CE, gradient routes to student
  not teacher, temperature scaling active, alpha schedule correct, mini-training
  loss decreases, KL of identical distributions = 0.

Credits: PR openai#1711 (aamodbhatt) GDN+brotli base; PR openai#1687 (resouer) GDN K_KVShare;
PR openai#461 (Christopher-Lee-McClendon) Score-First Legal TTT; FLA library
(sustcsonglin); Tarvainen & Valpola (NeurIPS 2017) Mean Teacher framework.
@aamodbhatt
Copy link
Copy Markdown
Contributor Author

Closing — byte-counting bug in inherited build_sentencepiece_luts double-credits the leading-space byte, inflating the denominator and deflating reported BPB. Will re-evaluate with the canonical LUT from PR #1019 before resubmitting.

@aamodbhatt aamodbhatt closed this Apr 18, 2026
manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Apr 18, 2026
…ing-space byte

CRITICAL correctness fix. Our inherited train_gdn_7k.py from PR openai#1711's base
contained a byte-counting bug that double-credits the leading-space byte:

  LUT set: base_bytes[i] = len(piece[1:].encode('utf-8')) + 1
  Eval adds: tb += (has_leading_space[tgt] & ~is_boundary[prev])

This counts the space twice for every leading-space token (~80%% of tokens
in SP1024). Denominator inflated by ~23%%, reported BPB deflated by same
factor. PR openai#1711 self-closed for this same bug (author statement: 'byte-
counting bug in inherited build_sentencepiece_luts double-credits the
leading-space byte...').

Fix: adopt canonical LUT from PR openai#549 / PR openai#1019 verbatim. Base bytes hold
UTF-8 length WITHOUT the leading-space marker; eval adds +1 conditionally.
Also adds byte-fallback handling (sp.is_byte) and is_unused check that were
missing.

After this fix, our reported BPBs are on the same scale as the merged
leaderboard. v6-v11 results are stale and need re-measurement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant