Skip to content

Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean)#1698

Open
arsenis-cmd wants to merge 1 commit intoopenai:mainfrom
arsenis-cmd:submission/gdn-fla-ttt-1.00995
Open

Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean)#1698
arsenis-cmd wants to merge 1 commit intoopenai:mainfrom
arsenis-cmd:submission/gdn-fla-ttt-1.00995

Conversation

@arsenis-cmd
Copy link
Copy Markdown

Summary

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed Steps Pre-TTT BPB Post-TTT BPB TTT Gain Artifact
42 2,364 1.02142 1.01130 -0.01012 16,600,916
314 2,398 1.01872 1.00896 -0.00976 16,548,775
999 2,370 1.01967 1.00959 -0.01008 16,474,250
Mean 2,377 1.01994 1.00995 -0.00999

Compliance

  • Train under 600s (wallclock limited)
  • All artifacts under 16 MiB (max: 16,600,916 bytes = 15.83 MiB)
  • Eval under 600s (~320s total: ~120s roundtrip + ~200s TTT)
  • Score-first TTT (no SLOT, no n-gram cache, no ETLB)
  • 3-seed validation (42, 314, 999)

Key Architecture

  • Attention: GatedDeltaNet (FLA) — O(n) linear recurrence, no quadratic attention
  • Config: K_KVShare_Wider — 10 layers, 544d, 8 heads, KV sharing stride=2
  • Quantization: Int6 matrices + Int8 embeddings + zstd-22
  • Training: Muon optimizer + EMA(0.997) + SWA(50) + Late QAT
  • TTT: SGD(lr=0.005, momentum=0.9), 3 epochs, 32K-token chunks, freeze first 2 blocks

Attribution

Run Command

ARCH_MODE=K TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
TTT_CHUNK_TOKENS=32768 TTT_FREEZE_BLOCKS=2 TTT_MOMENTUM=0.9 \
TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 SEED=42 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

… (3-seed mean)

GatedDeltaNet linear attention (FLA) + legal score-first TTT on PR openai#1687
K_KVShare_Wider architecture. 3-seed mean: 1.00995 BPB (std 0.0012).

Seeds: 42 (1.01130), 314 (1.00896), 999 (1.00959)
TTT gain: ~-0.010 BPB per seed
All artifacts under 16 MiB.

Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 17, 2026
@sergeevii123
Copy link
Copy Markdown

From rules The cap is decimal 16MB, i.e. 16,000,000 total bytes, not 16 MiB / 16,777,216 bytes.
Looks like this exceeds the required size

@dexhunter
Copy link
Copy Markdown
Contributor

Hey @arsenis-cmd, congrats on the strong result. I want to raise one interpretive question on Condition 4 of Issue #1017 ("single left-to-right pass") for organizer clarity, since the answer affects how much of the gain is novelty vs. recipe extension.

In train_gdn_7k.py, lines 1741-1771 run, for each non-final chunk, for _ep in range(args.ttt_epochs) with the default ttt_epochs=3 (line 1363, also visible in all three seed logs as ttt_epochs=3). So each chunk's tokens are scored once under inference_mode() (lines 1707-1731), then used for three epochs of SGD before moving to the next chunk.

Score-before-update is intact (each token is scored exactly once, before any gradient step that uses it), and Conditions 1, 2, and 3 look fine. The interpretive question is whether "single left-to-right pass" in Condition 4 refers strictly to the scoring pass (in which case 3 epochs of post-scoring adaptation is allowed) or to the entire eval procedure (in which case repeating SGD on previously-scored tokens 3× per chunk extends the merged TTT precedent of 1 epoch from PR #549/#461).

Could the organizers weigh in? If multi-epoch adaptation is permitted under Track B's "per-document LoRA under score-before-update" framing, that would be useful precedent for everyone. If not, a 1-epoch ablation would tell us how much of the margin to the runner-up is from GDN architecture vs. the extended TTT recipe.

Not flagging as a blocker — just want a clean ruling before others build on it.

@arsenis-cmd
Copy link
Copy Markdown
Author

@sergeevii123 Good catch — you're right. I was using 16 MiB (16,777,216 bytes) as the threshold when the rules specify decimal 16 MB (16,000,000 bytes). All three seeds are 474–601 KB over the correct cap.

I've already identified and coded the fix: reducing the quantization clip_range from 31 (int6) to 24 in mixed_quantize(). Fewer unique byte values in the int8 containers → zstd-22 compresses ~9% more efficiently, which brings all artifacts comfortably under 16,000,000 bytes. The quality impact is minimal since per-row scaling means most quantized values are already well within [-24, +24] — estimated +0.002–0.005 BPB at most.

Unfortunately I terminated my GPU instance after the original runs completed, so the unquantized checkpoints are gone. I've submitted a compute credit request to OpenAI to re-run the 3 seeds with the corrected quantization. Hopefully they'll grant it so I can validate the fix and update this PR.

@arsenis-cmd
Copy link
Copy Markdown
Author

@dexhunter Thanks for the careful analysis — this is a fair and well-framed question.

You're correct that ttt_epochs=3 means each chunk's tokens undergo 3 epochs of SGD after scoring. The key compliance point is Condition 4's "single left-to-right pass" and whether it refers to the scoring pass or the entire eval procedure.

My reading of the score-first TTT framework (established by @Christopher-Lee-McClendon in PR #461) is that the constraint is about scoring: each token must be scored exactly once, before any gradient update that could have used that token's signal. This is what "score-before-update" guarantees — no information leakage from future tokens into the score of past tokens.

The multi-epoch SGD that follows is adaptation on already-scored tokens. The scores are frozen and final at that point. The 3 epochs of post-scoring SGD only affect future chunk predictions, not any scored token's BPB contribution. In that sense, it's analogous to a larger learning rate or more aggressive optimizer — a training hyperparameter for the adaptation step, not a scoring multiplier.

That said, I agree this deserves an explicit ruling from organizers, especially since the merged TTT precedent (PR #549/#461) used 1 epoch. If multi-epoch is deemed non-compliant under Track B, I'm happy to provide a 1-epoch ablation — I expect the gap to narrow by roughly 0.002–0.003 BPB (the TTT gain is -0.01 total across 3 epochs, with diminishing returns per epoch), which would still hold the record.

Would appreciate @cocohearts @valerio-oai @0hq @yuzhougu-oai weighing in.

OlesStankevych added a commit to OlesStankevych/parameter-golf that referenced this pull request Apr 18, 2026
Built on PR openai#1698 (GatedDeltaNet + Legal TTT). Adds:
- FreqGPTQ: frequency-weighted Hessian calibration for GPTQ
- PassthroughQuant: int8 for control tensors (saves ~40KB)
- Sandwich quantization: int8 for final block
- Adaptive embedding precision: int8 top-100 / intN rest
- Configurable Int5/6 GPTQ with synced QAT
- LZMA wrapper saves ~73KB

Pending GPU validation for BPB results.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
yahya010 added a commit to yahya010/parameter-golf that referenced this pull request Apr 19, 2026
…ed mean)

3-seed mean 1.01080 (std 0.00115), seeds 42/314/999:
  - seed 42:  1.01205
  - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean)
  - seed 999: 1.01056
Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB.

Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD
design but disabled in the scored run (ttt_macro_phases=0) — on seed 42
it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain),
so not worth the extra eval time.

All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.
@dexhunter
Copy link
Copy Markdown
Contributor

Hi @arsenis-cmd — a second concern on this PR beyond the artifact-size flag.

The build_sentencepiece_luts helper in train_gdn_7k.py appears to inherit the byte-accounting defect tracked in #1719 (GDN-family BPB bug):

# train_gdn_7k.py — piece LUT
if piece.startswith("\u2581"):
    has_space[i] = True
    base_bytes[i] = len(piece[1:].encode("utf-8")) + 1   # pre-credits +1

combined with the eval accumulator that also adds +1 for -prefixed tokens via has_leading_space_lut & ~is_boundary_token_lut. One byte is counted twice.

The canonical pattern from merged PR #1019 drops the pre-credit:

base_bytes[i] = len(piece.encode("utf-8"))   # after stripping ▁

Net effect is ~+17.7% byte inflation, deflating reported val_bpb by ~1/1.177. Correcting the LUT places the claimed 1.00995 at approximately 1.189 canonical, under merged SOTA (PR #1493 at 1.0810).

Quick self-check: val_loss / val_bpb should be ≈2.58 for sp8192 (≈3.73 bytes/token × ln 2). The ~3.04 ratio in the GDN-family logs is the fingerprint of the bug.


Separately on my earlier Condition 4 question: after re-reading merged PR #549 (train_gpt.py:95,1187), ttt_epochs=3 is the default there with chunk-scoped epochs running AFTER inference_mode scoring. The multi-epoch pattern matches merged precedent when score-before-update holds per chunk, so I no longer view Condition 4 as the blocker here — apologies for the noise.

The actionable path is the LUT fix + the 16 MB cap. Worth re-running with both corrected; the real canonical number should land well above the current bar, but clean data will make the GDN contribution visible.

OleStan added a commit to OleStan/parameter-golf that referenced this pull request Apr 19, 2026
Built on PR openai#1698 (GatedDeltaNet + Legal TTT). Adds:
- FreqGPTQ: frequency-weighted Hessian calibration for GPTQ
- PassthroughQuant: int8 for control tensors (saves ~40KB)
- Sandwich quantization: int8 for final block
- Adaptive embedding precision: int8 top-100 / intN rest
- Configurable Int5/6 GPTQ with synced QAT
- LZMA wrapper saves ~73KB

Pending GPU validation for BPB results.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 19, 2026
…ad, MP-SGD TTT 4-phase

- PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter
  (~1.189 actual) + artifact size violation; effectively dead
- New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) —
  reversible case-factoring with byte sidecar; stronger legality than
  casefold; await Issue openai#1604 ruling
- PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738
  builds on it, both likely void
- PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable
- Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline

https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants