Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean) by arsenis-cmd · Pull Request #1698 · openai/parameter-golf

arsenis-cmd · 2026-04-17T16:17:23Z

Summary

GatedDeltaNet linear attention (FLA library) replaces softmax attention with O(n) gated delta rule recurrence
Legal score-first TTT adds 3-epoch SGD chunk adaptation at eval time
Built on PR Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean) #1687's K_KVShare_Wider architecture by @resouer
3-seed mean: 1.00995 BPB (std 0.0012), beating current Update README.md little things #1 (1.0810) by 0.071 BPB

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	Pre-TTT BPB	Post-TTT BPB	TTT Gain	Artifact
42	2,364	1.02142	1.01130	-0.01012	16,600,916
314	2,398	1.01872	1.00896	-0.00976	16,548,775
999	2,370	1.01967	1.00959	-0.01008	16,474,250
Mean	2,377	1.01994	1.00995	-0.00999

Compliance

Train under 600s (wallclock limited)
All artifacts under 16 MiB (max: 16,600,916 bytes = 15.83 MiB)
Eval under 600s (~320s total: ~120s roundtrip + ~200s TTT)
Score-first TTT (no SLOT, no n-gram cache, no ETLB)
3-seed validation (42, 314, 999)

Key Architecture

Attention: GatedDeltaNet (FLA) — O(n) linear recurrence, no quadratic attention
Config: K_KVShare_Wider — 10 layers, 544d, 8 heads, KV sharing stride=2
Quantization: Int6 matrices + Int8 embeddings + zstd-22
Training: Muon optimizer + EMA(0.997) + SWA(50) + Late QAT
TTT: SGD(lr=0.005, momentum=0.9), 3 epochs, 32K-token chunks, freeze first 2 blocks

Attribution

GatedDeltaNet architecture: @resouer (PR Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean) #1687) — K_KVShare_Wider, FLA integration
Flash Linear Attention: @sustcsonglin — GatedDeltaNet implementation (fla-core 0.4.2)
Legal TTT framework: @Christopher-Lee-McClendon (PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461) — score-first TTT (adapted for GDN)

Run Command

ARCH_MODE=K TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
TTT_CHUNK_TOKENS=32768 TTT_FREEZE_BLOCKS=2 TTT_MOMENTUM=0.9 \
TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 SEED=42 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

@resouer

… (3-seed mean) GatedDeltaNet linear attention (FLA) + legal score-first TTT on PR openai#1687 K_KVShare_Wider architecture. 3-seed mean: 1.00995 BPB (std 0.0012). Seeds: 42 (1.01130), 314 (1.00896), 999 (1.00959) TTT gain: ~-0.010 BPB per seed All artifacts under 16 MiB. Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.

…ct concern; PR openai#1687 CLOSED BPB bug; PR openai#1693 casefold 1.05733; SOTA Day 8; Session 16 https://claude.ai/code/session_01LVvBLAM46dRKg53renpkq4

sergeevii123 · 2026-04-17T19:00:49Z

From rules The cap is decimal 16MB, i.e. 16,000,000 total bytes, not 16 MiB / 16,777,216 bytes.
Looks like this exceeds the required size

dexhunter · 2026-04-18T01:00:38Z

Hey @arsenis-cmd, congrats on the strong result. I want to raise one interpretive question on Condition 4 of Issue #1017 ("single left-to-right pass") for organizer clarity, since the answer affects how much of the gain is novelty vs. recipe extension.

In train_gdn_7k.py, lines 1741-1771 run, for each non-final chunk, for _ep in range(args.ttt_epochs) with the default ttt_epochs=3 (line 1363, also visible in all three seed logs as ttt_epochs=3). So each chunk's tokens are scored once under inference_mode() (lines 1707-1731), then used for three epochs of SGD before moving to the next chunk.

Score-before-update is intact (each token is scored exactly once, before any gradient step that uses it), and Conditions 1, 2, and 3 look fine. The interpretive question is whether "single left-to-right pass" in Condition 4 refers strictly to the scoring pass (in which case 3 epochs of post-scoring adaptation is allowed) or to the entire eval procedure (in which case repeating SGD on previously-scored tokens 3× per chunk extends the merged TTT precedent of 1 epoch from PR #549/#461).

Could the organizers weigh in? If multi-epoch adaptation is permitted under Track B's "per-document LoRA under score-before-update" framing, that would be useful precedent for everyone. If not, a 1-epoch ablation would tell us how much of the margin to the runner-up is from GDN architecture vs. the extended TTT recipe.

Not flagging as a blocker — just want a clean ruling before others build on it.

arsenis-cmd · 2026-04-18T05:39:50Z

@sergeevii123 Good catch — you're right. I was using 16 MiB (16,777,216 bytes) as the threshold when the rules specify decimal 16 MB (16,000,000 bytes). All three seeds are 474–601 KB over the correct cap.

I've already identified and coded the fix: reducing the quantization clip_range from 31 (int6) to 24 in mixed_quantize(). Fewer unique byte values in the int8 containers → zstd-22 compresses ~9% more efficiently, which brings all artifacts comfortably under 16,000,000 bytes. The quality impact is minimal since per-row scaling means most quantized values are already well within [-24, +24] — estimated +0.002–0.005 BPB at most.

Unfortunately I terminated my GPU instance after the original runs completed, so the unquantized checkpoints are gone. I've submitted a compute credit request to OpenAI to re-run the 3 seeds with the corrected quantization. Hopefully they'll grant it so I can validate the fix and update this PR.

arsenis-cmd · 2026-04-18T05:40:08Z

@dexhunter Thanks for the careful analysis — this is a fair and well-framed question.

You're correct that ttt_epochs=3 means each chunk's tokens undergo 3 epochs of SGD after scoring. The key compliance point is Condition 4's "single left-to-right pass" and whether it refers to the scoring pass or the entire eval procedure.

My reading of the score-first TTT framework (established by @Christopher-Lee-McClendon in PR #461) is that the constraint is about scoring: each token must be scored exactly once, before any gradient update that could have used that token's signal. This is what "score-before-update" guarantees — no information leakage from future tokens into the score of past tokens.

The multi-epoch SGD that follows is adaptation on already-scored tokens. The scores are frozen and final at that point. The 3 epochs of post-scoring SGD only affect future chunk predictions, not any scored token's BPB contribution. In that sense, it's analogous to a larger learning rate or more aggressive optimizer — a training hyperparameter for the adaptation step, not a scoring multiplier.

That said, I agree this deserves an explicit ruling from organizers, especially since the merged TTT precedent (PR #549/#461) used 1 epoch. If multi-epoch is deemed non-compliant under Track B, I'm happy to provide a 1-epoch ablation — I expect the gap to narrow by roughly 0.002–0.003 BPB (the TTT gain is -0.01 total across 3 epochs, with diminishing returns per epoch), which would still hold the record.

Would appreciate @cocohearts @valerio-oai @0hq @yuzhougu-oai weighing in.

Built on PR openai#1698 (GatedDeltaNet + Legal TTT). Adds: - FreqGPTQ: frequency-weighted Hessian calibration for GPTQ - PassthroughQuant: int8 for control tensors (saves ~40KB) - Sandwich quantization: int8 for final block - Adaptive embedding precision: int8 top-100 / intN rest - Configurable Int5/6 GPTQ with synced QAT - LZMA wrapper saves ~73KB Pending GPU validation for BPB results. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ed mean) 3-seed mean 1.01080 (std 0.00115), seeds 42/314/999: - seed 42: 1.01205 - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean) - seed 999: 1.01056 Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB. Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD design but disabled in the scored run (ttt_macro_phases=0) — on seed 42 it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain), so not worth the extra eval time. All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.

dexhunter · 2026-04-19T05:26:07Z

Hi @arsenis-cmd — a second concern on this PR beyond the artifact-size flag.

The build_sentencepiece_luts helper in train_gdn_7k.py appears to inherit the byte-accounting defect tracked in #1719 (GDN-family BPB bug):

# train_gdn_7k.py — piece LUT
if piece.startswith("\u2581"):
    has_space[i] = True
    base_bytes[i] = len(piece[1:].encode("utf-8")) + 1   # pre-credits +1

combined with the eval accumulator that also adds +1 for ▁-prefixed tokens via has_leading_space_lut & ~is_boundary_token_lut. One byte is counted twice.

The canonical pattern from merged PR #1019 drops the pre-credit:

base_bytes[i] = len(piece.encode("utf-8"))   # after stripping ▁

Net effect is ~+17.7% byte inflation, deflating reported val_bpb by ~1/1.177. Correcting the LUT places the claimed 1.00995 at approximately 1.189 canonical, under merged SOTA (PR #1493 at 1.0810).

Quick self-check: val_loss / val_bpb should be ≈2.58 for sp8192 (≈3.73 bytes/token × ln 2). The ~3.04 ratio in the GDN-family logs is the fingerprint of the bug.

Separately on my earlier Condition 4 question: after re-reading merged PR #549 (train_gpt.py:95,1187), ttt_epochs=3 is the default there with chunk-scoped epochs running AFTER inference_mode scoring. The multi-epoch pattern matches merged precedent when score-before-update holds per chunk, so I no longer view Condition 4 as the blocker here — apologies for the noise.

The actionable path is the LUT fix + the 16 MB cap. Worth re-running with both corrected; the real canonical number should land well above the current bar, but clean data will make the GDN contribution visible.

Built on PR openai#1698 (GatedDeltaNet + Legal TTT). Adds: - FreqGPTQ: frequency-weighted Hessian calibration for GPTQ - PassthroughQuant: int8 for control tensors (saves ~40KB) - Sandwich quantization: int8 for final block - Adaptive embedding precision: int8 top-100 / intN rest - Configurable Int5/6 GPTQ with synced QAT - LZMA wrapper saves ~73KB Pending GPU validation for BPB results.

…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31

This was referenced Apr 18, 2026

Record: GatedDeltaNet FLA + Brotli (No TTT) — val_bpb 1.01902 (3-seed mean) #1712

Closed

Byte-accounting bug in build_sentencepiece_luts affects GDN-family submissions #1719

Open

OlesStankevych mentioned this pull request Apr 18, 2026

WIP: FreqGPTQ + GatedDeltaNet + Adaptive Quantization #1721

Closed

5 tasks

yahya010 mentioned this pull request Apr 19, 2026

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734

Closed

12 tasks

OleStan mentioned this pull request Apr 19, 2026

WIP: FreqGPTQ + GatedDeltaNet + Adaptive Quantization #1743

Open

5 tasks

dexhunter mentioned this pull request Apr 23, 2026

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean) #1791

Open

abi2024 mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean)#1698

Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean)#1698
arsenis-cmd wants to merge 1 commit intoopenai:mainfrom
arsenis-cmd:submission/gdn-fla-ttt-1.00995

arsenis-cmd commented Apr 17, 2026

Uh oh!

sergeevii123 commented Apr 17, 2026

Uh oh!

dexhunter commented Apr 18, 2026

Uh oh!

arsenis-cmd commented Apr 18, 2026

Uh oh!

arsenis-cmd commented Apr 18, 2026

Uh oh!

dexhunter commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arsenis-cmd commented Apr 17, 2026

Summary

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Compliance

Key Architecture

Attribution

Run Command

Uh oh!

sergeevii123 commented Apr 17, 2026

Uh oh!

dexhunter commented Apr 18, 2026

Uh oh!

arsenis-cmd commented Apr 18, 2026

Uh oh!

arsenis-cmd commented Apr 18, 2026

Uh oh!

dexhunter commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants