Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean)#1698
Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean)#1698arsenis-cmd wants to merge 1 commit intoopenai:mainfrom
Conversation
… (3-seed mean) GatedDeltaNet linear attention (FLA) + legal score-first TTT on PR openai#1687 K_KVShare_Wider architecture. 3-seed mean: 1.00995 BPB (std 0.0012). Seeds: 42 (1.01130), 314 (1.00896), 999 (1.00959) TTT gain: ~-0.010 BPB per seed All artifacts under 16 MiB. Based on PR openai#1687 by @resouer, TTT adapted from PR openai#461.
…ct concern; PR openai#1687 CLOSED BPB bug; PR openai#1693 casefold 1.05733; SOTA Day 8; Session 16 https://claude.ai/code/session_01LVvBLAM46dRKg53renpkq4
|
From rules |
|
Hey @arsenis-cmd, congrats on the strong result. I want to raise one interpretive question on Condition 4 of Issue #1017 ("single left-to-right pass") for organizer clarity, since the answer affects how much of the gain is novelty vs. recipe extension. In Score-before-update is intact (each token is scored exactly once, before any gradient step that uses it), and Conditions 1, 2, and 3 look fine. The interpretive question is whether "single left-to-right pass" in Condition 4 refers strictly to the scoring pass (in which case 3 epochs of post-scoring adaptation is allowed) or to the entire eval procedure (in which case repeating SGD on previously-scored tokens 3× per chunk extends the merged TTT precedent of 1 epoch from PR #549/#461). Could the organizers weigh in? If multi-epoch adaptation is permitted under Track B's "per-document LoRA under score-before-update" framing, that would be useful precedent for everyone. If not, a 1-epoch ablation would tell us how much of the margin to the runner-up is from GDN architecture vs. the extended TTT recipe. Not flagging as a blocker — just want a clean ruling before others build on it. |
|
@sergeevii123 Good catch — you're right. I was using 16 MiB (16,777,216 bytes) as the threshold when the rules specify decimal 16 MB (16,000,000 bytes). All three seeds are 474–601 KB over the correct cap. I've already identified and coded the fix: reducing the quantization Unfortunately I terminated my GPU instance after the original runs completed, so the unquantized checkpoints are gone. I've submitted a compute credit request to OpenAI to re-run the 3 seeds with the corrected quantization. Hopefully they'll grant it so I can validate the fix and update this PR. |
|
@dexhunter Thanks for the careful analysis — this is a fair and well-framed question. You're correct that My reading of the score-first TTT framework (established by @Christopher-Lee-McClendon in PR #461) is that the constraint is about scoring: each token must be scored exactly once, before any gradient update that could have used that token's signal. This is what "score-before-update" guarantees — no information leakage from future tokens into the score of past tokens. The multi-epoch SGD that follows is adaptation on already-scored tokens. The scores are frozen and final at that point. The 3 epochs of post-scoring SGD only affect future chunk predictions, not any scored token's BPB contribution. In that sense, it's analogous to a larger learning rate or more aggressive optimizer — a training hyperparameter for the adaptation step, not a scoring multiplier. That said, I agree this deserves an explicit ruling from organizers, especially since the merged TTT precedent (PR #549/#461) used 1 epoch. If multi-epoch is deemed non-compliant under Track B, I'm happy to provide a 1-epoch ablation — I expect the gap to narrow by roughly 0.002–0.003 BPB (the TTT gain is -0.01 total across 3 epochs, with diminishing returns per epoch), which would still hold the record. Would appreciate @cocohearts @valerio-oai @0hq @yuzhougu-oai weighing in. |
Built on PR openai#1698 (GatedDeltaNet + Legal TTT). Adds: - FreqGPTQ: frequency-weighted Hessian calibration for GPTQ - PassthroughQuant: int8 for control tensors (saves ~40KB) - Sandwich quantization: int8 for final block - Adaptive embedding precision: int8 top-100 / intN rest - Configurable Int5/6 GPTQ with synced QAT - LZMA wrapper saves ~73KB Pending GPU validation for BPB results. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ed mean) 3-seed mean 1.01080 (std 0.00115), seeds 42/314/999: - seed 42: 1.01205 - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean) - seed 999: 1.01056 Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB. Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD design but disabled in the scored run (ttt_macro_phases=0) — on seed 42 it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain), so not worth the extra eval time. All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.
|
Hi @arsenis-cmd — a second concern on this PR beyond the artifact-size flag. The # train_gdn_7k.py — piece LUT
if piece.startswith("\u2581"):
has_space[i] = True
base_bytes[i] = len(piece[1:].encode("utf-8")) + 1 # pre-credits +1combined with the eval accumulator that also adds The canonical pattern from merged PR #1019 drops the pre-credit: base_bytes[i] = len(piece.encode("utf-8")) # after stripping ▁Net effect is ~+17.7% byte inflation, deflating reported val_bpb by ~1/1.177. Correcting the LUT places the claimed 1.00995 at approximately 1.189 canonical, under merged SOTA (PR #1493 at 1.0810). Quick self-check: Separately on my earlier Condition 4 question: after re-reading merged PR #549 ( The actionable path is the LUT fix + the 16 MB cap. Worth re-running with both corrected; the real canonical number should land well above the current bar, but clean data will make the GDN contribution visible. |
Built on PR openai#1698 (GatedDeltaNet + Legal TTT). Adds: - FreqGPTQ: frequency-weighted Hessian calibration for GPTQ - PassthroughQuant: int8 for control tensors (saves ~40KB) - Sandwich quantization: int8 for final block - Adaptive embedding precision: int8 top-100 / intN rest - Configurable Int5/6 GPTQ with synced QAT - LZMA wrapper saves ~73KB Pending GPU validation for BPB results.
…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
Summary
K_KVShare_Widerarchitecture by @resouerResults (8xH100 80GB SXM, PyTorch 2.9.1+cu128)
Compliance
Key Architecture
Attribution
Run Command