Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts)#1734
Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts)#1734yahya010 wants to merge 1 commit intoopenai:mainfrom
Conversation
…ed mean) 3-seed mean 1.01080 (std 0.00115), seeds 42/314/999: - seed 42: 1.01205 - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean) - seed 999: 1.01056 Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB. Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD design but disabled in the scored run (ttt_macro_phases=0) — on seed 42 it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain), so not worth the extra eval time. All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.
Self-disclosure: this PR has a BPB-computation bug inherited from PR #1698I ran an empirical canonical byte-count check on the eval-time LUT + accumulator combination in # build_sentencepiece_luts (line ~216): +1 baked into LUT for leading-space tokens
base_bytes[i] = len(piece[1:].encode("utf-8")) + 1
# eval_val_sliding / eval_val_ttt_gdn (lines ~375, ~549): +1 added AGAIN at eval
tb = base_bytes_lut[tgt].to(torch.float64)
tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64)Empirical check on the first ~1M val tokens (SP8192, seed-0 val shard):
That means the reported The brotli-11 compression swap in this PR is still a legitimate artifact-size fix for PR #1698 — but the underlying claimed BPB is not comparable to records from the non-GDN line (merged SOTA PR #1493, PR #1700, etc.) which use the Closing this PR. My other open submission, #1727 (MP-SGD phases=4 + QK 5.25, val_bpb 1.07217), is based on PR #1700's codebase which uses the correct LUT ( Apologies for the noise. |
|
Closing based on the byte-LUT double-count finding in the comment above. The brotli-11 artifact-size fix is valid, but the claimed 1.01080 is inflated by ~17.46% vs canonical (~1.187 canonical). Not submitting a BPB-fixed rerun because canonical would not beat the merged-SOTA threshold. #1727 (val_bpb 1.07217 on PR #1700's non-buggy base) remains open. |
|
Hi yahya010, I built a static LUT audit tool inspired by your PR #1734 closure note. Before I file it as a non-record PR, I wanted to share the finding with you directly. Tool: Detects three LUT deviation patterns:
Empirical finding on your ratio: I initially expected the gap to be a scoring-strategy artifact, but tracing more carefully: your Running your exact LUT against the same val stream gives 1.1770, within ~0.2% of your quoted 1.1746 — so your number appears consistent with your own LUT, and the 0.75% gap to my tool's default is a genuine LUT-shape difference rather than anyone's measurement error. Full analysis in Result of applying the tool to top-10 open PRs (2026-04-23): Planned submission: A non-record PR to openai/parameter-golf with the tool, methodology doc, and current audit snapshot. Framed as a tooling contribution for future submitters to self-check, with your #1734 closure cited as the origin throughout. Anything you'd push back on before I file? Also happy to credit you as co-author of the original discovery if you'd prefer that framing. — abi2024 |
Summary
3-seed mean val_bpb = 1.01080 (std 0.00115), seeds 42/314/999, beating merged SOTA (1.0810, @bigbag PR #1493) by 0.07020 BPB / ~0.04867 nats — clears the 0.005-nat / ~0.0072 BPB threshold by a 10x margin.
This is the first VALID sub-1.02 BPB submission. Base is @arsenis-cmd's PR #1698 (GatedDeltaNet + Legal Score-First TTT, 1.00995 BPB), which is currently ineligible for merge because all three of its artifacts exceed the 16,000,000-byte decimal cap (16.47-16.60 MB). @sergeevii123 flagged this on the PR; @arsenis-cmd acknowledged and proposed reducing
clip_range31→24 as the fix — that does shrink artifacts but it also adds ~+0.014 BPB quantization penalty because more weights get clipped.This PR takes a different fix: keep
clip_range=31(no extra quant penalty) and replace zstandard-22 with brotli-11 for artifact compression. Brotli is ~6% smaller on this int6-GPTQ byte distribution (already the compressor used in merged SOTA PR #1493), bringing all 3 artifacts comfortably under 16,000,000 bytes with zero quality loss.Results (8×H100 80GB SXM, torch 2.9.1+cu128)
Seed 314 alone (1.00978) is lower than PR #1698's entire 3-seed mean (1.00995), confirming the brotli swap is quality-neutral.
Direct comparison with PR #1698 (same code + TTT recipe; different compressor)
Compliance (Issue #1017 Track A)
torch.inference_mode()BEFORE any SGD update (identical to PR Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean) #1698)The
ttt_epochs=3multi-epoch SGD on already-scored tokens matches PR #1698 exactly, and the same pattern is used in PR #1700 and merged SOTA PR #1493. The interpretation of Condition 4 vs multi-epoch post-score training is pending organizer clarification (raised by @dexhunter on PR #1698) — if ruled against, both this PR and #1698 would needttt_epochs=1.What's new vs PR #1698
zstandard.ZstdCompressor(level=22)→brotli.compress(quality=11)— the single line change (plus matching decompressor) that makes the recipe valid. Brotli is already used in merged SOTA PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493; this PR just brings it to the GDN codebase.TTT_MACRO_PHASES,TTT_MACRO_EPOCHS,TTT_MACRO_LR_MULT), but disabled in the scored run (TTT_MACRO_PHASES=0). On seed 42 I measured it head-to-head: macro-phases=4 gave TTT gain -0.00999; macro-phases=0 gave -0.01012 — within seed noise, so I did not use it and flagged this honestly in the README. Left in place for future tuning.No other changes — architecture (K_KVShare_Wider, 10-layer GDN, 544d, 8H, KV-share stride=2), training (Muon + Adam, EMA 0.997, SWA, Late QAT, 7000-step ceiling), and TTT (score-first SGD lr=0.005, 3 epochs/chunk, freeze first 2 blocks) are identical to PR #1698.
Reproduction
Attribution
fla-core==0.4.2).Test plan
🤖 Generated with Claude Code