Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) by yahya010 · Pull Request #1734 · openai/parameter-golf

yahya010 · 2026-04-19T05:00:42Z

Summary

3-seed mean val_bpb = 1.01080 (std 0.00115), seeds 42/314/999, beating merged SOTA (1.0810, @bigbag PR #1493) by 0.07020 BPB / ~0.04867 nats — clears the 0.005-nat / ~0.0072 BPB threshold by a 10x margin.

This is the first VALID sub-1.02 BPB submission. Base is @arsenis-cmd's PR #1698 (GatedDeltaNet + Legal Score-First TTT, 1.00995 BPB), which is currently ineligible for merge because all three of its artifacts exceed the 16,000,000-byte decimal cap (16.47-16.60 MB). @sergeevii123 flagged this on the PR; @arsenis-cmd acknowledged and proposed reducing clip_range 31→24 as the fix — that does shrink artifacts but it also adds ~+0.014 BPB quantization penalty because more weights get clipped.

This PR takes a different fix: keep clip_range=31 (no extra quant penalty) and replace zstandard-22 with brotli-11 for artifact compression. Brotli is ~6% smaller on this int6-GPTQ byte distribution (already the compressor used in merged SOTA PR #1493), bringing all 3 artifacts comfortably under 16,000,000 bytes with zero quality loss.

Results (8×H100 80GB SXM, torch 2.9.1+cu128)

Seed	EMA BPB	Pre-TTT BPB	Post-TTT BPB	TTT Gain	Artifact
42	1.00257	1.02189	1.01205	-0.00984	15,543,829 B
314	1.00033	1.01903	1.00978	-0.00925	15,527,172 B
999	1.00146	1.01986	1.01056	-0.00930	15,524,066 B
Mean	1.00146	1.02026	1.01080 (std 0.00115)	-0.00946	15,531,689 B

Seed 314 alone (1.00978) is lower than PR #1698's entire 3-seed mean (1.00995), confirming the brotli swap is quality-neutral.

Direct comparison with PR #1698 (same code + TTT recipe; different compressor)

	PR #1698 (zstd-22)	This PR (brotli-11)
Seed 42 artifact	16,600,916 B (INVALID, +601KB over cap)	15,543,829 B (VALID)
Seed 314 artifact	16,548,775 B (INVALID, +549KB over cap)	15,527,172 B (VALID)
Seed 999 artifact	16,474,250 B (INVALID, +474KB over cap)	15,524,066 B (VALID)
3-seed val_bpb	1.00995	1.01080 (+0.00085, within seed noise)

Compliance (Issue #1017 Track A)

Condition 1 (Causality): standard causal attention (GDN has no softmax, but is still strictly left-to-right)
Condition 2 (Normalized): standard softmax over full vocab at the LM head
Condition 3 (Score-before-update): each 32K-token chunk is fully scored under torch.inference_mode() BEFORE any SGD update (identical to PR Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean) #1698)
Condition 4 (Single pass): each token scored exactly once; no rescoring across passes
No SLOT, no n-gram cache, no ETLB, no pre-quant TTT
All 3 artifacts < 16,000,000 bytes (max 15,543,829 B)
All 3 trainings ≤ 600s wallclock-capped
All 3 evals < 600s

The ttt_epochs=3 multi-epoch SGD on already-scored tokens matches PR #1698 exactly, and the same pattern is used in PR #1700 and merged SOTA PR #1493. The interpretation of Condition 4 vs multi-epoch post-score training is pending organizer clarification (raised by @dexhunter on PR #1698) — if ruled against, both this PR and #1698 would need ttt_epochs=1.

What's new vs PR #1698

Compression: zstandard.ZstdCompressor(level=22) → brotli.compress(quality=11) — the single line change (plus matching decompressor) that makes the recipe valid. Brotli is already used in merged SOTA PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493; this PR just brings it to the GDN codebase.
Macro-phase SGD TTT hook (from PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's Multi-Phase Global SGD design): infrastructure added (TTT_MACRO_PHASES, TTT_MACRO_EPOCHS, TTT_MACRO_LR_MULT), but disabled in the scored run (TTT_MACRO_PHASES=0). On seed 42 I measured it head-to-head: macro-phases=4 gave TTT gain -0.00999; macro-phases=0 gave -0.01012 — within seed noise, so I did not use it and flagged this honestly in the README. Left in place for future tuning.

No other changes — architecture (K_KVShare_Wider, 10-layer GDN, 544d, 8H, KV-share stride=2), training (Muon + Adam, EMA 0.997, SWA, Late QAT, 7000-step ceiling), and TTT (score-first SGD lr=0.005, 3 epochs/chunk, freeze first 2 blocks) are identical to PR #1698.

Reproduction

pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install numpy sentencepiece zstandard brotli triton==3.5.1
pip install flash-linear-attention==0.4.2 fla-core==0.4.2 transformers==5.5.4 tokenizers==0.22.2 safetensors==0.7.0

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
  python3 data/cached_challenge_fineweb.py --variant sp8192

for seed in 42 314 999; do
  SEED=$seed \
  ARCH_MODE=K VOCAB_SIZE=8192 \
  DATA_PATH=./data/datasets/fineweb10B_sp8192 \
  TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
  MAX_WALLCLOCK_SECONDS=600 \
  INT6_CLIP_RANGE=31 \
  COMPRESSOR=brotli \
  TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  TTT_CHUNK_TOKENS=32768 TTT_FREEZE_BLOCKS=2 TTT_MOMENTUM=0.9 \
  TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
  TTT_MACRO_PHASES=0 \
  torchrun --standalone --nproc_per_node=8 train_gdn_7k.py
done

Attribution

@arsenis-cmd (PR Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean) #1698) — full base: GatedDeltaNet integration, K_KVShare_Wider config, all training and score-first TTT infrastructure. This submission changes only compression + adds an optional (disabled) macro-phase hook.
@resouer (PR Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean) #1687) — K_KVShare_Wider architecture and FLA integration.
Flash Linear Attention by @sustcsonglin — GatedDeltaNet Triton kernel (fla-core==0.4.2).
@Christopher-Lee-McClendon (PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461) — legal score-first TTT framework.
@jorge-asenjo (PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700) / @dexhunter (PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626) — Multi-Phase Global SGD TTT concept (the macro-phase hook design; disabled in scored run).
@bigbag (PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493, merged SOTA) — brotli-11 compression already proven in the non-GDN line; this PR ports it to GDN.

Test plan

3-seed training on 8×H100 SXM
All artifacts < 16,000,000 bytes (max 15,543,829 B)
Brotli roundtrip validated — quantized BPB matches on-disk decompressed BPB
Head-to-head ablation of macro-phase (on seed 42) documented; disabled in scored run

🤖 Generated with Claude Code

…ed mean) 3-seed mean 1.01080 (std 0.00115), seeds 42/314/999: - seed 42: 1.01205 - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean) - seed 999: 1.01056 Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB. Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD design but disabled in the scored run (ttt_macro_phases=0) — on seed 42 it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain), so not worth the extra eval time. All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.

yahya010 · 2026-04-19T05:35:39Z

Self-disclosure: this PR has a BPB-computation bug inherited from PR #1698

I ran an empirical canonical byte-count check on the eval-time LUT + accumulator combination in train_gdn_7k.py and confirmed the same issue that has been flagged elsewhere in the GDN-inheritance family:

# build_sentencepiece_luts (line ~216): +1 baked into LUT for leading-space tokens
base_bytes[i] = len(piece[1:].encode("utf-8")) + 1

# eval_val_sliding / eval_val_ttt_gdn (lines ~375, ~549): +1 added AGAIN at eval
tb = base_bytes_lut[tgt].to(torch.float64)
tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64)

Empirical check on the first ~1M val tokens (SP8192, seed-0 val shard):

	bytes
Canonical (`sp.decode_ids(...).encode('utf-8')`)	3,673,745
Buggy code (this PR, inherited from #1698)	4,315,102
Ratio (buggy / canonical)	1.1746

That means the reported val_bpb = 1.01080 corresponds to a canonical val_bpb ≈ 1.1873, which fails the merged-SOTA threshold (1.0738) by a wide margin.

The brotli-11 compression swap in this PR is still a legitimate artifact-size fix for PR #1698 — but the underlying claimed BPB is not comparable to records from the non-GDN line (merged SOTA PR #1493, PR #1700, etc.) which use the +1-only-at-eval convention.

Closing this PR. My other open submission, #1727 (MP-SGD phases=4 + QK 5.25, val_bpb 1.07217), is based on PR #1700's codebase which uses the correct LUT (+1 only at eval) and is not affected by this bug.

Apologies for the noise.

yahya010 · 2026-04-19T05:35:46Z

Closing based on the byte-LUT double-count finding in the comment above. The brotli-11 artifact-size fix is valid, but the claimed 1.01080 is inflated by ~17.46% vs canonical (~1.187 canonical). Not submitting a BPB-fixed rerun because canonical would not beat the merged-SOTA threshold. #1727 (val_bpb 1.07217 on PR #1700's non-buggy base) remains open.

abi2024 · 2026-04-24T08:11:49Z

Hi yahya010,

I built a static LUT audit tool inspired by your PR #1734 closure note. Before I file it as a non-record PR, I wanted to share the finding with you directly.

Tool: canonical_rescore.py — statically inspects build_sentencepiece_luts in any submission's train_gpt.py and classifies it as CORRECT / BUGGY / OBFUSCATED / UNKNOWN.

Detects three LUT deviation patterns:

baking +1 into leading-space tokens [the pattern you originally reported]
byte-token sizing using len(piece.encode("utf-8")) instead of literal 1
boundary predicate missing sp.is_unused. No GPU, no model reproduction required.

Empirical finding on your ratio:
My tool computes 1.1671 on the sliding-window scored subset that PR #1727's eval_val_sliding uses.
Your closure quoted 1.1746.

I initially expected the gap to be a scoring-strategy artifact, but tracing more carefully: your train_gdn_7k.py LUT has two shape differences from the #1727 canonical implementation (the +1 you disclosed, plus an is_unused difference in the boundary predicate).

Running your exact LUT against the same val stream gives 1.1770, within ~0.2% of your quoted 1.1746 — so your number appears consistent with your own LUT, and the 0.75% gap to my tool's default is a genuine LUT-shape difference rather than anyone's measurement error.

Full analysis in audit/methodology.md §4 at https://github.com/abi2024/agent-pgolf.

Result of applying the tool to top-10 open PRs (2026-04-23):
6 CORRECT, 4 OBFUSCATED, 0 BUGGY. The bug family does not currently appear in plain-text code at the top of the leaderboard — only the unverifiable obfuscated entries cannot be characterized.

Planned submission: A non-record PR to openai/parameter-golf with the tool, methodology doc, and current audit snapshot. Framed as a tooling contribution for future submitters to self-check, with your #1734 closure cited as the origin throughout.

Anything you'd push back on before I file? Also happy to credit you as co-author of the original discovery if you'd prefer that framing.

— abi2024

dexhunter mentioned this pull request Apr 19, 2026

Byte-accounting bug in build_sentencepiece_luts affects GDN-family submissions #1719

Open

yahya010 closed this Apr 19, 2026

abi2024 mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts)#1734

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts)#1734
yahya010 wants to merge 1 commit intoopenai:mainfrom
yahya010:submit/yahya010-gdn-brotli-valid

yahya010 commented Apr 19, 2026

Uh oh!

yahya010 commented Apr 19, 2026

Uh oh!

yahya010 commented Apr 19, 2026

Uh oh!

abi2024 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yahya010 commented Apr 19, 2026

Summary

Results (8×H100 80GB SXM, torch 2.9.1+cu128)

Direct comparison with PR #1698 (same code + TTT recipe; different compressor)

Compliance (Issue #1017 Track A)

What's new vs PR #1698

Reproduction

Attribution

Test plan

Uh oh!

yahya010 commented Apr 19, 2026

Self-disclosure: this PR has a BPB-computation bug inherited from PR #1698

Uh oh!

yahya010 commented Apr 19, 2026

Uh oh!

abi2024 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants