Skip to content

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts)#1734

Closed
yahya010 wants to merge 1 commit intoopenai:mainfrom
yahya010:submit/yahya010-gdn-brotli-valid
Closed

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts)#1734
yahya010 wants to merge 1 commit intoopenai:mainfrom
yahya010:submit/yahya010-gdn-brotli-valid

Conversation

@yahya010
Copy link
Copy Markdown

Summary

3-seed mean val_bpb = 1.01080 (std 0.00115), seeds 42/314/999, beating merged SOTA (1.0810, @bigbag PR #1493) by 0.07020 BPB / ~0.04867 nats — clears the 0.005-nat / ~0.0072 BPB threshold by a 10x margin.

This is the first VALID sub-1.02 BPB submission. Base is @arsenis-cmd's PR #1698 (GatedDeltaNet + Legal Score-First TTT, 1.00995 BPB), which is currently ineligible for merge because all three of its artifacts exceed the 16,000,000-byte decimal cap (16.47-16.60 MB). @sergeevii123 flagged this on the PR; @arsenis-cmd acknowledged and proposed reducing clip_range 31→24 as the fix — that does shrink artifacts but it also adds ~+0.014 BPB quantization penalty because more weights get clipped.

This PR takes a different fix: keep clip_range=31 (no extra quant penalty) and replace zstandard-22 with brotli-11 for artifact compression. Brotli is ~6% smaller on this int6-GPTQ byte distribution (already the compressor used in merged SOTA PR #1493), bringing all 3 artifacts comfortably under 16,000,000 bytes with zero quality loss.

Results (8×H100 80GB SXM, torch 2.9.1+cu128)

Seed EMA BPB Pre-TTT BPB Post-TTT BPB TTT Gain Artifact
42 1.00257 1.02189 1.01205 -0.00984 15,543,829 B
314 1.00033 1.01903 1.00978 -0.00925 15,527,172 B
999 1.00146 1.01986 1.01056 -0.00930 15,524,066 B
Mean 1.00146 1.02026 1.01080 (std 0.00115) -0.00946 15,531,689 B

Seed 314 alone (1.00978) is lower than PR #1698's entire 3-seed mean (1.00995), confirming the brotli swap is quality-neutral.

Direct comparison with PR #1698 (same code + TTT recipe; different compressor)

PR #1698 (zstd-22) This PR (brotli-11)
Seed 42 artifact 16,600,916 B (INVALID, +601KB over cap) 15,543,829 B (VALID)
Seed 314 artifact 16,548,775 B (INVALID, +549KB over cap) 15,527,172 B (VALID)
Seed 999 artifact 16,474,250 B (INVALID, +474KB over cap) 15,524,066 B (VALID)
3-seed val_bpb 1.00995 1.01080 (+0.00085, within seed noise)

Compliance (Issue #1017 Track A)

  • Condition 1 (Causality): standard causal attention (GDN has no softmax, but is still strictly left-to-right)
  • Condition 2 (Normalized): standard softmax over full vocab at the LM head
  • Condition 3 (Score-before-update): each 32K-token chunk is fully scored under torch.inference_mode() BEFORE any SGD update (identical to PR Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean) #1698)
  • Condition 4 (Single pass): each token scored exactly once; no rescoring across passes
  • No SLOT, no n-gram cache, no ETLB, no pre-quant TTT
  • All 3 artifacts < 16,000,000 bytes (max 15,543,829 B)
  • All 3 trainings ≤ 600s wallclock-capped
  • All 3 evals < 600s

The ttt_epochs=3 multi-epoch SGD on already-scored tokens matches PR #1698 exactly, and the same pattern is used in PR #1700 and merged SOTA PR #1493. The interpretation of Condition 4 vs multi-epoch post-score training is pending organizer clarification (raised by @dexhunter on PR #1698) — if ruled against, both this PR and #1698 would need ttt_epochs=1.

What's new vs PR #1698

  1. Compression: zstandard.ZstdCompressor(level=22)brotli.compress(quality=11) — the single line change (plus matching decompressor) that makes the recipe valid. Brotli is already used in merged SOTA PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493; this PR just brings it to the GDN codebase.
  2. Macro-phase SGD TTT hook (from PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's Multi-Phase Global SGD design): infrastructure added (TTT_MACRO_PHASES, TTT_MACRO_EPOCHS, TTT_MACRO_LR_MULT), but disabled in the scored run (TTT_MACRO_PHASES=0). On seed 42 I measured it head-to-head: macro-phases=4 gave TTT gain -0.00999; macro-phases=0 gave -0.01012 — within seed noise, so I did not use it and flagged this honestly in the README. Left in place for future tuning.

No other changes — architecture (K_KVShare_Wider, 10-layer GDN, 544d, 8H, KV-share stride=2), training (Muon + Adam, EMA 0.997, SWA, Late QAT, 7000-step ceiling), and TTT (score-first SGD lr=0.005, 3 epochs/chunk, freeze first 2 blocks) are identical to PR #1698.

Reproduction

pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install numpy sentencepiece zstandard brotli triton==3.5.1
pip install flash-linear-attention==0.4.2 fla-core==0.4.2 transformers==5.5.4 tokenizers==0.22.2 safetensors==0.7.0

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
  python3 data/cached_challenge_fineweb.py --variant sp8192

for seed in 42 314 999; do
  SEED=$seed \
  ARCH_MODE=K VOCAB_SIZE=8192 \
  DATA_PATH=./data/datasets/fineweb10B_sp8192 \
  TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
  MAX_WALLCLOCK_SECONDS=600 \
  INT6_CLIP_RANGE=31 \
  COMPRESSOR=brotli \
  TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  TTT_CHUNK_TOKENS=32768 TTT_FREEZE_BLOCKS=2 TTT_MOMENTUM=0.9 \
  TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
  TTT_MACRO_PHASES=0 \
  torchrun --standalone --nproc_per_node=8 train_gdn_7k.py
done

Attribution

Test plan

  • 3-seed training on 8×H100 SXM
  • All artifacts < 16,000,000 bytes (max 15,543,829 B)
  • Brotli roundtrip validated — quantized BPB matches on-disk decompressed BPB
  • Head-to-head ablation of macro-phase (on seed 42) documented; disabled in scored run

🤖 Generated with Claude Code

…ed mean)

3-seed mean 1.01080 (std 0.00115), seeds 42/314/999:
  - seed 42:  1.01205
  - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean)
  - seed 999: 1.01056
Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB.

Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD
design but disabled in the scored run (ttt_macro_phases=0) — on seed 42
it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain),
so not worth the extra eval time.

All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.
@yahya010
Copy link
Copy Markdown
Author

Self-disclosure: this PR has a BPB-computation bug inherited from PR #1698

I ran an empirical canonical byte-count check on the eval-time LUT + accumulator combination in train_gdn_7k.py and confirmed the same issue that has been flagged elsewhere in the GDN-inheritance family:

# build_sentencepiece_luts (line ~216): +1 baked into LUT for leading-space tokens
base_bytes[i] = len(piece[1:].encode("utf-8")) + 1

# eval_val_sliding / eval_val_ttt_gdn (lines ~375, ~549): +1 added AGAIN at eval
tb = base_bytes_lut[tgt].to(torch.float64)
tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64)

Empirical check on the first ~1M val tokens (SP8192, seed-0 val shard):

bytes
Canonical (sp.decode_ids(...).encode('utf-8')) 3,673,745
Buggy code (this PR, inherited from #1698) 4,315,102
Ratio (buggy / canonical) 1.1746

That means the reported val_bpb = 1.01080 corresponds to a canonical val_bpb ≈ 1.1873, which fails the merged-SOTA threshold (1.0738) by a wide margin.

The brotli-11 compression swap in this PR is still a legitimate artifact-size fix for PR #1698 — but the underlying claimed BPB is not comparable to records from the non-GDN line (merged SOTA PR #1493, PR #1700, etc.) which use the +1-only-at-eval convention.

Closing this PR. My other open submission, #1727 (MP-SGD phases=4 + QK 5.25, val_bpb 1.07217), is based on PR #1700's codebase which uses the correct LUT (+1 only at eval) and is not affected by this bug.

Apologies for the noise.

@yahya010
Copy link
Copy Markdown
Author

Closing based on the byte-LUT double-count finding in the comment above. The brotli-11 artifact-size fix is valid, but the claimed 1.01080 is inflated by ~17.46% vs canonical (~1.187 canonical). Not submitting a BPB-fixed rerun because canonical would not beat the merged-SOTA threshold. #1727 (val_bpb 1.07217 on PR #1700's non-buggy base) remains open.

@yahya010 yahya010 closed this Apr 19, 2026
@abi2024
Copy link
Copy Markdown

abi2024 commented Apr 24, 2026

Hi yahya010,

I built a static LUT audit tool inspired by your PR #1734 closure note. Before I file it as a non-record PR, I wanted to share the finding with you directly.

Tool: canonical_rescore.py — statically inspects build_sentencepiece_luts in any submission's train_gpt.py and classifies it as CORRECT / BUGGY / OBFUSCATED / UNKNOWN.

Detects three LUT deviation patterns:

  1. baking +1 into leading-space tokens [the pattern you originally reported]
  2. byte-token sizing using len(piece.encode("utf-8")) instead of literal 1
  3. boundary predicate missing sp.is_unused. No GPU, no model reproduction required.

Empirical finding on your ratio:
My tool computes 1.1671 on the sliding-window scored subset that PR #1727's eval_val_sliding uses.
Your closure quoted 1.1746.

I initially expected the gap to be a scoring-strategy artifact, but tracing more carefully: your train_gdn_7k.py LUT has two shape differences from the #1727 canonical implementation (the +1 you disclosed, plus an is_unused difference in the boundary predicate).

Running your exact LUT against the same val stream gives 1.1770, within ~0.2% of your quoted 1.1746 — so your number appears consistent with your own LUT, and the 0.75% gap to my tool's default is a genuine LUT-shape difference rather than anyone's measurement error.

Full analysis in audit/methodology.md §4 at https://github.com/abi2024/agent-pgolf.

Result of applying the tool to top-10 open PRs (2026-04-23):
6 CORRECT, 4 OBFUSCATED, 0 BUGGY. The bug family does not currently appear in plain-text code at the top of the leaderboard — only the unverifiable obfuscated entries cannot be characterized.

Planned submission: A non-record PR to openai/parameter-golf with the tool, methodology doc, and current audit snapshot. Framed as a tooling contribution for future submitters to self-check, with your #1734 closure cited as the origin throughout.

Anything you'd push back on before I file? Also happy to credit you as co-author of the original discovery if you'd prefer that framing.

— abi2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants