Record: SP8192 ParResid 3LayerLoop QK5.25 LegalTTT — 1.08083 BPB by anmarhindi · Pull Request #1776 · openai/parameter-golf

anmarhindi · 2026-04-22T14:59:39Z

Summary

val_bpb (TTT) = 1.08083 (3-seed mean, std 0.00062) | ~15.97 MB | 8xH100 SXM

Seed	Pre-Quant	Sliding	TTT	Artifact
1337	1.08609	1.08083	1.08032	15,971,929
42	1.08733	1.08214	1.08152	15,973,790
7	1.08635	1.08135	1.08064	15,971,863
Mean	1.08659	1.08144	1.08083	15,972,527
Std	0.00064	0.00066	0.00062

SP8192 + GPTQ SDClip: int6 matrices (k=12.85), int8 embeddings (k=20.0), zero pruning
3-Layer Depth Recurrence: loops layers 3,4,5 twice, activates at frac=0.35
Parallel Residuals (layers 7+): GPT-J style, attention and MLP share the same
pre-residual input
QK-Gain 5.25: learnable per-head query scaling
Legal Score-First TTT: SGD (lr=0.005, momentum=0.9), 3 epochs per 32K-token chunk,
freeze first 9 blocks, cross-rank dist.all_reduce on gradients, cosine LR decay
Tuned HPs: MUON_WD=0.095, MATRIX_LR=0.022, EMA=0.9965, WARMDOWN_FRAC=0.72
FA3/SDPA backend switch: USE_FA3=1 uses flash_attn_3_func on Hopper; USE_FA3=0
falls back to F.scaled_dot_product_attention(..., enable_gqa=True) for non-Hopper GPUs

Causality: strict sliding-window causal eval
Normalized distribution: standard softmax over full vocab, no n-gram cache, no logit
biasing
Score before update: each 32K-token chunk fully scored under torch.no_grad() BEFORE any
SGD update
Single pass: each token scored exactly once, no rescoring / multi-pass selection
No SLOT, no pre-quant TTT, no ETLB, no n-gram cache or tilt
Artifact ≤ 16 MB on all 3 seeds (max 15,973,790 B)
Training ≤ 600 s on all 3 seeds (~588 s actual)
Eval (sliding + TTT) ≤ 600 s on all 3 seeds (~545 s actual

Record: SP8192 ParResid 3LayerLoop QK5.25 LegalTTT — 1.08083 BPB

8fe421d