Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) by Omrigotlieb · Pull Request #1344 · openai/parameter-golf

Omrigotlieb · 2026-04-04T13:52:33Z

Summary

val_bpb: 1.0923 (3-seed mean, std 0.0005) — beats SOTA (1.1147) by 0.0224 BPB
Artifact: 15.69 MB (under 16,000,000 bytes)
Clean submission — no SLOT, no TTT
8×H100 SXM, PyTorch 2.9.1+cu128, 600s training

Results

Seed	Sliding BPB	Steps	Artifact
1337	1.0927	4,797	15,692,264
42	1.0917	4,795	15,698,922
2025	1.0925	4,795	15,691,117
Mean	1.0923 ±0.0005

Innovations (on clarkkev PR #1218 SP4096 base)

Polar Express Newton-Schulz (arXiv:2505.16932) — 4 minimax-optimal NS steps
MuonEq-R — row-normalize gradient before NS orthogonalization
Depth Recurrence layers 3,4,5 — shared MLP weights, 14 virtual layers from 11 physical
WD=0.105 — higher weight decay for quantization-friendly compression
MLR=0.022 — tuned matrix LR to recover quality from higher WD

Run Command

RECUR_LAYERS=3,4,5 MUON_WD=0.105 EMBED_WD=0.105 MATRIX_LR=0.022 \
MUON_BACKEND_STEPS=4 SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Test plan

3-seed validation (1337, 42, 2025)
All artifacts under 16,000,000 bytes
Clean — no SLOT, no TTT
Statistical significance: gap=0.0224, z=44.8 (p << 0.01)

🤖 Generated with Claude Code

… BPB (3-seed)

… BPB Systems-only optimization: replace standard Newton-Schulz with Gram Newton-Schulz in Muon optimizer. Mathematically equivalent (same fixed point), +3.4% throughput. Based on PR openai#1344 (Omrigotlieb).

Stateful eval was previously flagged as harmful on the grounds that INT6 quant errors accumulate in SSM recurrent state. Measurement shows the quant delta is actually flat ~8.2 mBPB across 100-1892 windows — no accumulation. Real cause of the pure-stateful BF16 regression was attention context loss at window boundaries. Stateful-overlap eval with overlap=1024 closes the gap to sliding within 0.3 mBPB while running in ~32s vs 500s, freeing 468s of eval budget for SLOT/TTT. Also corrects merged SOTA to 1.1147 bpb (PR openai#1019), flags PR openai#1329/openai#1344/ openai#1430 as unmerged/invalid, and revises the SLOT estimate from 50-150 to 15-30 mBPB based on capacity-regularization reasoning.

Layers 3,4,5 share MLP weights; attention weights stay unique per layer. Weight decay bumped to 0.09 (from 0.04) to regularize the shared MLP. Based on PRs openai#1334/openai#1344 which report 1.089-1.092 BPB with this setup. Why this works now when our prior attempt failed: - Prior: shared ALL layer weights -> quant error amplified 900x - Now: share ONLY MLP, keep attention unique -> per-layer discrimination - Higher WD regularizes against per-layer overfitting - Full Hessian GPTQ correctly accumulates Hessians across sharers Saves ~6.3 MB of parameters. The reinvest budget is the whole point: wider MLP, larger BigramHash, more unique layers, or higher-precision quantization for critical layers. GPTQ integration: forward pass accumulates Hessians under a shared key, quantizes the shared weight once using the combined Hessian, dedupes in _rebank_state_dict when constructing the export bank.

@sharpobject

…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421). Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on seed 0 against stock #1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR #1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR #1779's frozen recurrent α/β and PR #1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

… Attn Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421). Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on seed 0 against stock openai#1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR openai#1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and PR openai#1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…_blend_lr Iter 117 v1 NaN'd at step 60 (recon 1.2e-4 → 6.8e-2, attn_cv 0.88 → NaN). Root cause: 0.7% entmax-1.5 contribution at init combined with no entropy cushion let CV concentration cascade once routing started concentrating. Three principled fixes in this commit: 1. Annealed blend (H87 v2): blend = (1-anneal) + sigmoid(blend_logit)*anneal, with anneal: 0 → 1 over warmup_delay_frac=0.3. At init (anneal=0), blend forced to 1.0 = pure softmax = iter 100b exact (strict-gen verified numerically: max abs diff 0.00 at anneal=0). Avoids cold-start instability. New buffer `_entmax_blend_anneal` + setter, written by training-loop annealer alongside variance/entropy schedules. 2. Polar-Express Newton-Schulz Muon coefficients (PR openai#1344 → openai#1787 in records). Per-iter (a, b, c) tuples replace stock fixed (3.4445, -4.7750, 2.0315). Iter 1: aggressive (8.16, -22.5, 15.9); iter 5: gentle (2.35, -1.71, 0.42). At 10 iters: 0.05 mean rel err vs stock 0.20 — 4× better orthogonalization quality at the same matmul count. IMPORTANT — DEQ STABILITY CONSTRAINT: at backend_steps=5 (records' default) the aggressive iter-1 coefficient overshoots and breaks DEQ reverse reconstruction (smoke recon 1.2e-4 → 6.8e-2). Records are non-DEQ so they tolerate this. We MUST run at backend_steps=10 — confirmed smoke PASSES. Default lifted 5 → 10. Net throughput cost ~5-10% step_avg; net quality gain 4× orthogonalization. Likely net-positive. 3. Slow LR for blend_logit (entmax_blend_lr=0.002, 10× smaller than scalar_lr=0.02). Mirrors parcae_lr precedent — params controlling sensitive system dynamics get a slow LR to bound drift rate. Once anneal ramps after warmup_delay, gradient flows but blend_logit drifts 10× slower → routing has time to adapt to gradually-introduced sparsity. New optimizer group carved from scalar_params via "_entmax_blend_logit" name filter; coverage assertion updated. Smoke test PASSED: loss 7.00 → 4.55 at 300 steps with all three fixes active (entmax off by default — needs --use-entmax-routing=1 to enable). CLAUDE.md §5 muon_backend_steps row added documenting the 5 → 10 lift and the DEQ-specific constraint vs records' default. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

- Parallel residuals: credit PR openai#1204 (first) and PR openai#1529 (layer 8) - Depth recurrence: clarify 2-pass origin (openai#1344), 3-pass is final config - Loop curriculum frac=0.35: correct PR to openai#1420 (not openai#1344) - SmearGate: note original PR openai#162, SP8192 reintroduction in openai#1667 - Attention output gate: PR openai#1667 first, PR openai#1787 narrowed it - LeakyReLU²: replaced relu², not GELU - EMA decay: remove specific wrong value, defer to PR openai#287 - GPTQ first: correct to PR openai#535 (not openai#1019) - Global SGD phase: 2000 docs (openai#1610) refined to 2500 (openai#1626) - Training closing line: "halfway point" → "35% mark" Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Depth recurrence: openai#1344 → openai#1204 (first appearance on leaderboard) - GPTQ: openai#535 → openai#374 (GPTQ-lite first introduced in openai#374) - int7 embeddings: openai#1586 → openai#1626 (openai#1586 not in leaderboard) - LQER: openai#1797 → openai#1851 (technique evolution credits openai#1851) - XSA table: openai#287 → openai#265 (openai#265 is "first XSA") - SmearGate table: openai#1667 → openai#1851 (openai#1851 fixed BOS bug; openai#1667 just reused SmearGate) - LeakyReLU² table: openai#493 → openai#549 (openai#549 is "first" per leaderboard) - AWQ-lite table: openai#1908 → openai#1945 (openai#1908 not in leaderboard) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Record: SP4096 + PE + MuonEq-R + Depth Recurrence + WD=0.105 — 1.0923…

1c0a0d5

… BPB (3-seed)

Omrigotlieb mentioned this pull request Apr 4, 2026

Record: Polar Express NS + SLOT + MuonEq-R + XSA-all — 1.1043 BPB (3-seed mean) #1298

Closed

5 tasks

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Base: openai#1344 SP4096+DepthRecur (Round 7)

3768622

erichroepke mentioned this pull request Apr 6, 2026

Record: Combined 3-Layer Recurrence + Parallel Residuals + Polar Express + Brotli — val_bpb 1.1067 (3-seed mean) #1396

Closed

4 tasks

mradassaad mentioned this pull request Apr 7, 2026

Non-record: Mamba-3 Hybrid + GPTQ + Late QAT + MuonEq-R — val_bpb 1.1526 (best SSM) #1355

Open

4 tasks

nprime06 mentioned this pull request Apr 23, 2026

Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787

Merged

6 tasks

renqianluo mentioned this pull request Apr 23, 2026

Record: Polar Express NS + MIN_LR + GatedAttn + Alpha LoRA — val_bpb 1.07006 (3-seed mean) #1792

Open

aamodbhatt mentioned this pull request Apr 24, 2026

Record: SP8192 + Polar Express NS + Multi-Phase Global TTT — val_bpb 1.0771 (3-seed mean) #1802

Open

resouer mentioned this pull request Apr 25, 2026

[REVIEW-ONLY] Record: SP8192 + Polar Express NS + MIN_LR + Tight GPTQ on PR #1790 — val_bpb 1.06892 (3-seed mean) resouer/parameter-golf#12

Closed

This was referenced Apr 26, 2026

Record: SP8192 + PE + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean) #1825

Closed

Record: SP8192 + PE + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean) #1826

Open

Christopher-Lee-McClendon mentioned this pull request Apr 26, 2026

Non-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154) #1831

Open

AjAnubolu mentioned this pull request Apr 27, 2026

Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) #1874

Open

4 tasks

renqianluo mentioned this pull request Apr 28, 2026

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean) #1886

Open

bigbag mentioned this pull request Apr 29, 2026

Record: SP8192 PR #1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (3-seed mean) #1920

Closed

3 tasks

simon-marcus mentioned this pull request Apr 29, 2026

Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1 #1925

Open

8 tasks

bigbag mentioned this pull request Apr 29, 2026

Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean) #1926

Open

4 tasks

okezue mentioned this pull request Apr 30, 2026

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed) #1958

Closed

bsisduck mentioned this pull request Apr 30, 2026

SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean) #1969

Open

sahiee-dev mentioned this pull request Apr 30, 2026

SP8192 + PolarExpressNS + MIN_LR + LQER Asym Rank-4 | val_bpb=1.07302 (3-seed mean) #1977

Open

9 tasks

EthanYangTW mentioned this pull request Apr 30, 2026

Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item #1978

Open

6 tasks

This was referenced May 1, 2026

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083

Open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080

Open

Christopher-Lee-McClendon mentioned this pull request May 1, 2026

[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387 #2008

Open

simon-marcus mentioned this pull request May 1, 2026

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 5, 2026

writeup: fix PR openai#1344 loop attribution and GPTQ scope description

833c99a

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed)#1344

Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed)#1344
Omrigotlieb wants to merge 1 commit intoopenai:mainfrom
Omrigotlieb:v3-submission

Omrigotlieb commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Omrigotlieb commented Apr 4, 2026

Summary

Results

Innovations (on clarkkev PR #1218 SP4096 base)

Run Command

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant