Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) by dexhunter · Pull Request #1285 · openai/parameter-golf

dexhunter · 2026-04-03T05:33:56Z

Summary

val_bpb = 1.0912 (3-seed mean, std 0.0009) | 2.5106 nats | ~15.96 MB | 8xH100 SXM, 590s | No TTT
WD-quantization synergy: higher weight decay (0.090) compresses 5% better, allowing ALL 66 layers at int6
All seeds under 16MB with 32K+ margins
No SLOT, no TTT, no eval-time adaptation, fully legal

Key Innovation: WD-Quantization Synergy

Higher WD (0.090 vs 0.085) → smaller weights → 5% better brotli compression → enough headroom for ALL 66 layers at int6 precision. The quantization quality gain exceeds the WD BPP cost:

Config	WD	N_INT6	Artifact	val_bpb (s42)
PR #1260	0.085	60	15,981K	1.09217
PR #1279	0.085	61	15,997K	1.09170
This	0.090	66	15,967K	1.09057

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
42	5,540	106.5	1.0906	2.50910	15,967,483
0	5,536	106.6	1.0908	2.50973	15,962,242
1337	5,538	106.6	1.0923	2.51309	15,959,253
Mean	5,538	106.6	1.0912	2.51064	15,962,993

Changes from PR #1218

	PR #1218	This
val_bpb	1.09785	1.09124 (-0.00661)
Weight decay	0.085	0.090
Optimizer	Muon	MuonEq-R
Depth recurrence	None	Layers 4,5 repeated
Quantization	Mixed	All int6 (66/66)

Credits

@clarkkev for PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 (4096-Vocab + high-WD architecture)
@abaybektursun for PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (GPTQ + XSA + BigramHash baseline)
@msisovic for PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (depth recurrence concept)
@dexhunter for PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260/Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean) #1279 (MuonEq-R + recurrence + mixed quant)

Test plan

3-seed verification (42, 0, 1337) — all pass
All under 16MB (min margin: 32,517)
4-seed tested (seed 7 also fits at 15,970,676)
No TTT, no SLOT

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

@clarkkev

…b 1.0900 (3-seed mean) 3-layer depth recurrence (layers 3,4,5) with WD-LR synergy: higher WD (0.095) compresses for all-int6 headroom, higher MLR (0.022) recovers quality. All 66 layers at int6 precision. 3-seed mean: 1.0900 BPP / 2.5077 nats (seeds 42, 0, 7) All seeds under 16MB with 36K+ margins. No TTT, no SLOT, no eval-time adaptation. Improves PR openai#1285 (1.0912) by 0.0013 BPP. Beats PR openai#1218 by 0.0079. Built on PR openai#1218 by @clarkkev.

Hypothesis: Polar Express 4-step minimax NS on top of full PR openai#1334 stack Expected delta: ~-0.001 to -0.002 BPB from 1.0897 baseline Key changes vs PR openai#1334: - Polar Express Newton-Schulz (4-step minimax coefficients, arXiv:2505.16932) - MATRIX_LR=0.022 (validated for WD=0.090) - MUON_WD=0.090 (PR openai#1285/1334 optimal for 2-layer recurrence) - NoPE explicitly disabled (nope_every_n=0) after critique - Trackio experiment tracking added Stack: SP4096 vocab + MLP 4x + WD=0.090 + MuonEq-R + QK-Gain 5.0 + Depth recurrence L4-5 (step 3000) + Parallel residuals L7+ + Brotli

…el spectral test-time training Base: clarkkev openai#1218 (1.0974 BPB, 4096 vocab, brotli, 34M params) Added: depth recurrence L4,5 (from openai#1285), MuonEq-R, WD=0.09 Novel: Spectral TTT — adapt singular values at eval time (8192 params) Target: ~1.085 BPB

…1285 base

…0-allint6 Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)

…1.01710 Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09). Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493). New target: ≤1.0760 val_bpb. 18 days to deadline. Key findings: - GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review - VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next - TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560 - PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed - PR openai#758: major legality flags, do not implement Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9). Updated logs/daily_research.md: new 2026-04-12 entry prepended. https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss

Frontier records (PR openai#1285 MuonEq-R + WD=0.090, PR openai#1218 WD=0.085) use AdamW-style decoupled weight decay on the Muon optimizer. Add the knob with default 0.0 (backward-compatible). Applied as p.data.mul_(1 - lr * wd) before the Muon matrix update. MuonEq-R (row-normalized) variant is not ported — it would need more line budget than we have on this branch. WD alone accounts for the majority of that record's improvement per the commit notes. dev/run_frontier.sh sets MUON_WEIGHT_DECAY=0.09 by default. Also inlined restore_low_dim_params_to_fp32 at its single call site to free lines for this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dexhunter mentioned this pull request Apr 4, 2026

Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331

Open

This was referenced Apr 4, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean) #1333

Open

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334

Merged

himanshudongre mentioned this pull request Apr 4, 2026

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation #1341

Open

AnubhavBharadwaaj added a commit to AnubhavBharadwaaj/parameter-golf that referenced this pull request Apr 6, 2026

1.0898 BPB: Pre-Quant TTT + ETLB (Eval-Time Logit Bias) on PR openai#…

f154b6f

…1285 base

AnubhavBharadwaaj mentioned this pull request Apr 6, 2026

Record: Pre-Quant TTT + ETLB: Eval-Time Logit Bias for Neural Language Model Compression 1.0898 BPB on PR #1285 base #1399

Open

cocohearts merged commit ee38f46 into openai:main Apr 9, 2026

cocohearts mentioned this pull request Apr 9, 2026

Update README leaderboard for April records #1511

Merged

SH-Tan pushed a commit to SH-Tan/parameter-golf that referenced this pull request Apr 9, 2026

Merge pull request openai#1285 from dexhunter/muoneqr-recurrence-wd09…

433e93c

…0-allint6 Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)

MatoTeziTanka mentioned this pull request Apr 10, 2026

PR Acceptance Order and Competition Rules - A discussion - I want to know what you think #1522

Open

This was referenced Apr 11, 2026

Record: SP8192 + Pre-Quant AdamW TTT + Compiled TTT — val_bpb 1.0587 (3-seed mean) #1539

Closed

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean) #1550

Open

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-wd090-allint6

dexhunter commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Apr 3, 2026

Summary

Key Innovation: WD-Quantization Synergy

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Changes from PR #1218

Credits

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants