Skip to content

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285

Merged
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-wd090-allint6
Apr 9, 2026
Merged

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)#1285
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-wd090-allint6

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

  • val_bpb = 1.0912 (3-seed mean, std 0.0009) | 2.5106 nats | ~15.96 MB | 8xH100 SXM, 590s | No TTT
  • WD-quantization synergy: higher weight decay (0.090) compresses 5% better, allowing ALL 66 layers at int6
  • All seeds under 16MB with 32K+ margins
  • No SLOT, no TTT, no eval-time adaptation, fully legal

Key Innovation: WD-Quantization Synergy

Higher WD (0.090 vs 0.085) → smaller weights → 5% better brotli compression → enough headroom for ALL 66 layers at int6 precision. The quantization quality gain exceeds the WD BPP cost:

Config WD N_INT6 Artifact val_bpb (s42)
PR #1260 0.085 60 15,981K 1.09217
PR #1279 0.085 61 15,997K 1.09170
This 0.090 66 15,967K 1.09057

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed Steps ms/step Sliding BPB val_loss (nats) Artifact
42 5,540 106.5 1.0906 2.50910 15,967,483
0 5,536 106.6 1.0908 2.50973 15,962,242
1337 5,538 106.6 1.0923 2.51309 15,959,253
Mean 5,538 106.6 1.0912 2.51064 15,962,993

Changes from PR #1218

PR #1218 This
val_bpb 1.09785 1.09124 (-0.00661)
Weight decay 0.085 0.090
Optimizer Muon MuonEq-R
Depth recurrence None Layers 4,5 repeated
Quantization Mixed All int6 (66/66)

Credits

Test plan

  • 3-seed verification (42, 0, 1337) — all pass
  • All under 16MB (min margin: 32,517)
  • 4-seed tested (seed 7 also fits at 15,970,676)
  • No TTT, no SLOT

….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 4, 2026
…b 1.0900 (3-seed mean)

3-layer depth recurrence (layers 3,4,5) with WD-LR synergy:
higher WD (0.095) compresses for all-int6 headroom, higher MLR (0.022)
recovers quality. All 66 layers at int6 precision.

3-seed mean: 1.0900 BPP / 2.5077 nats (seeds 42, 0, 7)
All seeds under 16MB with 36K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Improves PR openai#1285 (1.0912) by 0.0013 BPP. Beats PR openai#1218 by 0.0079.
Built on PR openai#1218 by @clarkkev.
chandra447 added a commit to chandra447/parameter-golf that referenced this pull request Apr 4, 2026
Hypothesis: Polar Express 4-step minimax NS on top of full PR openai#1334 stack
Expected delta: ~-0.001 to -0.002 BPB from 1.0897 baseline

Key changes vs PR openai#1334:
- Polar Express Newton-Schulz (4-step minimax coefficients, arXiv:2505.16932)
- MATRIX_LR=0.022 (validated for WD=0.090)
- MUON_WD=0.090 (PR openai#1285/1334 optimal for 2-layer recurrence)
- NoPE explicitly disabled (nope_every_n=0) after critique
- Trackio experiment tracking added

Stack: SP4096 vocab + MLP 4x + WD=0.090 + MuonEq-R + QK-Gain 5.0 +
       Depth recurrence L4-5 (step 3000) + Parallel residuals L7+ + Brotli
EthanYangTW added a commit to EthanYangTW/parameter-golf that referenced this pull request Apr 4, 2026
…el spectral test-time training

Base: clarkkev openai#1218 (1.0974 BPB, 4096 vocab, brotli, 34M params)
Added: depth recurrence L4,5 (from openai#1285), MuonEq-R, WD=0.09
Novel: Spectral TTT — adapt singular values at eval time (8192 params)
Target: ~1.085 BPB
AnubhavBharadwaaj added a commit to AnubhavBharadwaaj/parameter-golf that referenced this pull request Apr 6, 2026
@cocohearts cocohearts merged commit ee38f46 into openai:main Apr 9, 2026
SH-Tan pushed a commit to SH-Tan/parameter-golf that referenced this pull request Apr 9, 2026
…0-allint6

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 12, 2026
…1.01710

Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09).
Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493).
New target: ≤1.0760 val_bpb. 18 days to deadline.

Key findings:
- GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review
- VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next
- TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560
- PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed
- PR openai#758: major legality flags, do not implement

Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9).
Updated logs/daily_research.md: new 2026-04-12 entry prepended.

https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss
tns15june pushed a commit to tns15june/parameter-golf that referenced this pull request Apr 19, 2026
Frontier records (PR openai#1285 MuonEq-R + WD=0.090, PR openai#1218 WD=0.085) use
AdamW-style decoupled weight decay on the Muon optimizer. Add the knob
with default 0.0 (backward-compatible). Applied as
p.data.mul_(1 - lr * wd) before the Muon matrix update.

MuonEq-R (row-normalized) variant is not ported — it would need more
line budget than we have on this branch. WD alone accounts for the
majority of that record's improvement per the commit notes.

dev/run_frontier.sh sets MUON_WEIGHT_DECAY=0.09 by default.

Also inlined restore_low_dim_params_to_fp32 at its single call site
to free lines for this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants