Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean) by dexhunter · Pull Request #1279 · openai/parameter-golf

dexhunter · 2026-04-03T02:38:38Z

Summary

val_bpb = 1.0924 (3-seed mean, std 0.0008) | 2.5133 nats | ~15.98 MB | 8xH100 SXM, 590s | No TTT
Improves PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 (1.0929) by using N_INT6=61 (one more int6 layer) with a smaller mini runner
All 3 seeds under 16MB (max: 15,996,591)
No SLOT, no TTT, no eval-time adaptation, fully legal

Key Innovation: N_INT6=61

PR #1260 used N_INT6=60. By regenerating a smaller self-extracting mini runner (21,396 bytes vs 87K standalone), we freed enough artifact budget to fit one additional int6 layer. N_INT6=61 improves BPP by ~0.001 per seed with zero architecture change — purely a quantization precision upgrade.

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
42	5,540	106.5	1.0917	2.51171	15,996,591
0	5,536	106.6	1.0923	2.51309	15,974,481
7	5,538	106.6	1.0932	2.51522	15,982,332
Mean	5,538	106.6	1.0924	2.51334	15,984,468

Changes from PR #1218

	PR #1218	This
val_bpb	1.09785	1.09241 (-0.00544)
Optimizer	Muon	MuonEq-R
Depth recurrence	None	Layers 4,5 repeated
Mixed quantization	No	61 int6 + 5 int5

Credits

@clarkkev for PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 (4096-Vocab + MLP 4x + WD 0.085)
@abaybektursun for PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (GPTQ + XSA + BigramHash baseline)
@msisovic for PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (depth recurrence concept)
@dexhunter for PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 (MuonEq-R + recurrence foundation)

Test plan

3-seed verification (42, 0, 7) — all pass artifact + time + score
All seeds under 16,000,000 bytes (seed 42 verified 3× with consistent fit)
Train < 600s, eval < 600s
No TTT, no SLOT, no forbidden techniques
Rule checker passed (log + script)

@clarkkev

… (3-seed mean) Improves PR openai#1260 (1.0929) by using N_INT6=61 (one more int6 layer) with a smaller mini runner (21,396 bytes) that creates enough headroom. 3-seed mean: 1.0924 BPB / 2.5133 nats (seeds 42, 0, 7) All seeds under 16MB (max: 15,996,591 bytes) No TTT, no SLOT, no eval-time adaptation. Techniques: MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), 61 int6 + 5 int5 Hessian-ranked GPTQ, brotli-11 compression. Built on PR openai#1218 by @clarkkev.

New architecture: instead of N independent transformer blocks, use K shared blocks cycled to N virtual layers, with per-layer FiLM conditioning (learned scale vectors for attn/mlp/residual per virtual layer). Saves massive parameters — 3 shared blocks for 9 virtual layers uses ~6.5M vs 17.1M params, freeing artifact budget. This is genuinely novel for parameter-golf: no submission has tried feature-wise linear modulation for depth conditioning. The closest is PR openai#1279's LoRA adapters, but FiLM is much cheaper (1024 params per virtual layer vs ~8K for LoRA rank-4). Experiments running: Standard 9L vs FiLM 3→9 vs FiLM 3→18 vs FiLM 1→9. Also includes best_full_run.log: Kitchen Sink seq2048 at 600s reached 1.2698 BPB (1338 steps, 15.6MB artifact). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:04:27Z

Community Review — Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

SyntaxError: f-string: expecting '}' (line 574)

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: f-string: expecting '}' (line 574). Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

dexhunter mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean)#1279

Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean)#1279
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-n61-mixedquant

dexhunter commented Apr 3, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Apr 3, 2026

Summary

Key Innovation: N_INT6=61

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Changes from PR #1218

Credits

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading