Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) by dexhunter · Pull Request #1331 · openai/parameter-golf

dexhunter · 2026-04-04T08:07:23Z

Summary

val_bpb = 1.0900 (3-seed mean, std 0.0005) | 2.5077 nats | ~15.96 MB | 8xH100 SXM, 590s | No TTT
3-layer depth recurrence (layers 3,4,5) with WD-LR synergy: WD=0.095 compresses for headroom, MLR=0.022 recovers quality
All seeds under 16MB with 36K+ margins
No SLOT, no TTT, no eval-time adaptation

Key Innovation: 3-Layer Recurrence + WD-LR Synergy

Extends 2-layer recurrence (PR #1285) to 3 layers. The extra virtual layer needs more artifact budget, compensated by:

Higher WD (0.095 vs 0.090) → better compression → headroom for 3-layer recurrence
Higher MLR (0.022 vs 0.020) → recovers quality lost from WD increase

Results

Seed	Sliding BPB	val_loss (nats)	Artifact
42	1.0898	2.50733	15,961,029
0	1.0895	2.50672	15,955,962
7	1.0905	2.50901	15,964,018
Mean	1.0900	2.50769	15,960,336

Changes from PR #1285 (1.0912)

	PR #1285	This
val_bpb	1.09124	1.08995 (-0.00129)
Recurrence	2-layer (4,5)	3-layer (3,4,5)
WD	0.090	0.095
Matrix LR	0.020	0.022

Credits

@clarkkev for PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218
@abaybektursun for PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019
@msisovic for PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (depth recurrence)
@dexhunter for PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 (WD-quantization synergy)

@clarkkev

…b 1.0900 (3-seed mean) 3-layer depth recurrence (layers 3,4,5) with WD-LR synergy: higher WD (0.095) compresses for all-int6 headroom, higher MLR (0.022) recovers quality. All 66 layers at int6 precision. 3-seed mean: 1.0900 BPP / 2.5077 nats (seeds 42, 0, 7) All seeds under 16MB with 36K+ margins. No TTT, no SLOT, no eval-time adaptation. Improves PR openai#1285 (1.0912) by 0.0013 BPP. Beats PR openai#1218 by 0.0079. Built on PR openai#1218 by @clarkkev.

Two complementary additions to AwebUltimate base: 1. Depth Recurrence (PR openai#1517/openai#1331/openai#1471 pattern, must credit): Re-run selected encoder layers once more after encoder pass, with NO skip-stack push (preserves U-Net 5-in/5-out symmetry). Curriculum: activated at RECUR_START_STEP (default 2000). Eval always uses recurrence. Env vars: RECUR_LAYERS (e.g. '3,4'), RECUR_START_STEP. 2. Lookahead Optimizer (Zhang/Lucas/Hinton/Ba, NeurIPS 2019) — Aweb signature: Maintains slow weights for all trainable params. Every k inner steps: slow := (1-α)*slow + α*fast; fast := slow. ~5%% wall-clock overhead. Novel for nanochat speedrun (verified via gh search). Env vars: LOOKAHEAD_ENABLED, LOOKAHEAD_K, LOOKAHEAD_ALPHA. Backwards-compat: RECUR_LAYERS='' (default) + LOOKAHEAD_ENABLED=0 reproduces proven 1.1190 baseline byte-identically. CPU smoke test (10 cases) PASSES: env wiring, model construction, recur parsing, forward/forward_hidden recur application, skip-stack symmetry, mini-training loss decrease, lookahead update math.

QK_GAIN_INIT=5.5 extends the monotonic improvement trend past 5.25. 3-seed mean 1.0809 (std 0.0004) on 8xH100 SXM. Base: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + Legal TTT (PRs openai#1394, openai#1331, openai#1437, openai#1412, openai#549, openai#1445)

Mertyandimata mentioned this pull request Apr 7, 2026

[Submission] EngramLite + Mousse + Progressive Depth Recurrence + TTT — val_bpb 1.1026 | 15.95MB | 8×H100 #1440

Open

This was referenced Apr 7, 2026

[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445

Open

[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866 #1471

Open

This was referenced Apr 9, 2026

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1492

Closed

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493

Merged

RulinShao mentioned this pull request Apr 10, 2026

Record: Depth Recurrence + Banked Muon + Pre-Quant TTT (18ep) — val_bpb 1.0632 (3-seed mean) #1517

Open

taka6745 mentioned this pull request Apr 10, 2026

SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824 #1520

Open

6 tasks

bigbag mentioned this pull request Apr 11, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541

Open

MatoTeziTanka mentioned this pull request Apr 11, 2026

Record: Asymmetric Two-Lane Parallel Routing + Tap-In V6 + Legal TTT (1.073938) #1518

Open

abbudjoe mentioned this pull request Apr 12, 2026

[Non Record] Fractal recurrent primitive hybrid - SP1024 1xH100 #1569

Closed

G3sparky mentioned this pull request Apr 18, 2026

Record: QK-Gain 5.5 — val_bpb 1.0810 (3-seed mean) #1715

Closed

himanshudongre mentioned this pull request Apr 18, 2026

Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean) #1716

Open

kiyoaki mentioned this pull request Apr 18, 2026

Notable: SP8192 + 3-Layer Recurrence + Parallel Residuals - 5-Seed Quantization Reference and SDClip Ablations #1720

Open

5 tasks

Victory963 mentioned this pull request Apr 19, 2026

Record: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785 (3-seed mean) #1731

Closed

AjAnubolu mentioned this pull request Apr 19, 2026

Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735

Open

6 tasks

alertcat mentioned this pull request Apr 19, 2026

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds) #1738

Open

OE-GOD mentioned this pull request Apr 20, 2026

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean) #1755

Open

kilojoules mentioned this pull request Apr 21, 2026

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean) #1758

Open

5 tasks

aamodbhatt mentioned this pull request Apr 24, 2026

Record: SP8192 + Polar Express NS + Multi-Phase Global TTT — val_bpb 1.0771 (3-seed mean) #1802

Open

davie2009kh mentioned this pull request Apr 24, 2026

Record attempt: SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly) — val_bpb 1.07037 (3-seed mean) #1807

Open

PranavViswanath mentioned this pull request Apr 24, 2026

Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean) #1809

Open

This was referenced Apr 26, 2026

Record: SP8192 + PE + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean) #1825

Closed

Record: SP8192 + PE + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean) #1826

Open

G3sparky mentioned this pull request Apr 27, 2026

Record: Pre-Quant TTT + Void Compass — val_bpb 1.0282 (3-seed mean) #1852

Closed

This was referenced Apr 27, 2026

Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) #1876

Open

Record: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean) #1880

Open

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

G3sparky mentioned this pull request Apr 29, 2026

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) #1858

Open

davie2009kh mentioned this pull request Apr 29, 2026

Record: SP8192 + SLOT scored-position + cross-batch EMA warmup: val_bpb=0.94569 #1929

Open

hilbertmeng mentioned this pull request Apr 29, 2026

[Record]: MUDD Connections + SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT— val_bpb 1.0769 (3-seed mean) #1936

Open

okezue mentioned this pull request Apr 30, 2026

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed) #1958

Closed

G3sparky mentioned this pull request Apr 30, 2026

Flower Brain v3: SmearGate + LoRA-TTT + GPTQ — val_bpb 1.0680 (unlimited compute, 2hr 1xH100) #1896

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean)#1331

Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean)#1331
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-3layer-recurrence-wd095-mlr022

dexhunter commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 4, 2026

Summary

Key Innovation: 3-Layer Recurrence + WD-LR Synergy

Results

Changes from PR #1285 (1.0912)

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant