Skip to content

Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean)#1331

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-3layer-recurrence-wd095-mlr022
Open

Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean)#1331
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-3layer-recurrence-wd095-mlr022

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

  • val_bpb = 1.0900 (3-seed mean, std 0.0005) | 2.5077 nats | ~15.96 MB | 8xH100 SXM, 590s | No TTT
  • 3-layer depth recurrence (layers 3,4,5) with WD-LR synergy: WD=0.095 compresses for headroom, MLR=0.022 recovers quality
  • All seeds under 16MB with 36K+ margins
  • No SLOT, no TTT, no eval-time adaptation

Key Innovation: 3-Layer Recurrence + WD-LR Synergy

Extends 2-layer recurrence (PR #1285) to 3 layers. The extra virtual layer needs more artifact budget, compensated by:

  • Higher WD (0.095 vs 0.090) → better compression → headroom for 3-layer recurrence
  • Higher MLR (0.022 vs 0.020) → recovers quality lost from WD increase

Results

Seed Sliding BPB val_loss (nats) Artifact
42 1.0898 2.50733 15,961,029
0 1.0895 2.50672 15,955,962
7 1.0905 2.50901 15,964,018
Mean 1.0900 2.50769 15,960,336

Changes from PR #1285 (1.0912)

PR #1285 This
val_bpb 1.09124 1.08995 (-0.00129)
Recurrence 2-layer (4,5) 3-layer (3,4,5)
WD 0.090 0.095
Matrix LR 0.020 0.022

Credits

…b 1.0900 (3-seed mean)

3-layer depth recurrence (layers 3,4,5) with WD-LR synergy:
higher WD (0.095) compresses for all-int6 headroom, higher MLR (0.022)
recovers quality. All 66 layers at int6 precision.

3-seed mean: 1.0900 BPP / 2.5077 nats (seeds 42, 0, 7)
All seeds under 16MB with 36K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Improves PR openai#1285 (1.0912) by 0.0013 BPP. Beats PR openai#1218 by 0.0079.
Built on PR openai#1218 by @clarkkev.
manfromnowhere143 added a commit to manfromnowhere143/parameter-golf that referenced this pull request Apr 18, 2026
Two complementary additions to AwebUltimate base:

1. Depth Recurrence (PR openai#1517/openai#1331/openai#1471 pattern, must credit):
   Re-run selected encoder layers once more after encoder pass, with NO
   skip-stack push (preserves U-Net 5-in/5-out symmetry). Curriculum:
   activated at RECUR_START_STEP (default 2000). Eval always uses recurrence.
   Env vars: RECUR_LAYERS (e.g. '3,4'), RECUR_START_STEP.

2. Lookahead Optimizer (Zhang/Lucas/Hinton/Ba, NeurIPS 2019) — Aweb signature:
   Maintains slow weights for all trainable params. Every k inner steps:
   slow := (1-α)*slow + α*fast; fast := slow. ~5%% wall-clock overhead.
   Novel for nanochat speedrun (verified via gh search).
   Env vars: LOOKAHEAD_ENABLED, LOOKAHEAD_K, LOOKAHEAD_ALPHA.

Backwards-compat: RECUR_LAYERS='' (default) + LOOKAHEAD_ENABLED=0 reproduces
proven 1.1190 baseline byte-identically.

CPU smoke test (10 cases) PASSES: env wiring, model construction, recur
parsing, forward/forward_hidden recur application, skip-stack symmetry,
mini-training loss decrease, lookahead update math.
G3sparky added a commit to G3sparky/parameter-golf that referenced this pull request Apr 18, 2026
QK_GAIN_INIT=5.5 extends the monotonic improvement trend past 5.25.
3-seed mean 1.0809 (std 0.0004) on 8xH100 SXM.

Base: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + Legal TTT
(PRs openai#1394, openai#1331, openai#1437, openai#1412, openai#549, openai#1445)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant