Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB) by inin-zou · Pull Request #1607 · openai/parameter-golf

inin-zou · 2026-04-14T00:21:52Z

Summary

First Mamba depth recurrence in the competition (checks off "State-space models" from Requests for PRs)
Nemotron-H inspired hybrid: 7 Mamba-3 SISO + 1 Attention (8 physical layers → 12 virtual via hinge-point recurrence)
Novel hinge-point multi-recurrence: layers 3,4 repeated 2x at U-Net hinge, outperforms spread recurrence
val_bpb: 1.4765 post-quant (1000 steps, 1xH100, GPTQ int6+LZMA, 8.2MB artifact)
Systematic ablation of 6 recurrence configs, 3 quantization strategies, and 3 architectural variants

Key Findings

Finding	Detail
Mamba depth recurrence works	-0.0092 bpb vs no recurrence (first-ever SSM recurrence result)
Focused > spread recurrence	Hinge ×2 (1.2824) beats 4-layer ×1 (1.2864) at same virtual depth
Ternary Mamba not viable at 26M	+0.397 bpb worse (literature confirms min ~1.3B needed)
Q-Mamba DSQ not needed	Standard Full Hessian GPTQ already handles SSM outliers (0.082 vs 0.148 quant loss)
RoPE removal hurts at small scale	+0.072 worse (unlike Jamba 1.3B where it's neutral)

Architecture

Physical: [Mamba3_0, Mamba3_1, Mamba3_2, Mamba3_3, Attn_4, Mamba3_5, Mamba3_6, Mamba3_7]
Virtual:  [M0, M1, M2, M3, A4, M3, A4, M3, A4, M5, M6, M7]  (12 layers, 0 extra params)

Credits

Built on PR #1355 (best SSM) pipeline. Inspired by NVIDIA Nemotron-H (arXiv 2504.03624), Mamba-3 (ICLR 2026), and PR #1204 (depth recurrence concept).

Test plan

Verify script runs: torchrun --standalone --nproc_per_node=1 train_nemotron_hybrid.py with env vars from README
Check artifact < 16MB (currently 8.2MB)
Pending: 8xH100 10-min run (awaiting OpenAI compute grant)

MLX Timing Mismatch with Main Script

Update README typo

Fix MLX multi-batch validation memory growth

Update README.md

## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval **val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB) Four orthogonal improvements over the naive baseline: 1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization 2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB. 3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes 4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost ### Run command ```bash RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \ torchrun --standalone --nproc_per_node=8 train_gpt.py ``` ### Key metrics | Metric | Value | |--------|-------| | Steps (10 min cap) | 12,395 | | int6/int8 sliding val_bpb | **1.1630** | | Quantization penalty | +0.0015 BPB | | Artifact size | 15,353,490 bytes |

… 1.2129) 10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129 across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats (t=34.12, p<<0.001). Key changes: - 10 layers (vs 9 baseline) - Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - FP16 tied embedding export (reduces quant gap) - Int6 quantization for middle layers 2-7 (fits under 16MB) Mean artifact size: 15.36MB (under 16MB cap). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…aluating the graph after each sub-batch step

Use eager mx.eval() to fix running train script on 16GB Mac devices

keep tok_emb.weight in fp16 during int8 export (kills the quant gap), shrink MLP hidden to 992 to fit under 16MB, bump warmdown to 3600 and matrix LR to 0.06. tested on 8xH100 SXM (2 seeds) and 8xH200 SXM (3 seeds). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* SOTA attempt * Improve score on SXM --------- Co-authored-by: spokane-way <spokane@way>

Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB). Key changes: - 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params) - QAT: STE fake-quantize simulates int6 during training - Int6 quantization on all block weights (layers 0-8) - Sliding window eval (stride=64) for ~0.033 BPB free gain - FP16 tied embedding + lower LRs (carried over) 5-seed results on 8xH100 SXM: Mean slide_bpb: 1.1652 (std=0.0017) Mean rt_bpb: 1.1985 t-statistic: 78.93 (p << 0.001) All artifacts under 16MB (mean: 15.64MB) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The window_starts filter dropped windows shorter than stride, silently skipping up to (stride-1) tokens at the end of the validation set. Now includes all windows with >= 1 scoreable token, and clamps the score start for short final windows.

Co-authored-by: spokane-way <spokane@way>

…pb 1.09785 (3-seed mean)

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

…al_bpb 1.0897 (3-seed mean) Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation. SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0. 3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.

…3 (5-seed mean)

…08354 BPB)

@clarkkev

…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.

…2 (3-seed mean)

…25 + Legal TTT — val_bpb 1.0810 (3-seed mean) 3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…60-gptq-brotli-1.1105 Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)

…mult4-wd085 Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR openai#1179, -0.0143 vs merged SOTA

…0-allint6 Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)

…-slot-v4 Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)

…mb-sdclip-loop45x2 Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)

…-ttt-1.08279 Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)

…rallel-ttt Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)

…oard-readme Update README leaderboard for April records

…(1.4765 BPB) First Mamba depth recurrence in Parameter Golf. 7 Mamba-3 + 1 Attention hybrid with hinge-point multi-recurrence (12 virtual layers from 8 physical, zero extra params). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rgets - Phase 1 complete, Phase 2 in progress - 1000-step benchmark: target ≤1.29 before 8xH100 commit - PR openai#1607 submitted - No OpenAI compute credits received - ~$4 Modal credits remaining - Deadline tomorrow (Apr 30) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0hq and others added 30 commits March 18, 2026 09:33

Update README.md

c3135f4

Remove scripts

164ba0b

Update README typo

06b0c30

match timing to main script to exclude eval timing

d710db1

Fix MLX validation loss accumulation

9a4963e

Log MLX validation progress

713dd27

Merge pull request openai#18 from berniwal/main

a80e308

MLX Timing Mismatch with Main Script

Merge pull request openai#9 from oof-baroomf/patch-1

bd94775

Update README typo

Merge pull request openai#32 from yhn112/fix-mlx-eval-memory-growth

9e39111

Fix MLX multi-batch validation memory growth

Update README.md

87f49e0

Merge pull request openai#35 from openai/0hq-patch-1

e70cf85

Update README.md

Update train_gpt.py

add8f64

Update train_gpt_mlx.py

309775a

Update README.md

e971157

Record: Seq4096 + Sliding Window Eval, val_bpb=1.1808

d028086

Non-record: SwiGLU + warmdown fix + quarter batch (1x5090, 1.3281 bpb)

77f0fba

Add MLX_EAGER_EVAL flag to further reduce memory pressure by force-ev…

b9c36b9

…aluating the graph after each sub-batch step

Merge pull request openai#100 from sandsevenone/mlx_eager_eval

94b39ce

Use eager mx.eval() to fix running train script on 16GB Mac devices

Update README.md (openai#105)

dd8d0c4

clarify torch version

e8c997f

SOTA attempt (val_bpb=1.2064) (openai#49)

c547b22

* SOTA attempt * Improve score on SXM --------- Co-authored-by: spokane-way <spokane@way>

Update README.md

a7830de

Add record: Sliding Window Eval (stride=64), val_bpb=1.1925 (openai#50)

608a57d

Update README.md

6e8241d

New SOTA attempt (openai#52)

567525d

Co-authored-by: spokane-way <spokane@way>

msisovic and others added 28 commits April 1, 2026 00:20

Update submission README and add seed logs

4a87526

Update submission reproducibility notes

63d1db0

Add submission metadata for ParallelResiduals run

d3fc095

Clean root for submission branch

b67f9e8

Restore root files for submission

76b1127

Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_b…

6074fc8

…pb 1.09785 (3-seed mean)

Add train_gpt.py to submission

b370566

Record: SP8192 + GPTQ Embeddings + SDClip + Loop45x2 — val_bpb 1.0856…

813de88

…3 (5-seed mean)

Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.…

ff0e071

…08354 BPB)

Fix LaTeX rendering

4ec6ed5

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.082…

a4deb2b

…2 (3-seed mean)

Merge pull request openai#1179 from dexhunter/submission/splitlr-dim1…

16e50cb

…60-gptq-brotli-1.1105 Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)

Merge pull request openai#1218 from clarkkev/submission/vocab4096-mlp…

5db4cc0

…mult4-wd085 Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)

Merge pull request openai#1204 from msisovic/hyperconnections_submission

9f7f551

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR openai#1179, -0.0143 vs merged SOTA

Merge pull request openai#1285 from dexhunter/muoneqr-recurrence-wd09…

433e93c

…0-allint6 Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)

Merge pull request openai#1334 from aryanbhosale/submission/sp4096-no…

8b6ada0

…-slot-v4 Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)

Merge pull request openai#1394 from clarkkev/submission/sp8192-gptq-e…

890c6d7

…mb-sdclip-loop45x2 Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)

Merge pull request openai#1413 from dexhunter/record/sp8192-qk5-legal…

2381927

…-ttt-1.08279 Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)

Merge pull request openai#1477 from aryanbhosale/submission/sp8192-pa…

96da7a8

…rallel-ttt Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)

Merge pull request openai#1493 from bigbag/submission/sp8192-ttt-clean

69593a6

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

Merge pull request openai#1412 from Robby955/submission/parallel-resi…

0ef8564

…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)

Update README leaderboard for April records

1ef2a91

Merge pull request openai#1511 from openai/codex/update-april-leaderb…

52cc58a

…oard-readme Update README leaderboard for April records

inin-zou force-pushed the submission/nemotron-h-mamba3-depth-recurrence branch from cd8ae31 to 1504011 Compare April 23, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)#1607

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)#1607
inin-zou wants to merge 150 commits intoopenai:mainfrom
inin-zou:submission/nemotron-h-mamba3-depth-recurrence

inin-zou commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

inin-zou commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Findings

Architecture

Credits

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

inin-zou commented Apr 14, 2026 •

edited

Loading