Skip to content

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)#1607

Open
inin-zou wants to merge 150 commits intoopenai:mainfrom
inin-zou:submission/nemotron-h-mamba3-depth-recurrence
Open

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)#1607
inin-zou wants to merge 150 commits intoopenai:mainfrom
inin-zou:submission/nemotron-h-mamba3-depth-recurrence

Conversation

@inin-zou
Copy link
Copy Markdown

@inin-zou inin-zou commented Apr 14, 2026

Summary

  • First Mamba depth recurrence in the competition (checks off "State-space models" from Requests for PRs)
  • Nemotron-H inspired hybrid: 7 Mamba-3 SISO + 1 Attention (8 physical layers → 12 virtual via hinge-point recurrence)
  • Novel hinge-point multi-recurrence: layers 3,4 repeated 2x at U-Net hinge, outperforms spread recurrence
  • val_bpb: 1.4765 post-quant (1000 steps, 1xH100, GPTQ int6+LZMA, 8.2MB artifact)
  • Systematic ablation of 6 recurrence configs, 3 quantization strategies, and 3 architectural variants

Key Findings

Finding Detail
Mamba depth recurrence works -0.0092 bpb vs no recurrence (first-ever SSM recurrence result)
Focused > spread recurrence Hinge ×2 (1.2824) beats 4-layer ×1 (1.2864) at same virtual depth
Ternary Mamba not viable at 26M +0.397 bpb worse (literature confirms min ~1.3B needed)
Q-Mamba DSQ not needed Standard Full Hessian GPTQ already handles SSM outliers (0.082 vs 0.148 quant loss)
RoPE removal hurts at small scale +0.072 worse (unlike Jamba 1.3B where it's neutral)

Architecture

Physical: [Mamba3_0, Mamba3_1, Mamba3_2, Mamba3_3, Attn_4, Mamba3_5, Mamba3_6, Mamba3_7]
Virtual:  [M0, M1, M2, M3, A4, M3, A4, M3, A4, M5, M6, M7]  (12 layers, 0 extra params)

Credits

Built on PR #1355 (best SSM) pipeline. Inspired by NVIDIA Nemotron-H (arXiv 2504.03624), Mamba-3 (ICLR 2026), and PR #1204 (depth recurrence concept).

Test plan

  • Verify script runs: torchrun --standalone --nproc_per_node=1 train_nemotron_hybrid.py with env vars from README
  • Check artifact < 16MB (currently 8.2MB)
  • Pending: 8xH100 10-min run (awaiting OpenAI compute grant)

0hq and others added 30 commits March 18, 2026 09:33
MLX Timing Mismatch with Main Script
Fix MLX multi-batch validation memory growth
## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval

**val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB)

Four orthogonal improvements over the naive baseline:

1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization
2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB.
3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes
4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost

### Run command

```bash
RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

### Key metrics

| Metric | Value |
|--------|-------|
| Steps (10 min cap) | 12,395 |
| int6/int8 sliding val_bpb | **1.1630** |
| Quantization penalty | +0.0015 BPB |
| Artifact size | 15,353,490 bytes |
… 1.2129)

10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129
across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats
(t=34.12, p<<0.001).

Key changes:
- 10 layers (vs 9 baseline)
- Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03
- FP16 tied embedding export (reduces quant gap)
- Int6 quantization for middle layers 2-7 (fits under 16MB)

Mean artifact size: 15.36MB (under 16MB cap).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aluating the graph after each sub-batch step
Use eager mx.eval() to fix running train script on 16GB Mac devices
keep tok_emb.weight in fp16 during int8 export (kills the quant gap),
shrink MLP hidden to 992 to fit under 16MB, bump warmdown to 3600
and matrix LR to 0.06.

tested on 8xH100 SXM (2 seeds) and 8xH200 SXM (3 seeds).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* SOTA attempt

* Improve score on SXM

---------

Co-authored-by: spokane-way <spokane@way>
Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB).

Key changes:
- 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params)
- QAT: STE fake-quantize simulates int6 during training
- Int6 quantization on all block weights (layers 0-8)
- Sliding window eval (stride=64) for ~0.033 BPB free gain
- FP16 tied embedding + lower LRs (carried over)

5-seed results on 8xH100 SXM:
  Mean slide_bpb: 1.1652 (std=0.0017)
  Mean rt_bpb:    1.1985
  t-statistic:    78.93 (p << 0.001)
  All artifacts under 16MB (mean: 15.64MB)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The window_starts filter dropped windows shorter than stride,
silently skipping up to (stride-1) tokens at the end of the
validation set. Now includes all windows with >= 1 scoreable
token, and clamps the score start for short final windows.
Co-authored-by: spokane-way <spokane@way>
msisovic and others added 28 commits April 1, 2026 00:20
….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
…al_bpb 1.0897 (3-seed mean)

Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation.
SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0.
3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.
…(3-seed mean)

On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal
score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the
clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all
fitting 16MB with 7-11K margin.

Per-seed (post-TTT):
- seed 0   : 1.08210 (val_loss 2.79517)
- seed 42  : 1.08315 (val_loss 2.79788)
- seed 1234: 1.08314 (val_loss 2.79785)
- mean     : 1.08279 (2.79697 nats per token)

Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token,
clearing the 0.005 nats record threshold by 0.00231 nats per seed.

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change.
Score-first TTT matches PR openai#549 precedent: every chunk scored under
inference_mode() before any parameter update.
…25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999
All artifacts under 16MB, training under 600s, eval under 600s
Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…60-gptq-brotli-1.1105

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)
…mult4-wd085

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)
Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR openai#1179, -0.0143 vs merged SOTA
…0-allint6

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)
…-slot-v4

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)
…mb-sdclip-loop45x2

Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip  — val_bpb 1.08563 (5 seed mean)
…-ttt-1.08279

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)
…rallel-ttt

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)
Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)
…duals-hessian-sdclip

Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)
…oard-readme

Update README leaderboard for April records
…(1.4765 BPB)

First Mamba depth recurrence in Parameter Golf.
7 Mamba-3 + 1 Attention hybrid with hinge-point multi-recurrence
(12 virtual layers from 8 physical, zero extra params).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@inin-zou inin-zou force-pushed the submission/nemotron-h-mamba3-depth-recurrence branch from cd8ae31 to 1504011 Compare April 23, 2026 17:44
inin-zou added a commit to inin-zou/parameter-golf that referenced this pull request Apr 29, 2026
…rgets

- Phase 1 complete, Phase 2 in progress
- 1000-step benchmark: target ≤1.29 before 8xH100 commit
- PR openai#1607 submitted
- No OpenAI compute credits received
- ~$4 Modal credits remaining
- Deadline tomorrow (Apr 30)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.