Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)#1607
Open
inin-zou wants to merge 150 commits intoopenai:mainfrom
Open
Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)#1607inin-zou wants to merge 150 commits intoopenai:mainfrom
inin-zou wants to merge 150 commits intoopenai:mainfrom
Conversation
MLX Timing Mismatch with Main Script
Update README typo
Fix MLX multi-batch validation memory growth
Update README.md
## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval **val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB) Four orthogonal improvements over the naive baseline: 1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization 2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB. 3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes 4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost ### Run command ```bash RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \ torchrun --standalone --nproc_per_node=8 train_gpt.py ``` ### Key metrics | Metric | Value | |--------|-------| | Steps (10 min cap) | 12,395 | | int6/int8 sliding val_bpb | **1.1630** | | Quantization penalty | +0.0015 BPB | | Artifact size | 15,353,490 bytes |
… 1.2129) 10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129 across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats (t=34.12, p<<0.001). Key changes: - 10 layers (vs 9 baseline) - Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03 - FP16 tied embedding export (reduces quant gap) - Int6 quantization for middle layers 2-7 (fits under 16MB) Mean artifact size: 15.36MB (under 16MB cap). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aluating the graph after each sub-batch step
Use eager mx.eval() to fix running train script on 16GB Mac devices
keep tok_emb.weight in fp16 during int8 export (kills the quant gap), shrink MLP hidden to 992 to fit under 16MB, bump warmdown to 3600 and matrix LR to 0.06. tested on 8xH100 SXM (2 seeds) and 8xH200 SXM (3 seeds). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* SOTA attempt * Improve score on SXM --------- Co-authored-by: spokane-way <spokane@way>
Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB). Key changes: - 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params) - QAT: STE fake-quantize simulates int6 during training - Int6 quantization on all block weights (layers 0-8) - Sliding window eval (stride=64) for ~0.033 BPB free gain - FP16 tied embedding + lower LRs (carried over) 5-seed results on 8xH100 SXM: Mean slide_bpb: 1.1652 (std=0.0017) Mean rt_bpb: 1.1985 t-statistic: 78.93 (p << 0.001) All artifacts under 16MB (mean: 15.64MB) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The window_starts filter dropped windows shorter than stride, silently skipping up to (stride-1) tokens at the end of the validation set. Now includes all windows with >= 1 scoreable token, and clamps the score start for short final windows.
Co-authored-by: spokane-way <spokane@way>
…pb 1.09785 (3-seed mean)
….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
…al_bpb 1.0897 (3-seed mean) Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation. SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0. 3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.
…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.
…25 + Legal TTT — val_bpb 1.0810 (3-seed mean) 3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…60-gptq-brotli-1.1105 Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)
…mult4-wd085 Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)
Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR openai#1179, -0.0143 vs merged SOTA
…0-allint6 Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)
…-slot-v4 Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)
…mb-sdclip-loop45x2 Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)
…-ttt-1.08279 Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)
…rallel-ttt Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)
Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)
…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)
…oard-readme Update README leaderboard for April records
…(1.4765 BPB) First Mamba depth recurrence in Parameter Golf. 7 Mamba-3 + 1 Attention hybrid with hinge-point multi-recurrence (12 virtual layers from 8 physical, zero extra params). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cd8ae31 to
1504011
Compare
inin-zou
added a commit
to inin-zou/parameter-golf
that referenced
this pull request
Apr 29, 2026
…rgets - Phase 1 complete, Phase 2 in progress - 1000-step benchmark: target ≤1.29 before 8xH100 commit - PR openai#1607 submitted - No OpenAI compute credits received - ~$4 Modal credits remaining - Deadline tomorrow (Apr 30) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Key Findings
Architecture
Credits
Built on PR #1355 (best SSM) pipeline. Inspired by NVIDIA Nemotron-H (arXiv 2504.03624), Mamba-3 (ICLR 2026), and PR #1204 (depth recurrence concept).
Test plan
torchrun --standalone --nproc_per_node=1 train_nemotron_hybrid.pywith env vars from README