Skip to content

Non-record: Mamba-3 Hybrid + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)#1890

Open
mradassaad wants to merge 4 commits intoopenai:mainfrom
mradassaad:mamba3-multiepoch-ttt-2026-04-22
Open

Non-record: Mamba-3 Hybrid + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)#1890
mradassaad wants to merge 4 commits intoopenai:mainfrom
mradassaad:mamba3-multiepoch-ttt-2026-04-22

Conversation

@mradassaad
Copy link
Copy Markdown
Contributor

@mradassaad mradassaad commented Apr 28, 2026

Non-record: Mamba-3 Hybrid SSM + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)

val_bpb: 1.1456 (3-seed mean, std 0.0011) | 15.93 MB total (3-seed mean) | 8×H100

A follow-up SSM submission building on PR #1644 (1.1473 bpb). Same 7-layer Mamba-3/Attention hybrid; the −1.7 mBPB improvement comes from three quant/TTT-phase changes that don't touch the architecture.

Seed BF16 Post-quant+TTT Total submission
1337 1.1389 1.1441 15,930,191 B
42 1.1462 1.1460 15,961,203 B
2025 1.1495 1.1468 15,975,083 B
Mean 1.1449 1.1456 15,955,492 B
Std 0.0045 0.0011 18,852 B

Submitted artifact corresponds to seed 1337 (1.1441, 15,930,191 B).

What changed vs PR #1644

  1. TTT_EPOCHS=2: PR Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1644 used a single TTT epoch and saw a +8.3 mBPB BF16 → post-quant regression. With ep=2, the regression flips to approximately neutral (mean +0.7 mBPB across 3 seeds). The second epoch gives the model enough adaptation budget to recover the quant noise injected by INT6. Cost: 132s vs 76s for the TTT phase, both within the 600s eval budget.

  2. Mixed-precision SSM dynamics protection: the dd_A and dd_dt rows of each Mamba-3 in_proj.weight (32 of 2232 rows per SSM block) are quantized at INT8 instead of INT6. Q-Mamba (ICLR 2025) showed that uniform 6-bit PTQ collapses Mamba perplexity from 5.5 to >21 because A/Ā errors compound through the recurrence. Promoting just these semantic-specific rows to INT8 costs ~0.01 MiB at this scale and recovers ~0.8 mBPB of quality. Implemented as per-row bit widths threaded through both the GPTQ path and the percentile-search path. New env var QUANT_BITS_SSM_DYNAMICS=8 (default in Hyperparameters).

  3. Scale-floor quant bug fix: an earlier mixed-precision commit accidentally hardcoded scale.clamp_min(1.0/127) (INT8 floor) for ALL rows, including INT6 rows that should floor at 1/31. Consequence: INT6 q-values spread across [-31, 31] more uniformly, inflating LZMA entropy and starving selective ±1 pruning. Fixed to use per-row 1/qmax. Net effect: ~1.4 MiB of spurious size inflation on prior runs disappears.

Architecture (unchanged from PR #1644)

7-layer Mamba-3 SISO hybrid: 5 SSM blocks + 2 FlashAttention layers at positions 2 and 5, dim=512, d_state=64, expand=2, headdim=64, chunk_size=64, mlp_mult=3, 25.16M params. SP8192 BPE tokenizer trained from scratch on FineWeb. See PR #1644 for the full architectural rationale and Triton kernel analysis (no kernel-level changes here).

Reproduction

SEED=1337 VOCAB_SIZE=8192 NUM_LAYERS=7 NUM_ATTN_LAYERS=2 \
  TRAIN_SEQ_LEN=4096 WARMDOWN_ITERS=2600 WARMDOWN_SHAPE=linear \
  MUON_EQ_R=1 LATE_QAT_THRESHOLD=0.15 \
  USE_GPTQ=1 QUANT_BITS=6 QUANT_BITS_EMBED=8 GPTQ_NUM_SEQS=32 \
  EVAL_OVERLAP=1024 USE_LZMA=1 EVAL_TEMP=0.9 TTT_EPOCHS=2 \
  WEIGHT_DECAY=0.04 MUON_MOMENTUM=0.99 MATRIX_LR=0.025 \
  torchrun --nproc_per_node=8 train_mamba3_hybrid.py

QUANT_BITS_SSM_DYNAMICS=8 is the default in Hyperparameters and does not need to be set explicitly. Repeat with SEED=42 and SEED=2025 for the 3-seed mean.

Data

Same as PR #1644: SP8192 BPE tokenizer trained from scratch on FineWeb-10B because the kevclark/parameter-golf SP8192 tokenizer was not consistent with this submission's tokenizer config. Tokenized shards and tokenizer artifacts available on a private HF dataset on request.

What I tested and removed

This is a non-record submission and represents the cleaned production path from a much larger experimental sprint. The training script in this PR is the lean submission version. Many techniques that did not survive empirical validation at 25M / 10min / 16MB / SP8192 are not represented in this PR — including 1-attention ratio (works at SP4096, fails at SP8192 by +7.5 mBPB BF16), low-rank in_proj factorization (fails because random factored init destroys upstream's structured init for dd_A/dd_dt rows), depth recurrence at SP8192 (fails by +13.9 mBPB BF16 at expand=1.5), MLP INT5 quantization (+8 mBPB quality), and several others. Two patterns emerged worth flagging:

  • LZMA compression penalty for SSM weights: across three runs I measured SSM-heavy hybrids compressing ~33% under LZMA vs ~40% for attention-heavier hybrids — roughly a 3× higher compressed-bytes-per-raw-byte cost for swapping an attention block for an SSM block. The candidate mechanism (untested) is that Mamba-3's in_proj rows have heterogeneous distributions (z, xv, B, C, dd_dt, dd_A, trap, angles) and so quantize to higher-entropy byte streams than attention's uniform QKV. I did not run the experiment that would isolate this from other SSM-vs-attention differences.

  • SP4096 architectural sweeps don't transfer to SP8192: replacing 2-attn with 1-attn at SP8192 7L costs +7.5 mBPB BF16, even though the same swap at SP4096 8L was a clean −9.8 mBPB win. Depth recurrence at expand=1.5 has a similar sign flip across vocabularies. I don't have a tested explanation; one suspect I considered but didn't isolate is that Muon's Newton-Schulz orthogonalization may interact with the heterogeneous magnitude structure of SSM in_proj rows differently than with attention's uniform QKV. Mainly worth flagging as a methodology warning: don't extrapolate small-vocab sweep results to larger-vocab submissions.

…nt — 1.1456 bpb (3-seed mean)

Follow-up to PR openai#1644. Same SP8192 7L (5 SSM + 2 attn) architecture; the
−1.7 mBPB improvement comes from quant/TTT-phase changes only:

1. TTT_EPOCHS=2: PR openai#1644's single-epoch TTT had a +8.3 mBPB BF16 → post-quant
   regression. Two epochs flip that to approximately neutral (~+0.7 mBPB) by
   giving the model a second pass to recover quant noise.

2. Mixed-precision SSM dynamics: dd_A and dd_dt rows of mamba3.in_proj.weight
   (32 of 2232 per SSM block) quantized at INT8 instead of INT6, addressing
   the recurrence-amplification of A/Ā quant errors documented in Q-Mamba
   (ICLR 2025). Per-row bit widths threaded through GPTQ + percentile-search
   paths. ~0.01 MiB cost, ~0.8 mBPB quality recovery.

3. Scale-floor bug fix: the original mixed-precision commit clamped scale
   floor at 1/127 for all rows; INT6 rows should use 1/31. Per-row 1/qmax
   restores correct LZMA compressibility (~1.4 MiB of spurious size on the
   prior runs disappears).

3-seed mean: post-quant+TTT 1.1456 ± 0.0011, 15.93 MB total ± 19 KB. All
three seeds individually beat PR openai#1644 and fit ≤16 MB.
Removed a note about findings documented in a separate writeup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant