Non-record: Mamba-3 Hybrid + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean) by mradassaad · Pull Request #1890 · openai/parameter-golf

mradassaad · 2026-04-28T14:27:48Z

Non-record: Mamba-3 Hybrid SSM + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)

val_bpb: 1.1456 (3-seed mean, std 0.0011) | 15.93 MB total (3-seed mean) | 8×H100

A follow-up SSM submission building on PR #1644 (1.1473 bpb). Same 7-layer Mamba-3/Attention hybrid; the −1.7 mBPB improvement comes from three quant/TTT-phase changes that don't touch the architecture.

Seed	BF16	Post-quant+TTT	Total submission
1337	1.1389	1.1441	15,930,191 B
42	1.1462	1.1460	15,961,203 B
2025	1.1495	1.1468	15,975,083 B
Mean	1.1449	1.1456	15,955,492 B
Std	0.0045	0.0011	18,852 B

Submitted artifact corresponds to seed 1337 (1.1441, 15,930,191 B).

What changed vs PR #1644

TTT_EPOCHS=2: PR Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1644 used a single TTT epoch and saw a +8.3 mBPB BF16 → post-quant regression. With ep=2, the regression flips to approximately neutral (mean +0.7 mBPB across 3 seeds). The second epoch gives the model enough adaptation budget to recover the quant noise injected by INT6. Cost: 132s vs 76s for the TTT phase, both within the 600s eval budget.
Mixed-precision SSM dynamics protection: the dd_A and dd_dt rows of each Mamba-3 in_proj.weight (32 of 2232 rows per SSM block) are quantized at INT8 instead of INT6. Q-Mamba (ICLR 2025) showed that uniform 6-bit PTQ collapses Mamba perplexity from 5.5 to >21 because A/Ā errors compound through the recurrence. Promoting just these semantic-specific rows to INT8 costs ~0.01 MiB at this scale and recovers ~0.8 mBPB of quality. Implemented as per-row bit widths threaded through both the GPTQ path and the percentile-search path. New env var QUANT_BITS_SSM_DYNAMICS=8 (default in Hyperparameters).
Scale-floor quant bug fix: an earlier mixed-precision commit accidentally hardcoded scale.clamp_min(1.0/127) (INT8 floor) for ALL rows, including INT6 rows that should floor at 1/31. Consequence: INT6 q-values spread across [-31, 31] more uniformly, inflating LZMA entropy and starving selective ±1 pruning. Fixed to use per-row 1/qmax. Net effect: ~1.4 MiB of spurious size inflation on prior runs disappears.

Architecture (unchanged from PR #1644)

7-layer Mamba-3 SISO hybrid: 5 SSM blocks + 2 FlashAttention layers at positions 2 and 5, dim=512, d_state=64, expand=2, headdim=64, chunk_size=64, mlp_mult=3, 25.16M params. SP8192 BPE tokenizer trained from scratch on FineWeb. See PR #1644 for the full architectural rationale and Triton kernel analysis (no kernel-level changes here).

Reproduction

SEED=1337 VOCAB_SIZE=8192 NUM_LAYERS=7 NUM_ATTN_LAYERS=2 \
  TRAIN_SEQ_LEN=4096 WARMDOWN_ITERS=2600 WARMDOWN_SHAPE=linear \
  MUON_EQ_R=1 LATE_QAT_THRESHOLD=0.15 \
  USE_GPTQ=1 QUANT_BITS=6 QUANT_BITS_EMBED=8 GPTQ_NUM_SEQS=32 \
  EVAL_OVERLAP=1024 USE_LZMA=1 EVAL_TEMP=0.9 TTT_EPOCHS=2 \
  WEIGHT_DECAY=0.04 MUON_MOMENTUM=0.99 MATRIX_LR=0.025 \
  torchrun --nproc_per_node=8 train_mamba3_hybrid.py

QUANT_BITS_SSM_DYNAMICS=8 is the default in Hyperparameters and does not need to be set explicitly. Repeat with SEED=42 and SEED=2025 for the 3-seed mean.

Data

Same as PR #1644: SP8192 BPE tokenizer trained from scratch on FineWeb-10B because the kevclark/parameter-golf SP8192 tokenizer was not consistent with this submission's tokenizer config. Tokenized shards and tokenizer artifacts available on a private HF dataset on request.

What I tested and removed

This is a non-record submission and represents the cleaned production path from a much larger experimental sprint. The training script in this PR is the lean submission version. Many techniques that did not survive empirical validation at 25M / 10min / 16MB / SP8192 are not represented in this PR — including 1-attention ratio (works at SP4096, fails at SP8192 by +7.5 mBPB BF16), low-rank in_proj factorization (fails because random factored init destroys upstream's structured init for dd_A/dd_dt rows), depth recurrence at SP8192 (fails by +13.9 mBPB BF16 at expand=1.5), MLP INT5 quantization (+8 mBPB quality), and several others. Two patterns emerged worth flagging:

LZMA compression penalty for SSM weights: across three runs I measured SSM-heavy hybrids compressing ~33% under LZMA vs ~40% for attention-heavier hybrids — roughly a 3× higher compressed-bytes-per-raw-byte cost for swapping an attention block for an SSM block. The candidate mechanism (untested) is that Mamba-3's in_proj rows have heterogeneous distributions (z, xv, B, C, dd_dt, dd_A, trap, angles) and so quantize to higher-entropy byte streams than attention's uniform QKV. I did not run the experiment that would isolate this from other SSM-vs-attention differences.
SP4096 architectural sweeps don't transfer to SP8192: replacing 2-attn with 1-attn at SP8192 7L costs +7.5 mBPB BF16, even though the same swap at SP4096 8L was a clean −9.8 mBPB win. Depth recurrence at expand=1.5 has a similar sign flip across vocabularies. I don't have a tested explanation; one suspect I considered but didn't isolate is that Muon's Newton-Schulz orthogonalization may interact with the heterogeneous magnitude structure of SSM in_proj rows differently than with attention's uniform QKV. Mainly worth flagging as a methodology warning: don't extrapolate small-vocab sweep results to larger-vocab submissions.

…nt — 1.1456 bpb (3-seed mean) Follow-up to PR openai#1644. Same SP8192 7L (5 SSM + 2 attn) architecture; the −1.7 mBPB improvement comes from quant/TTT-phase changes only: 1. TTT_EPOCHS=2: PR openai#1644's single-epoch TTT had a +8.3 mBPB BF16 → post-quant regression. Two epochs flip that to approximately neutral (~+0.7 mBPB) by giving the model a second pass to recover quant noise. 2. Mixed-precision SSM dynamics: dd_A and dd_dt rows of mamba3.in_proj.weight (32 of 2232 per SSM block) quantized at INT8 instead of INT6, addressing the recurrence-amplification of A/Ā quant errors documented in Q-Mamba (ICLR 2025). Per-row bit widths threaded through GPTQ + percentile-search paths. ~0.01 MiB cost, ~0.8 mBPB quality recovery. 3. Scale-floor bug fix: the original mixed-precision commit clamped scale floor at 1/127 for all rows; INT6 rows should use 1/31. Per-row 1/qmax restores correct LZMA compressibility (~1.4 MiB of spurious size on the prior runs disappears). 3-seed mean: post-quant+TTT 1.1456 ± 0.0011, 15.93 MB total ± 19 KB. All three seeds individually beat PR openai#1644 and fit ≤16 MB.

Removed a note about findings documented in a separate writeup.

…rst-person voice

mradassaad added 4 commits April 28, 2026 10:27

Reflow README paragraphs (single-line per paragraph for GitHub markdown)

235828f

Update README.md to remove documentation reference

bab9361

Removed a note about findings documented in a separate writeup.

Refine README: drop sycophantic framing, mark untested mechanisms, fi…

c9bb21e

…rst-person voice

dd-dent mentioned this pull request Apr 30, 2026

Non-Record: TTSM — Typical Ternary State-Space Model, 2.0032 bpb #1999

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Mamba-3 Hybrid + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)#1890

Non-record: Mamba-3 Hybrid + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)#1890
mradassaad wants to merge 4 commits intoopenai:mainfrom
mradassaad:mamba3-multiepoch-ttt-2026-04-22

mradassaad commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mradassaad commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-record: Mamba-3 Hybrid SSM + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)

What changed vs PR #1644

Architecture (unchanged from PR #1644)

Reproduction

Data

What I tested and removed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mradassaad commented Apr 28, 2026 •

edited

Loading