Non-record: Mamba-3 Hybrid + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)#1890
Open
mradassaad wants to merge 4 commits intoopenai:mainfrom
Open
Conversation
…nt — 1.1456 bpb (3-seed mean) Follow-up to PR openai#1644. Same SP8192 7L (5 SSM + 2 attn) architecture; the −1.7 mBPB improvement comes from quant/TTT-phase changes only: 1. TTT_EPOCHS=2: PR openai#1644's single-epoch TTT had a +8.3 mBPB BF16 → post-quant regression. Two epochs flip that to approximately neutral (~+0.7 mBPB) by giving the model a second pass to recover quant noise. 2. Mixed-precision SSM dynamics: dd_A and dd_dt rows of mamba3.in_proj.weight (32 of 2232 per SSM block) quantized at INT8 instead of INT6, addressing the recurrence-amplification of A/Ā quant errors documented in Q-Mamba (ICLR 2025). Per-row bit widths threaded through GPTQ + percentile-search paths. ~0.01 MiB cost, ~0.8 mBPB quality recovery. 3. Scale-floor bug fix: the original mixed-precision commit clamped scale floor at 1/127 for all rows; INT6 rows should use 1/31. Per-row 1/qmax restores correct LZMA compressibility (~1.4 MiB of spurious size on the prior runs disappears). 3-seed mean: post-quant+TTT 1.1456 ± 0.0011, 15.93 MB total ± 19 KB. All three seeds individually beat PR openai#1644 and fit ≤16 MB.
Removed a note about findings documented in a separate writeup.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: Mamba-3 Hybrid SSM + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)
val_bpb: 1.1456 (3-seed mean, std 0.0011) | 15.93 MB total (3-seed mean) | 8×H100
A follow-up SSM submission building on PR #1644 (1.1473 bpb). Same 7-layer Mamba-3/Attention hybrid; the −1.7 mBPB improvement comes from three quant/TTT-phase changes that don't touch the architecture.
Submitted artifact corresponds to seed 1337 (1.1441, 15,930,191 B).
What changed vs PR #1644
TTT_EPOCHS=2: PR Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1644 used a single TTT epoch and saw a +8.3 mBPB BF16 → post-quant regression. With ep=2, the regression flips to approximately neutral (mean +0.7 mBPB across 3 seeds). The second epoch gives the model enough adaptation budget to recover the quant noise injected by INT6. Cost: 132s vs 76s for the TTT phase, both within the 600s eval budget.Mixed-precision SSM dynamics protection: the
dd_Aanddd_dtrows of each Mamba-3in_proj.weight(32 of 2232 rows per SSM block) are quantized at INT8 instead of INT6. Q-Mamba (ICLR 2025) showed that uniform 6-bit PTQ collapses Mamba perplexity from 5.5 to >21 because A/Ā errors compound through the recurrence. Promoting just these semantic-specific rows to INT8 costs ~0.01 MiB at this scale and recovers ~0.8 mBPB of quality. Implemented as per-row bit widths threaded through both the GPTQ path and the percentile-search path. New env varQUANT_BITS_SSM_DYNAMICS=8(default inHyperparameters).Scale-floor quant bug fix: an earlier mixed-precision commit accidentally hardcoded
scale.clamp_min(1.0/127)(INT8 floor) for ALL rows, including INT6 rows that should floor at1/31. Consequence: INT6 q-values spread across [-31, 31] more uniformly, inflating LZMA entropy and starving selective ±1 pruning. Fixed to use per-row1/qmax. Net effect: ~1.4 MiB of spurious size inflation on prior runs disappears.Architecture (unchanged from PR #1644)
7-layer Mamba-3 SISO hybrid: 5 SSM blocks + 2 FlashAttention layers at positions 2 and 5, dim=512, d_state=64, expand=2, headdim=64, chunk_size=64, mlp_mult=3, 25.16M params. SP8192 BPE tokenizer trained from scratch on FineWeb. See PR #1644 for the full architectural rationale and Triton kernel analysis (no kernel-level changes here).
Reproduction
QUANT_BITS_SSM_DYNAMICS=8is the default inHyperparametersand does not need to be set explicitly. Repeat withSEED=42andSEED=2025for the 3-seed mean.Data
Same as PR #1644: SP8192 BPE tokenizer trained from scratch on FineWeb-10B because the
kevclark/parameter-golfSP8192 tokenizer was not consistent with this submission's tokenizer config. Tokenized shards and tokenizer artifacts available on a private HF dataset on request.What I tested and removed
This is a non-record submission and represents the cleaned production path from a much larger experimental sprint. The training script in this PR is the lean submission version. Many techniques that did not survive empirical validation at 25M / 10min / 16MB / SP8192 are not represented in this PR — including 1-attention ratio (works at SP4096, fails at SP8192 by +7.5 mBPB BF16), low-rank
in_projfactorization (fails because random factored init destroys upstream's structured init fordd_A/dd_dtrows), depth recurrence at SP8192 (fails by +13.9 mBPB BF16 at expand=1.5), MLP INT5 quantization (+8 mBPB quality), and several others. Two patterns emerged worth flagging:LZMA compression penalty for SSM weights: across three runs I measured SSM-heavy hybrids compressing ~33% under LZMA vs ~40% for attention-heavier hybrids — roughly a 3× higher compressed-bytes-per-raw-byte cost for swapping an attention block for an SSM block. The candidate mechanism (untested) is that Mamba-3's
in_projrows have heterogeneous distributions (z, xv, B, C, dd_dt, dd_A, trap, angles) and so quantize to higher-entropy byte streams than attention's uniform QKV. I did not run the experiment that would isolate this from other SSM-vs-attention differences.SP4096 architectural sweeps don't transfer to SP8192: replacing 2-attn with 1-attn at SP8192 7L costs +7.5 mBPB BF16, even though the same swap at SP4096 8L was a clean −9.8 mBPB win. Depth recurrence at expand=1.5 has a similar sign flip across vocabularies. I don't have a tested explanation; one suspect I considered but didn't isolate is that Muon's Newton-Schulz orthogonalization may interact with the heterogeneous magnitude structure of SSM
in_projrows differently than with attention's uniform QKV. Mainly worth flagging as a methodology warning: don't extrapolate small-vocab sweep results to larger-vocab submissions.