Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Non-record: Mamba-3 Hybrid SSM + Multi-Epoch TTT + Dynamics-Protected Quant — 1.1456 bpb (3-seed mean)

**val_bpb: 1.1456** (3-seed mean, std 0.0011) | **15.93 MB total** (3-seed mean) | 8×H100

A follow-up SSM submission building on PR #1644 (1.1473 bpb). Same 7-layer Mamba-3/Attention hybrid; the −1.7 mBPB improvement comes from three quant/TTT-phase changes that don't touch the architecture.

| Seed | BF16 | Post-quant+TTT | Total submission |
|------|------|----------------|------------------|
| 1337 | 1.1389 | **1.1441** | 15,930,191 B |
| 42 | 1.1462 | **1.1460** | 15,961,203 B |
| 2025 | 1.1495 | **1.1468** | 15,975,083 B |
| **Mean** | **1.1449** | **1.1456** | **15,955,492 B** |
| **Std** | **0.0045** | **0.0011** | **18,852 B** |

Submitted artifact corresponds to seed 1337 (1.1441, 15,930,191 B).

## What changed vs PR #1644

1. **`TTT_EPOCHS=2`**: PR #1644 used a single TTT epoch and saw a +8.3 mBPB BF16 → post-quant regression. With ep=2, the regression flips to approximately neutral (mean +0.7 mBPB across 3 seeds). The second epoch gives the model enough adaptation budget to recover the quant noise injected by INT6. Cost: 132s vs 76s for the TTT phase, both within the 600s eval budget.

2. **Mixed-precision SSM dynamics protection**: the `dd_A` and `dd_dt` rows of each Mamba-3 `in_proj.weight` (32 of 2232 rows per SSM block) are quantized at INT8 instead of INT6. Q-Mamba (ICLR 2025) showed that uniform 6-bit PTQ collapses Mamba perplexity from 5.5 to >21 because A/Ā errors compound through the recurrence. Promoting just these semantic-specific rows to INT8 costs ~0.01 MiB at this scale and recovers ~0.8 mBPB of quality. Implemented as per-row bit widths threaded through both the GPTQ path and the percentile-search path. New env var `QUANT_BITS_SSM_DYNAMICS=8` (default in `Hyperparameters`).

3. **Scale-floor quant bug fix**: an earlier mixed-precision commit accidentally hardcoded `scale.clamp_min(1.0/127)` (INT8 floor) for ALL rows, including INT6 rows that should floor at `1/31`. Consequence: INT6 q-values spread across [-31, 31] more uniformly, inflating LZMA entropy and starving selective ±1 pruning. Fixed to use per-row `1/qmax`. Net effect: ~1.4 MiB of spurious size inflation on prior runs disappears.

## Architecture (unchanged from PR #1644)

7-layer Mamba-3 SISO hybrid: 5 SSM blocks + 2 FlashAttention layers at positions 2 and 5, dim=512, d_state=64, expand=2, headdim=64, chunk_size=64, mlp_mult=3, 25.16M params. SP8192 BPE tokenizer trained from scratch on FineWeb. See PR #1644 for the full architectural rationale and Triton kernel analysis (no kernel-level changes here).

## Reproduction

```bash
SEED=1337 VOCAB_SIZE=8192 NUM_LAYERS=7 NUM_ATTN_LAYERS=2 \
TRAIN_SEQ_LEN=4096 WARMDOWN_ITERS=2600 WARMDOWN_SHAPE=linear \
MUON_EQ_R=1 LATE_QAT_THRESHOLD=0.15 \
USE_GPTQ=1 QUANT_BITS=6 QUANT_BITS_EMBED=8 GPTQ_NUM_SEQS=32 \
EVAL_OVERLAP=1024 USE_LZMA=1 EVAL_TEMP=0.9 TTT_EPOCHS=2 \
WEIGHT_DECAY=0.04 MUON_MOMENTUM=0.99 MATRIX_LR=0.025 \
torchrun --nproc_per_node=8 train_mamba3_hybrid.py
```

`QUANT_BITS_SSM_DYNAMICS=8` is the default in `Hyperparameters` and does not need to be set explicitly. Repeat with `SEED=42` and `SEED=2025` for the 3-seed mean.

## Data

Same as PR #1644: SP8192 BPE tokenizer trained from scratch on FineWeb-10B because the `kevclark/parameter-golf` SP8192 tokenizer was not consistent with this submission's tokenizer config. Tokenized shards and tokenizer artifacts available on a private HF dataset on request.

## What I tested and removed

This is a non-record submission and represents the cleaned production path from a much larger experimental sprint. The training script in this PR is the lean submission version. Many techniques that did not survive empirical validation at 25M / 10min / 16MB / SP8192 are not represented in this PR — including 1-attention ratio (works at SP4096, fails at SP8192 by +7.5 mBPB BF16), low-rank `in_proj` factorization (fails because random factored init destroys upstream's structured init for `dd_A`/`dd_dt` rows), depth recurrence at SP8192 (fails by +13.9 mBPB BF16 at expand=1.5), MLP INT5 quantization (+8 mBPB quality), and several others. Two patterns emerged worth flagging:

- **LZMA compression penalty for SSM weights**: across three runs I measured SSM-heavy hybrids compressing ~33% under LZMA vs ~40% for attention-heavier hybrids — roughly a 3× higher compressed-bytes-per-raw-byte cost for swapping an attention block for an SSM block. The candidate mechanism (untested) is that Mamba-3's `in_proj` rows have heterogeneous distributions (z, xv, B, C, dd_dt, dd_A, trap, angles) and so quantize to higher-entropy byte streams than attention's uniform QKV. I did not run the experiment that would isolate this from other SSM-vs-attention differences.

- **SP4096 architectural sweeps don't transfer to SP8192**: replacing 2-attn with 1-attn at SP8192 7L costs +7.5 mBPB BF16, even though the same swap at SP4096 8L was a clean −9.8 mBPB win. Depth recurrence at expand=1.5 has a similar sign flip across vocabularies. I don't have a tested explanation; one suspect I considered but didn't isolate is that Muon's Newton-Schulz orthogonalization may interact with the heterogeneous magnitude structure of SSM `in_proj` rows differently than with attention's uniform QKV. Mainly worth flagging as a methodology warning: don't extrapolate small-vocab sweep results to larger-vocab submissions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
torch>=2.9.1
triton>=3.5.0
mamba-ssm>=2.3.1
sentencepiece
einops
numpy
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"author": "mradassaad",
"github_id": "mradassaad",
"name": "Mamba-3 Hybrid SSM + SP8192 + Multi-Epoch TTT + Dynamics-Protected Quant",
"blurb": "Same 7L Mamba-3 SISO hybrid as PR #1644 with three additions: (1) TTT_EPOCHS=2 (multi-epoch chunk TTT recovers most quant damage), (2) mixed-precision quant protecting dd_A + dd_dt rows of in_proj at INT8, (3) scale-floor bug fix in mixed-precision pipeline. 3-seed mean post-quant+TTT 1.1456 bpb (std 0.0011).",
"date": "2026-04-22",
"val_loss": 2.95477626,
"val_bpb": 1.14408575,
"bytes_total": 15930191,
"bytes_code": 116783
}
Loading