Skip to content

Non-record: BESE + Mamba-3 SSD Hybrid (1.3571 BPB, 7.6 MB artifact)#1665

Open
mrbese wants to merge 2 commits intoopenai:mainfrom
mrbese:bese-mamba3-hybrid
Open

Non-record: BESE + Mamba-3 SSD Hybrid (1.3571 BPB, 7.6 MB artifact)#1665
mrbese wants to merge 2 commits intoopenai:mainfrom
mrbese:bese-mamba3-hybrid

Conversation

@mrbese
Copy link
Copy Markdown

@mrbese mrbese commented Apr 16, 2026

Summary

  • BESE 288-vocab byte-level tokenizer + Mamba-3 SSD / Attention hybrid (6 Mamba + 2 Attention blocks)
  • Addresses the "State-space models" bounty from the challenge README
  • To our knowledge, the first submission combining a custom byte-level tokenizer with Mamba-3 SSD
  • val_bpb: 1.3571 (INT6 + LZMA + sliding window eval with n-gram tilt)
  • Artifact: 7.6 MB (48% of 16 MB limit) — BESE's tiny 288-vocab embedding saves ~3-4 MB vs SP8192

Architecture

8 layers: 6 Mamba-3 SSD blocks + 2 Attention blocks at positions [2, 5]. dim=512, d_state=128, ngroups=1, expand=2. No depth recurrence (hurts SSMs per PR #1355). Pure PyTorch SSD implementation.

Key findings

  • ngroups=1 (shared B/C across heads) is optimal, matching reference Mamba-2 and PR Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1644
  • Depth recurrence hurts SSMs by -69 mBPB — disabled entirely
  • 2 attention layers at positions [2, 5] provides crucial global mixing between SSM segments
  • Artifact efficiency: competitive BPB at half the size budget, demonstrating BESE's embedding savings
  • Three config ablations included (d_state=64 vs 128, dim=512 vs 576)

Ongoing work

Compute credits pending for: Triton kernel integration (2-3x faster steps), TTT, QAT for wider models, and 3-seed statistical runs. Target: 1.17-1.20 BPB.

Files

  • README.md — Full writeup with architecture details, results, ablations, and reproduction steps
  • submission.json — Metadata
  • train_gpt.py + mamba3_ssd.py + tokenizer files — Self-contained, runnable from records folder
  • 3 train logs (different configs: d_state=128, d_state=64, dim=576)

Omer Bese added 2 commits April 16, 2026 00:56
First combination of a custom byte-level tokenizer (BESE, 288 vocab) with
Mamba-3 SSD hybrid architecture (6 Mamba + 2 Attention). Addresses the
"State-space models" bounty from the challenge README.

val_bpb: 1.3571 (INT6 + LZMA + sliding window + n-gram tilt)
Artifact: 7,614,888 bytes (48% of 16 MB limit)
…folder reproduction, code-structure note

- Add full inline BPB correctness proof (per-token byte accounting + invariant + per-case argument + transitive merge accounting + runnable self-test)
- Add explicit single-seed disclaimer for the 1.3571 headline; clarify that train_log_run2/run3 are architecture ablations, not seed replicates
- Add 'Quick path' reproduction that runs train_gpt.py directly from the records folder (per the rule that submissions must run within the records folder)
- Keep full-pipeline reproduction instructions for the data-prep step
- Add 'Code Structure' section addressing the FAQ rule about train_gpt.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant