Non-record: BESE + Mamba-3 SSD Hybrid (1.3571 BPB, 7.6 MB artifact) by mrbese · Pull Request #1665 · openai/parameter-golf

mrbese · 2026-04-16T07:57:05Z

Summary

BESE 288-vocab byte-level tokenizer + Mamba-3 SSD / Attention hybrid (6 Mamba + 2 Attention blocks)
Addresses the "State-space models" bounty from the challenge README
To our knowledge, the first submission combining a custom byte-level tokenizer with Mamba-3 SSD
val_bpb: 1.3571 (INT6 + LZMA + sliding window eval with n-gram tilt)
Artifact: 7.6 MB (48% of 16 MB limit) — BESE's tiny 288-vocab embedding saves ~3-4 MB vs SP8192

Architecture

8 layers: 6 Mamba-3 SSD blocks + 2 Attention blocks at positions [2, 5]. dim=512, d_state=128, ngroups=1, expand=2. No depth recurrence (hurts SSMs per PR #1355). Pure PyTorch SSD implementation.

Key findings

ngroups=1 (shared B/C across heads) is optimal, matching reference Mamba-2 and PR Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1644
Depth recurrence hurts SSMs by -69 mBPB — disabled entirely
2 attention layers at positions [2, 5] provides crucial global mixing between SSM segments
Artifact efficiency: competitive BPB at half the size budget, demonstrating BESE's embedding savings
Three config ablations included (d_state=64 vs 128, dim=512 vs 576)

Ongoing work

Compute credits pending for: Triton kernel integration (2-3x faster steps), TTT, QAT for wider models, and 3-seed statistical runs. Target: 1.17-1.20 BPB.

Files

README.md — Full writeup with architecture details, results, ablations, and reproduction steps
submission.json — Metadata
train_gpt.py + mamba3_ssd.py + tokenizer files — Self-contained, runnable from records folder
3 train logs (different configs: d_state=128, d_state=64, dim=576)

First combination of a custom byte-level tokenizer (BESE, 288 vocab) with Mamba-3 SSD hybrid architecture (6 Mamba + 2 Attention). Addresses the "State-space models" bounty from the challenge README. val_bpb: 1.3571 (INT6 + LZMA + sliding window + n-gram tilt) Artifact: 7,614,888 bytes (48% of 16 MB limit)

…folder reproduction, code-structure note - Add full inline BPB correctness proof (per-token byte accounting + invariant + per-case argument + transitive merge accounting + runnable self-test) - Add explicit single-seed disclaimer for the 1.3571 headline; clarify that train_log_run2/run3 are architecture ablations, not seed replicates - Add 'Quick path' reproduction that runs train_gpt.py directly from the records folder (per the rule that submissions must run within the records folder) - Keep full-pipeline reproduction instructions for the data-prep step - Add 'Code Structure' section addressing the FAQ rule about train_gpt.py

Omer Bese added 2 commits April 16, 2026 00:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: BESE + Mamba-3 SSD Hybrid (1.3571 BPB, 7.6 MB artifact)#1665

Non-record: BESE + Mamba-3 SSD Hybrid (1.3571 BPB, 7.6 MB artifact)#1665
mrbese wants to merge 2 commits intoopenai:mainfrom
mrbese:bese-mamba3-hybrid

mrbese commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrbese commented Apr 16, 2026

Summary

Architecture

Key findings

Ongoing work

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant