Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040) by potatonyliu · Pull Request #1994 · openai/parameter-golf

potatonyliu · 2026-04-30T17:39:12Z

Summary

First SSM-based entry in either track. From the Requests for PRs list: "State-space models, E2E TTT, super long context for
evaluation or training" — this is the SSM half of that ask.

Architecture. kill-Mamba-2 (LTI selectivity: dt/B/C replaced with learned per-head/per-state constants, conv1d + gated SSD scan retained) running in parallel with
attention at every block, with NUM_UNIQUE_LAYERS=7 × NUM_LOOPS=3 weight-shared depth recurrence (effective depth 21). Block layout: RMSNorm → (attn || SSM) → +residual; RMSNorm → SwiGLU MLP → +residual. SwiGLU MLP_MULT=8, GQA-4, RoPE, tied embeddings, MODEL_DIM=512, sp1024 vocab, 61.7M params.
Quant. BitNet-b1.58 ternary body weights ({−γ, 0, +γ} via absmean STE) on attn/Mamba/MLP projections, exported at 2 bits/param packed (4 vals/byte). 1D and small-tensor SSM
dynamics buffers (A_log, B_proj, C_proj, dt_bias, D_skip, conv1d weights) stay fp32 via CONTROL_TENSOR_NAME_PATTERNS. Final compression: int8 + 2-bit packed ternary →
brotli q=11.
Schedule. EMA-of-weights at β=0.999 (shadow swapped before final eval), Muon (MATRIX_LR=0.045, 15 NS steps) + AdamW, 4,380 steps × 524,288 batch ≈ 2.30B tokens.

Result. val_bpb 1.30040 / 12.08 MB / single seed (1337). Trained 4×H200 SXM, 1 hour wallclock — non-record on both hardware (4×H200 not 8×H100) and time (1h not 10min).

Positioning

This submission lands +0.076 BPB above the naive records-track baseline (1.2244) despite roughly 9× the compute budget. It earns its place as the first SSM entry, not as a frontier
number. The standard-stack hooks (ternary export, EMA, depth recurrence, brotli, control-tensor protection) are wired up correctly; the gap to records is dominated by training
duration (1h vs the records-track frontier's hours of accumulated optimization) and missing ports (parallel-residuals, sliding-window eval, GPTQ) rather than the architecture itself.

Comparison	val_bpb	Δ vs this PR
`track_10min_16mb` naive baseline	1.2244	+0.076 (this PR is worse)
Mid-tier records-track frontier (PR #1204)	1.1063	+0.194
`track_non_record_16mb` 106M Binary 2.15h	1.1239	+0.177
`track_non_record_16mb` Quasi-10B 4h	1.2074	+0.093

Compliance checklist

Single-file train_gpt.py (106,722 bytes); BitLinear + pack_ternary/unpack_ternary inlined, no local helper modules.
Artifact total 12,076,468 bytes = 106,722 code + 11,969,746 compressed payload, under the 16 MB decimal cap with 3.92 MB headroom.
No tokenizer or dataset modifications (sp1024 standard).
No network calls; no validation-data access during training; eval uses standard root-harness full-window val + int8/brotli quant roundtrip (no TTT, no sliding-window, no GPTQ).
PR is purely additive — only adds the new folder under records/track_non_record_16mb/.

Reproduction

From the repo root, on a CUDA box with nproc_per_node dividing 8 (the script asserts 8 % world_size == 0 for clean grad accumulation):

source records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/env.sh
torchrun --standalone --nproc_per_node=4 \
  records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/train_gpt.py

Caveats

Single seed (1337). Non-record allows it; included because the −0.65 BPB delta over the prior local SSM mark sits far outside any plausible noise floor for this family. No claim
of statistical significance.
Partial training log. The pod was stopped before the full run.log could sync from /workspace; train_seed1337.log is the captured monitor stream covering every headline
number (pre/post-quant val_bpb, payload bytes, step times, peak memory, EMA shadow swap, final_int8_zlib_roundtrip_exact). The code_bytes:104676 and Total submission size:12074422 lines in the log reflect the original three-file run (with modules/bitlinear.py and modules/trigram_side_memory.py shipped separately); the shipped artifact is now
single-file (BitLinear inlined, dead trigram_side_memory.py removed), so the post-cleanup numbers (106,722 / 12,076,468) cited above replace them. Trained checkpoint is unchanged —
only code-side accounting moved.

… 1hr) First SSM-based entry in either track. kill-Mamba-2 (LTI selectivity, B/C constants) in parallel with attention at every block; NUM_UNIQUE_LAYERS=7 with NUM_LOOPS=3 weight sharing; BitNet-b1.58 ternary body with 2-bit-packed export; EMA-of-weights at β=0.999; brotli on top. Trained 4,380 steps × 524,288 batch ≈ 2.3B tokens in 3,600s on 4×H200 SXM 141GB. val_bpb 1.30040 / 12.07 MB / seed=1337

- env.sh was the cumulative experiment-lineage env file (215 lines, multiple reassignments of ITERATIONS / MATRIX_LR / TRAIN_BATCH_TOKENS / WARMDOWN_ITERS / MAX_WALLCLOCK_SECONDS / VAL_TOKENS / NUM_UNIQUE_LAYERS / PARALLEL_LAYER_POSITIONS, with stale comments from the early MPS-smoke regime). Replaced with a 53-line canonical environment that only contains the effective values shaping the run, organized by purpose. Behaviour is unchanged: sourcing produces the same final var set as last-write-wins resolution of the previous file. - README "Files" entry for env.sh updated; "Command" section rephrased so env.sh is described as equivalent to the inline command, not as "every variable used by the run".

train_gpt.py imports `modules.bitlinear` (BitLinear, pack_ternary, unpack_ternary — load-bearing under TERNARY_BODY=1) at module top-level and `modules.trigram_side_memory` under inert TRIGRAM_SIDE_MEMORY=0 guards. These local modules were missing from the submission folder and do not exist in upstream openai/parameter-golf, so a reviewer running the original repo-root command would have crashed at the line-705 import. Including both modules (3,674 + 38,494 bytes) and changing the run convention so `modules/` is a sibling of `train_gpt.py`: cd records/track_non_record_16mb/<this-folder>/ source ./env.sh torchrun --standalone --nproc_per_node=4 train_gpt.py env.sh now sets DATA_PATH and TOKENIZER_PATH to ../../../data/... so they reach the repo-root data tree from inside the submission folder. README "Command" and "Files" sections updated to match.

…ting The previous d4a2208 fix (bundle local modules/) made the submission runnable but introduced a byte-accounting hole: the harness reported code_bytes=104,676 (just train_gpt.py) while the artifact actually shipped 146,844 bytes of code (train_gpt.py + 3,674-byte modules/bitlinear.py + 38,494-byte modules/trigram_side_memory.py). The upstream rule is "All counted code should live in the train_gpt.py script." This commit makes the submission single-file and re-aligns the numbers: - Inline BitLinear + pack_ternary + unpack_ternary into train_gpt.py near line 705 (the original BitLinear import site). All four import sites that previously read `from modules.bitlinear import ...` (lines 502, 575, 705, 2178) now resolve in-module. - Delete modules/trigram_side_memory.py entirely. It was 38,494 bytes of dead code under default TRIGRAM_SIDE_MEMORY=0; all three lazy imports inside train_gpt.py were behind conditional guards that don't fire under this submission's config. The three import sites (lines 1482, 1519, 2000) now raise NotImplementedError so a future reviewer who flips the flag gets a clear error instead of a silent ModuleNotFoundError. - env.sh switches back to the repo-root run convention (no more DATA_PATH=../../../...). DATA_PATH/TOKENIZER_PATH defaults already resolve from repo root, matching every other records-folder submission. - README "Command", "Key metrics", "Files" sections rewritten: code_bytes: 104,676 -> 106,722 (single-file train_gpt.py) payload: 11,969,746 (unchanged; same trained checkpoint) total: 12,074,422 -> 12,076,468 (~12.08 MB) headroom: ~3.92 MB under 16 MB cap - README "Comparison" + submission.json comparison_baseline now explicitly call out the naive records-track baseline (1.2244) and note +0.076 BPB worse despite ~9x compute. Previous wording called 1.1063 "the records baseline" which was wrong (it's a mid-tier PR openai#1204 entry). - submission.json adds payload_bytes alongside code_bytes/bytes_total so the accounting is self-explanatory; notes field acknowledges the train_seed1337.log carries the pre-cleanup numbers from the original three-file run while the shipped artifact is single-file. Trained model checkpoint unchanged — only code-side accounting moved. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

potatonyliu and others added 4 commits April 30, 2026 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040)#1994

Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040)#1994
potatonyliu wants to merge 4 commits intoopenai:mainfrom
potatonyliu:submit/ssm-kill-mamba2-h200-1hr

potatonyliu commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

potatonyliu commented Apr 30, 2026

Summary

Positioning

Compliance checklist

Reproduction

Caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant