Skip to content

Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040)#1994

Open
potatonyliu wants to merge 4 commits intoopenai:mainfrom
potatonyliu:submit/ssm-kill-mamba2-h200-1hr
Open

Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040)#1994
potatonyliu wants to merge 4 commits intoopenai:mainfrom
potatonyliu:submit/ssm-kill-mamba2-h200-1hr

Conversation

@potatonyliu
Copy link
Copy Markdown

Summary

First SSM-based entry in either track. From the Requests for PRs list: "State-space models, E2E TTT, super long context for
evaluation or training" — this is the SSM half of that ask.

  • Architecture. kill-Mamba-2 (LTI selectivity: dt/B/C replaced with learned per-head/per-state constants, conv1d + gated SSD scan retained) running in parallel with
    attention at every block
    , with NUM_UNIQUE_LAYERS=7 × NUM_LOOPS=3 weight-shared depth recurrence (effective depth 21). Block layout: RMSNorm → (attn || SSM) → +residual; RMSNorm → SwiGLU MLP → +residual. SwiGLU MLP_MULT=8, GQA-4, RoPE, tied embeddings, MODEL_DIM=512, sp1024 vocab, 61.7M params.
  • Quant. BitNet-b1.58 ternary body weights ({−γ, 0, +γ} via absmean STE) on attn/Mamba/MLP projections, exported at 2 bits/param packed (4 vals/byte). 1D and small-tensor SSM
    dynamics buffers (A_log, B_proj, C_proj, dt_bias, D_skip, conv1d weights) stay fp32 via CONTROL_TENSOR_NAME_PATTERNS. Final compression: int8 + 2-bit packed ternary →
    brotli q=11.
  • Schedule. EMA-of-weights at β=0.999 (shadow swapped before final eval), Muon (MATRIX_LR=0.045, 15 NS steps) + AdamW, 4,380 steps × 524,288 batch ≈ 2.30B tokens.

Result. val_bpb 1.30040 / 12.08 MB / single seed (1337). Trained 4×H200 SXM, 1 hour wallclock — non-record on both hardware (4×H200 not 8×H100) and time (1h not 10min).

Positioning

This submission lands +0.076 BPB above the naive records-track baseline (1.2244) despite roughly 9× the compute budget. It earns its place as the first SSM entry, not as a frontier
number. The standard-stack hooks (ternary export, EMA, depth recurrence, brotli, control-tensor protection) are wired up correctly; the gap to records is dominated by training
duration (1h vs the records-track frontier's hours of accumulated optimization) and missing ports (parallel-residuals, sliding-window eval, GPTQ) rather than the architecture itself.

Comparison val_bpb Δ vs this PR
track_10min_16mb naive baseline 1.2244 +0.076 (this PR is worse)
Mid-tier records-track frontier (PR #1204) 1.1063 +0.194
track_non_record_16mb 106M Binary 2.15h 1.1239 +0.177
track_non_record_16mb Quasi-10B 4h 1.2074 +0.093

Compliance checklist

  • Single-file train_gpt.py (106,722 bytes); BitLinear + pack_ternary/unpack_ternary inlined, no local helper modules.
  • Artifact total 12,076,468 bytes = 106,722 code + 11,969,746 compressed payload, under the 16 MB decimal cap with 3.92 MB headroom.
  • No tokenizer or dataset modifications (sp1024 standard).
  • No network calls; no validation-data access during training; eval uses standard root-harness full-window val + int8/brotli quant roundtrip (no TTT, no sliding-window, no GPTQ).
  • PR is purely additive — only adds the new folder under records/track_non_record_16mb/.

Reproduction

From the repo root, on a CUDA box with nproc_per_node dividing 8 (the script asserts 8 % world_size == 0 for clean grad accumulation):

source records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/env.sh
torchrun --standalone --nproc_per_node=4 \
  records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/train_gpt.py

Caveats

  • Single seed (1337). Non-record allows it; included because the −0.65 BPB delta over the prior local SSM mark sits far outside any plausible noise floor for this family. No claim
    of statistical significance.
  • Partial training log. The pod was stopped before the full run.log could sync from /workspace; train_seed1337.log is the captured monitor stream covering every headline
    number (pre/post-quant val_bpb, payload bytes, step times, peak memory, EMA shadow swap, final_int8_zlib_roundtrip_exact). The code_bytes:104676 and Total submission size:12074422 lines in the log reflect the original three-file run (with modules/bitlinear.py and modules/trigram_side_memory.py shipped separately); the shipped artifact is now
    single-file (BitLinear inlined, dead trigram_side_memory.py removed), so the post-cleanup numbers (106,722 / 12,076,468) cited above replace them. Trained checkpoint is unchanged —
    only code-side accounting moved.

potatonyliu and others added 4 commits April 30, 2026 12:10
… 1hr)

First SSM-based entry in either track. kill-Mamba-2 (LTI selectivity, B/C
constants) in parallel with attention at every block; NUM_UNIQUE_LAYERS=7
with NUM_LOOPS=3 weight sharing; BitNet-b1.58 ternary body with 2-bit-packed
export; EMA-of-weights at β=0.999; brotli on top. Trained 4,380 steps ×
524,288 batch ≈ 2.3B tokens in 3,600s on 4×H200 SXM 141GB.

val_bpb 1.30040 / 12.07 MB / seed=1337
- env.sh was the cumulative experiment-lineage env file (215 lines, multiple
  reassignments of ITERATIONS / MATRIX_LR / TRAIN_BATCH_TOKENS / WARMDOWN_ITERS
  / MAX_WALLCLOCK_SECONDS / VAL_TOKENS / NUM_UNIQUE_LAYERS / PARALLEL_LAYER_POSITIONS,
  with stale comments from the early MPS-smoke regime). Replaced with a
  53-line canonical environment that only contains the effective values
  shaping the run, organized by purpose. Behaviour is unchanged: sourcing
  produces the same final var set as last-write-wins resolution of the
  previous file.
- README "Files" entry for env.sh updated; "Command" section rephrased so
  env.sh is described as equivalent to the inline command, not as "every
  variable used by the run".
train_gpt.py imports `modules.bitlinear` (BitLinear, pack_ternary,
unpack_ternary — load-bearing under TERNARY_BODY=1) at module top-level
and `modules.trigram_side_memory` under inert TRIGRAM_SIDE_MEMORY=0
guards. These local modules were missing from the submission folder and
do not exist in upstream openai/parameter-golf, so a reviewer running
the original repo-root command would have crashed at the line-705
import. Including both modules (3,674 + 38,494 bytes) and changing the
run convention so `modules/` is a sibling of `train_gpt.py`:

  cd records/track_non_record_16mb/<this-folder>/
  source ./env.sh
  torchrun --standalone --nproc_per_node=4 train_gpt.py

env.sh now sets DATA_PATH and TOKENIZER_PATH to ../../../data/... so
they reach the repo-root data tree from inside the submission folder.
README "Command" and "Files" sections updated to match.
…ting

The previous d4a2208 fix (bundle local modules/) made the submission
runnable but introduced a byte-accounting hole: the harness reported
code_bytes=104,676 (just train_gpt.py) while the artifact actually
shipped 146,844 bytes of code (train_gpt.py + 3,674-byte
modules/bitlinear.py + 38,494-byte modules/trigram_side_memory.py). The
upstream rule is "All counted code should live in the train_gpt.py
script."

This commit makes the submission single-file and re-aligns the numbers:

  - Inline BitLinear + pack_ternary + unpack_ternary into train_gpt.py
    near line 705 (the original BitLinear import site). All four import
    sites that previously read `from modules.bitlinear import ...`
    (lines 502, 575, 705, 2178) now resolve in-module.

  - Delete modules/trigram_side_memory.py entirely. It was 38,494 bytes
    of dead code under default TRIGRAM_SIDE_MEMORY=0; all three lazy
    imports inside train_gpt.py were behind conditional guards that
    don't fire under this submission's config. The three import sites
    (lines 1482, 1519, 2000) now raise NotImplementedError so a future
    reviewer who flips the flag gets a clear error instead of a silent
    ModuleNotFoundError.

  - env.sh switches back to the repo-root run convention (no more
    DATA_PATH=../../../...). DATA_PATH/TOKENIZER_PATH defaults already
    resolve from repo root, matching every other records-folder
    submission.

  - README "Command", "Key metrics", "Files" sections rewritten:
      code_bytes:   104,676 -> 106,722  (single-file train_gpt.py)
      payload:      11,969,746          (unchanged; same trained checkpoint)
      total:        12,074,422 -> 12,076,468  (~12.08 MB)
      headroom:     ~3.92 MB under 16 MB cap

  - README "Comparison" + submission.json comparison_baseline now
    explicitly call out the naive records-track baseline (1.2244) and
    note +0.076 BPB worse despite ~9x compute. Previous wording called
    1.1063 "the records baseline" which was wrong (it's a mid-tier
    PR openai#1204 entry).

  - submission.json adds payload_bytes alongside code_bytes/bytes_total
    so the accounting is self-explanatory; notes field acknowledges the
    train_seed1337.log carries the pre-cleanup numbers from the
    original three-file run while the shipped artifact is single-file.

Trained model checkpoint unchanged — only code-side accounting moved.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant