Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040)#1994
Open
potatonyliu wants to merge 4 commits intoopenai:mainfrom
Open
Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040)#1994potatonyliu wants to merge 4 commits intoopenai:mainfrom
potatonyliu wants to merge 4 commits intoopenai:mainfrom
Conversation
… 1hr) First SSM-based entry in either track. kill-Mamba-2 (LTI selectivity, B/C constants) in parallel with attention at every block; NUM_UNIQUE_LAYERS=7 with NUM_LOOPS=3 weight sharing; BitNet-b1.58 ternary body with 2-bit-packed export; EMA-of-weights at β=0.999; brotli on top. Trained 4,380 steps × 524,288 batch ≈ 2.3B tokens in 3,600s on 4×H200 SXM 141GB. val_bpb 1.30040 / 12.07 MB / seed=1337
- env.sh was the cumulative experiment-lineage env file (215 lines, multiple reassignments of ITERATIONS / MATRIX_LR / TRAIN_BATCH_TOKENS / WARMDOWN_ITERS / MAX_WALLCLOCK_SECONDS / VAL_TOKENS / NUM_UNIQUE_LAYERS / PARALLEL_LAYER_POSITIONS, with stale comments from the early MPS-smoke regime). Replaced with a 53-line canonical environment that only contains the effective values shaping the run, organized by purpose. Behaviour is unchanged: sourcing produces the same final var set as last-write-wins resolution of the previous file. - README "Files" entry for env.sh updated; "Command" section rephrased so env.sh is described as equivalent to the inline command, not as "every variable used by the run".
train_gpt.py imports `modules.bitlinear` (BitLinear, pack_ternary, unpack_ternary — load-bearing under TERNARY_BODY=1) at module top-level and `modules.trigram_side_memory` under inert TRIGRAM_SIDE_MEMORY=0 guards. These local modules were missing from the submission folder and do not exist in upstream openai/parameter-golf, so a reviewer running the original repo-root command would have crashed at the line-705 import. Including both modules (3,674 + 38,494 bytes) and changing the run convention so `modules/` is a sibling of `train_gpt.py`: cd records/track_non_record_16mb/<this-folder>/ source ./env.sh torchrun --standalone --nproc_per_node=4 train_gpt.py env.sh now sets DATA_PATH and TOKENIZER_PATH to ../../../data/... so they reach the repo-root data tree from inside the submission folder. README "Command" and "Files" sections updated to match.
…ting The previous d4a2208 fix (bundle local modules/) made the submission runnable but introduced a byte-accounting hole: the harness reported code_bytes=104,676 (just train_gpt.py) while the artifact actually shipped 146,844 bytes of code (train_gpt.py + 3,674-byte modules/bitlinear.py + 38,494-byte modules/trigram_side_memory.py). The upstream rule is "All counted code should live in the train_gpt.py script." This commit makes the submission single-file and re-aligns the numbers: - Inline BitLinear + pack_ternary + unpack_ternary into train_gpt.py near line 705 (the original BitLinear import site). All four import sites that previously read `from modules.bitlinear import ...` (lines 502, 575, 705, 2178) now resolve in-module. - Delete modules/trigram_side_memory.py entirely. It was 38,494 bytes of dead code under default TRIGRAM_SIDE_MEMORY=0; all three lazy imports inside train_gpt.py were behind conditional guards that don't fire under this submission's config. The three import sites (lines 1482, 1519, 2000) now raise NotImplementedError so a future reviewer who flips the flag gets a clear error instead of a silent ModuleNotFoundError. - env.sh switches back to the repo-root run convention (no more DATA_PATH=../../../...). DATA_PATH/TOKENIZER_PATH defaults already resolve from repo root, matching every other records-folder submission. - README "Command", "Key metrics", "Files" sections rewritten: code_bytes: 104,676 -> 106,722 (single-file train_gpt.py) payload: 11,969,746 (unchanged; same trained checkpoint) total: 12,074,422 -> 12,076,468 (~12.08 MB) headroom: ~3.92 MB under 16 MB cap - README "Comparison" + submission.json comparison_baseline now explicitly call out the naive records-track baseline (1.2244) and note +0.076 BPB worse despite ~9x compute. Previous wording called 1.1063 "the records baseline" which was wrong (it's a mid-tier PR openai#1204 entry). - submission.json adds payload_bytes alongside code_bytes/bytes_total so the accounting is self-explanatory; notes field acknowledges the train_seed1337.log carries the pre-cleanup numbers from the original three-file run while the shipped artifact is single-file. Trained model checkpoint unchanged — only code-side accounting moved. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First SSM-based entry in either track. From the Requests for PRs list: "State-space models, E2E TTT, super long context for
evaluation or training" — this is the SSM half of that ask.
dt/B/Creplaced with learned per-head/per-state constants, conv1d + gated SSD scan retained) running in parallel withattention at every block, with
NUM_UNIQUE_LAYERS=7×NUM_LOOPS=3weight-shared depth recurrence (effective depth 21). Block layout:RMSNorm → (attn || SSM) → +residual; RMSNorm → SwiGLU MLP → +residual. SwiGLUMLP_MULT=8, GQA-4, RoPE, tied embeddings,MODEL_DIM=512, sp1024 vocab, 61.7M params.{−γ, 0, +γ}via absmean STE) on attn/Mamba/MLP projections, exported at 2 bits/param packed (4 vals/byte). 1D and small-tensor SSMdynamics buffers (
A_log,B_proj,C_proj,dt_bias,D_skip, conv1d weights) stay fp32 viaCONTROL_TENSOR_NAME_PATTERNS. Final compression: int8 + 2-bit packed ternary →brotli q=11.
MATRIX_LR=0.045, 15 NS steps) + AdamW, 4,380 steps × 524,288 batch ≈ 2.30B tokens.Result. val_bpb 1.30040 / 12.08 MB / single seed (1337). Trained 4×H200 SXM, 1 hour wallclock — non-record on both hardware (4×H200 not 8×H100) and time (1h not 10min).
Positioning
This submission lands +0.076 BPB above the naive records-track baseline (1.2244) despite roughly 9× the compute budget. It earns its place as the first SSM entry, not as a frontier
number. The standard-stack hooks (ternary export, EMA, depth recurrence, brotli, control-tensor protection) are wired up correctly; the gap to records is dominated by training
duration (1h vs the records-track frontier's hours of accumulated optimization) and missing ports (parallel-residuals, sliding-window eval, GPTQ) rather than the architecture itself.
track_10min_16mbnaive baselinetrack_non_record_16mb106M Binary 2.15htrack_non_record_16mbQuasi-10B 4hCompliance checklist
train_gpt.py(106,722 bytes);BitLinear+pack_ternary/unpack_ternaryinlined, no local helper modules.records/track_non_record_16mb/.Reproduction
From the repo root, on a CUDA box with
nproc_per_nodedividing 8 (the script asserts8 % world_size == 0for clean grad accumulation):source records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/env.sh torchrun --standalone --nproc_per_node=4 \ records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/train_gpt.pyCaveats
of statistical significance.
run.logcould sync from/workspace;train_seed1337.logis the captured monitor stream covering every headlinenumber (pre/post-quant val_bpb, payload bytes, step times, peak memory, EMA shadow swap,
final_int8_zlib_roundtrip_exact). Thecode_bytes:104676andTotal submission size:12074422lines in the log reflect the original three-file run (withmodules/bitlinear.pyandmodules/trigram_side_memory.pyshipped separately); the shipped artifact is nowsingle-file (BitLinear inlined, dead
trigram_side_memory.pyremoved), so the post-cleanup numbers (106,722 / 12,076,468) cited above replace them. Trained checkpoint is unchanged —only code-side accounting moved.