Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
First SSM-based entry in either track. Trained on 4×H200 SXM for one hour rather than 8×H100 for ten minutes — non-record on both hardware and time. Eval is the standard root-harness full-window val + int8/brotli quant; no TTT, no sliding-window, no GPTQ.

**val_bpb 1.30040 / 12.08 MB / seed=1337**

## Architecture

**Topology.** 7 distinct transformer-style blocks, each shared across 3 sequential applications via depth recurrence (`NUM_UNIQUE_LAYERS=7 NUM_LOOPS=3`). Effective compute depth is 21; total stored body parameters are equivalent to 7 layers. No U-Net skip; the loop is plain weight reuse. `MODEL_DIM=512`, tied input/output embeddings, sp1024 vocab.

**Block contents.** Each block is `(attn || SSM) + MLP`, not the standard `attn → MLP` chain:

1. RMSNorm → run attention and the SSM in parallel on the same normalized input → sum their outputs (with independent per-channel learned scales `attn_scale`, `s4d_scale`) into the residual stream.
2. RMSNorm → SwiGLU MLP (`MLP_MULT=8`, hidden = 8·dim) → add to the residual stream with its own per-channel scale (`mlp_scale`).
3. A learned 2-vector `resid_mix` per block interpolates the incoming residual between the live stream `x` and the original post-embedding `x0` before normalization. Cheap, lets each block decide how much it cares about deep context vs the original embedding.

**Attention branch.** Standard causal multi-head attention with grouped-query (`NUM_HEADS=8`, `NUM_KV_HEADS=4`) and RoPE positional encoding.

**SSM branch — kill-Mamba-2.** Standard Mamba-2 has an `in_proj` that produces `x` plus three input-dependent quantities `dt`, `B`, `C` ("selectivity" — the ability to modulate the recurrence per-token), runs a depthwise causal `conv1d` (kernel=4) on `x`, then an SSD chunkwise selective scan (`d_state=64`, `expand=2`, `chunk_size=64`, `headdim=64`, 16 SSD heads), then `out_proj`. **kill-Mamba-2 replaces `dt`, `B`, `C` with learned per-head/per-state constants** (`_B_const`, `_C_const`) and a per-head `dt_bias`, making the recurrence linear time-invariant (LTI) instead of input-dependent. Same conv1d, same gating, same `A_log`, same `D_skip`, same in/out projections — only the dynamics become LTI. The intuition: at sub-records training scale, the input-dependent projections are under-trained and add noise; the LTI variant keeps the structural advantages of Mamba-2 (conv1d local recall, gated SSD scan) without the gradient surface area of selectivity.

**Why parallel attention || SSM.** The two mixers have complementary recall: attention does exact content-addressable lookup over the context, conv1d-equipped SSM does structured local recall. Running them on the same normalized input and summing outputs is consistently better at this scale than alternating attention-only and SSM-only blocks (cross-class hybrid finding from earlier experiments).

**No BigramHash.** BigramHash recall (`BIGRAM_VOCAB_SIZE=0` here) helps S4D-Lin family blocks but interacts negatively with the conv1d already present in Mamba-2 — it ends up adding noise to the recall niche conv1d already occupies.

**Quantization (`TERNARY_BODY=1`).** Body weights of the attention `qkv`/`out` projections, Mamba-2 `in_proj`/`out_proj`, and SwiGLU MLP gates are constrained to `{−γ, 0, +γ}` ternary via BitNet-b1.58 absmean straight-through estimation: at every forward pass each weight matrix is quantized to ternary (with a per-row scale γ = mean absolute value) before the matmul; gradients pass through unchanged. Roughly 1.58 bits per parameter of effective resolution. At quant export, ternary weights are stored as 2 bits per parameter packed (4 vals per byte) in a custom format that bypasses int8 entirely — lossless ternary→ternary round-trip. 1D and small (≤65,536-element) tensors — the SSM dynamics buffers `A_log`, `B_proj`, `C_proj`, `dt_bias`, `D_skip`, `conv1d` weights, all RMSNorm scales, all per-channel scales — stay fp32 throughout via `CONTROL_TENSOR_NAME_PATTERNS`.

**EMA-of-weights (`EMA_BETA=0.999`).** A shadow copy of the model weights is updated each step as `shadow = β·shadow + (1−β)·model`, then swapped into the model immediately before final eval. β=0.999 gives an effective averaging window of ~1000 steps — about the last 23% of this run's 4,380 steps. EMA is reliable when `(1−β)·steps ≫ 1`; for shorter runs (<3,000 steps) β=0.99 is the right choice.

**Optimizer.** Muon (Newton-Schulz orthogonalization, `MUON_BACKEND_STEPS=15`) for 2D weight matrices; AdamW for low-dim parameters and embeddings. `MATRIX_LR=0.045` is the modded-nanogpt baseline value. The optimizer split is by tensor pattern, same as the records-track convention.

**Compression.** int8 quantization of fp32 buffers + 2-bit packed ternary for body weights → brotli q=11 → final artifact. brotli/zlib ratio on this bytestream is ~0.985; brotli is the standard records-track choice.

## Configuration summary

- Track: `non-record`, 16 MB cap
- `NUM_UNIQUE_LAYERS=7 NUM_LOOPS=3` (effective depth 21)
- `MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4`, tied embeddings, sp1024
- `PARALLEL_LAYER_POSITIONS=0,1,2,3,4,5,6 PARALLEL_SSM_TYPE=mamba2_kill MAMBA2_KILL_SELECTIVITY=1`
- `BIGRAM_VOCAB_SIZE=0` (off)
- `TERNARY_BODY=1` (BitNet-b1.58, exported 2-bit packed)
- `EMA_BETA=0.999` (shadow swapped at last step)
- Schedule: `WARMDOWN_ITERS=1800 LR_WARMUP_STEPS=30 MATRIX_LR=0.045 TIED_EMBED_INIT_STD=0.05 MUON_BACKEND_STEPS=15 TRAIN_BATCH_TOKENS=524288`

## Command

`train_gpt.py` is fully self-contained — `BitLinear` plus `pack_ternary` / `unpack_ternary` are inlined directly into the script (no local helper modules). Run from the **repo root**, same convention as every other records-folder submission:

```bash
source records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/env.sh
torchrun --standalone --nproc_per_node=4 \
records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/train_gpt.py
```

`env.sh` sets `CONTROL_TENSOR_NAME_PATTERNS` (load-bearing — keeps SSM dynamics buffers fp32 under ternary quantization), the topology / quant / EMA / optimizer knobs, and the eval cadence. `DATA_PATH` and `TOKENIZER_PATH` are not exported because their script-side defaults (`./data/datasets/fineweb10B_sp1024` and `./data/tokenizers/fineweb_1024_bpe.model`) already resolve correctly from the repo root. `MAX_WALLCLOCK_SECONDS=3600` is the binding cap; `ITERATIONS=20000` is just an upper bound.

## Key metrics

- Pre-quant val_bpb: `1.2983`
- Post-quant val_bpb: `1.30040229`
- Quant tax: `0.0021`
- Wallclock: `3,600s` (cap fired)
- Step time: `821.94 ms`
- Steps: `4,380`
- Tokens trained: `4,380 × 524,288 ≈ 2.30B`
- Code size: `106,722 bytes` (`train_gpt.py`; single-file, no local helper modules)
- Compressed model payload (int8 + 2-bit-packed ternary + brotli q=11): `11,969,746 bytes`
- **Artifact total: `12,076,468 bytes` = code + payload (≈12.08 MB; 16 MB cap honored with ~3.92 MB headroom)**
- Model parameters: `61,657,752`
- Hardware: 4×H200 SXM (141GB HBM3e per GPU), `--nproc 4`, grad_accum=2 (peak GPU memory at run: 114,125 MiB allocated)

The train log (`train_seed1337.log`) shows `code_bytes:104676` and `Total submission size int8+zlib:12074422` because it was generated by the original three-file run (`train_gpt.py` + `modules/bitlinear.py` + `modules/trigram_side_memory.py`). The shipped artifact is now single-file (`bitlinear.py` inlined into `train_gpt.py`; `trigram_side_memory.py` deleted as gated dead code under default `TRIGRAM_SIDE_MEMORY=0`), so the post-cleanup byte counts above replace the log values. The trained model checkpoint is unchanged — only the code-side accounting moved.

## Comparison

- vs `track_10min_16mb` naive baseline (1.2244, 9L 512d sp1024 GQA-4 tied-emb, 8×H100 10 min): **+0.076 BPB worse** despite ~9× the compute budget. This submission is below the naive records-track baseline; it earns its place on the non-record list as the **first SSM entry** rather than as a frontier number.
- vs `track_10min_16mb` mid-tier records-track frontier `2026-03-31_ParallelResiduals_MiniDepthRecurrence` (1.1063): +0.194 BPB. Most of this gap is duration + missing standard-stack ports (parallel-residuals, sliding-window eval, GPTQ), not architecture.
- vs `track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L...` (1.1239, 8×H100, 2.15h, 8192 BPE): +0.177 BPB
- vs `track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3` (1.2074, 8×H100, 4h): +0.093 BPB

## Files

- `train_gpt.py` — single-file submission script (106,722 bytes). `BitLinear` plus `pack_ternary` / `unpack_ternary` are inlined; the (gated) `trigram_side_memory` import sites raise `NotImplementedError` so the dead branches are visible-and-unreachable rather than smuggling in a hidden dependency.
- `env.sh` — canonical environment; source from the repo root.
- `train_seed1337.log` — training log (partial: pod was stopped before the full `run.log` synced from `/workspace`; the lines preserved cover the headline numbers — pre/post-quant val_bpb, payload bytes, step times, peak memory, EMA shadow swap). The `code_bytes` / `Total submission size` lines reflect the original three-file run; see "Key metrics" above for the post-cleanup numbers.
- `result.json`, `submission.json` — leaderboard metadata.
- `requirements.txt` — `brotli` and `sentencepiece` are required at quant-export.
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Canonical environment for this submission.
#
# Run from the REPO ROOT (where data/ lives):
#
# source records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/env.sh
# torchrun --standalone --nproc_per_node=4 \
# records/track_non_record_16mb/2026-04-30_KillMamba2_TriParallel_n7_Ternary_EMA_4xH200_1hr/train_gpt.py
#
# train_gpt.py is fully self-contained — it does NOT import any local helper
# modules. DATA_PATH and TOKENIZER_PATH default to ./data/datasets/fineweb10B_sp1024
# and ./data/tokenizers/fineweb_1024_bpe.model, both of which resolve from the
# repo root, matching the convention of every other records-folder submission.

# --- Identity / data ---
export RUN_ID="kill_mamba2_n7_ternary_ema_h200_1hr"
export SEED=1337
export VOCAB_SIZE=1024
export TRAIN_SEQ_LEN=1024

# --- Topology: 7 unique blocks × 3 weight-shared loops; parallel(attn || kill-Mamba-2) at every position ---
export NUM_UNIQUE_LAYERS=7
export NUM_LOOPS=3
export ATTN_LAYER_POSITIONS=
export MAMBA2_LAYER_POSITIONS=
export PARALLEL_LAYER_POSITIONS=0,1,2,3,4,5,6
export PARALLEL_SSM_TYPE=mamba2_kill
export MAMBA2_KILL_SELECTIVITY=1
export MLP_TYPE=swiglu
export MLP_MULT=8
export BIGRAM_VOCAB_SIZE=0 # off — interacts negatively with conv1d already in Mamba-2

# --- Quantization: BitNet-b1.58 ternary body, exported 2-bit packed ---
export TERNARY_BODY=1
# Tensors matching these patterns stay fp32 throughout (1D / small, ≤65,536 elem).
# Load-bearing for SSM dynamics buffers (A_log, B_proj, C_proj, dt_bias, D_skip, conv1d, etc.).
export CONTROL_TENSOR_NAME_PATTERNS="attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,A_log,A_im,B_proj,C_proj,dt_log,D_skip,dt_bias,delta_bias,conv1d"

# --- EMA-of-weights, β=0.999 (effective averaging window ~1000 steps; shadow swapped before final eval) ---
export EMA_BETA=0.999

# --- Schedule / optimizer ---
export TRAIN_BATCH_TOKENS=524288
export ITERATIONS=20000 # upper bound; the wallclock cap below is the binding limit
export MAX_WALLCLOCK_SECONDS=3600 # 1 hour
export WARMDOWN_ITERS=1800
export LR_WARMUP_STEPS=30
export WARMUP_STEPS=0 # batch-size warmup (separate from LR warmup); off
export MATRIX_LR=0.045
export TIED_EMBED_INIT_STD=0.05
export MUON_BACKEND_STEPS=15

# --- Eval ---
export VAL_TOKENS=0 # 0 = full validation set, writeup-quality
export VAL_LOSS_EVERY=0 # no mid-training val (eval runs once at the end after EMA swap)
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# RunPod / CUDA runtime requirements for parameter-golf-ssm.
#
# The RunPod "Parameter Golf" template comes with PyTorch + CUDA already
# installed, so torch is intentionally NOT pinned here — we use whatever
# CUDA build the image ships and only layer on the small library set the
# training/quantization path needs.
#
# If you ever provision a non-PG image and need to install torch yourself:
# pip install torch --index-url https://download.pytorch.org/whl/cu121
# (cu121 / cu124 both work; train_gpt.py only relies on stock PyTorch APIs.)
#
# Mirrors records/track_10min_16mb/2026-03-31_ParallelResiduals_MiniDepthRecurrence/requirements.txt
# plus the SSM-side memory packing path (brotli) and a couple of helpers.

numpy
tqdm
huggingface-hub
setuptools
typing-extensions==4.15.0
datasets
sentencepiece
brotli
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"id": "0124_path_a_h200_1hr",
"parent": "0121_path_a_h100_5k",
"created_at": "2026-04-30T03:38:11Z",
"metrics": {
"val_bpb_pre_quant": 1.2983,
"val_bpb_post_quant": 1.30040229,
"val_loss_pre_quant": 2.1921,
"val_loss_post_quant": 2.19567479,
"quant_tax": 0.002102,
"step_avg_ms": 821.94,
"num_steps": 4380.0,
"artifact_bytes": 12074422.0,
"artifact_mb": 12.074,
"code_bytes": 104676.0,
"compression_ratio": 7.7
},
"flags": {
"crashed": false,
"size_violation": false,
"has_nan": false,
"exit_code": 0,
"device": "cuda_h200_x4"
},
"status": "keep",
"description": "Best project result. n=7 long-train SSM at 4×H200 1hr."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
{
"author": "potatonyliu",
"github_id": "potatonyliu",
"name": "kill-Mamba-2 SSM + Ternary + n=7 Depth Recurrence + EMA (1-hour 4×H200)",
"blurb": "First SSM-based entry in either track. kill-Mamba-2 (LTI selectivity, B/C constants) in parallel with attention at every block, NUM_UNIQUE_LAYERS=7 with NUM_LOOPS=3 weight sharing, BitNet-b1.58 ternary body with 2-bit-packed export, EMA-of-weights at β=0.999, brotli on top. Trained 4,380 steps × 524,288 batch ≈ 2.3B tokens in 3,600s on 4×H200 SXM (one hour, non-record). Lands at 1.30040 BPB / 12.08 MB; +0.076 BPB above the naive records-track baseline (1.2244) despite ~9× the compute, so this earns its place as the first SSM entry rather than as a frontier number. The gap to the records-track frontier (~0.19 BPB vs 1.1063) is dominated by training duration + missing standard-stack ports (parallel-residuals, sliding-window eval, GPTQ) rather than architecture.",
"date": "2026-04-30",
"track": "non_record_16mb",
"val_loss": 2.19567479,
"val_bpb": 1.30040229,
"seeds": [1337],
"seed_results": {
"1337": {
"val_loss_pre_quant": 2.1921,
"val_bpb_pre_quant": 1.2983,
"val_loss_post_quant": 2.19567479,
"val_bpb_post_quant": 1.30040229,
"quant_tax": 0.002102,
"artifact_bytes": 12076468,
"code_bytes": 106722,
"payload_bytes": 11969746,
"steps": 4380,
"step_avg_ms": 821.94,
"device": "cuda_h200_x4",
"wallclock_seconds": 3600,
"model_params": 61657752
}
},
"comparison_baseline": {
"track_10min_16mb_naive_baseline": {
"val_bpb": 1.2244,
"path": "records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md",
"delta_bpb": 0.0760,
"note": "9L 512d sp1024 GQA-4 tied-emb on 8×H100 10min — this submission is +0.0760 BPB worse despite ~9× the compute"
},
"track_10min_16mb_mid_tier_frontier": {
"val_bpb": 1.1063,
"path": "records/track_10min_16mb/2026-03-31_ParallelResiduals_MiniDepthRecurrence/README.md",
"delta_bpb": 0.1941,
"note": "Mid-tier records-track entry (PR #1204), not the baseline; gap is mostly duration + missing standard-stack ports"
},
"track_non_record_16mb_peers": {
"binary_106M_2.15h": { "val_bpb": 1.1239, "path": "records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md" },
"transformer_4h": { "val_bpb": 1.2074, "path": "records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/README.md" }
}
},
"bytes_total": 12076468,
"code_bytes": 106722,
"payload_bytes": 11969746,
"hardware": "4×H200 SXM 141GB",
"notes": "Non-record because of hardware (4×H200 not 8×H100) and time (1h not 10min). First SSM entry. Eval is the standard root-harness full-window val + int8/brotli quant roundtrip; not sliding-window. Single-seed (1337); the −0.65 BPB delta over the prior local SSM winner is far outside any reasonable noise floor for this architecture family. Score (1.3004) is +0.076 BPB above the naive records-track baseline (1.2244); positioning as 'requested-PR SSM entry' rather than as a competitive number. The train log records pre-cleanup code_bytes (104,676) / total (12,074,422) from the original three-file run; the shipped single-file artifact has code_bytes=106,722 / total=12,076,468 (BitLinear + pack_ternary / unpack_ternary inlined into train_gpt.py; trigram_side_memory.py deleted as gated dead code). Trained model checkpoint is unchanged — only code-side accounting moved."
}
Loading