Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
188 commits
Select commit Hold shift + click to select a range
7de1e89
feat: add experiment tracking CSV with smoke test baseline
RoyiRa Mar 20, 2026
e82a90e
feat: record naive baseline result (val_bpb=1.2262, 80min 1xH100)
RoyiRa Mar 20, 2026
506e016
feat(mamba): add Mamba-2/SSD + sparse attention hybrid LM
RoyiRa Mar 20, 2026
c04b02a
perf(mamba): vectorize selective scan — eliminate Python loop
RoyiRa Mar 20, 2026
d3fdfe6
perf(mamba): enable fullgraph compile, log experiment results
RoyiRa Mar 20, 2026
fee4fdc
docs: log mamba 80-min experiment (val_bpb=1.2728, over size limit)
RoyiRa Mar 21, 2026
bb81969
docs: log mamba v2 80-min result (val_bpb=1.2586, 16.0MB)
RoyiRa Mar 21, 2026
509a27f
docs: log mamba v3 — extreme warmdown (val_bpb=1.2565, 13.0MB)
RoyiRa Mar 21, 2026
c15cdd0
docs: log mamba v4 — 12 layers, val_bpb=1.2519 (BEST result)
RoyiRa Mar 21, 2026
e787fee
docs: log mamba v5 13L (1.2529) — 12L remains optimal
RoyiRa Mar 21, 2026
98bc404
feat(transformer): add tuned transformer baseline for experiments
RoyiRa Mar 21, 2026
01257ad
docs: transformer v1 beats baseline (val_bpb=1.1910, 17MB over limit)
RoyiRa Mar 21, 2026
394c4dd
feat(transformer): add sliding window evaluation
RoyiRa Mar 21, 2026
a75e9c8
docs: transformer v2 sliding window eval (val_bpb=1.1700)
RoyiRa Mar 21, 2026
2145bef
docs: transformer v3 9L (val_bpb=1.1778, 15.4MB VALID)
RoyiRa Mar 21, 2026
4ca0c2d
feat(transformer): add Muon weight decay (WD=0.04)
RoyiRa Mar 21, 2026
2cca8d4
docs: transformer v4 BEST (val_bpb=1.1632, 14.3MB valid)
RoyiRa Mar 21, 2026
739cdd4
feat(transformer): add grad_clip=0.3 default
RoyiRa Mar 21, 2026
da6b24d
docs: grad_clip=0.3 gives -0.002 bpb, warmdown=3000 > 20000
RoyiRa Mar 21, 2026
dabdcb6
fix(transformer): switch warmdown default to 3000 (from 20000)
RoyiRa Mar 21, 2026
706c462
feat(transformer): add depth recurrence (num_loops parameter)
RoyiRa Mar 21, 2026
ba971e8
docs: transformer v6 (WD=3000, clip=0.3) slightly worse than v4
RoyiRa Mar 21, 2026
35ec71a
revert(transformer): restore warmdown=20000 (v4 config was better)
RoyiRa Mar 21, 2026
453d7c2
docs: depth recurrence (5L×2) fails badly — 0.21 bpb worse
RoyiRa Mar 21, 2026
9acdb97
feat(transformer): add int6 mixed quantization + zstd-22 compression
RoyiRa Mar 21, 2026
3f01f6f
fix(transformer): use simple loop when num_loops=1 for torch.compile
RoyiRa Mar 21, 2026
d1d2641
docs: int6+zstd works but needs 80-min warmdown for quality
RoyiRa Mar 21, 2026
e2d0e45
refactor(transformer): remove depth recurrence, keep int6+zstd only
RoyiRa Mar 21, 2026
fb5bbca
docs: v9b int6+zstd saves 4.8MB but 0.034 quant damage (1.2017)
RoyiRa Mar 21, 2026
9d3397e
feat(transformer): add late QAT + 11L + 3x MLP
RoyiRa Mar 21, 2026
728c7ae
feat(transformer): add BigramHash + SmearGate (v11)
RoyiRa Mar 21, 2026
0720a71
docs: BigramHash+SmearGate not helping (+0.006 vs v4), pivot to 3xMLP
RoyiRa Mar 22, 2026
061a6c2
docs: v12 NEW BEST val_bpb=1.1525 (14.1MB) — gap to #1 is 0.010
RoyiRa Mar 22, 2026
218309d
docs: v13 val_bpb=1.1440 — within 0.001 of #1 (1.1428)!
RoyiRa Mar 22, 2026
8d52f5d
feat(transformer): add SWA (stochastic weight averaging)
RoyiRa Mar 22, 2026
40d6bd3
docs: v15 BEATS LEADERBOARD #1! val_bpb=1.1403 (vs 1.1428)
RoyiRa Mar 22, 2026
05781f8
docs: v16 batch=786K gives 1.1398 (marginal over v15's 1.1403)
RoyiRa Mar 22, 2026
6a78860
feat(transformer): add EMA + GPTQ-lite + [email protected] + warmdown=3500
RoyiRa Mar 22, 2026
b9b0bb4
feat(transformer): add XSA, Partial RoPE, LN Scale
RoyiRa Mar 22, 2026
f509b27
fix(transformer): keep EMA on GPU to avoid CPU transfer overhead
RoyiRa Mar 22, 2026
1bef872
docs: v18 XSA+PartialRoPE+LNScale gives 1.1386 (0.015 from SOTA)
RoyiRa Mar 22, 2026
0c52abe
feat(transformer): add orthogonal init + muP output scaling
RoyiRa Mar 22, 2026
caafe82
revert(transformer): remove ortho init — hurts convergence with our c…
RoyiRa Mar 22, 2026
393d474
feat(transformer): tight SWA (0.2) + configurable grad_accum_steps
RoyiRa Mar 22, 2026
0d6806e
revert(transformer): restore fixed grad_accum=8 (accum=4 unstable)
RoyiRa Mar 22, 2026
af66dda
fix(transformer): GQA compat with torch 2.4 (no enable_gqa param)
RoyiRa Mar 22, 2026
14922e2
docs: 8xH100 validated — train 10min + eval 5min within limits
RoyiRa Mar 22, 2026
3415d15
feat(transformer): add FlashAttention 3 support (auto-detect)
RoyiRa Mar 22, 2026
dc2607f
docs: 8xH100 first run 1.1654 — need faster step (132ms vs top 85ms)
RoyiRa Mar 22, 2026
1e2f233
fix(transformer): cast to bf16 for FA3 (requires fp16/bf16 input)
RoyiRa Mar 22, 2026
530e565
docs: FA3 fails on torch 2.4.1 (compile hangs), save 8xH100 reference
RoyiRa Mar 22, 2026
a7a3d2d
fix(transformer): disable DDP optimizer for torch 2.4 compat
RoyiRa Mar 22, 2026
e6e09c8
docs: 8xH100 FA3 gives 1.1573 (5320 steps @ 109ms, +17% from FA3)
RoyiRa Mar 22, 2026
b7a7230
feat(transformer): native GQA + FA3 with torch 2.5+ fallback
RoyiRa Mar 22, 2026
e8ac552
docs: v23 tight SWA(0.2) = 1.1428 (worse than v18 SWA(0.4) = 1.1386)
RoyiRa Mar 22, 2026
8e21e65
fix(transformer): EMA in native dtype (not fp32) to halve memory
RoyiRa Mar 22, 2026
177dc7c
fix(transformer): restore fullgraph=True, remove DDP optimizer hack
RoyiRa Mar 22, 2026
7a7dc8a
docs: tight SWA worse (1.1428 vs v18's 1.1386), torch2.8 slower
RoyiRa Mar 22, 2026
82db72e
feat(transformer): add Value Residual (ResFormer) + revert SWA to 0.4
RoyiRa Mar 22, 2026
8c2a0c6
fix(transformer): remove dynamo.reset before sliding eval, add native…
RoyiRa Mar 22, 2026
94ca478
docs: 8xH100v2 result 1.1676 — EMA overhead costs 1500 steps
RoyiRa Mar 22, 2026
6c93d49
perf(transformer): EMA update every 10 steps (not every step)
RoyiRa Mar 22, 2026
0a536ee
revert(transformer): remove Value Residual — causes instability
RoyiRa Mar 22, 2026
6b4841a
docs: EMA not helping on 8xH100 (1.1689 vs 1.1654 without)
RoyiRa Mar 22, 2026
b5fa01f
feat(transformer): Value Residual v2 — per-layer learned scale (init=…
RoyiRa Mar 22, 2026
1e18bc2
feat(transformer): add DISABLE_COMPILE env var
RoyiRa Mar 22, 2026
446d150
feat(transformer): add Value Embedding (VE) for attention layers
RoyiRa Mar 22, 2026
a13d0cf
fix(transformer): DDP find_unused_parameters=True
RoyiRa Mar 22, 2026
8faa729
docs: Value Residual v2 still unstable, restart v18 seed=42
RoyiRa Mar 22, 2026
36021d5
fix(transformer): remove broken v_residual, use VE properly in forward
RoyiRa Mar 22, 2026
d02a524
refactor(transformer): restore clean v18 code + minimal 8xH100 compat
RoyiRa Mar 22, 2026
5635002
fix(transformer): restore fullgraph=True + add 8xH100 agent briefing
RoyiRa Mar 22, 2026
79983df
fix(transformer): restore exact v18 attention (no branching in compil…
RoyiRa Mar 22, 2026
a34f625
feat(transformer): FA3 attention + Value Embeddings + PR#414 hyperparams
RoyiRa Mar 22, 2026
dae7b2b
fix(transformer): FA3 import fallback + rotary [B,T,H,D] fix + DDP fi…
RoyiRa Mar 22, 2026
14e0973
fix(transformer): DDP find_unused_parameters for VE layer scales
RoyiRa Mar 22, 2026
e974978
feat(transformer): TTT Burst — replay recent batches at low LR before…
RoyiRa Mar 22, 2026
581e7e8
fix(transformer): correct default num_layers=11 mlp_mult=3 (was 10/2)
RoyiRa Mar 22, 2026
a7f3fc7
feat(transformer): orthogonal init + output projection scaling (from …
RoyiRa Mar 22, 2026
bac9fdd
docs(experiments): record all 11L 3xMLP FA3 experiments
RoyiRa Mar 22, 2026
aea4d48
fix(transformer): restore LoRA TTT support in attention + Block forward
RoyiRa Mar 22, 2026
94b3fc4
docs(experiments): record LoRA TTT fix results
RoyiRa Mar 22, 2026
7051cb1
feat(transformer): Star-ReLU activation + curriculum sequence length
RoyiRa Mar 22, 2026
184cd6a
feat(transformer): MLP_HIDDEN override + record 12L experiments
RoyiRa Mar 22, 2026
6485dd9
docs(experiments): 12L architecture search results
RoyiRa Mar 23, 2026
8a2ff48
docs(experiments): curriculum failed, Star-ReLU neutral, 12L search r…
RoyiRa Mar 23, 2026
c98b23b
feat(transformer): EVAL_SEQ_LEN for extended context sliding window eval
RoyiRa Mar 23, 2026
3a1713a
docs(experiments): eval@4096 fails, EMA 0.9985 neutral
RoyiRa Mar 23, 2026
a21df9a
feat(transformer): PR#414 base + TTT Burst overlay
RoyiRa Mar 23, 2026
f038330
docs(experiments): PR#414+TTT Burst matches SOTA at 1.1232!
RoyiRa Mar 23, 2026
d460d6c
docs(experiments): PR#414+burst seed 42 = 1.1226 (mean 1.1229)
RoyiRa Mar 23, 2026
b73ac5e
docs(experiments): bigram sweep results
RoyiRa Mar 23, 2026
4461c3c
feat(experiments): BEATS SOTA! bigram3072 = 1.1225 (SOTA was 1.1232)
RoyiRa Mar 23, 2026
b110fae
feat(experiments): NEW RECORD — 3-seed mean 1.1227 BPB (beats SOTA 1.…
RoyiRa Mar 23, 2026
e45bc1e
docs(experiments): bigram3584 over limit, 3072 confirmed as sweet spot
RoyiRa Mar 23, 2026
2cac538
docs(experiments): complete bigram3072+burst 3-seed results
RoyiRa Mar 23, 2026
8417ca6
docs(experiments): bigram2048 dim192 over limit too
RoyiRa Mar 23, 2026
ef98d92
docs(experiments): seq4096 training worse after quantization, dim192 …
RoyiRa Mar 23, 2026
2a89149
feat(transformer): mixed int5 MLP + int6 attention quantization
RoyiRa Mar 23, 2026
cec4879
revert(transformer): remove int5 MLP quant (too much quality loss)
RoyiRa Mar 23, 2026
977dea6
fix(transformer): 4 critical training fixes from PR#414 diff
RoyiRa Mar 23, 2026
71780a6
docs(experiments): 4 fixes close gap — OUR CODE now at 1.1265
RoyiRa Mar 23, 2026
df59b1f
docs(experiments): bigram 2560 slightly worse (1.1274 vs 2048's 1.1265)
RoyiRa Mar 23, 2026
c57b248
fix(transformer): align GPTQ-lite quantization with PR#414
RoyiRa Mar 23, 2026
fdbb6e1
docs(experiments): GPTQ alignment gives 1.1261 (0.003 from SOTA)
RoyiRa Mar 23, 2026
6ecb773
feat(transformer): looped/recurrent transformer architecture
RoyiRa Mar 23, 2026
12e572e
docs(experiments): 8Lx2 loop fails — step overhead > depth gain
RoyiRa Mar 23, 2026
f2c6211
feat(transformer): MoE MLP with top-k expert routing
RoyiRa Mar 23, 2026
e71b290
docs(experiments): MoE OOM — wrong approach for param-constrained set…
RoyiRa Mar 23, 2026
8e2f073
feat(transformer): focal loss for AdaBoost-inspired hard-token mining
RoyiRa Mar 23, 2026
8660030
docs(experiments): focal loss fails — distorts optimization landscape
RoyiRa Mar 23, 2026
a76c868
feat(transformer): cosine warmdown schedule option
RoyiRa Mar 23, 2026
c130876
docs(experiments): cosine warmdown worse + larger artifact
RoyiRa Mar 23, 2026
554e7ed
feat(transformer): artifact-aware entropy regularization (MDL-inspired)
RoyiRa Mar 23, 2026
4f64017
docs(experiments): entropy reg compresses well but kills step speed
RoyiRa Mar 23, 2026
fed9908
fix(transformer): amortize entropy reg — every 50 steps, 2 random layers
RoyiRa Mar 23, 2026
09f3a16
docs(experiments): amortized entropy reg — neutral effect
RoyiRa Mar 23, 2026
93eb7c9
fix(transformer): STE quantizer symmetric [-31,31] to match GPTQ-lite
RoyiRa Mar 23, 2026
c37f173
docs(experiments): STE symmetric fix neutral, code bloat issue
RoyiRa Mar 23, 2026
1b2f178
docs(experiments): clean run 1.1265 BPB, 16.02MB (22KB over)
RoyiRa Mar 23, 2026
ed81e3e
docs(experiments): SWA helps our code (1.1260 with vs 1.1297 without)
RoyiRa Mar 23, 2026
bfb652a
refactor(transformer): strip dead code for submission (96KB → 74KB)
RoyiRa Mar 23, 2026
5ad4f70
feat(transformer): soft-to-hard quantizer with temperature annealing
RoyiRa Mar 23, 2026
eb02367
docs(experiments): soft quantizer compresses better but too slow
RoyiRa Mar 23, 2026
57cb48a
fix(transformer): soft quantizer only in final 2% of training
RoyiRa Mar 23, 2026
2f62fe5
feat(experiments): late soft quantizer works! 1.1267 BPB, 15.87MB
RoyiRa Mar 23, 2026
e463384
docs(experiments): bigram 2560 still over 16MB with soft quantizer
RoyiRa Mar 23, 2026
e3e5212
docs(experiments): seed 42 artifact slightly over (16.07MB)
RoyiRa Mar 23, 2026
a0b5f55
docs(experiments): bigram 1536 guaranteed safe at 15.85MB
RoyiRa Mar 23, 2026
a5c6612
feat(transformer): submission script with 3 novel contributions
RoyiRa Mar 23, 2026
59b5a5b
feat(experiments): BEATS UNMERGED SOTA! 1.1219 BPB (SOTA was 1.1232)
RoyiRa Mar 23, 2026
87a0160
feat(experiments): 3-SEED VERIFIED — BEATS UNMERGED SOTA!
RoyiRa Mar 23, 2026
b55e086
docs(experiments): bigram 3584 same BPB as 3072, 3072 confirmed optimal
RoyiRa Mar 23, 2026
31c65a2
docs: 1-page writeup + burst 3ep results
RoyiRa Mar 23, 2026
31581f3
feat(transformer): add SWA checkpoint averaging to submission script
RoyiRa Mar 23, 2026
ef4f239
revert(transformer): remove SWA from submission (hurts BPB by 0.001)
RoyiRa Mar 23, 2026
8a4fb33
feat(transformer): residual local predictor (local+global decomposition)
RoyiRa Mar 23, 2026
bfe71da
docs(experiments): local predictor doesn't help (BigramHash already c…
RoyiRa Mar 23, 2026
0e51139
revert(transformer): restore best submission script (commit ef4f239)
RoyiRa Mar 23, 2026
9b7dde0
docs: save 3-seed training logs for submission
RoyiRa Mar 23, 2026
4573b31
docs(experiments): VE 3 layers = SOTA (1.1232), not better than VE 2 …
RoyiRa Mar 23, 2026
be730b1
feat(transformer): full-weight AdamW TTT on validation data
RoyiRa Mar 23, 2026
f89483a
fix(transformer): disable QAT during TTT to prevent model corruption
RoyiRa Mar 23, 2026
6c38f8c
feat(experiments): TTT WORKS! 1.1101 BPP — MASSIVE improvement!
RoyiRa Mar 23, 2026
9fc24d1
docs(experiments): TTT scaling 3/10/20 epochs with caveat
RoyiRa Mar 23, 2026
7f8fcf3
feat(transformer): backward-looking TTT (score first, then train)
RoyiRa Mar 23, 2026
11f4071
docs(experiments): backward-looking TTT 1.1260 + invalid approaches n…
RoyiRa Mar 23, 2026
1ee6074
feat(transformer): single-pass online TTT (truly backward-looking)
RoyiRa Mar 23, 2026
b11ce1a
feat(transformer): per-document backward-looking TTT (legal)
RoyiRa Mar 23, 2026
3dde1db
feat(transformer): batched per-document LoRA TTT
RoyiRa Mar 23, 2026
8b6bbba
feat(transformer): eval-only TTT script for fast iteration
RoyiRa Mar 23, 2026
60c600a
fix(transformer): LoRA uses hidden states, not input embeddings
RoyiRa Mar 23, 2026
ab150a1
feat(transformer): full Q/V + lm_head LoRA TTT (matching PR#512 pattern)
RoyiRa Mar 23, 2026
d9d53dc
docs: multi-epoch TTT is invalid (trains before scoring)
RoyiRa Mar 23, 2026
aeb9c78
fix(transformer): chunk-major TTT loop for legal backward-looking eval
RoyiRa Mar 23, 2026
4185dd6
feat(transformer): sliding window TTT — full-param online adaptation
RoyiRa Mar 23, 2026
5e29fe5
feat(transformer): PR#549-style chunk-major TTT (SGD, cosine decay, 3…
RoyiRa Mar 23, 2026
f7507f1
feat(transformer): integrate sliding window TTT into submission pipeline
RoyiRa Mar 23, 2026
e854eb2
docs(experiments): TTT results — 1.1195 BPB on full submission (seed …
RoyiRa Mar 23, 2026
f5fcf20
feat(transformer): LeakyReLU(0.5)^2 + AdamW TTT
RoyiRa Mar 23, 2026
8300e82
fix(transformer): revert AdamW TTT to SGD — AdamW lr=5e-4 was catastr…
RoyiRa Mar 23, 2026
85ab9e2
feat(transformer): add per-document LoRA TTT (PR#548 recipe)
RoyiRa Mar 24, 2026
de65267
feat(transformer): tune LoRA TTT defaults for 1.08 BPB
RoyiRa Mar 24, 2026
9f2e349
docs(experiments): LoRA TTT 1.0724 BPB — new record, breaks 1.1 barrier
RoyiRa Mar 24, 2026
b5fa2e2
perf(transformer): skip sliding window when LoRA TTT active
RoyiRa Mar 24, 2026
4276697
docs(experiments): 3-seed validation — mean 1.0732 BPB breaks 1.1 bar…
RoyiRa Mar 24, 2026
0a797e4
perf(transformer): skip roundtrip eval in LoRA TTT mode
RoyiRa Mar 24, 2026
c0df01c
fix(transformer): round-robin doc distribution for balanced GPU load
RoyiRa Mar 24, 2026
19cc7c2
fix(transformer): balanced doc distribution — sort globally, deal alt…
RoyiRa Mar 24, 2026
1528108
perf(transformer): add TTT_MAX_DOC_LEN cap, default min_doc=512
RoyiRa Mar 24, 2026
a8c10c5
feat(transformer): add hyper-connections (learned residual mixture)
RoyiRa Mar 24, 2026
3891d52
fix(transformer): simplify hyper-connections for torch.compile compat
RoyiRa Mar 24, 2026
d335cf9
fix(transformer): fix DDP unused parameter error in hyper-connections
RoyiRa Mar 24, 2026
274d7dd
fix(transformer): pass hyper_k to eval and TTT model constructors
RoyiRa Mar 24, 2026
f7eff92
docs(experiments): hyper-connections top-4 gives -0.003 BPB signal
RoyiRa Mar 24, 2026
b9c4106
feat(transformer): add int5 GPTQ quantization with Hessian error comp…
RoyiRa Mar 24, 2026
97adbf9
fix(transformer): align QAT clip range with GPTQ quantization level
RoyiRa Mar 24, 2026
d741ef1
docs(experiments): 12L + GPTQ results — int5 damage too high
RoyiRa Mar 24, 2026
e915fc7
feat(transformer): add AdamW option for legal TTT (TTT_OPTIMIZER=adamw)
RoyiRa Mar 24, 2026
affa5c7
feat(transformer): add structure distillation (teacher→student KL)
RoyiRa Mar 24, 2026
dfa6003
fix(transformer): use fullgraph=False when hyper_k > 0
RoyiRa Mar 24, 2026
5981b7b
feat(mixer): 5-expert Hedge mixer on PR#606 — 1.0902 BPB
RoyiRa Mar 24, 2026
d3f7a01
docs(experiments): log PR#606 mixer results
RoyiRa Mar 24, 2026
b5f8bc2
perf(mixer): optimize 5-expert Hedge mixer — 1573s → 562s eval
RoyiRa Mar 24, 2026
d400418
feat(submission): 5-expert Hedge mixer + TTT submission 2026-03-24
RoyiRa Mar 24, 2026
838fbcc
perf(submission): tune hyperparams for <16MB artifact — 1.0745 mean BPB
RoyiRa Mar 25, 2026
264a1b9
docs(submission): add README and finalize submission.json for 2026-03-24
RoyiRa Mar 25, 2026
037a239
docs(submission): add seed logs and refine README
RoyiRa Mar 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions 8xh100_AGENT_BRIEF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# 8xH100 Agent Briefing — Parameter Golf Competition

## Task
Fix the training speed on 8xH100 to achieve competitive step times (~85ms/step) and beat val_bpb < 1.12.

## Competition
- **Repository**: https://github.com/openai/parameter-golf
- **Goal**: Train best LM in 16MB artifact, 10 min on 8xH100, evaluated by BPB on FineWeb
- **Current SOTA**: 1.1233 (PR #414), 1.1428 (merged leaderboard #1)
- **Our best**: 1.1386 (1xH100 80min), 1.1573 (8xH100 torch 2.4 + FA3)
- **Issue #140**: https://github.com/openai/parameter-golf/issues/140 — live leaderboard tracking
- **Top PRs to study**: #414 (1.1233), #315 (1.1248), #287 (1.1280)

## Our Training Script
- **Location**: `parameter-golf/transformer/train.py` — single-file training script
- **Architecture**: 11L transformer, 512-dim, 8/4 GQA heads, 3x MLP, U-Net skips
- **Key techniques**: XSA (last 4 layers), Partial RoPE (16/64), LN Scale, EMA, SWA, Late QAT, GPTQ-lite, int6+zstd, sliding window eval
- **Runs with**: `torchrun --standalone --nproc_per_node=8 transformer/train.py`

## Known Issues on 8xH100
1. **torch 2.4 (old RunPod)**: FA3 works, 109ms/step, but `enable_gqa` not available (uses slow repeat_interleave). Still best result.
2. **torch 2.8 (new RunPod)**: Native GQA available but torch.compile takes 2+ min for warmup, and DDP optimizer has issues. fullgraph=True causes process count explosion (273 python procs). Step time 143ms even after warmup.
3. **FA3 + torch.compile**: flash_attn_func may not trace well under torch.compile. The top submissions compile around FA3 or exclude it from the graph.
4. **GQA fallback**: We use try/except resolved at import time (_HAS_NATIVE_GQA flag), but the repeat_interleave fallback on torch 2.4 adds ~23ms/step.

## What the Top Submission (#414) Does Differently
- **torch version**: Likely 2.5-2.6 (has enable_gqa + fast compile)
- **FlashAttention 3**: Direct `flash_attn_func` calls, not through torch.compile
- **Step time**: 85ms/step on 8xH100 (vs our 109-143ms)
- **Compile strategy**: May use `torch.compile` with `mode="reduce-overhead"` or exclude attention

## Target
- Get step time to ~85ms on 8xH100 in 10 min
- This alone would give ~7000 steps (vs our 4500-5300)
- Expected improvement: ~0.01 bpb from more training steps

## Environment
- **SSH config**: `gcp-single-h100` for 1xH100, RunPod for 8xH100
- **Data**: `data/datasets/fineweb10B_sp1024/` (80 shards + val)
- **Tokenizer**: `data/tokenizers/fineweb_1024_bpe.model` (vocab 1024)
- **Experiments log**: `experiments.csv`

## Key Files to Read
1. `transformer/train.py` — our training script
2. `experiments.csv` — all experiment results
3. Top submission code: `git fetch upstream 'pull/414/head:pr-414'` then `git show pr-414:records/track_10min_16mb/2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233/train_gpt.py`
34 changes: 34 additions & 0 deletions WRITEUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Parameter Golf Submission — val_bpb 1.1224 (3-seed mean)

## Result
**3-seed mean: 1.1224 BPB** | Artifact: 15.6-15.9MB | Train: 600s on 8xH100 | Eval: 74s

## What We Tried (40+ experiments)

**What worked:**
- FA3 Hopper attention (76ms/step, +47% more training steps)
- Fixing 4 training bugs found by deep-diffing against PR#414 (dead bigram weights in optimizer, Muon weight decay order, STE quantizer range mismatch, YaRN RoPE frequency extension)
- BigramHash vocab 3072 (optimized for 16MB budget — 2048 too small, 4096 too big)
- TTT Burst: replaying last 100 training batches at 10% LR before EMA finalization
- **Soft-to-hard quantizer with late temperature annealing** (novel, described below)

**What failed (and why):**
- Looped transformer 8Lx2 (+40% step cost kills training budget)
- MoE with 8 experts (8x params — wrong tradeoff for parameter-constrained setting)
- Focal loss (distorts CE objective; model gets overconfident on easy tokens)
- Entropy regularization on weights (great compression 13.3MB! but 2.5x slower per step)
- Cosine warmdown (worse compression AND worse quality)
- Curriculum seq length 1024->2048 (massive quantization damage)
- 12L architecture (doesn't fit 16MB with 3x MLP)
- int5 MLP quantization (+0.035 BPB damage — too aggressive)
- Star-ReLU, orthogonal init, eval at 4096 — all neutral

## Novel Contribution: Soft-to-Hard Quantizer

**The idea:** Replace hard STE rounding in QAT with temperature-controlled soft rounding. During the final 2% of training (scale < 0.02), the quantizer switches from hard round to sigmoid-interpolated soft round. This gives weight gradients a differentiable signal toward the nearest quantization grid point, nudging weights to "snap" to int6 levels right before EMA/SWA finalizes them.

**Why it works:** Standard STE has zero gradient information about quantization bin assignment — round() has zero derivative everywhere. By using `sigmoid((frac - 0.5) / tau)` as a soft surrogate in the backward pass, the optimizer receives non-zero gradients that push weights toward grid centers. Applying this only in the final phase (tau=0.1) avoids slowing down early training while getting the compression benefit when it matters most.

**Evidence:** Full soft quantizer (every step) compresses to 15.8MB (vs 16.0MB baseline) but costs 14% step overhead. Late-only application (last 2%) achieves the same compression improvement at zero overhead. Combined with bigram 3072 and TTT Burst, the submission achieves 1.1224 mean BPB — beating the prior SOTA of 1.1232.

**Connection to literature:** This is a lightweight instance of the Differentiable Soft Quantization (DSQ) and soft-to-hard vector quantization family, adapted for the parameter golf setting where training budget is tight and the target is a compressed artifact.
Loading