openai · RoyiRa · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/8xh100_AGENT_BRIEF.md b/8xh100_AGENT_BRIEF.md
@@ -0,0 +1,46 @@
+# 8xH100 Agent Briefing — Parameter Golf Competition
+
+## Task
+Fix the training speed on 8xH100 to achieve competitive step times (~85ms/step) and beat val_bpb < 1.12.
+
+## Competition
+- **Repository**: https://github.com/openai/parameter-golf
+- **Goal**: Train best LM in 16MB artifact, 10 min on 8xH100, evaluated by BPB on FineWeb
+- **Current SOTA**: 1.1233 (PR #414), 1.1428 (merged leaderboard #1)
+- **Our best**: 1.1386 (1xH100 80min), 1.1573 (8xH100 torch 2.4 + FA3)
+- **Issue #140**: https://github.com/openai/parameter-golf/issues/140 — live leaderboard tracking
+- **Top PRs to study**: #414 (1.1233), #315 (1.1248), #287 (1.1280)
+
+## Our Training Script
+- **Location**: `parameter-golf/transformer/train.py` — single-file training script
+- **Architecture**: 11L transformer, 512-dim, 8/4 GQA heads, 3x MLP, U-Net skips
+- **Key techniques**: XSA (last 4 layers), Partial RoPE (16/64), LN Scale, EMA, SWA, Late QAT, GPTQ-lite, int6+zstd, sliding window eval
+- **Runs with**: `torchrun --standalone --nproc_per_node=8 transformer/train.py`
+
+## Known Issues on 8xH100
+1. **torch 2.4 (old RunPod)**: FA3 works, 109ms/step, but `enable_gqa` not available (uses slow repeat_interleave). Still best result.
+2. **torch 2.8 (new RunPod)**: Native GQA available but torch.compile takes 2+ min for warmup, and DDP optimizer has issues. fullgraph=True causes process count explosion (273 python procs). Step time 143ms even after warmup.
+3. **FA3 + torch.compile**: flash_attn_func may not trace well under torch.compile. The top submissions compile around FA3 or exclude it from the graph.
+4. **GQA fallback**: We use try/except resolved at import time (_HAS_NATIVE_GQA flag), but the repeat_interleave fallback on torch 2.4 adds ~23ms/step.
+
+## What the Top Submission (#414) Does Differently
+- **torch version**: Likely 2.5-2.6 (has enable_gqa + fast compile)
+- **FlashAttention 3**: Direct `flash_attn_func` calls, not through torch.compile
+- **Step time**: 85ms/step on 8xH100 (vs our 109-143ms)
+- **Compile strategy**: May use `torch.compile` with `mode="reduce-overhead"` or exclude attention
+
+## Target
+- Get step time to ~85ms on 8xH100 in 10 min
+- This alone would give ~7000 steps (vs our 4500-5300)
+- Expected improvement: ~0.01 bpb from more training steps
+
+## Environment
+- **SSH config**: `gcp-single-h100` for 1xH100, RunPod for 8xH100
+- **Data**: `data/datasets/fineweb10B_sp1024/` (80 shards + val)
+- **Tokenizer**: `data/tokenizers/fineweb_1024_bpe.model` (vocab 1024)
+- **Experiments log**: `experiments.csv`
+
+## Key Files to Read
+1. `transformer/train.py` — our training script
+2. `experiments.csv` — all experiment results
+3. Top submission code: `git fetch upstream 'pull/414/head:pr-414'` then `git show pr-414:records/track_10min_16mb/2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233/train_gpt.py`
diff --git a/WRITEUP.md b/WRITEUP.md
@@ -0,0 +1,34 @@
+# Parameter Golf Submission — val_bpb 1.1224 (3-seed mean)
+
+## Result
+**3-seed mean: 1.1224 BPB** | Artifact: 15.6-15.9MB | Train: 600s on 8xH100 | Eval: 74s
+
+## What We Tried (40+ experiments)
+
+**What worked:**
+- FA3 Hopper attention (76ms/step, +47% more training steps)
+- Fixing 4 training bugs found by deep-diffing against PR#414 (dead bigram weights in optimizer, Muon weight decay order, STE quantizer range mismatch, YaRN RoPE frequency extension)
+- BigramHash vocab 3072 (optimized for 16MB budget — 2048 too small, 4096 too big)
+- TTT Burst: replaying last 100 training batches at 10% LR before EMA finalization
+- **Soft-to-hard quantizer with late temperature annealing** (novel, described below)
+
+**What failed (and why):**
+- Looped transformer 8Lx2 (+40% step cost kills training budget)
+- MoE with 8 experts (8x params — wrong tradeoff for parameter-constrained setting)
+- Focal loss (distorts CE objective; model gets overconfident on easy tokens)
+- Entropy regularization on weights (great compression 13.3MB! but 2.5x slower per step)
+- Cosine warmdown (worse compression AND worse quality)
+- Curriculum seq length 1024->2048 (massive quantization damage)
+- 12L architecture (doesn't fit 16MB with 3x MLP)
+- int5 MLP quantization (+0.035 BPB damage — too aggressive)
+- Star-ReLU, orthogonal init, eval at 4096 — all neutral
+
+## Novel Contribution: Soft-to-Hard Quantizer
+
+**The idea:** Replace hard STE rounding in QAT with temperature-controlled soft rounding. During the final 2% of training (scale < 0.02), the quantizer switches from hard round to sigmoid-interpolated soft round. This gives weight gradients a differentiable signal toward the nearest quantization grid point, nudging weights to "snap" to int6 levels right before EMA/SWA finalizes them.
+
+**Why it works:** Standard STE has zero gradient information about quantization bin assignment — round() has zero derivative everywhere. By using `sigmoid((frac - 0.5) / tau)` as a soft surrogate in the backward pass, the optimizer receives non-zero gradients that push weights toward grid centers. Applying this only in the final phase (tau=0.1) avoids slowing down early training while getting the compression benefit when it matters most.
+
+**Evidence:** Full soft quantizer (every step) compresses to 15.8MB (vs 16.0MB baseline) but costs 14% step overhead. Late-only application (last 2%) achieves the same compression improvement at zero overhead. Combined with bigram 3072 and TTT Burst, the submission achieves 1.1224 mean BPB — beating the prior SOTA of 1.1232.
+
+**Connection to literature:** This is a lightweight instance of the Differentiable Soft Quantization (DSQ) and soft-to-hard vector quantization family, adapted for the parameter golf setting where training budget is tight and the target is a compressed artifact.