openai · davie2009kh · Apr 24, 2026
diff --git a/records/track_10min_16mb/2026-04-24_David_SDPA_3EpochTTT_HuberWD/README.md b/records/track_10min_16mb/2026-04-24_David_SDPA_3EpochTTT_HuberWD/README.md
@@ -0,0 +1,88 @@
+# Record: SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly)
+
+**val_bpb 1.07037** (3-seed mean, std 0.00027) on the 10 min / 16 MB track.
+
+## Summary
+
+Record over merged SOTA (PR #1493, 1.0810) by **−0.01063 BPB** at a 3-seed mean.
+
+This submission adapts the parallel pre-quant AdamW TTT stack (PR #1735, @AjAnubolu + PR #1738, @alertcat) for environments without FlashAttention-3 — specifically a torch 2.11+cu130 stack that has no compatible FA3 wheel and no nvcc available for source builds. On such a stack, SDPA is the only available attention backend and TTT epochs cost ~4× longer than the FA3 reference. The original 21-epoch schedule blows the 600s eval budget in that regime; this PR rebalances the schedule to fit.
+
+The three concrete changes, each small and defensible in isolation:
+
+1. **3-epoch pre-quant TTT with epoch-level cosine (1e-3 → 1e-4, no warm restart).**
+   A dedicated ablation (seed 1337) showed that with SDPA's ~96 s warmup + ~19 s/epoch costs, the budget only fits 3 full TTT epochs. A 4-epoch cosine-warm-restart variant (cycle 1 = 3 ep, cycle 2 = 1 ep) was tried first and regressed from 1.0701 (3-epoch) to 1.0727 (4-epoch) — the restart LR jolt hurt when the follow-on cycle was too short to re-converge. Final schedule is plain `CosineAnnealingLR(T_max=3, eta_min=1e-4)`.
+2. **Odd-epoch-only diagnostic eval + runtime budget guard.** Diagnostic `eval_val` calls after every epoch cost ~12s on SDPA. We run them on epochs 1, 3, 5, … and always the final epoch; a budget guard breaks TTT early if `elapsed + 150s × remaining_epochs > 600s`. Under the 3-epoch schedule the guard never triggers, but it protects long-tail variance on slower runs.
+3. **Huber weight decay in the main-train Muon optimizer.** Replaces Muon's decoupled L2 decay `p ← p · (1 − lr·wd)` with a Huber variant: L2 for `|w| < δ`, L1 above, with `δ = 3/√(fan_in)`. The intent is to suppress outlier weights that cause int6 GPTQ clipping loss, without over-penalizing typical weights. Contribution to final BPB is small (within noise of the 3-epoch TTT change).
+
+Everything else is inherited verbatim from the PR #1493 stack (SP8192 + CaseOps tokenizer + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + EMA + GPTQ SDClip + Brotli).
+
+## 3-seed results (8× H100 80GB SXM, 10-min train / 10-min eval)
+
+| Seed | Pre-quant post-EMA | Post-TTT pre-quant | Quantized | **Sliding BPB** | Artifact |
+|------|-------------------:|-------------------:|----------:|----------------:|---------:|
+| 1337 | 1.08893 | 1.07552 | 1.08762 | **1.07013** | 15,857,678 |
+| 42   | 1.08872 | 1.07502 | 1.08828 | **1.07065** | 15,858,437 |
+| 2025 | 1.08893 | 1.07529 | 1.08778 | **1.07033** | 15,862,994 |
+| **Mean** | **1.08886** | **1.07528** | **1.08789** | **1.07037** | **15,859,703** |
+| **Std** | 0.00010 | 0.00021 | 0.00028 | **0.00027** | — |
+
+Artifact margin: worst-case 137,006 bytes under 16MB. Training uses 588s of the 600s cap across all seeds; SDPA eval uses ~300s total.
+
+## Per-epoch TTT trajectory (seed 1337)
+
+| Epoch | LR | val_bpb |
+|-------|---:|--------:|
+| 1/3 | 1.0e-3 | 1.09388 |
+| 2/3 | 7.8e-4 | skipped |
+| 3/3 | 3.3e-4 | 1.07589 |
+
+The epoch-1 eval intentionally overshoots because the initial LR is at peak — the loss floor at epoch 3 (1.07589) is what matters for the quantization step that follows.
+
+## Compliance (Issue #1017 Track A)
+
+- ✅ **Fixed predictor**: scored artifact is int6-GPTQ + brotli, no eval-time adaptation
+- ✅ **No SLOT, no RLS, no n-gram cache, no ETLB, no pre-quant TTT leakage** (TTT uses only legal held-out tokens, federated-averaged across ranks)
+- ✅ **Sliding-window eval**: strictly causal, stride 64, single pass
+- ✅ **Normalized softmax distribution**
+- ✅ **CaseOps byte sidecar** for honest BPB accounting (Title/AllCaps/CapNext control symbols don't inflate byte counts)
+- ✅ **Train < 600s** (588s), **Eval < 600s**, **Artifact < 16MB** (all three seeds)
+
+## Relationship to pending PRs
+
+PR #1735 (@AjAnubolu, 1.0429), PR #1738 (@alertcat, 1.0354 with CaseOps), and the kilojoules follow-up (1.0284 with LR=1e-3/freeze=0) all use FA3 and run 21 epochs of pre-quant TTT. On FA3-less hardware those scores are not reachable; this submission reconstructs the best TTT schedule that *is* reachable there, and separately adds Huber-Muon WD.
+
+If any of those PRs merge first and become the new record baseline, this PR should be rebased or withdrawn — it does not claim improvement over them.
+
+## Reproduction
+
+```bash
+# Data + tokenizer (PR #1729, CaseOps-v1)
+MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \
+MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=datasets \
+python3 cached_challenge_fineweb.py \
+  --variant sp8192_lossless_caps_caseops_v1_reserved \
+  --train-shards 80
+
+# Run 3 seeds (8×H100 SXM)
+for SEED in 1337 42 2025; do
+  SEED=$SEED DATA_DIR=/path/to/data_caseops \
+    torchrun --standalone --nproc_per_node=8 train_gpt.py \
+    2>&1 | tee train_seed${SEED}.log
+done
+```
+
+Environment: pytorch 2.11.0+cu130, no FA3 (script falls back to SDPA). A reproduction on pytorch 2.9.1+cu128 with FA3 would finish faster but should land at the same BPB to within ~0.001.
+
+## Attribution
+
+- @clarkkev (PR #1394) — SP8192 + GPTQ SDClip + Brotli
+- @dexhunter (PR #1331, #1437) — 3-layer depth recurrence
+- @Robby955 (PR #1412) — Parallel residuals
+- @bigbag (PR #1493) — QK-Gain 5.25 + Legal Score-First TTT stack (merged SOTA baseline)
+- @stukenov (PR #1364) — Pre-quant AdamW TTT concept
+- @AjAnubolu (PR #1735) — 8-GPU parallel pre-quant AdamW TTT
+- @romeerp (PR #1729), @alertcat (PR #1738) — CaseOps lossless-case tokenizer + byte sidecar
+- kilojoules (unmerged follow-up on PR #1738) — reference for LR=1e-3 / freeze_blocks=0 TTT defaults
+
+This PR's contribution: schedule + eval-budget rebalancing for FA3-less stacks, and Huber-WD variant for Muon.
diff --git a/records/track_10min_16mb/2026-04-24_David_SDPA_3EpochTTT_HuberWD/submission.json b/records/track_10min_16mb/2026-04-24_David_SDPA_3EpochTTT_HuberWD/submission.json
@@ -0,0 +1,41 @@
+{
+  "author": "davie2009kh",
+  "github_id": "davie2009kh",
+  "name": "SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly)",
+  "date": "2026-04-24",
+  "track": "10min_16mb",
+  "val_bpb": 1.07037,
+  "val_bpb_std": 0.00027,
+  "seeds": [1337, 42, 2025],
+  "seed_results": {
+    "1337": {"val_bpb": 1.07013, "artifact_bytes": 15857678},
+    "42":   {"val_bpb": 1.07065, "artifact_bytes": 15858437},
+    "2025": {"val_bpb": 1.07033, "artifact_bytes": 15862994}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.11.0+cu130",
+  "attention_backend": "torch.nn.functional.scaled_dot_product_attention (SDPA)",
+  "technique_summary": "SP8192 + CaseOps tokenizer + 3-Layer Depth Recurrence (L3-5) + Parallel Residuals (L7+) + QK-Gain 5.25 + EMA 0.9965 + Muon WD 0.095 (Huber variant) + 3-epoch 8-GPU Parallel Pre-Quant AdamW TTT (LR 1e-3, epoch-level cosine 1e-3 -> 1e-4, freeze_blocks=0, odd-epoch diagnostic eval) + GPTQ SDClip + Brotli",
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_pre_quant_ttt_leakage": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "legal_ttt_only": true,
+    "three_seeds": true,
+    "fixed_predictor": true
+  },
+  "attribution": {
+    "sp8192_gptq_sdclip_brotli": "@clarkkev (PR #1394)",
+    "depth_recurrence": "@dexhunter (PR #1331, #1437)",
+    "parallel_residuals": "@Robby955 (PR #1412)",
+    "qk_gain_5.25_and_legal_ttt_stack": "@bigbag (PR #1493)",
+    "pre_quant_adamw_ttt_concept": "@stukenov (PR #1364)",
+    "parallel_pre_quant_ttt_8gpu": "@AjAnubolu (PR #1735)",
+    "caseops_tokenizer_and_byte_sidecar": "@romeerp (PR #1729), @alertcat (PR #1738)",
+    "ttt_lr_tuning_reference": "kilojoules (unmerged PR on #1738, lr=1e-3 freeze=0 defaults)"
+  }
+}