openai · liujshi · Apr 22, 2026
diff --git a/...min_16mb/2026-04-22_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_Vgated/README.md b/...min_16mb/2026-04-22_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_Vgated/README.md
@@ -0,0 +1,56 @@
+# Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + V-Gated
+
+**val_bpb = 1.0796** (3-seed mean, std 0.00025) | **~15.99 MB** | 8xH100 SXM
+
+## 3-Seed Results
+
+| Seed | Sliding BPP | **TTT BPP** | Artifact |
+|------|-------------|-------------|----------|
+| 42   | 1.08112468     | **1.07985553**  | 15985814 |
+| 314  | 1.08057225      | **1.07927887**  | 15983675 |
+| 999  | 1.08110726    | **1.07973035**  | 15986648 |
+| **Mean** | **1.0809** | **1.0796** | **15985379** |
+| **Std** | **0.00025** | **0.00025** | |
+
+## Key Techniques
+
+1. Based on **SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT** (PR #1493 by @bigbag).
+2. Added a learnable **final norm scale** and a **Smear gate** to make representations smoother and slightly more compression-friendly, reducing the compressed artifact size by about **40 KB** and freeing space for additional parameters.
+3. Added a **per-head V-Gate**, where the V projection jointly determines both **what information is fed into attention** and **how much each head contributes to the output**. This significantly improved model performance.
+4. Improved the quantized compression pipeline with **per-matrix automatic layout selection**, giving a small further reduction in final artifact size of about **10 KB**.
+5. Performed extensive hyperparameter search, including settings such as `MUON_BACKEND_STEPS=4` and `TTT_LR=0.01`. This made training more stable and improved reproducibility.
+
+## Architecture
+
+11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (loops layers 3-5, activated at step ~2016). Parallel residuals from layer 7: attention and MLP operate on same pre-residual input. Skip gates (sigmoid-gated U-Net connections). Final norm scale, Smear gate and V-Gate.
+
+## Reproduction
+
+```bash
+pip install brotli sentencepiece
+pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
+
+SEED=314 QK_GAIN_INIT=5.25 TTT_ENABLED=1 TTT_LR=0.01 TTT_EPOCHS=3 \
+MLP_MULT=4 MUON_BACKEND_STEPS=4 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+- **@bigbag** — SP8192 + 3-Layer Recurrence + Parallel Residuals + QK_GAIN_INIT=5.25 (PR #1493)
+- **@clarkkev** — SP8192 + GPTQ Embeddings + SDClip + MuonEq-R + depth recurrence (PR #1394)
+- **@dexhunter** — 3-layer depth recurrence (PR #1331, #1437), legal TTT on SP8192 (PR #1413)
+- **@abaybektursun** — Score-first TTT framework (PR #549, merged precedent)
+- **@Robby955** — Parallel residuals on SP8192 (PR #1412)
+- **@msisovic** — Parallel residuals concept (PR #1204)
+- **@X-Abhishek-X** — Hyperparameter tuning: WD=0.095, MLR=0.022, EMA=0.9965 (PR #1445, #1471)
+- **@kellerjordan** -- SmearGate concept (originally from modded-nanogpt)
+
+## Included Files
+
+- `README.md` (this file)
+- `submission.json`
+- `train_gpt.py`
+- `train_seed42.log`
+- `train_seed314.log`
+- `train_seed999.log`
diff --git a/...k_10min_16mb/2026-04-22_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_Vgated/submission.json b/...k_10min_16mb/2026-04-22_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_Vgated/submission.json
@@ -0,0 +1,30 @@
+{
+  "author": "liujshi",
+  "github_id": "liujshi",
+  "name": "SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal Score-First TTT + V Gate",
+  "date": "2026-04-22",
+  "track": "10min_16mb",
+  "val_bpb": 1.0796,
+  "val_bpb_std": 0.00025,
+  "seeds": [42, 314, 999],
+  "seed_results": {
+    "42": {"val_bpb": 1.07985553, "artifact_bytes": 15985814},
+    "314": {"val_bpb": 1.07927887, "artifact_bytes": 15983675},
+    "999": {"val_bpb": 1.07973035, "artifact_bytes": 15986648}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "technique_summary": "SP8192 + 3-Layer Depth Recurrence (L3-5) + Parallel Residuals (L7+) + QK-Gain 5.25 + EMA 0.9965 + WD 0.095 + Score-First TTT (SGD 3ep) + GPTQ SDClip + Brotli + V Gate + Smear",
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_ttt": true,
+    "three_seeds": true
+  },
+  "based_on": "PR #1493 by @bigbag"
+}