openai · wisebreadloaf · Apr 21, 2026
diff --git a/...2026-04-20_SP8192_Progressive3LayerRecurrence_ParResid_QK525_LegalTTT/README.md b/...2026-04-20_SP8192_Progressive3LayerRecurrence_ParResid_QK525_LegalTTT/README.md
@@ -0,0 +1,113 @@
+# Record: SP8192 + Progressive 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT
+
+**val_bpb = 1.0806** (3-seed mean, std 0.0011) | **~15.99 MB** | **8xH100 SXM**
+
+## 3-Seed Results
+
+| Seed | Sliding BPP | **TTT BPP** | Artifact |
+|------|-------------|-------------|----------|
+| 42   | 1.0818      | **1.0805**  | 15,992,388 |
+| 999  | 1.0831      | **1.0818**  | 15,996,018 |
+| 1337 | 1.0810      | **1.0796**  | 15,989,841 |
+| **Mean** | **1.0820** | **1.0806** | **15,992,749** |
+| **Std** | **0.0009** | **0.0011** | |
+
+Current merged record in repo: **1.0810 BPP**. Delta: **-0.0004 BPP**.
+
+## Key Change
+
+This submission keeps the merged `2026-04-09` SP8192 record stack intact and changes only the recurrence schedule.
+
+Baseline recurrence schedule:
+
+- full 3-layer recurrence activates at `frac=0.35`
+
+This submission uses a progressive schedule:
+
+- phase 1 at `frac=0.35`: one extra recurrence pass
+- phase 2 at `frac=0.55`: full 3-layer recurrence
+
+The hypothesis is that the strongest recurrence stack benefits from a smoother transition into deeper virtual depth rather than a single hard switch.
+
+## Key Techniques
+
+1. **SP8192 + GPTQ SDClip** — int6 matrices (`k=12.85`), int8 embeddings (`k=20.0`), Brotli-compressed GPTQ state
+2. **Progressive 3-Layer Depth Recurrence** — layers `3..5`, with partial recurrence at `0.35` and full recurrence at `0.55`
+3. **Parallel Residuals** (`layers 7+`) — attention and MLP read from the same pre-residual input
+4. **QK-Gain 5.25** — learnable per-head query scaling
+5. **Legal Score-First TTT** — SGD (`lr=0.005`, momentum `0.9`), 3 epochs per 32K-token chunk, score-before-update ordering
+6. **Tuned Hyperparameters** — WD `0.095`, matrix LR `0.022`, EMA `0.9965`, warmdown `0.72`
+7. **Packed LZMA wrapper** — keeps code overhead low enough for the full artifact to fit under 16 MB
+
+## Architecture
+
+11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16 dims), layerwise LN scale, tied embeddings, logit softcap `30.0`, skip gates enabled.
+
+Progressive recurrence schedules the looped `3..5` segment in two phases:
+
+- phase 1 encoder: `[0,1,2,3,4,5,3]`
+- phase 1 decoder: `[4,5,6,7,8,9,10]`
+- phase 2 encoder: `[0,1,2,3,4,5,3,4]`
+- phase 2 decoder: `[5,3,4,5,6,7,8,9,10]`
+
+Parallel residuals start at layer 7.
+
+## Training
+
+MuonEq-R optimizer (row-normalized Muon, Newton-Schulz 5 steps), AdamW for embeddings/scalars, max wallclock `600s`, GPTQ reserve `12s`. Training stops at the wallclock cap around `4699-4711` steps depending on seed.
+
+## Quantization
+
+Full-Hessian GPTQ with SDClip, int6 matrices and int8 token embeddings. Byte-shuffle plus Brotli-11 compression. All three seeds fit under `16,000,000` bytes with the packed code wrapper.
+
+## TTT
+
+Score-first, chunk-based SGD adaptation at eval time:
+
+- chunk size `32K` tokens
+- score each chunk under `torch.no_grad()` before any update
+- train for 3 epochs on already-scored chunk tokens
+- cosine LR decay across chunks
+- gradient clipping at `1.0`
+
+## Compliance
+
+Per Issue #1017 Track B:
+
+- strictly causal sliding-window scoring
+- normalized softmax over the full vocabulary
+- score-before-update TTT ordering
+- single-pass token scoring
+- no SLOT
+- no ETLB
+- no n-gram cache
+- all artifacts under 16 MB on all 3 seeds
+- training under 600s on all 3 seeds
+
+## Reproduction
+
+```bash
+pip install brotli sentencepiece
+pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
+
+SEED=42 QK_GAIN_INIT=5.25 PARALLEL_RESIDUAL_START=7 \
+ENABLE_LOOPING_AT=0.35 LOOP_PHASE2_AT=0.55 \
+TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 COMPRESSOR=brotli \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+- **@clarkkev** — SP8192 + GPTQ embeddings + SDClip + MuonEq-R base stack
+- **@dexhunter** — 3-layer depth recurrence and legal TTT framework on SP8192
+- **@abaybektursun** — score-first TTT framework
+- **@Robby955** — parallel residuals on SP8192
+- **@msisovic** — parallel residual concept
+- **@X-Abhishek-X** — hyperparameter tuning for the strong SP8192 family
+
+## Included Files
+
+- `README.md`
+- `submission.json`
+- `train_gpt.py`
diff --git a/...6mb/2026-04-20_SP8192_Progressive3LayerRecurrence_ParResid_QK525_LegalTTT/submission.json b/...6mb/2026-04-20_SP8192_Progressive3LayerRecurrence_ParResid_QK525_LegalTTT/submission.json
@@ -0,0 +1,37 @@
+{
+  "author": "wisebreadloaf",
+  "github_id": "wisebreadloaf",
+  "name": "SP8192 + Progressive 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT",
+  "date": "2026-04-20",
+  "track": "10min_16mb",
+  "val_bpb": 1.08061,
+  "val_bpb_std": 0.00112,
+  "seeds": [42, 999, 1337],
+  "seed_results": {
+    "42": {"val_bpb": 1.08051, "artifact_bytes": 15992388},
+    "999": {"val_bpb": 1.08177, "artifact_bytes": 15996018},
+    "1337": {"val_bpb": 1.07955, "artifact_bytes": 15989841}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "technique_summary": "SP8192 + progressive 3-layer recurrence (phase1@0.35, phase2@0.55) + parallel residuals + QK-Gain 5.25 + EMA 0.9965 + legal TTT (SGD 3ep) + GPTQ SDClip + Brotli + packed LZMA code wrapper",
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_ttt": true,
+    "three_seeds": true
+  },
+  "attribution": {
+    "sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
+    "depth_recurrence": "@dexhunter (PR #1331, #1437)",
+    "parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)",
+    "legal_ttt_framework": "@abaybektursun (PR #549), @dexhunter (PR #1413)",
+    "hyperparameter_tuning": "@X-Abhishek-X (PR #1445, #1471)",
+    "v15_contribution": "progressive recurrence schedule on the merged SP8192 record family"
+  }
+}