openai · bigbag · Apr 29, 2026 · Apr 29, 2026
diff --git a/records/track_10min_16mb/2026-04-29_SP8192_PR1874_ChunkTTT32/README.md b/records/track_10min_16mb/2026-04-29_SP8192_PR1874_ChunkTTT32/README.md
@@ -0,0 +1,51 @@
+# SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean)
+
+## Results
+
+| Seed | Pre-quant BPB | Post-quant BPB | **Post-TTT BPB** | Artifact (bytes) | Train time | Eval time |
+|------|---------------|----------------|------------------|-------------------|------------|-----------|
+| 1337 | 1.06960       | 1.07925        | **1.06798**      | 15,950,405        | 599.64s    | 409.6s    |
+| 42   | 1.06984       | 1.07948        | **1.06824**      | 15,952,215        | 599.65s    | 421.0s    |
+| 2025 | 1.07060       | 1.08028        | **1.06909**      | 15,948,755        | 599.56s    | 381.2s    |
+| **Mean** | **1.07001** | **1.07967** | **1.06844**  | **15,950,458**    | **599.62s** | **403.9s** |
+| **Std** | 0.00053    | 0.00053        | **0.00058**      | 1,730             | 0.05s      | 20.2s     |
+
+## Configuration
+
+- **Base code:** PR #1874 (AjAnubolu) verbatim — no code modifications
+- **Environment variables:** `MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0`
+- **Hardware:** 8xH100 80GB SXM (RunPod on-demand)
+- **Data template:** `c5dbhtfrrt` (SP8192, 128 train + 1 val shards)
+
+## Techniques (all from PR #1874, activated via env vars)
+
+1. **LQER Asymmetric Rank-4** — SVD-based low-rank quantization error reduction on top-K=3 highest-error GPTQ residuals
+2. **SmearGate + Attention Output Gate (width 24)** — per-layer smoothing + full-dim attention gating
+3. **Polar Express Newton-Schulz** — 5 per-iteration minimax-tuned coefficient tuples for Muon optimizer
+4. **MIN_LR=0.10** — warmdown LR floor at 10% of max (prevents LR collapse to zero)
+5. **QK_GAIN_INIT=5.25** — per-head query-key attention scaling
+6. **GATE_ATTN_WIDTH=24** — doubled attention gate capacity
+7. **GPTQ_RESERVE_SECONDS=0.5** — maximizes training steps (default 4.0 wastes ~28 steps)
+8. **VAL_LOSS_EVERY=0** — eliminates mid-training eval overhead (~14s saved = ~112 extra steps)
+9. **Phased Score-First TTT** — 3-phase AdamW LoRA-TTT (rank 128), score-first ordering
+
+## Rule Compliance
+
+- Score-first phased TTT (no re-scoring)
+- No pre-quant TTT on validation data
+- No n-gram cache or PPM
+- No CaseOps, no casefold — standard SP8192 UTF-8 byte counting
+- Artifact < 16,000,000 bytes (max 15,952,215 B)
+- Train time < 600s (max 599.65s), eval time < 600s (max 421.0s)
+
+## How to reproduce
+
+```bash
+SEED=1337 MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 \
+  GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Attribution
+
+Built entirely on PR #1874 (AjAnubolu), which itself builds on PR #1790 (miaoyuxun), PR #1344 (Polar Express), PR #1787 (nprime06), PR #1797 (dexhunter).
diff --git a/records/track_10min_16mb/2026-04-29_SP8192_PR1874_ChunkTTT32/submission.json b/records/track_10min_16mb/2026-04-29_SP8192_PR1874_ChunkTTT32/submission.json
@@ -0,0 +1,50 @@
+{
+  "author": "bigbag",
+  "val_bpb": 1.06844,
+  "val_bpb_std": 0.00058,
+  "bytes_total_max": 15952215,
+  "seed_results": [
+    {
+      "seed": 1337,
+      "val_bpb": 1.06798,
+      "val_loss": 2.75871500,
+      "bytes_total": 15950405,
+      "train_time_ms": 599643,
+      "eval_time_ms": 409616
+    },
+    {
+      "seed": 42,
+      "val_bpb": 1.06824,
+      "val_loss": 2.75937977,
+      "bytes_total": 15952215,
+      "train_time_ms": 599651,
+      "eval_time_ms": 420960
+    },
+    {
+      "seed": 2025,
+      "val_bpb": 1.06909,
+      "val_loss": 2.76156888,
+      "bytes_total": 15948755,
+      "train_time_ms": 599564,
+      "eval_time_ms": 381181
+    }
+  ],
+  "score_first_ttt": true,
+  "no_pre_quant_ttt": true,
+  "no_ngram_cache": true,
+  "env_vars": "MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0",
+  "techniques": [
+    "PR #1874 verbatim code (SP8192 non-CaseOps)",
+    "LQER asymmetric rank-4 quantization correction",
+    "SmearGate + Attention Output Gate (width 24)",
+    "Polar Express Newton-Schulz coefficients",
+    "MIN_LR=0.10 warmdown floor",
+    "QK_GAIN_INIT=5.25 attention scaling",
+    "GATE_ATTN_WIDTH=24 full-dim gating",
+    "GPTQ_RESERVE_SECONDS=0.5 (maximize training steps)",
+    "VAL_LOSS_EVERY=0 (no mid-train eval overhead)",
+    "Phased TTT (3-phase score-first AdamW, LoRA rank 128)",
+    "11L x 512D x 8H/4KV, parallel residuals, depth recurrence",
+    "GPTQ int6/int7 + Brotli-11"
+  ]
+}