openai · Meirzhan05 · Apr 28, 2026 · Apr 28, 2026
diff --git a/records/track_10min_16mb/2026-04-25_AttnOutGate_SmearGate_Softcap15/README.md b/records/track_10min_16mb/2026-04-25_AttnOutGate_SmearGate_Softcap15/README.md
@@ -0,0 +1,108 @@
+# Record: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean)
+
+**val_bpb: 1.07750** (3-seed mean, std 0.0006) | **~15.99 MB** | 8×H100 SXM
+
+Beats current SOTA (PR #1493, 1.0810) by **0.00350 BPB** with std 0.0006 → t-statistic ≈ 5.5, p < 0.001 across 3 seeds. Comparable in magnitude to recent record gaps on the leaderboard (e.g., #2→#1 was 0.0012, #3→#2 was 0.0006).
+
+Three additive zero-cost modifications, all fully precedented and reproducible.
+
+## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
+
+| Seed | Steps | Pre-quant BPB | Quantized BPB | Sliding BPB | TTT BPB | Artifact |
+|------|-------|---------------|---------------|-------------|---------|----------|
+| 1337 | 4457 | 1.08396 | 1.09737 | 1.07817 | **1.07693** | 15,994,840 |
+| 42   | 4459 | 1.08491 | 1.09617 | 1.07934 | **1.07805** | 15,996,097 |
+| 2025 | 4450 | 1.08458 | 1.09555 | 1.07873 | **1.07753** | 15,992,597 |
+| **Mean** | **4455** | **1.08449** | **1.09636** | **1.07875** | **1.07750** | **15,994,511** |
+
+## Key Changes vs Our Previous Submission (PR #1876, 1.08008 BPB)
+
+Three additive zero-cost modifications:
+
+### 1. AttnOutGate (PR #1667/#1693)
+Per-head data-dependent gate on SDPA output, before `out_proj`:
+```
+out = W_o @ ( SDPA(x) ⊙ 2σ(W_g · x[:, :12]) )
+```
+- `W_g`: (12 × 8) per layer, zero-init → 2σ(0) = 1 (transparent at init)
+- 8 heads × 12 width × 11 layers = **1,056 extra params** (~1KB at fp16)
+- Lets each head dynamically suppress noise per-token
+- Routes through scalar AdamW (added `attn_gate` to CONTROL_TENSOR_NAME_PATTERNS)
+
+### 2. SmearGate (PR #1667 + PR #1851 BOS-fix)
+Forward-1-token residual mixer at embedding lane:
+```
+x_t ← x_t + λ · σ(W · x_t[:12]) · x_{t-1}     (for t ≥ 1, identity at t=0)
+```
+- `W`: (12 × 1) and `λ`: scalar — both zero-init
+- **Total: 13 extra params** (~26 bytes)
+- BOS-fix prevents cross-document leakage during packed training: gate is masked to 0 where `input_ids == BOS_TOKEN_ID` (default 1)
+- Routes through scalar AdamW
+
+### 3. Lower logit softcap 30 → 15 (Modded-NanoGPT record #18)
+Single hyperparameter change:
+```
+logits = 15 * tanh(logits / 15)    (was 30 * tanh(logits / 30))
+```
+- Tighter cap engages tanh's saturating region
+- Smoother loss landscape, prevents extreme overconfidence
+- **Single-line change, no params**
+
+## Architecture (unchanged from previous submission)
+
+- SP8192 BPE tokenizer
+- 11 layers, dim=512, 8 heads, 4 KV heads (GQA)
+- Depth recurrence: layers 3-5 looped 3× (17 virtual layers), enabled at 35%
+- XSA on all 11 layers, parallel residuals from layer 7+
+- U-Net skip connections with learnable gates
+- Tied embeddings, MLP 4× LeakyReLU(0.5)²
+- Coprime-stride multi-shard data loader
+
+## Training (unchanged)
+- Muon optimizer (5-step NS) for matrices, AdamW for embeds/scalars
+- EMA decay 0.9965, 72% warmdown, 20-step warmup + 20-step loop warmup
+- Gradient clipping 0.3
+- Brotli-11 compression + byte shuffling
+- Score-first TTT (SGD, momentum 0.9, LR 0.005, 3 epochs, 32K chunks)
+- Full Hessian GPTQ with Cholesky error compensation + actorder
+- LZMA code compression (53KB → 19KB)
+
+## What We Tried That Did Not Help
+
+| Technique | Result | Why it failed |
+|---|---|---|
+| LoRA on recurrence (rank 2/4) | Worse | 10% step loss, artifact over 16MB |
+| MTP (Multi-Token Prediction) | Worse | 10.5% step loss, no quality gain |
+| QAT weight-snapping during warmdown | Catastrophic | Disrupted Muon's update dynamics |
+| Hessian-Aware SDClip (PR #1412) | No change | Per-row Hessian importance too noisy |
+| Per-group clip allocation | No change | Group traces are stable but didn't translate |
+| Asymmetric sigmoid logit rescale | Worse (+0.001) | Tanh form was already well-tuned |
+| nGPT normalization | Excluded after research | Speedup only at 0.5B+ params and 200k+ steps |
+| GatedDeltaNet/linear attention | Excluded after research | All "frontier" PRs had byte-accounting bugs |
+| Value embeddings | Excluded | Don't fit in 5KB artifact headroom |
+
+## Compliance (Issue #1017 conditions)
+
+### Condition 1 (Strict Causal Dependence)
+Causal attention via `flash_attn_func(causal=True)`. AttnOutGate uses position-local input `x_t[:12]` (no leakage). SmearGate is strictly backward-looking (`x_{t-1}`), with BOS-mask preventing cross-document leakage. TTT only incorporates tokens from already-scored chunks.
+
+### Condition 2 (Full Normalized Distribution)
+`F.cross_entropy` over full vocab_size logits. Softcap is monotonic (does not mask).
+
+### Condition 3 (Score-Before-Update)
+Each TTT chunk scored under `torch.no_grad()` BEFORE any training on it. Model weights at scoring reflect only prior chunks.
+
+### Condition 4 (Single Left-to-Right Pass)
+Single `for ci in range(num_chunks)` loop. Each token scored exactly once.
+
+## Credits
+- Current SOTA base (PR #1493): @bigbag
+- AttnOutGate: @MarioPaerle (PR #1667), @dexhunter (PR #1693)
+- SmearGate + BOS-fix: @KoszarskyB / @classiclarryd (modded-nanogpt), @cocohearts + @aquariouseworkman (PR #1851)
+- Logit softcap 15: @KoszarskyB (Modded-NanoGPT record #18)
+- SP8192 + GPTQ + SDClip: @clarkkev (PR #1394)
+- Depth recurrence: @dexhunter (PR #1331, #1437)
+- Parallel residuals: @Robby955 (PR #1412), @msisovic (PR #1204)
+- Score-first TTT: @abaybektursun (PR #549), @Christopher-Lee-McClendon (PR #461)
+- Coprime-stride loader: PR #726 style
+- LZMA code compression: PR #1394
diff --git a/records/track_10min_16mb/2026-04-25_AttnOutGate_SmearGate_Softcap15/submission.json b/records/track_10min_16mb/2026-04-25_AttnOutGate_SmearGate_Softcap15/submission.json
@@ -0,0 +1,42 @@
+{
+  "author": "Meirzhan Saparov",
+  "github_id": "Meirzhan05",
+  "name": "AttnOutGate + SmearGate + Softcap 15",
+  "date": "2026-04-25",
+  "track": "10min_16mb",
+  "val_bpb": 1.07750,
+  "val_bpb_std": 0.00060,
+  "seeds": [1337, 42, 2025],
+  "seed_results": {
+    "1337": {"val_bpb": 1.07693, "artifact_bytes": 15994840},
+    "42":   {"val_bpb": 1.07805, "artifact_bytes": 15996097},
+    "2025": {"val_bpb": 1.07753, "artifact_bytes": 15992597}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "record": true,
+  "technique_summary": "SP8192 + 11L + Depth Recurrence (L3-5) + Parallel Residuals (L7+) + XSA-all + Coprime-Stride Loader + Full Hessian GPTQ with Cholesky Fallback + EMA 0.9965 + Score-First TTT (SGD 3ep) + Brotli-11 + LZMA Code + AttnOutGate (PR #1667/#1693) + SmearGate with BOS-fix (PR #1667/#1851) + Logit Softcap 15 (Modded-NanoGPT #18)",
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_ttt": true,
+    "three_seeds": true
+  },
+  "attribution": {
+    "current_sota_base": "@bigbag (PR #1493)",
+    "attn_out_gate": "@MarioPaerle (PR #1667), @dexhunter (PR #1693)",
+    "smear_gate_bos_fix": "@KoszarskyB / @classiclarryd (modded-nanogpt), @cocohearts + @aquariouseworkman (PR #1851)",
+    "logit_softcap_15": "@KoszarskyB (Modded-NanoGPT record #18)",
+    "sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
+    "depth_recurrence": "@dexhunter (PR #1331, #1437)",
+    "parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)",
+    "legal_ttt_framework": "@abaybektursun (PR #549), @Christopher-Lee-McClendon (PR #461)",
+    "coprime_stride_loader": "PR #726 style",
+    "lzma_code_compression": "PR #1394"
+  }
+}