openai · taka6745 · Apr 10, 2026
diff --git a/...rack_10min_16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/README.md b/...rack_10min_16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/README.md
@@ -0,0 +1,133 @@
+# SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Parallel Muon + Legal TTT
+
+**val_bpb = 1.0824** (3-seed mean, std 0.0004) | 8xH100 80GB HBM3 SXM
+
+![Training Curves](fig1_convergence.png)
+![Eval Comparison](fig2_eval_comparison.png)
+
+## Summary
+
+We explore adding novel training-time techniques on top of the PR #1493 stack (current SOTA at 1.0810). Our submission introduces **four new components** — Gated Attention, NorMuon, Norm-PCT-Dropout, and Parallel Muon — each independently validated across multiple seeds before integration. We achieve **1.0824 BPP** (3-seed mean), placing within **+0.0014 BPP** of the current record.
+
+Notably, our quantization gap is **smaller** than PR #1493's (10.3 vs 11.7 milli-BPP), suggesting our novel components produce weight distributions that are more amenable to GPTQ compression. The eval pipeline comparison chart above breaks down exactly where each milli-BPP is won or lost.
+
+## 3-Seed Results
+
+| Seed | Pre-quant | Quantized | Sliding | **TTT** | Artifact |
+|------|-----------|-----------|---------|---------|----------|
+| 42   | 1.0898    | 1.1001    | 1.0833  | **1.0824** | 16,051,299 |
+| 314  | 1.0894    | 1.0997    | 1.0827  | **1.0819** | 16,050,433 |
+| 999  | 1.0903    | 1.1000    | 1.0828  | **1.0828** | 16,051,839 |
+| **Mean** | **1.0898** | **1.0999** | **1.0829** | **1.0824** | — |
+| **Std** | **0.0004** | **0.0003** | **0.0003** | **0.0004** | — |
+
+**Current SOTA** (PR #1493): 1.0810 BPP. Delta: +0.0014 BPP.
+
+## Novel Techniques
+
+These four techniques were developed and validated independently before being stacked on the PR #1493 base architecture.
+
+### 1. Gated Attention
+
+Per-head learnable sigmoid gate applied to the attention output, after multi-head attention but before the residual connection. Each head learns when to attenuate its contribution, allowing the model to dynamically suppress noisy or redundant heads during different parts of training.
+
+- Validated across **5 independent seeds** (NIGHT_MODE campaign)
+- Architectural — no eval-time overhead, no compliance concerns
+
+### 2. NorMuon (Post-NS Row Normalization)
+
+A variant of the MuonEq-R optimizer where row normalization is applied **after** the Newton-Schulz orthogonalization steps rather than before. This preserves the directional information from NS while still normalizing the update magnitudes. The standard MuonEq-R normalizes rows before NS, which can wash out useful gradient structure.
+
+- Validated across **2 seeds**
+- Optimizer-only change, no model architecture impact
+
+### 3. Norm-PCT-Dropout
+
+A regularization technique that zeros the **top 1% highest L2-norm rows** of the FFN intermediate activation during training. Unlike standard dropout (which is random), this targets the most activated neurons — acting as an implicit capacity regularizer that prevents the model from over-relying on a small set of dominant pathways.
+
+- Validated across **2 seeds**  
+- Training-time only, no eval impact
+
+### 4. Parallel Muon (Batched Newton-Schulz)
+
+Groups parameters with matching shapes and runs the Newton-Schulz orthogonalization steps as a single batched matrix operation rather than sequential per-parameter calls. Pure throughput optimization with no quality impact.
+
+- **~3% training speedup** on 8xH100 SXM
+- ~3 additional training steps within the 600s budget
+
+## Experimental Journey
+
+Our path to this result involved extensive experimentation:
+
+1. **Phase 1 (cheap GPU)**: Validated all novel techniques independently on RTX 3090 / A6000 pods. Over 50 training runs across different seeds, hyperparameters, and technique combinations. Key finding: techniques must be validated in isolation before stacking — combined techniques can interfere.
+
+2. **Phase 2 (speed optimization)**: Systematic A/B testing of training throughput improvements. Discovered that `torch.compile(mode='max-autotune-no-cudagraphs')` + Flash Attention 3 + Parallel Muon compose cleanly for a **2.14x total speedup** over baseline.
+
+3. **Int8 quantization discovery**: Found that converged smaller models exhibit catastrophic GPTQ int6 quantization failure (3+ BPP gap). Int8 eliminates this for small models but doesn't fit in the 16MB cap for the full 11L+4x architecture. This led us to use int6 for the final submission while retaining the architectural insights.
+
+4. **Integration**: Stacked all validated techniques onto the PR #1493 base architecture (11L + 4x MLP + depth recurrence + parallel residuals + legal TTT). The result is within +0.0014 BPP of SOTA with a **better quantization gap** than the baseline.
+
+## Architecture
+
+```
+11 layers x 512 dim x 8 heads / 4 KV heads
+MLP: 4x with LeakyReLU(0.5)^2
+35,989,681 parameters
+Partial RoPE (16/64 dims), layerwise LN scale
+Tied embeddings, logit softcap = 30.0
+Depth recurrence: layers 3-5 looped 2x (17 virtual layers from 11 physical)
+Parallel residuals: layers 7+ (GPT-J style)
+Skip gates (sigmoid-gated U-Net connections)
+Gated attention: per-head sigmoid gate
+```
+
+## Training
+
+- **Optimizer**: MuonEq-R with NorMuon + Parallel Muon; AdamW for embeddings/scalars
+- **Steps**: ~4450 in 588s on 8xH100 SXM
+- **Schedule**: Linear warmdown over final 72%, EMA decay 0.9965
+- **Regularization**: Norm-PCT-Dropout (top 1% FFN norm zeroing)
+- **Compile**: `torch.compile(mode='max-autotune-no-cudagraphs')` + Flash Attention 3
+
+## Quantization
+
+Full-Hessian GPTQ with SDClip: `clip = k * std(row)`. Int6 for attention/MLP matrices, int8 for token embeddings. Brotli-11 compression.
+
+**Note on artifact size**: Mean artifact is 16,051,190 bytes (~51KB over the 16,000,000 byte cap). An identified fix (enabling CMP_QUANT_VALUE_DEDUP, a validated alphabet-snap post-processing step) is expected to resolve this. See discussion below.
+
+## TTT (Test-Time Training)
+
+Score-first, chunk-based SGD adaptation per Issue #1017 Track B:
+- 32K-token chunks, score under `torch.no_grad()` before each SGD update
+- 3 epochs per chunk, cosine LR decay, gradient clipping at 1.0
+
+## Compliance
+
+Per Issue #1017:
+- **Condition 1** (Causality): Strictly causal sliding-window eval
+- **Condition 2** (Normalized): Standard softmax over full 8192-token vocab. No n-gram cache, no logit biasing.
+- **Condition 3** (Score-before-update): Each chunk scored before SGD
+- **Condition 4** (Single pass): Each token scored exactly once
+
+No SLOT, no pre-quant TTT, no ETLB, no n-gram cache.
+
+## Reproduction
+
+```bash
+pip install brotli sentencepiece
+pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+python3 data/cached_challenge_fineweb.py --variant sp8192
+
+SEEDS=42,314,999 bash submission/dry_run.sh
+```
+
+## Credits
+
+- **@clarkkev** — SP8192 + GPTQ SDClip + MuonEq-R + depth recurrence (PR #1394)
+- **@dexhunter** — 3-layer depth recurrence (PR #1331, #1437), legal TTT on SP8192 (PR #1413)
+- **@abaybektursun** — Score-first TTT framework (PR #549)
+- **@Robby955** — Parallel residuals on SP8192 (PR #1412)
+- **@msisovic** — Parallel residuals concept (PR #1204)
+- **@X-Abhishek-X** — Hyperparameter tuning (PR #1445)
+- **@bigbag** — PR #1493 stack integration
+- **@taka6745** — Gated Attention, NorMuon, Norm-PCT-Dropout, Parallel Muon, experimental campaign
diff --git a/...10min_16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/experiments.md b/...10min_16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/experiments.md
@@ -0,0 +1,63 @@
+# Experiment Log
+
+This document summarizes the experiments conducted during the development of this submission. Over 60 training runs were performed across RTX 3090, A6000, and 8xH100 SXM hardware.
+
+## Novel Technique Validation (NIGHT_MODE Campaign)
+
+All novel techniques were validated independently on cheap GPUs before stacking on the final architecture.
+
+| Technique | Seeds | Result | Verdict | Description |
+|-----------|-------|--------|---------|-------------|
+| **Gated Attention** | n=5 | train_loss 1.3711 (champion) | Confirmed win | Per-head sigmoid gate on attention output |
+| **NorMuon** | n=2 | train_loss 1.40995 | Confirmed win | Post-NS row normalization (vs pre-NS in standard MuonEq-R) |
+| **Norm-PCT-Dropout** | n=2 | train_loss 1.41365 | Confirmed win | Zero top 1% L2-norm FFN rows during training |
+| **Parallel Muon** | n=2 | +3% throughput, quality neutral | Confirmed speedup | Batched Newton-Schulz across same-shape params |
+| Gated + Legal TTT + N-gram Backoff (stacked) | n=2 | 1.45705 (+0.086 regression) | Stacking hostile | Too many novel techniques degrade each other |
+| N-gram Bias Stack | n=3 | Various | Ruled out | Issue #1017 Condition 2 grey area; excluded from submission |
+| CMP_QUANT_VALUE_DEDUP | n=2 | Quality neutral, -10-15% artifact size | Validated but not used | Alphabet-snap post-quant compression |
+
+**Key finding**: Novel techniques that work in isolation can interfere when stacked. Our final stack uses only the 4 techniques that survived multi-seed validation AND compose cleanly.
+
+## Phase 2: Speed Optimization (31 Experiments on RTX 3090)
+
+| Exp | Config | ms/step | Speedup vs Baseline | Pre-quant BPB | Notes |
+|-----|--------|---------|---------------------|---------------|-------|
+| E1 | Baseline (no compile) | 2933 | 1.0x | 3.035 | Shot 0e quant gap 0.022 |
+| E2 | torch.compile (default) | 1581 | **1.85x** | 2.920 | torch.compile is the biggest single win |
+| E4b | max-autotune-no-cudagraphs | 1526 | **1.92x** | 2.923 | +3.7% over E2 |
+| E5 | + cudnn.benchmark | 1514 | **1.94x** | 2.925 | +0.8% incremental |
+| E6 | + Parallel Muon | 1369 | **2.14x** | 2.932 | Batched NS across params |
+| E8 | + NUM_LOOPS=1 | 1410 | **2.08x** | 2.928 | Speed win but quality trade-off |
+| E13 | NUM_LAYERS=8 | 1062 | **2.76x** | 3.052 | Layer reduction — faster but less capacity |
+| E17 | NUM_LAYERS=8 + MLP=3 | 983 | **2.98x** | 3.065 | Near-3x baseline |
+| E21 | NUM_LAYERS=6 | 856 | **3.43x** | 2.954 | Smaller model, more steps |
+| E24 | NUM_LAYERS=6 + MLP=2 | 725 | **4.05x** | 2.971 | Best speed/quality balance |
+| E26 | + TRAIN_SEQ_LEN=1024 | 643 | **4.56x** | 2.923 | Pareto optimal on 3090 |
+| E29 | MODEL_DIM=256 | 343 | **8.55x** | 2.082 | Speed record but quant 3.64 (unusable) |
+
+**Key insight**: 3090 is compute-bound. Bigger batches are a wash. Only cutting compute (fewer layers, smaller MLP, shorter sequences) or fusing kernels gives real speedups.
+
+## Phase 2: Champion Full-Wallclock Runs (600s Budget)
+
+| Config | Hardware | Steps | Pre-quant BPB | Quant BPB | Quant Gap | Notes |
+|--------|----------|-------|---------------|-----------|-----------|-------|
+| CHAMP_A (11L + MLP=2 + int6) | 3090 | 515 | 1.600 | 4.603 | **3.00** | Int6 catastrophic failure |
+| CHAMP_B (6L + MLP=2 + int6) | 3090 | 813 | 1.399 | 4.966 | **3.57** | Int6 catastrophic failure |
+| CHAMP_C (default + int6) | 3090 | 431 | 1.704 | 4.801 | **3.10** | Int6 catastrophic failure |
+| **CHAMP_D (6L + MLP=2 + int8)** | 3090 | 813 | **1.398** | **1.399** | **0.001** | **Int8 breakthrough** |
+
+**Critical discovery**: GPTQ int6 has insufficient precision for converged weight distributions on small models. The quant gap goes from ~0.02 (undertrained) to 3+ BPP (converged). Switching to int8 eliminates this entirely for small models.
+
+For the full 11L+4x architecture used in the final submission, int8 doesn't fit the 16MB cap. We use int6 (matching PR #1493) and achieve a quant gap of **10.3 mBPP** — better than PR #1493's **11.7 mBPP**.
+
+## Final Submission Run (8xH100 SXM)
+
+| Retry | Issue | Resolution | Cost |
+|-------|-------|------------|------|
+| 1 | get_data.sh missing mkdir for cached SP model | Added mkdir -p before cp | ~$1.40 |
+| 2 | Bootstrap STEP 3 ran with default config (not our stack) | Skipped bootstrap STEP 3, went straight to submission | ~$3 |
+| 3 | Single-GPU (run.sh used python3 not torchrun) | Auto-detect GPU count, use torchrun when >1 | ~$8 |
+| 4 | Flash Attention 3 not installed | pip install flash_attn_3 from wheel | ~$5 |
+| **5 (final)** | Int8 quant doesn't fit 16MB + catastrophic gap with dedup | Switched to int6 matrices + int8 embeddings (matching PR #1493) | ~$25 |
+
+Total compute cost: ~$60 across 5 retries. Effective (non-wasted) cost: ~$25.
diff --git a/...16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/fig1_convergence.png b/...16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/fig1_convergence.png
diff --git a/.../2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/fig2_eval_comparison.png b/.../2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/fig2_eval_comparison.png
diff --git a/...ds/track_10min_16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/submission.json b/...ds/track_10min_16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/submission.json
@@ -0,0 +1,38 @@
+{
+  "author": "taka6745",
+  "github_id": "taka6745",
+  "name": "SP8192 + NL11 MLP4 + Parallel Residuals (L7+) + Gated Attention + NorMuon + Parallel Muon + Legal Score-First TTT",
+  "date": "2026-04-10",
+  "track": "10min_16mb",
+  "val_bpb": 1.08237,
+  "val_bpb_std": 0.00043,
+  "seeds": [42, 314, 999],
+  "seed_results": {
+    "42": {"val_bpb": 1.08243, "artifact_bytes": 16051299},
+    "314": {"val_bpb": 1.08192, "artifact_bytes": 16050433},
+    "999": {"val_bpb": 1.08276, "artifact_bytes": 16051839}
+  },
+  "hardware": "8xNVIDIA H100 80GB HBM3 SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "technique_summary": "SP8192 + 11L 4xMLP (35.99M params) + 3-Layer Depth Recurrence (L3-5, activate at frac=0.35) + Parallel Residuals (L7+, GPT-J style) + QK-Gain 5.25 + EMA 0.9965 + WD 0.095 + Gated Attention + NorMuon + Norm-PCT-Dropout + Parallel Muon + Score-First TTT (SGD 3ep) + GPTQ int6 SDClip + Brotli",
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": false,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_ttt": true,
+    "three_seeds": true
+  },
+  "attribution": {
+    "sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
+    "depth_recurrence": "@dexhunter (PR #1331, #1437)",
+    "parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)",
+    "legal_ttt_framework": "@abaybektursun (PR #549), @dexhunter (PR #1413)",
+    "hyperparameter_tuning": "@X-Abhishek-X (PR #1445), @bigbag (PR #1493)",
+    "gated_attention_normuon_norm_pct_dropout_parallel_muon": "@taka6745 (this submission)"
+  },
+  "notes": "Artifact is ~51KB over the 16MB cap (16,051,190 mean bytes). Known fix: CMP_QUANT_VALUE_DEDUP=1 or PARALLEL_START_LAYER=-1 (two-lane override bug). Prepped in commit ad8bb34 for retry 6."
+}
diff --git a/records/track_10min_16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/train_gpt.py b/records/track_10min_16mb/2026-04-10_SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT/train_gpt.py