openai · renqianluo · Apr 23, 2026
diff --git a/...track_10min_16mb/2026-04-23_PolarNS_MinLR_GatedAttn_AlphaLoRA_1.07006/README.md b/...track_10min_16mb/2026-04-23_PolarNS_MinLR_GatedAttn_AlphaLoRA_1.07006/README.md
@@ -0,0 +1,98 @@
+# Polar NS + MIN_LR + GatedAttn + Alpha LoRA — 1.07006 BPB
+
+**val_bpb: 1.07005686** (3-seed mean: seeds 1337, 42, 314)
+
+## Results
+
+| Seed | BPB | Train time | Eval time | Artifact |
+|------|-----|------------|-----------|----------|
+| 1337 | 1.07026727 | 599.6s | 480.7s | 15,977,086 B |
+| 42   | 1.06964040 | 599.6s | 474.4s | 15,975,968 B |
+| 314  | 1.07026291 | 599.6s | 475.8s | 15,975,620 B |
+| **Mean** | **1.07005686** | | | |
+
+All runs: train ≤600s, eval ≤600s, artifact ≤16MB.
+
+## What this submission adds on top of PR #1768
+
+This submission stacks three independently-validated techniques from other authors
+onto our PR #1768 stack:
+
+### (1) Polar Express NS coefficients (ported from PR #1344)
+
+Replaces Muon's fixed Newton-Schulz coefficients `(3.4445, -4.775, 2.0315)` (applied
+identically 5 times per Muon step) with 5 per-iteration minimax-optimal tuples:
+
+```python
+_PE_COEFFS = [
+    (8.156554524902461, -22.48329292557795, 15.878769915207462),
+    (4.042929935166739, -2.808917465908714, 0.5000178451051316),
+    (3.8916678022926607, -2.772484153217685, 0.5060648178503393),
+    (3.285753657755655, -2.3681294933425376, 0.46449024233003106),
+    (2.3465413258596377, -1.7097828382687081, 0.42323551169305323),
+]
+```
+
+Same backend_steps=5, but the per-iteration minimax coefficients produce a
+higher-quality polar factor approximation per Muon step.
+
+### (2) MIN_LR=0.10 warmdown floor (from PR #1787)
+
+Floors the LR warmdown at 10% of max instead of 0 — the final ~25% of training
+keeps delivering meaningful gradient updates instead of winding down to near-zero.
+
+### (3) Tight budget polish (from PR #1787)
+
+- `GPTQ_RESERVE_SECONDS=0.5` (was 4.0)
+- `VAL_LOSS_EVERY=0` (was 4000, disables periodic mid-training val)
+
+Together these reclaim ~15s of the 600s training budget for additional depth-3
+training steps, visible in the higher step counts vs prior submissions.
+
+## Stack summary
+
+All techniques and their origins:
+
+| Component | Origin |
+|-----------|--------|
+| SP8192 + triple depth recurrence + parallel residuals | @bigbag PR #1493, @EthanYangTW PR #1523 |
+| VarLen attention + Fused Triton MLP + doc-independent LoRA TTT | @samacqua PR #1530 |
+| Phased TTT | @romeerp PR #1610 |
+| Multi-Phase Global SGD + Trimmed GPTQ + MATRIX_LR=0.026 | @dexhunter |
+| Gated Attention | @dexhunter PR #1736 |
+| Alpha/rank LoRA scaling + Warm-start A + WD=1.0 + alpha=144 | **this author, PR #1767** |
+| Gate mirror in LoRA-TTT forward path + per-row int8 gate quant | **this author, PR #1768** |
+| Polar Express NS coefficients | Ported from PR #1344 |
+| MIN_LR=0.10 + GPTQ_RESERVE=0.5 + VAL_LOSS_EVERY=0 | Ported from @nprime06 PR #1787 |
+
+## 3-seed trajectory
+
+| Seed | 1.07326 (PR #1767 mean-reproduction) | PR #1767 | PR #1768 | **This PR** |
+|------|---:|---:|---:|---:|
+| 1337 | 1.07423 | 1.07189 | 1.07146 | **1.07027** |
+| 42   | 1.07341 | 1.07248 | 1.07014 | **1.06964** |
+| 314  | 1.07214 | 1.07189 | 1.07082 | **1.07026** |
+| Mean | 1.07326 | 1.07209 | 1.07081 | **1.07006** |
+
+Every seed improves monotonically across each submission.
+
+## Legality (Issue #1017)
+
+- **Condition 1 (Causal)**: single left-to-right pass.
+- **Condition 2 (Full normalized distribution)**: standard softmax over 8192 SP tokens.
+- **Condition 3 (Score-before-update)**: each chunk scored in `forward_ttt_train` before the optimizer step on it.
+- **Condition 4 (Single pass)**: one left-to-right pass, no rescoring.
+
+## Reproduction
+
+```bash
+export DATA_DIR=/path/to/parameter-golf/data
+torchrun --standalone --nproc_per_node=8 train_gpt.py        # seed 1337
+SEED=42  torchrun --standalone --nproc_per_node=8 train_gpt.py
+SEED=314 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+All hyperparameters hardcoded as defaults in `train_gpt.py`:
+`TTT_LORA_RANK=128`, `TTT_LORA_ALPHA=144`, `TTT_WARM_START_A=1`, `TTT_WEIGHT_DECAY=1.0`,
+`GATED_ATTN_ENABLED=1`, `GATED_ATTN_INIT_STD=0.005`, `POLAR_EXPRESS_NS=1`, `MIN_LR=0.10`,
+`GPTQ_RESERVE_SECONDS=0.5`, `VAL_LOSS_EVERY=0`, `PHASED_TTT_ENABLED=1`, `PHASED_TTT_NUM_PHASES=3`.
diff --git a/...ds/track_10min_16mb/2026-04-23_PolarNS_MinLR_GatedAttn_AlphaLoRA_1.07006/requirements.txt b/...ds/track_10min_16mb/2026-04-23_PolarNS_MinLR_GatedAttn_AlphaLoRA_1.07006/requirements.txt
@@ -0,0 +1,7 @@
+torch>=2.9
+flash-attn>=3.0
+triton>=3.5
+sentencepiece
+python-minifier
+brotli
+numpy
diff --git a/...rds/track_10min_16mb/2026-04-23_PolarNS_MinLR_GatedAttn_AlphaLoRA_1.07006/submission.json b/...rds/track_10min_16mb/2026-04-23_PolarNS_MinLR_GatedAttn_AlphaLoRA_1.07006/submission.json
@@ -0,0 +1,61 @@
+{
+  "authors": [
+    {
+      "name": "Renqian Luo",
+      "github_id": "renqianluo"
+    }
+  ],
+  "description": "Polar Express Newton-Schulz coefficients (ported from @orangekame3 PR #1344) stacked with MIN_LR=0.10 warmdown floor, tight GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0, on top of our PR #1768 stack (GatedAttn + gate mirror in TTT path + per-row int8 gate quant + alpha=144 LoRA + warm-start A + WD=1.0). 3-seed mean 1.07006 BPB.",
+  "val_bpb": 1.07005686,
+  "seed_results": {
+    "1337": 1.07026727,
+    "42": 1.06964040,
+    "314": 1.07026291
+  },
+  "eval_time_seconds": {
+    "1337": 480.7,
+    "42": 474.4,
+    "314": 475.8
+  },
+  "train_time_seconds": {
+    "1337": 599.6,
+    "42": 599.6,
+    "314": 599.6
+  },
+  "artifact_size_bytes": {
+    "1337": 15977096,
+    "42": 15975978,
+    "314": 15975630
+  },
+  "methods": [
+    "Polar Express NS (ported from PR #1344): 5 per-iteration minimax-optimal (a,b,c) coefficients for Muon's Newton-Schulz iteration, replacing the single fixed (3.4445,-4.775,2.0315) tuple applied 5 times. Higher-quality polar factor per step at same backend_steps=5.",
+    "MIN_LR=0.10 warmdown floor (from PR #1787 @nprime06): floors LR warmdown at 10% of max instead of 0; final ~25% of training keeps delivering useful gradients.",
+    "GPTQ_RESERVE_SECONDS=0.5 (vs 4.0) + VAL_LOSS_EVERY=0 (from PR #1787 @nprime06): reclaim ~15s of 600s budget for depth-3 training steps.",
+    "PR #1768 stack (this author): per-head Gated Attention with gate mirrored in _block_with_lora and _parallel_block_with_lora (without mirror TTT silently skips the gate and collapses), per-row int8 quantization of attn_gate_w to stay under 16MB.",
+    "PR #1767 stack (this author): alpha/rank LoRA scaling, warm-start LoRA A across batches, TTT WD=1.0, alpha=144 on rank 128.",
+    "Base (phased TTT + VarLen + Fused MLP + multi-phase global SGD + SD-clip GPTQ): unchanged."
+  ],
+  "attribution": {
+    "polar_express_ns_coefficients": "Ported from PR #1344 (@orangekame3 et al)",
+    "min_lr_warmdown_floor__tight_gptq_reserve__disabled_val": "Ported from PR #1787 (@nprime06)",
+    "gate_mirror_ttt_path__per_row_int8_gate_quant": "Renqian Luo (PR #1768)",
+    "alpha_scaled_lora__warm_start_A__higher_wd__raised_alpha": "Renqian Luo (PR #1767)",
+    "gated_attention": "@dexhunter (PR #1736)",
+    "varlen_attention_fused_mlp_doc_ttt": "@samacqua (PR #1530)",
+    "phased_ttt_concept": "@romeerp (PR #1610)",
+    "multi_phase_global_sgd_trimmed_gptq": "@dexhunter",
+    "triple_recurrence_parallel_residuals": "@bigbag (PR #1493), @EthanYangTW (PR #1523)",
+    "legal_ttt_framework": "@abaybektursun (PR #549)"
+  },
+  "legal_ttt": true,
+  "compliance": {
+    "train_under_600s": true,
+    "eval_under_600s": true,
+    "artifact_under_16mb": true,
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_ngram_cache": true,
+    "score_first_ttt": true,
+    "three_seeds": true
+  }
+}