Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Polar NS + MIN_LR + GatedAttn + Alpha LoRA — 1.07006 BPB

**val_bpb: 1.07005686** (3-seed mean: seeds 1337, 42, 314)

## Results

| Seed | BPB | Train time | Eval time | Artifact |
|------|-----|------------|-----------|----------|
| 1337 | 1.07026727 | 599.6s | 480.7s | 15,977,086 B |
| 42 | 1.06964040 | 599.6s | 474.4s | 15,975,968 B |
| 314 | 1.07026291 | 599.6s | 475.8s | 15,975,620 B |
| **Mean** | **1.07005686** | | | |

All runs: train ≤600s, eval ≤600s, artifact ≤16MB.

## What this submission adds on top of PR #1768

This submission stacks three independently-validated techniques from other authors
onto our PR #1768 stack:

### (1) Polar Express NS coefficients (ported from PR #1344)

Replaces Muon's fixed Newton-Schulz coefficients `(3.4445, -4.775, 2.0315)` (applied
identically 5 times per Muon step) with 5 per-iteration minimax-optimal tuples:

```python
_PE_COEFFS = [
(8.156554524902461, -22.48329292557795, 15.878769915207462),
(4.042929935166739, -2.808917465908714, 0.5000178451051316),
(3.8916678022926607, -2.772484153217685, 0.5060648178503393),
(3.285753657755655, -2.3681294933425376, 0.46449024233003106),
(2.3465413258596377, -1.7097828382687081, 0.42323551169305323),
]
```

Same backend_steps=5, but the per-iteration minimax coefficients produce a
higher-quality polar factor approximation per Muon step.

### (2) MIN_LR=0.10 warmdown floor (from PR #1787)

Floors the LR warmdown at 10% of max instead of 0 — the final ~25% of training
keeps delivering meaningful gradient updates instead of winding down to near-zero.

### (3) Tight budget polish (from PR #1787)

- `GPTQ_RESERVE_SECONDS=0.5` (was 4.0)
- `VAL_LOSS_EVERY=0` (was 4000, disables periodic mid-training val)

Together these reclaim ~15s of the 600s training budget for additional depth-3
training steps, visible in the higher step counts vs prior submissions.

## Stack summary

All techniques and their origins:

| Component | Origin |
|-----------|--------|
| SP8192 + triple depth recurrence + parallel residuals | @bigbag PR #1493, @EthanYangTW PR #1523 |
| VarLen attention + Fused Triton MLP + doc-independent LoRA TTT | @samacqua PR #1530 |
| Phased TTT | @romeerp PR #1610 |
| Multi-Phase Global SGD + Trimmed GPTQ + MATRIX_LR=0.026 | @dexhunter |
| Gated Attention | @dexhunter PR #1736 |
| Alpha/rank LoRA scaling + Warm-start A + WD=1.0 + alpha=144 | **this author, PR #1767** |
| Gate mirror in LoRA-TTT forward path + per-row int8 gate quant | **this author, PR #1768** |
| Polar Express NS coefficients | Ported from PR #1344 |
| MIN_LR=0.10 + GPTQ_RESERVE=0.5 + VAL_LOSS_EVERY=0 | Ported from @nprime06 PR #1787 |

## 3-seed trajectory

| Seed | 1.07326 (PR #1767 mean-reproduction) | PR #1767 | PR #1768 | **This PR** |
|------|---:|---:|---:|---:|
| 1337 | 1.07423 | 1.07189 | 1.07146 | **1.07027** |
| 42 | 1.07341 | 1.07248 | 1.07014 | **1.06964** |
| 314 | 1.07214 | 1.07189 | 1.07082 | **1.07026** |
| Mean | 1.07326 | 1.07209 | 1.07081 | **1.07006** |

Every seed improves monotonically across each submission.

## Legality (Issue #1017)

- **Condition 1 (Causal)**: single left-to-right pass.
- **Condition 2 (Full normalized distribution)**: standard softmax over 8192 SP tokens.
- **Condition 3 (Score-before-update)**: each chunk scored in `forward_ttt_train` before the optimizer step on it.
- **Condition 4 (Single pass)**: one left-to-right pass, no rescoring.

## Reproduction

```bash
export DATA_DIR=/path/to/parameter-golf/data
torchrun --standalone --nproc_per_node=8 train_gpt.py # seed 1337
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
SEED=314 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

All hyperparameters hardcoded as defaults in `train_gpt.py`:
`TTT_LORA_RANK=128`, `TTT_LORA_ALPHA=144`, `TTT_WARM_START_A=1`, `TTT_WEIGHT_DECAY=1.0`,
`GATED_ATTN_ENABLED=1`, `GATED_ATTN_INIT_STD=0.005`, `POLAR_EXPRESS_NS=1`, `MIN_LR=0.10`,
`GPTQ_RESERVE_SECONDS=0.5`, `VAL_LOSS_EVERY=0`, `PHASED_TTT_ENABLED=1`, `PHASED_TTT_NUM_PHASES=3`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
torch>=2.9
flash-attn>=3.0
triton>=3.5
sentencepiece
python-minifier
brotli
numpy
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
{
"authors": [
{
"name": "Renqian Luo",
"github_id": "renqianluo"
}
],
"description": "Polar Express Newton-Schulz coefficients (ported from @orangekame3 PR #1344) stacked with MIN_LR=0.10 warmdown floor, tight GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0, on top of our PR #1768 stack (GatedAttn + gate mirror in TTT path + per-row int8 gate quant + alpha=144 LoRA + warm-start A + WD=1.0). 3-seed mean 1.07006 BPB.",
"val_bpb": 1.07005686,
"seed_results": {
"1337": 1.07026727,
"42": 1.06964040,
"314": 1.07026291
},
"eval_time_seconds": {
"1337": 480.7,
"42": 474.4,
"314": 475.8
},
"train_time_seconds": {
"1337": 599.6,
"42": 599.6,
"314": 599.6
},
"artifact_size_bytes": {
"1337": 15977096,
"42": 15975978,
"314": 15975630
},
"methods": [
"Polar Express NS (ported from PR #1344): 5 per-iteration minimax-optimal (a,b,c) coefficients for Muon's Newton-Schulz iteration, replacing the single fixed (3.4445,-4.775,2.0315) tuple applied 5 times. Higher-quality polar factor per step at same backend_steps=5.",
"MIN_LR=0.10 warmdown floor (from PR #1787 @nprime06): floors LR warmdown at 10% of max instead of 0; final ~25% of training keeps delivering useful gradients.",
"GPTQ_RESERVE_SECONDS=0.5 (vs 4.0) + VAL_LOSS_EVERY=0 (from PR #1787 @nprime06): reclaim ~15s of 600s budget for depth-3 training steps.",
"PR #1768 stack (this author): per-head Gated Attention with gate mirrored in _block_with_lora and _parallel_block_with_lora (without mirror TTT silently skips the gate and collapses), per-row int8 quantization of attn_gate_w to stay under 16MB.",
"PR #1767 stack (this author): alpha/rank LoRA scaling, warm-start LoRA A across batches, TTT WD=1.0, alpha=144 on rank 128.",
"Base (phased TTT + VarLen + Fused MLP + multi-phase global SGD + SD-clip GPTQ): unchanged."
],
"attribution": {
"polar_express_ns_coefficients": "Ported from PR #1344 (@orangekame3 et al)",
"min_lr_warmdown_floor__tight_gptq_reserve__disabled_val": "Ported from PR #1787 (@nprime06)",
"gate_mirror_ttt_path__per_row_int8_gate_quant": "Renqian Luo (PR #1768)",
"alpha_scaled_lora__warm_start_A__higher_wd__raised_alpha": "Renqian Luo (PR #1767)",
"gated_attention": "@dexhunter (PR #1736)",
"varlen_attention_fused_mlp_doc_ttt": "@samacqua (PR #1530)",
"phased_ttt_concept": "@romeerp (PR #1610)",
"multi_phase_global_sgd_trimmed_gptq": "@dexhunter",
"triple_recurrence_parallel_residuals": "@bigbag (PR #1493), @EthanYangTW (PR #1523)",
"legal_ttt_framework": "@abaybektursun (PR #549)"
},
"legal_ttt": true,
"compliance": {
"train_under_600s": true,
"eval_under_600s": true,
"artifact_under_16mb": true,
"no_slot": true,
"no_pre_quant_ttt": true,
"no_ngram_cache": true,
"score_first_ttt": true,
"three_seeds": true
}
}
Loading