Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Non-record: Polar Express NS Coefficient Ablation on SP8192 3-Layer Recurrence Stack

**Ablation study: Polar Express per-iteration Newton-Schulz coefficients vs fixed coefficients on PR #1809's architecture.**

**Result: Polar Express made things slightly worse (+0.00024 BPB).** The fixed NS coefficients `(3.4445, -4.775, 2.0315)` in #1809 outperform PE's per-iteration optimal coefficients for this architecture.

## Results (seed=42, 8×H100 SXM)

| Variant | val_bpb (TTT) | val_bpb (sliding) | val_bpb (quant, no TTT) | Pre-quant post-EMA | Steps | Train time | Artifact bytes |
|---------|---------------|-------------------|--------------------------|---------------------|-------|------------|----------------|
| **#1809 baseline** (fixed NS) | **1.08130** | 1.08262 | 1.09922 | 1.08805 | 4542 | 588s | 15,989,814 |
| **#1809 + PE5** (per-iter NS) | 1.08154 | 1.08303 | 1.09974 | 1.08825 | 4547 | 588s | 15,974,228 |
| **Δ (PE5 − baseline)** | **+0.00024** | +0.00041 | +0.00052 | +0.00020 | +5 | ~0s | −15,586 |

The PE variant is consistently worse across all evaluation modes (TTT, sliding window, quantized-only, pre-quant).

## Background

### What is Polar Express?

Polar Express (PE) replaces the fixed Newton-Schulz (NS) polynomial coefficients in the Muon optimizer with per-iteration optimal coefficients computed from the spectral radius of the current iterate. This was introduced by @orangekame3 in PR #1344 and applied in PR #1787 by @nprime06.

The fixed coefficients `(3.4445, -4.775, 2.0315)` are a single-point approximation to the optimal orthogonalization polynomial. PE computes the spectrally-optimal coefficients at each NS iteration, which should in theory give a better approximation to the orthogonal projection.

### Why test on #1809?

PR #1809 by @bigbag holds a top leaderboard position (claimed val_bpb 1.08079) using an SP8192 tokenizer with a 3-layer depth-recurrence stack, INT5 mixed-precision QAT, zstd compression, and aggressive test-time training. It uses 5 NS steps in Muon. We tested whether PE could improve these NS steps.

### What we changed

In the PE5 variant, we replaced the fixed coefficients with per-iteration optimal coefficients computed via:
```python
spectral_radius = (X @ X.T).norm() # Frobenius proxy
a, b, c = optimal_coeffs(spectral_radius) # Per-iteration
```

Everything else (architecture, hyperparameters, data, seed, hardware) was held constant.

## Key Finding

**Polar Express does NOT improve #1809's architecture.** The degradation is small (+0.00024 BPB with TTT) but consistent across all evaluation modes. Possible explanations:

1. **The fixed coefficients are already well-tuned** for the spectral distribution encountered during #1809's training.
2. **PE's spectral radius estimation via Frobenius norm** may introduce noise that slightly degrades convergence.
3. **5 NS steps already provide sufficient orthogonalization** — the marginal improvement from optimal coefficients doesn't overcome the estimation overhead.

This is a negative result but a useful data point: PE is not a universal improvement and its benefit may depend on the specific architecture and training regime.

## Methodology

- Both runs used identical code (decoded from #1809's submission artifact) except for the NS coefficient computation.
- Same seed (42), same hardware (8×H100 80GB SXM), same data (FineWeb SP8192).
- Training was wallclock-capped at 600s; both runs completed ~4545 steps in ~588s.
- Single seed — this is an ablation study, not a SOTA claim. The delta (+0.00024) is below the statistical significance threshold (0.005 nats) required for SOTA claims.

## Attribution

- **@bigbag** — PR #1809 (base architecture and training recipe)
- **@orangekame3** — Polar Express concept (PR #1344)
- **@nprime06** — PE integration in parameter-golf (PR #1787)

## Compliance

| Requirement | Status |
|-------------|--------|
| Artifact ≤ 16,000,000 bytes | ✅ 15,974,228 bytes (PE5) / 15,989,814 bytes (baseline) |
| Training ≤ 600s wallclock | ✅ ~588s both runs |
| 8×H100 SXM hardware | ✅ |
| No validation data during training | ✅ |
| Self-contained artifact | ✅ |

**Note:** This is a non-record submission. The baseline reproduction (1.08130) does not match #1809's claimed 1.08079, likely due to non-determinism across different H100 pod configurations. The ablation comparison is valid because both runs used the same pod setup and seed.

## Files

- `train_gpt.py` — Decoded #1809 source (base variant with fixed NS coefficients)
- `train_seed42.log` — Full training log for PE5 ablation run
- `baseline_seed42.log` — Full training log for #1809 reproduction (fixed NS)
- `submission.json` — Submission metadata
- `README.md` — This file
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
W0426 04:36:17.459000 1018 torch/distributed/run.py:851]
W0426 04:36:17.459000 1018 torch/distributed/run.py:851] *****************************************
W0426 04:36:17.459000 1018 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0426 04:36:17.459000 1018 torch/distributed/run.py:851] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp8192
distributed: True
ema_decay: 0.9965
embed_bits: 8
embed_clip_sigmas: 20.0
embed_lr: 0.6
embed_wd: 0.085
embedding_dim: 512
enable_looping_at: 0.35
etlb_clip: 3.0
etlb_enabled: False
etlb_lr: 0.05
etlb_steps: 5
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_reserve_seconds: 12.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/98b7f9f3-4f33-46a5-9770-695cda460eec.txt
logit_softcap: 30.0
loop_end: 5
loop_start: 3
matrix_bits: 6
matrix_clip_sigmas: 12.85
matrix_lr: 0.022
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_row_normalize: True
muon_wd: 0.095
num_heads: 8
num_kv_heads: 4
num_layers: 11
num_loops: 2
parallel_residual_start: 7
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: 98b7f9f3-4f33-46a5-9770-695cda460eec
scalar_lr: 0.02
seed: 42
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_chunk_tokens: 32768
ttt_enabled: True
ttt_epochs: 3
ttt_lr: 0.005
ttt_momentum: 0.9
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
val_loss_every: 4000
vocab_size: 8192
warmdown_frac: 0.72
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 80
val_tokens: 40540160
model_params:35944536
gptq:reserving 12s, effective=588000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
loop_warmup_step: 1/20
loop_warmup_step: 2/20
loop_warmup_step: 3/20
loop_warmup_step: 4/20
loop_warmup_step: 5/20
loop_warmup_step: 6/20
loop_warmup_step: 10/20
loop_warmup_step: 20/20
0/20000 val_loss: 9.0090 val_bpb: 3.4877
1/20000 train_loss: 9.0104 train_time: 0.0m tok/s: 8276452
2/20000 train_loss: 12.3634 train_time: 0.0m tok/s: 8144023
3/20000 train_loss: 11.0159 train_time: 0.0m tok/s: 8039151
4/20000 train_loss: 9.4599 train_time: 0.0m tok/s: 7995376
5/20000 train_loss: 8.3313 train_time: 0.0m tok/s: 7962786
500/20000 train_loss: 3.3821 train_time: 0.8m tok/s: 7755795
1000/20000 train_loss: 3.2857 train_time: 1.7m tok/s: 7738638
1500/20000 train_loss: 3.1828 train_time: 2.5m tok/s: 7739981
2000/20000 train_loss: 3.0711 train_time: 3.4m tok/s: 7742129
layer_loop:enabled step:2027 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
2500/20000 train_loss: 3.1227 train_time: 4.6m tok/s: 7087068
3000/20000 train_loss: 2.9016 train_time: 5.9m tok/s: 6638943
3500/20000 train_loss: 2.9427 train_time: 7.2m tok/s: 6389207
4000/20000 train_loss: 2.8231 train_time: 8.4m tok/s: 6221648
4000/20000 val_loss: 2.8738 val_bpb: 1.1126
4500/20000 train_loss: 2.8440 train_time: 9.7m tok/s: 6082115
4542/20000 val_loss: 2.8137 val_bpb: 1.0893
stopping_early: wallclock_cap train_time: 588154ms step: 4542/20000
peak memory allocated: 39045 MiB reserved: 39120 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.81054112 val_bpb:1.08804805 eval_time:7374ms
Serialized model: 135431033 bytes
Code size: 16594 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 67 Hessians in 12.8s
Quantized weights:
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
gptq (int8): tok_emb.weight
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
Serialized model quantized+brotli: 15973220 bytes
Total submission size quantized+brotli: 15989814 bytes
quantized val_loss:2.83940790 val_bpb:1.09922328 eval_time:25128ms
quantized_sliding_window val_loss:2.79652141 val_bpb:1.08262058 eval_time:125936ms
ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3
quantized_ttt val_loss:2.79310299 val_bpb:1.08129721 eval_time:383996ms
total 147884
drwxr-xr-x 2 root root 148 Apr 26 04:59 .
drwx------ 1 root root 4096 Apr 26 04:50 ..
-rw-r--r-- 1 root root 9402 Apr 26 04:59 98b7f9f3-4f33-46a5-9770-695cda460eec.txt
-rw-r--r-- 1 root root 15973220 Apr 26 04:59 final_model.int6.ptz
-rw-r--r-- 1 root root 135431033 Apr 26 04:59 final_model.pt
-rw-r--r-- 1 root root 5764 Apr 26 04:59 pgolf_stdout.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"author": "Christopher-Lee-McClendon",
"github_id": "Christopher-Lee-McClendon",
"name": "Polar Express NS Coefficient Ablation on SP8192 3-Layer Recurrence (PR #1809)",
"blurb": "Ablation testing Polar Express per-iteration Newton-Schulz coefficients on PR #1809's architecture. PE5 yields val_bpb 1.08154 vs baseline 1.08130 — a regression of +0.00024 BPB. Fixed coefficients (3.4445, -4.775, 2.0315) outperform PE for this setup.",
"date": "2026-04-26",
"track": "non_record_16mb",
"base_pr": 1809,
"variants": {
"baseline_fixed_ns": {
"val_bpb": 1.08130,
"val_bpb_sliding": 1.08262,
"val_bpb_quantized": 1.09922,
"val_loss_ttt": 2.79310,
"artifact_bytes": 15989814,
"steps": 4542,
"train_time_seconds": 588
},
"polar_express_5step": {
"val_bpb": 1.08154,
"val_bpb_sliding": 1.08303,
"val_bpb_quantized": 1.09974,
"val_loss_ttt": 2.79372,
"artifact_bytes": 15974228,
"steps": 4547,
"train_time_seconds": 588
}
},
"val_bpb": 1.08154,
"artifact_bytes": 15974228,
"seeds": [42],
"hardware": "8xH100 80GB SXM",
"finding": "negative — Polar Express NS coefficients do not improve PR #1809 architecture"
}
Loading