Non-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154) by Christopher-Lee-McClendon · Pull Request #1831 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-04-26T11:07:10Z

Non-record: Polar Express NS Coefficient Ablation on SP8192 3-Layer Recurrence Stack

Ablation study: Polar Express (PE) per-iteration Newton-Schulz coefficients vs fixed coefficients on PR #1809's architecture.

Result: PE made things slightly worse (+0.00024 BPB). Fixed NS coefficients (3.4445, -4.775, 2.0315) outperform PE for this architecture.

Results (seed=42, 8×H100 SXM)

Variant	val_bpb (TTT)	val_bpb (sliding)	Artifact bytes
#1809 baseline (fixed NS)	1.08130	1.08262	15,989,814
#1809 + PE5 (per-iter NS)	1.08154	1.08303	15,974,228
Δ	+0.00024	+0.00041	−15,586

Key Finding

PE is not a universal improvement. On #1809's architecture with 5 NS steps, the fixed coefficients already provide sufficient orthogonalization. The degradation is consistent across all eval modes (TTT, sliding window, quantized-only).

Methodology

Identical code except NS coefficient computation
Same seed (42), hardware (8×H100 SXM), data (FineWeb SP8192)
Both runs ~4545 steps in ~588s wallclock
Single seed ablation — delta below statistical significance threshold

Attribution

@bigbag — PR Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean) #1809 (base architecture)
@orangekame3 — Polar Express concept (PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344)
@nprime06 — PE integration (PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787)

Compliance

✅ Artifact ≤ 16,000,000 bytes (15,974,228)
✅ Training ≤ 600s (588s)
✅ 8×H100 SXM
✅ Self-contained, no val data during training

Full details in records/track_non_record_16mb/2026-04-26_SP8192_PolarExpress_Ablation/README.md

…itecture Testing Polar Express per-iteration Newton-Schulz coefficients on PR openai#1809's SP8192 3-layer recurrence stack. Result: PE makes things slightly worse (+0.00024 BPB). Fixed coefficients (3.4445, -4.775, 2.0315) outperform PE. Variants tested (seed=42, 8xH100 SXM): - Baseline (openai#1809 fixed NS): val_bpb 1.08130, artifact 15,989,814 bytes - PE5 (per-iter NS): val_bpb 1.08154, artifact 15,974,228 bytes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PranavViswanath · 2026-04-26T16:36:24Z

Thanks for the ablation! Quick note on two things:

Attribution: PR #1809 is mine (@PranavViswanath), not @bigbag's. #1809 forks #1493 (which is @bigbag's).

Methodology: This ablation tests PE at 5 NS steps, but #1809 uses PE at 4 NS steps, that's the key change. The Polar Express coefficients aren't meant to improve orthogonalization quality at the same step count. They provide sufficient orthogonalization in fewer steps, which saves ~20% optimizer wall time per step, which yields ~150 additional training steps within the 600s budget. That's where the BPB gain comes from.

A more direct ablation would be: #1809 as-is (PE + 4 steps) vs fixed coefficients + 4 steps vs fixed coefficients + 5 steps. That would decompose whether the gain comes from PE specifically or from the step reduction it enables.

Appreciate the rigor, happy to chat more.

chris-lee-mc · 2026-04-26T20:28:16Z

Thanks for the heads-up! Wanted to get the draft in, I will fix these things and see if I have budget for another ablation here, thanks for the suggestion. I liked the novel optimizer approach so wanted to investigate it more. Nice submission!

Christopher-Lee-McClendon · 2026-04-27T13:51:08Z

Update after PR #1851:

This PR should be interpreted narrowly. It tested PE5 vs fixed5 on a #1809-style stack and found PE5 slightly worse on seed 42 (1.08154 vs 1.08130 BPB, both at ~4545 steps / ~588s).

It does not test the newer #1851 BOS-safe SmearGate/LQER stack, and it does not test the main optimizer-throughput hypothesis of Gram-NS + 4 NS steps.

A follow-up, if profiling shows optimizer headroom, should compare:

Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851 exact vs. Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851 + Gram-NS + 4-step PolarNS
with post-TTT BPB as the primary metric
and optimizer_pct / steps_completed logged

This preserves #1831 as a clean negative result while acknowledging the newer SOTA context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154)#1831

Non-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154)#1831
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-pe-ablation

Christopher-Lee-McClendon commented Apr 26, 2026

Uh oh!

PranavViswanath commented Apr 26, 2026

Uh oh!

chris-lee-mc commented Apr 26, 2026

Uh oh!

Christopher-Lee-McClendon commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Christopher-Lee-McClendon commented Apr 26, 2026

Non-record: Polar Express NS Coefficient Ablation on SP8192 3-Layer Recurrence Stack

Results (seed=42, 8×H100 SXM)

Key Finding

Methodology

Attribution

Compliance

Uh oh!

PranavViswanath commented Apr 26, 2026

Uh oh!

chris-lee-mc commented Apr 26, 2026

Uh oh!

Christopher-Lee-McClendon commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants