Non-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154)#1831
Conversation
…itecture Testing Polar Express per-iteration Newton-Schulz coefficients on PR openai#1809's SP8192 3-layer recurrence stack. Result: PE makes things slightly worse (+0.00024 BPB). Fixed coefficients (3.4445, -4.775, 2.0315) outperform PE. Variants tested (seed=42, 8xH100 SXM): - Baseline (openai#1809 fixed NS): val_bpb 1.08130, artifact 15,989,814 bytes - PE5 (per-iter NS): val_bpb 1.08154, artifact 15,974,228 bytes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the ablation! Quick note on two things: Attribution: PR #1809 is mine (@PranavViswanath), not @bigbag's. #1809 forks #1493 (which is @bigbag's). Methodology: This ablation tests PE at 5 NS steps, but #1809 uses PE at 4 NS steps, that's the key change. The Polar Express coefficients aren't meant to improve orthogonalization quality at the same step count. They provide sufficient orthogonalization in fewer steps, which saves ~20% optimizer wall time per step, which yields ~150 additional training steps within the 600s budget. That's where the BPB gain comes from. A more direct ablation would be: #1809 as-is (PE + 4 steps) vs fixed coefficients + 4 steps vs fixed coefficients + 5 steps. That would decompose whether the gain comes from PE specifically or from the step reduction it enables. Appreciate the rigor, happy to chat more. |
|
Thanks for the heads-up! Wanted to get the draft in, I will fix these things and see if I have budget for another ablation here, thanks for the suggestion. I liked the novel optimizer approach so wanted to investigate it more. Nice submission! |
|
Update after PR #1851: This PR should be interpreted narrowly. It tested PE5 vs fixed5 on a #1809-style stack and found PE5 slightly worse on seed 42 (1.08154 vs 1.08130 BPB, both at ~4545 steps / ~588s). It does not test the newer #1851 BOS-safe SmearGate/LQER stack, and it does not test the main optimizer-throughput hypothesis of Gram-NS + 4 NS steps. A follow-up, if profiling shows optimizer headroom, should compare:
This preserves #1831 as a clean negative result while acknowledging the newer SOTA context. |
Non-record: Polar Express NS Coefficient Ablation on SP8192 3-Layer Recurrence Stack
Ablation study: Polar Express (PE) per-iteration Newton-Schulz coefficients vs fixed coefficients on PR #1809's architecture.
Result: PE made things slightly worse (+0.00024 BPB). Fixed NS coefficients
(3.4445, -4.775, 2.0315)outperform PE for this architecture.Results (seed=42, 8×H100 SXM)
Key Finding
PE is not a universal improvement. On #1809's architecture with 5 NS steps, the fixed coefficients already provide sufficient orthogonalization. The degradation is consistent across all eval modes (TTT, sliding window, quantized-only).
Methodology
Attribution
Compliance
Full details in
records/track_non_record_16mb/2026-04-26_SP8192_PolarExpress_Ablation/README.md