Skip to content

Non-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154)#1831

Open
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-pe-ablation
Open

Non-record: Polar Express NS Coefficient Ablation on #1809 (val_bpb 1.08154)#1831
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-pe-ablation

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown

Non-record: Polar Express NS Coefficient Ablation on SP8192 3-Layer Recurrence Stack

Ablation study: Polar Express (PE) per-iteration Newton-Schulz coefficients vs fixed coefficients on PR #1809's architecture.

Result: PE made things slightly worse (+0.00024 BPB). Fixed NS coefficients (3.4445, -4.775, 2.0315) outperform PE for this architecture.

Results (seed=42, 8×H100 SXM)

Variant val_bpb (TTT) val_bpb (sliding) Artifact bytes
#1809 baseline (fixed NS) 1.08130 1.08262 15,989,814
#1809 + PE5 (per-iter NS) 1.08154 1.08303 15,974,228
Δ +0.00024 +0.00041 −15,586

Key Finding

PE is not a universal improvement. On #1809's architecture with 5 NS steps, the fixed coefficients already provide sufficient orthogonalization. The degradation is consistent across all eval modes (TTT, sliding window, quantized-only).

Methodology

  • Identical code except NS coefficient computation
  • Same seed (42), hardware (8×H100 SXM), data (FineWeb SP8192)
  • Both runs ~4545 steps in ~588s wallclock
  • Single seed ablation — delta below statistical significance threshold

Attribution

Compliance

  • ✅ Artifact ≤ 16,000,000 bytes (15,974,228)
  • ✅ Training ≤ 600s (588s)
  • ✅ 8×H100 SXM
  • ✅ Self-contained, no val data during training

Full details in records/track_non_record_16mb/2026-04-26_SP8192_PolarExpress_Ablation/README.md

…itecture

Testing Polar Express per-iteration Newton-Schulz coefficients on PR openai#1809's
SP8192 3-layer recurrence stack. Result: PE makes things slightly worse
(+0.00024 BPB). Fixed coefficients (3.4445, -4.775, 2.0315) outperform PE.

Variants tested (seed=42, 8xH100 SXM):
- Baseline (openai#1809 fixed NS): val_bpb 1.08130, artifact 15,989,814 bytes
- PE5 (per-iter NS):          val_bpb 1.08154, artifact 15,974,228 bytes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PranavViswanath
Copy link
Copy Markdown

Thanks for the ablation! Quick note on two things:

Attribution: PR #1809 is mine (@PranavViswanath), not @bigbag's. #1809 forks #1493 (which is @bigbag's).

Methodology: This ablation tests PE at 5 NS steps, but #1809 uses PE at 4 NS steps, that's the key change. The Polar Express coefficients aren't meant to improve orthogonalization quality at the same step count. They provide sufficient orthogonalization in fewer steps, which saves ~20% optimizer wall time per step, which yields ~150 additional training steps within the 600s budget. That's where the BPB gain comes from.

A more direct ablation would be: #1809 as-is (PE + 4 steps) vs fixed coefficients + 4 steps vs fixed coefficients + 5 steps. That would decompose whether the gain comes from PE specifically or from the step reduction it enables.

Appreciate the rigor, happy to chat more.

@chris-lee-mc
Copy link
Copy Markdown

Thanks for the heads-up! Wanted to get the draft in, I will fix these things and see if I have budget for another ablation here, thanks for the suggestion. I liked the novel optimizer approach so wanted to investigate it more. Nice submission!

@Christopher-Lee-McClendon
Copy link
Copy Markdown
Author

Update after PR #1851:

This PR should be interpreted narrowly. It tested PE5 vs fixed5 on a #1809-style stack and found PE5 slightly worse on seed 42 (1.08154 vs 1.08130 BPB, both at ~4545 steps / ~588s).

It does not test the newer #1851 BOS-safe SmearGate/LQER stack, and it does not test the main optimizer-throughput hypothesis of Gram-NS + 4 NS steps.

A follow-up, if profiling shows optimizer headroom, should compare:

This preserves #1831 as a clean negative result while acknowledging the newer SOTA context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants