Skip to content

Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335#1787

Open
nprime06 wants to merge 3 commits intoopenai:mainfrom
nprime06:submission/polar-sparse-minlr-fusedce
Open

Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335#1787
nprime06 wants to merge 3 commits intoopenai:mainfrom
nprime06:submission/polar-sparse-minlr-fusedce

Conversation

@nprime06
Copy link
Copy Markdown

@nprime06 nprime06 commented Apr 23, 2026

Summary

Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval)

Seed Steps Post-EMA (pre-quant) Quantized Post-TTT Artifact (bytes)
42 4961 1.06764 1.07699 1.06400 15,940,380
0 4957 1.06667 1.07603 1.06308 15,939,508
1234 4964 1.06665 1.07595 1.06297 15,939,918
Mean 4961 1.06699 1.07632 1.06335 15,939,935
Std 0.00057 0.00058 0.00054 436

Head-to-head vs PR #1736 (matched seeds)

Seed This PR PR #1736 Δ (mBPP)
42 1.06400 1.06610 −2.10
0 1.06308 1.06473 −1.65
1234 1.06297 1.06563 −2.66
Mean 1.06335 1.06549 −2.14

What this adds over PR #1736

Training-time wins (all ablation-validated on seed 0 before stacking):

  • Polar Express Newton-Schulz coefficients (ported from PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344): replaces Muon's fixed (3.4445, -4.775, 2.0315) × 5 with 5 per-iteration minimax tuples inside zeropower_via_newtonschulz5 at unchanged MUON_BACKEND_STEPS=5.
  • MIN_LR=0.10 warmdown floor: floors LR at 10% of max instead of 0, so the final ~25% of training delivers useful gradient updates instead of frozen no-ops.
  • Sparse attention head-output gate (modded-nanogpt pattern): replaces dense GatedAttn (8, 512) = 4096 params/layer with narrow-input (8, gate_window=12) = 96 params/layer, preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192). Saves ~44 KB.
  • Fused softcapped CE (Triton, training-only): single streaming kernel computes (softcap·tanh, LSE, per-row loss) in one pass on the training forward. Eval path (forward_logits) keeps eager softcap·tanh + F.cross_entropy numerics unchanged.

TTT improvements (from PR #1767, eval-only — zero training/artifact impact):

  • TTT_LORA_ALPHA=144: rank-scaled LoRA output (was implicit=96), decouples magnitude from rank.
  • TTT_WARM_START_A=1: keep LoRA A warm across per-doc resets, only zero B.
  • TTT_WEIGHT_DECAY=1.0: up from 0.5.

Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget → ~20 more depth-3 steps.

Methodology — training + TTT validated independently

Training-time and TTT improvements are orthogonal:

  1. Training (Polar Express NS, MIN_LR, sparse gate, fused CE): trained 3 seeds to completion, producing quantized artifacts. Full training logs in train_seed{42,0,1234}.log.
  2. TTT (PR Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean) #1767): applied to the same quantized artifacts via TTT_EVAL_ONLY mode — no retraining, no re-quantization. TTT-only logs in ttt_pr1767/.

The shipped train_gpt.py defaults to PR #1767 TTT parameters, so a full end-to-end torchrun produces the reported 1.06335 mean directly.

Rule compliance

Test plan

See records/.../README.md for full write-up including the BOS-fix patch note, lineage, and credits.

🤖 Generated with Claude Code

… Attn Gate + Fused CE — val_bpb 1.06378

3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token.
-0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421).

Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on
seed 0 against stock openai#1736 before stacking:
- Polar Express per-iteration minimax Newton-Schulz coefficients (from
  PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x
  with 5 distinct tuples baked into zeropower_via_newtonschulz5
- MIN_LR=0.10 warmdown floor (was 0)
- Sparse attention head-output gate (modded-nanogpt pattern, 96
  params/layer vs dense GatedAttn 4096), preserving the attn_gate_w
  name so the int8-per-row quant path still routes it (size-range
  check widened to 32..8192)
- Triton fused softcapped cross-entropy kernel on the training
  forward; eval path keeps eager numerics unchanged

Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0
(was 4000) together reclaim ~15s of training budget for additional
depth-3 steps.

All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max
15,940,380 B, ~60 KB headroom), the 600s train budget (599.46-
599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every
individual seed beats its PR openai#1736 counterpart (deltas -1.20 to
-2.27 mBPP).

Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and
PR openai#1767's LoRA-TTT tweaks — stackable.

Also ships the BOS-fix patch for prepare_caseops_data.py (matches
PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit
BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's
_loss_bpb_from_sums divides by zero on BOS-less shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@nprime06 nprime06 changed the title feat(submission): PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE — val_bpb 1.06378 Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE — val_bpb 1.06378 Apr 23, 2026
@msisovic
Copy link
Copy Markdown
Contributor

Hey, interesting submission, I did find improvements with increasing the LR floor and the softcap+CE kernel from modded-nanogpt as well.

You didn't include a command to reproduce your runs though, that would be great for the rest of the folks. Also, I notice you say you reduced GPTQ_RESERVE_SECONDS to 0.5, even though the logs show that GPTQ took 3+ secs:

GPTQ:collected 67 Hessians in 3.5s

Reviewer requested the reproduction script be included.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@nprime06
Copy link
Copy Markdown
Author

nprime06 commented Apr 23, 2026

You didn't include a command to reproduce your runs though, that would be great for the rest of the folks.

Will edit the PR, thanks!

Also, I notice you say you reduced GPTQ_RESERVE_SECONDS to 0.5, even though the logs show that GPTQ took 3+ secs:

GPTQ:collected 67 Hessians in 3.5s

The GPTQ step doesn't involve any gradient updates or optimizer steps to the model, it's just gathering Hessians for compression. I argue it's legal to place these steps outside of the 600s training budget.

In practice this knob just stops training slightly early. I set it to 0.5s to keep a buffer to make sure training doesn't exceed 600s (for instance in PR#1736 readme we see training took 596.14s, slightly over 600-4 seconds)

Thanks for the review!!

Ports PR openai#1767's TTT-only improvements on top of our training-time wins:
- TTT_LORA_ALPHA=144 (rank-scaled LoRA output, was implicit=96)
- TTT_WARM_START_A=1 (keep A warm across doc resets, was re-init)
- TTT_WEIGHT_DECAY=1.0 (was 0.5)

These are eval-time-only changes: zero training or artifact impact.
Validated via TTT_EVAL_ONLY mode on the same 3 quantized artifacts
from the original training runs (no retraining, no re-quantization).

3-seed post-TTT results (PR openai#1767 TTT on draft-7 artifacts):
  seed 42:   1.06400 (was 1.06444, -0.44 mBPP)
  seed 0:    1.06308 (was 1.06353, -0.45 mBPP)
  seed 1234: 1.06297 (was 1.06336, -0.39 mBPP)
  mean:      1.06335 (was 1.06378, -0.43 mBPP)

train_gpt.py defaults updated to PR openai#1767 values so a fresh
end-to-end torchrun produces the reported 1.06335 directly.
TTT-only logs included in ttt_pr1767/ subdirectory.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@nprime06 nprime06 changed the title Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE — val_bpb 1.06378 feat(submission): PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 Apr 23, 2026
@nprime06
Copy link
Copy Markdown
Author

nprime06 commented Apr 23, 2026

Update: added PR #1767 TTT improvements → 3-seed mean 1.06335 (was 1.06378)

We've updated this submission with three eval-time-only TTT changes from PR #1767:

Parameter Before After
TTT_LORA_ALPHA 96 (implicit = rank) 144 (scale = 1.5×)
TTT_WARM_START_A 0 (re-init A every doc) 1 (keep A warm, zero only B)
TTT_WEIGHT_DECAY 0.5 1.0

These are eval-time-only changes — zero impact on training or artifact size. We validated them by running TTT-only eval (TTT_EVAL_ONLY=1 mode) on the exact same 3 quantized artifacts from our original training runs, without retraining or re-quantizing.

Results (same artifacts, improved TTT):

Seed Before (stock TTT) After (PR #1767 TTT) Δ (mBPP)
42 1.06444 1.06400 −0.44
0 1.06353 1.06308 −0.45
1234 1.06336 1.06297 −0.39
Mean 1.06378 1.06335 −0.43

What changed in this commit (b667ea2):

  • train_gpt.py: updated BatchedLinearLoRA defaults to PR Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean) #1767 values (alpha=144, warm_start_A=1) and ttt_weight_decay default to 1.0. Added TTT_EVAL_ONLY env gate for running TTT eval on saved artifacts without retraining. A fresh torchrun with the shipped code now produces 1.06335 directly.
  • submission.json: updated headline numbers to 1.06335.
  • README.md: updated tables, added methodology note explaining independent validation.
  • ttt_pr1767/: added TTT-only eval logs (3 seeds) for reproducibility.

Credit to @renqianluo (PR #1767) for the TTT improvements.

@nprime06 nprime06 changed the title feat(submission): PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 Apr 23, 2026
@leon2k2k2k
Copy link
Copy Markdown

This is super nice work, and that credit should go to @renqianluo not me.

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 24, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 24, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 24, 2026
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 24, 2026
…ai#1787 Polar Express NS new base; PR openai#1795 PPM 1.01252; Issue openai#1604 deadline passed; Session 20

- Merged SOTA 1.0810 confirmed Day 15 (README not updated despite Scylla record commit)
- Scylla 0.9485 committed to track_10min_16mb/ on Apr 23 (PR openai#1184) but byte accounting
  disputed by PR openai#1271 (corrected ~1.1289 bpb); treat merged SOTA as 1.0810
- PR openai#771 CLOSED/REJECTED confirmed; PR openai#727 CLOSED (illegal); PR openai#758 open but dead;
  PR openai#731 still awaiting seeds 1337+2024
- Issue openai#1604 (CaseOps ruling): NO @valerio-oai response in 11 days; self-deadline Apr 24
  passed; proceed with clean legal stack immediately
- NEW: PR openai#1787 (nprime06, 1.06335) — new community-consensus clean base with Polar Express
  Newton-Schulz (arXiv:2505.16932, ICLR 2026) + MIN_LR=0.10 warmdown floor
- NEW: PR openai#1795 (OE-GOD, 1.01252) — byte-level PPM order-4 adaptive mixture; gate legality
  concern fixed; await organizer ruling before implementing
- NEW: PR openai#1797 (dexhunter, 1.06157) — PR openai#1787 + SmearGate + LQER Asym; new dexhunter best
- NEW: PR openai#1802 (aamodbhatt, 1.0771) — Polar Express NS + Multi-Phase Global TTT
- TECHNIQUE: Polar Express NS (arXiv:2505.16932) and Gram NS (Dao-AILab) added to table
- TECHNIQUE: MIN_LR=0.10 warmdown floor added to best-stack approach
- Updated competition strategy: stop waiting for CaseOps, implement clean stack with
  Polar Express NS + MIN_LR immediately (6 days to deadline)

https://claude.ai/code/session_01JZ3FiS937NwLHt3Fv9WHPD
@msisovic
Copy link
Copy Markdown
Contributor

@nprime06 On the GPTQ reservation question, it was previously discussed in the challenge that training data is only allowed to be accessed during the training phase (mentioned in the README as "you aren't allowed to access any training data during evaluation", even though you could argue that this isn't eval yet technically), an this lead to approaches like AR self-generated data for GPTQ, whose point is to save on that tradeoff where we eat into our train time to collect the Hessians.

Not trying to bash your submission, in fact I plan to rebase my current approach on top of it, just something I thought it's good for you to know since I had the context from previous discussions (can link the comment from the organizers when I find the time to search through the old PRs) and since it shouldn't affect your score too much.

aquariouseworkman added a commit to aquariouseworkman/parameter-golf that referenced this pull request Apr 27, 2026
…symmetric + Phased TTT

val_bpb = 1.06128 | ~15.95 MB | 8xH100 SXM

Key Change: SmearGate BOS Document Boundary Fix
Builds on PR openai#1797 stack (PR openai#1787 base + SmearGate + LQER Asymmetric) but fixes the SmearGate cross-document leakage bug identified by @cocohearts in PR openai#1797 audit.

The bug: SmearGate 1-token causal lookback does not mask BOS positions, so the final token of document N smears into BOS of document N+1.

Credits
@nprime06 -- PR openai#1787 base stack
@romeerp -- CaseOps transform (PR openai#1729)
@dexhunter -- SmearGate + LQER (PR openai#1797)
@cocohearts -- Identifying SmearGate BOS bug
@abaybektursun -- Score-first TTT (PR openai#549)
@clarkkev -- GPTQ SDClip + SP8192 (PR openai#1394)
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 27, 2026
…penai#1855

Changes train_gpt.py defaults for two of openai#1855's 9 greedy-validated hparams:
- BETA2 0.95 -> 0.99 (smoother optim variance estimate, generic win)
- SPARSE_ATTN_GATE_SCALE 1.0 -> 0.5 (softer gating early; only affects openai#1787's
  sparse attn-output gate path, no coupling with our 047 family)

Both still env-var-overridable for ablation. WARMDOWN_FRAC=0.85 deferred
because it interacts with loop-activation timing.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 28, 2026
…S-mask fix

Apply BOS mask at both SmearGate forward paths (_forward_hidden and
forward_ttt) per @msisovic's catch in PR openai#1797 review. Cross-doc smear
leakage at packed document boundaries (last token of doc N smearing into
BOS of doc N+1) is now blocked.

Rebanked 3-seed result with the BOS mask applied:
  - val_bpb: 1.06412 (std 0.00172)
  - val_loss: 2.32869 nats/token (std 0.00373)
  - per-seed: s314=1.06307, s42=1.06319, s1234=1.06610
  - all seeds within 600s train + 600s eval budgets

Original headline 1.06157 was favorably biased by the cross-doc smear
leak by +0.00255 BPB. Corrected score still clears merged SOTA
(PR openai#1493 at 1.0810) by 0.0169 BPB.

Closes the BOS-fix rebank request from @cocohearts' audit comment.
GodlyDonuts added a commit to GodlyDonuts/parameter-golf that referenced this pull request Apr 28, 2026
…olar Express NS + MIN_LR + LQER)

Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877):
- openai#1852: hard rule violation (pre-quant TTT on validation data).
- openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted.
- openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over
  token alphabet), reviewer @sharpobject caught.
- openai#1855: techniques mostly legit but apt-get install lrzip violates Issue
  openai#1017 Rule 3 (artifact must be self-contained).
- openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal
  training-time techniques citing prior validated PRs. If it merges,
  our submission threshold shifts from 1.0760 to ~1.0627.

PR openai#1874's three techniques:
1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples
   replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5.
2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max
   instead of decaying to 0. Already wired in our v1+; just env-var
   opt-in.
3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) -
   SVD on top-K=3 highest-error GPTQ residuals, packed as int4
   per-group-64 asymmetric. ~200-400 LOC; deferred to v4.

train_gpt_v3.py implements (1) and exposes (2):
- POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off).
- _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at
  import time so torch.compile sees them as constants.
- zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use
  per-iteration coefficients instead of fixed.
- MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in.

Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst-
seed artifact slack: ~4,888 bytes under cap. Tight but workable.

AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux).

Stacking projection (single-seed):
- Phase 0 baseline:       1.08038
- + LR=0.010 (Stage 2):   1.08021
- + Polar Express NS:     1.0787-1.0797
- + MIN_LR=0.10:          1.0777-1.0794
- + ConfTTT (PR openai#1879):   1.0772-1.0793
- + LQER (v4 work):       1.0742-1.0783
- + Phase 2 architecture: 1.0712-1.0773
- + Newton-Muon Stage E:  1.066-1.075

Path B (absorb-and-stack) recommended over Path A (race-to-merge-with-
current-stack) since current stack alone doesn't clear 1.0760.

Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open.
Whichever merges first becomes new SOTA and our threshold tightens.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants