Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) by AjAnubolu · Pull Request #1874 · openai/parameter-golf

AjAnubolu · 2026-04-27T21:42:12Z

Record: PR #1790 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)

Summary

val_bpb = 1.06766 (3-seed mean, std 0.00076) | val_loss = 2.34122 nats/token | ~15.95 MB | 8×H100 SXM, 600s train / 600s eval

Builds directly on PR #1790's verbatim non-CaseOps stack and layers in three orthogonal, previously validated techniques from open PRs:

Polar Express Newton-Schulz coefficients (PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344) — 5 per-iteration minimax-tuned tuples replace Muon's fixed (3.4445, −4.775, 2.0315) × 5, same MUON_BACKEND_STEPS=5.
MIN_LR=0.10 warmdown floor (PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787) — floors LR at 10% of max so the warmdown tail delivers useful gradient updates instead of frozen no-ops.
LQER asymmetric rank-4 quantization correction (PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797) — SVD on the top-K=3 highest-error GPTQ residuals, packed as int4 with per-group asymmetric scales.

−0.00225 BPP vs PR #1790 (1.06991 → 1.06766) on the same SP8192 base, all training-time legal under Issue #1017 conditions. No CaseOps, no casefold, no preprocessing — denominator is original UTF-8 bytes throughout. Improvement over current merged SOTA (PR #1493 = 1.0810) is −0.0133 BPP at p ≪ 0.001.

3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)

Seed	Steps	Pre-quant post-EMA	Quantized	Post-TTT	Artifact (bytes)	train_time	eval_time
1337	4954	1.06842	1.07813	1.06699	15,953,831	596.15s	456.6s
42	4954	1.06903	1.07856	1.06751	15,950,901	596.12s	455.2s
2025	4953	1.06994	1.07955	1.06849	15,948,627	596.13s	394.4s
Mean	4954	1.06913	1.07875	1.06766	15,951,120	596.13s	435.4s
Std		0.00076	0.00072	0.00076	2,634	0.02s	35.5s

All three seeds clear size, train-time, and eval-time budgets. 3-seed std (0.00076) is well inside the 0.005 significance floor.

Head-to-head vs PR #1790 (matched seed)

Seed	This PR	PR #1790	Δ (mBPP)
1337	1.06699	1.06986	−2.87

Single-seed comparison on seed 1337 (the only seed PR #1790 reports individually that overlaps with our set). The improvement direction is consistent at −2.25 mBPP on the 3-seed means.

Key Techniques

1. Polar Express Newton-Schulz coefficients (from PR #1344)

Muon's zeropower_via_newtonschulz5 orthogonalizes a gradient matrix via 5 iterations of a Newton-Schulz polynomial. Stock Muon uses a single fixed coefficient tuple (a, b, c) = (3.4445, −4.775, 2.0315) for all 5 iterations. Polar Express replaces this with 5 distinct per-iteration tuples, each minimax-tuned for the residual error distribution at that step. Same compute budget, sharper orthogonal approximation.

# Stock Muon
COEFS = [(3.4445, -4.775, 2.0315)] * 5

# Polar Express (PR #1344)
COEFS = [
    (a1, b1, c1),  # tuned for far-from-orthogonal input
    (a2, b2, c2),  # tuned for residual after step 1
    ...
    (a5, b5, c5),  # final polish, near-orthogonal input
]

Env var: POLAR_EXPRESS_NS=1.

2. MIN_LR=0.10 warmdown floor (from PR #1787)

PR #1790's LR schedule decays linearly to 0 during the warmdown phase. The final ~25% of training under warmdown_frac=0.75 runs at near-zero LR, delivering no useful gradient updates. Flooring lr_mul at 0.10 keeps the tail of training productive.

# PR #1790 (decay to zero)
lr_mul(step) = max((1 - frac) / warmdown_frac, 0.0)

# This PR (10% floor)
lr_mul(step) = max((1 - frac) / warmdown_frac, 0.10)

Env var: MIN_LR=0.10.

3. LQER asymmetric rank-4 quantization correction (from PR #1797)

After GPTQ quantizes a weight W, the residual E = W − dequant(quantize(W)) is non-zero. LQER computes a rank-4 SVD of E for the top-K=3 highest-error layers and packs the factors as asymmetric int4 with per-group (g=64) scales.

U, S, Vh = torch.linalg.svd(E, full_matrices=False)
A = U[:, :4] * S[:4]                # (out_dim × 4)
B = Vh[:4, :]                        # (4 × in_dim)
qA, sA = symmetric_int4_per_row(A)
qB, sB = asymmetric_int4_per_group(B, g=64)
# At inference: W_used = dequant(W_q) + dequant(qA) @ dequant(qB)

Selection by ||E||_F; in practice top-3 captures the bulk of the recoverable error mass and going wider is neutral (tested at top-12 in single-seed ablation).

Env vars: LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=3 LQER_GROUP_SIZE=64.

Hyperparameters

SEED=<1337|42|2025> \
QK_GAIN_INIT=5.25 \
SMEAR_GATE=1 \
GATE_ATTN_OUT=1 \
GATE_ATTN_WIDTH=24 \
GPTQ_RESERVE_SECONDS=4 \
GPTQ_CALIBRATION_BATCHES=16 \
POLAR_EXPRESS_NS=1 \
MIN_LR=0.10 \
LQER_ENABLED=1 \
LQER_RANK=4 \
LQER_TOP_K=3 \
LQER_GROUP_SIZE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

All other hyperparameters inherited from PR #1790 base unchanged.

Rule Compliance (Issue #1017 Track B)

Strict causal dependence: LoRA state built from prefix tokens only. Phased global SGD at boundaries operates on already-scored prefix docs only (inherited from PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790).
Full normalized distribution: standard softmax over SP8192 vocab. No n-gram cache, no logit biasing.
Score before update: per-chunk forward pass and loss accumulation BEFORE any LoRA gradient step. Last chunk of each phase explicitly skipped.
Single left-to-right pass: each token scored exactly once, no rescoring.

Artifact ≤ 16,000,000 bytes: ✅ all seeds 15.95 MB
Training ≤ 600s: ✅ all seeds 596.1s (wallclock-capped, identical to PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790)
Eval ≤ 600s: ✅ all seeds 394–457s
No val data during training: ✅ training uses only fineweb_train_*.bin shards
3-seed validation: ✅
BPB on original UTF-8 bytes: ✅ no preprocessing, no sidecar required
Reproducible: ✅ deterministic given seed and the env vars above

Attribution

@miaoyuxun — PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790 base stack (SmearGate + AttnOutGate(w24) + LoRA-TTT improvements + Phased TTT)
@bigbag — PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 prior merged SOTA (SP8192 base)
@MarioPaerle — PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 SmearGate + AttnOutGate (inherited via PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790)
@renqianluo — PR Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean) #1767 LoRA TTT improvements (inherited via PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790)
@jorge-asenjo — PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700 phased TTT (inherited via PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790)
Polar Express Newton-Schulz coefficients — PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344
MIN_LR warmdown floor — PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787
LQER asymmetric rank-4 — PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797

Test plan

Organizer reviews submission folder contents (train_gpt.py, 3 seed logs, submission.json, README.md).
Organizer reproduces at least one seed: SEED=1337 ... torchrun --standalone --nproc_per_node=8 train_gpt.py (full env above).
Reproduced quantized_ttt_phased val_bpb matches the logged 1.06699 (±0.00076) within seed noise.
Artifact size, train_time, total_eval_time within budgets on re-run.

…_bpb 1.06766, 3-seed mean) Stacks three orthogonal techniques on PR openai#1790's verbatim non-CaseOps base: - Polar Express Newton-Schulz coefficients (PR openai#1344) - MIN_LR=0.10 warmdown floor (PR openai#1787) - LQER asymmetric rank-4 quantization correction (PR openai#1797) 3-seed mean 1.06766 (std 0.00076), -0.01334 BPP vs merged SOTA (PR openai#1493 = 1.0810) at p << 0.001. All seeds fit budget (15.95 MB, 596s train, 394-457s eval).

@sharpobject

…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

LQER (PR openai#1797 / PR openai#1874 / PR openai#1530 lineage) ported into v4 verbatim from PR openai#1874's diff. The biggest single remaining lever in our stack: PR openai#1797 measured -0.009 BPB recovery from int6 quant tax at ~30 KB artifact cost. Default-OFF: LQER_ENABLED=0 returns v4 to v3 byte-for-byte. Patch surface: - 6 new env vars on Hyperparameters (LQER_ENABLED, LQER_RANK=4, LQER_TOP_K=3, LQER_FACTOR_BITS=4, LQER_ASYM_ENABLED=1, LQER_ASYM_GROUP=64). - _lqer_pack (sym INT4 per-row) and _lqer_pack_asym (INT2 for A scalar, INT4 per-group-64 for B) helper functions. - gptq_mixed_quantize: after each weight's GPTQ pass, capture residual E = W - W_quant and stash with Frobenius norm. After the main loop, if LQER_ENABLED=1, sort by norm, pick top-K, run torch.linalg.svd, take rank-r factors, pack via asym (default) or sym fallback. - dequantize_mixed: if metadata contains 'lqer_asym' or 'lqer', dequant the factors and add A @ B to the dequantized weight. Verified: - AST-clean on Python 3.13 (macOS) and 3.12 (Linux/Vultr). - CPU pack/dequant round-trip on a 512x2048 residual: confirms shape arithmetic and that asymmetric INT2/INT4 reconstruction tracks the symmetric INT4 reconstruction within 0.5%. - Sizes: v4 raw 57,420 lzma 15,776 (+648 vs v3, +2,528 vs SOTA). - Byte-cost projection: ~10.5 KB raw factors per 512x2048 weight, ~4-6 KB after brotli compression of redundant int8 patterns. Top-K=3 ~ 12-18 KB total. Worst-seed artifact projection ~ 9 KB OVER 16M cap; mitigated by LQER_TOP_K=2 fallback (~6 KB savings) or pre-flight serialize check. Proposal is deliberately directive (per user request). Tells Claude: 1. Run sanity check (Step 1, $0.70) - LQER_ENABLED=0 must reproduce v3. 2. Run size pre-flight (Step 2, $3) - 30-step training to verify artifact stays under 16M with LQER enabled. Drop to TOP_K=2 if over. 3. Run single-seed full retrain (Step 3, $15) stacking LQER + Polar Express + MIN_LR + LR=0.010 + ConfTTT. Compare against Phase-2 best. 4. If <= 1.0780, run 3-seed validation (Step 4, $45) and submit. Race awareness: PR openai#1797 and PR openai#1874 are both OPEN with LQER as a core component. Either merging tightens our threshold significantly. LQER is on the critical path either way. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…he README - models/nm_default.int6.ptz — Newton-Muon enabled, full 600s, seed=42 (val_bpb 1.10705) - models/nm_smoke.int6.ptz — Newton-Muon enabled, 180s smoke run - models/baseline_pr1874_seed42.int6.ptz — PR openai#1874 baseline, NM disabled, seed=42 (val_bpb 1.06928) This lets a reviewer inspect the actual trained quantized weights without having to retrain. Total ~46 MB of binary artifacts. README updated with a verified CPU-only inspection snippet (brotli decompress -> _byte_unshuffle -> torch.load) demonstrating the artifacts are well-formed int6 GPTQ state dicts. Author: Saicharan Ramineni <[email protected]>

andrewbaggio1 · 2026-04-28T20:15:17Z

Potential SmearGate boundary leak in #1874 / inherited #1790 path

I noticed this PR appears to inherit the pre-BOS-fix SmearGate pattern from the #1790 stack. In the PR diff, both forward_logits and forward have the same shape:

g = sl * torch.sigmoid(self.smear_gate(x[:, 1:, :self.smear_width]))
x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1]], dim=1)

If production train/eval ever passes packed multi-document rows, then for a packed stream

[..., a, BOS, b1, b2, ...]

the BOS embedding for the new document becomes

E[BOS] + gate(E[BOS]) * E[a]

and the logit at that BOS position scores b1, so document B can depend on document A's final token before attention/cu_seqlens can reset anything.

Could you confirm whether #1874's production train/eval path ever presents packed multi-doc rows to this SmearGate code? If yes, this needs the same BOS mask used in the BOS-fix submissions:

bos = input_ids[:, 1:].eq(BOS_ID)
prev = x[:, :-1].masked_fill(bos[..., None], 0)
x = torch.cat([x[:, :1], x[:, 1:] + g * prev], dim=1)

This is independent of CaseOps; it is a packed-document boundary issue.

V19c (seed 42) result: 1.06179 BPB (LOSS by +0.001 vs PR openai#1908 frontier 1.06081). V19c data attribution: pre-quant 1.06906 vs PR openai#1908 1.06384 = +0.0052 hurt -> primary cause: MATRIX_LR=0.028 (vs default 0.026) penalty on seed 42 TTT recovery -0.01489 vs PR openai#1908 -0.01269 = +0.0022 helped -> AsymLogit + PHASED_TTT_PREFIX=3500 actually working V20 strategy: remove LR penalty + keep TTT helpers + add LORA capacity: - DROP MATRIX_LR=0.028 -> default 0.026 (recovers +0.005 BPB on pre-quant) - KEEP ASYM_LOGIT_RESCALE=1 (eval-only, verified -0.001 to -0.002) - KEEP TTT_WEIGHT_DECAY=2.0 (stability fix) - KEEP PHASED_TTT_PREFIX_DOCS=3500 (verified more LoRA training data) - ADD TTT_LORA_RANK=144 (vs 96 default, +50% LoRA capacity) PR openai#1909 GodlyDonuts verified rank=192 gives small benefit on PR openai#1874 Conservative 144 to balance benefit vs eval-time budget (V19c was 527s, 73s buffer) Predicted (seed 42): pre-quant: ~1.063 (no train hparam changes from PR openai#1908) quantized: ~1.072 (matches PR openai#1908 quant tax) post-TTT: ~1.057 (TTT recovery -0.013 base + -0.002 AsymLogit/PHASED + -0.001 RANK = -0.016) Win threshold: < 1.06021 (PR openai#1908 - 0.0006 community floor) Probability of true win: ~50% Cost: ~$22 single-seed scout on 8xH100 SXM

AjAnubolu changed the title ~~Record: PR #1790 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)~~ Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) Apr 28, 2026

This was referenced Apr 28, 2026

Newton-Muon × PR #1874's document-packed loader: a controlled negative result #1907

Open

Record val_bpb 1.06996: Independent 3-seed reproduction of PR #1874 + TTT_LORA_RANK=192 #1909

Open

This was referenced Apr 29, 2026

Record: SP8192 PR #1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (3-seed mean) #1920

Closed

Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean) #1926

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)#1874

Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)#1874
AjAnubolu wants to merge 1 commit intoopenai:mainfrom
AjAnubolu:submission/2026-04-25-pr1790-polar-minlr-lqer

AjAnubolu commented Apr 27, 2026

Uh oh!

andrewbaggio1 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AjAnubolu commented Apr 27, 2026

Record: PR #1790 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)

Summary

3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)

Head-to-head vs PR #1790 (matched seed)

Key Techniques

1. Polar Express Newton-Schulz coefficients (from PR #1344)

2. MIN_LR=0.10 warmdown floor (from PR #1787)

3. LQER asymmetric rank-4 quantization correction (from PR #1797)

Hyperparameters

Rule Compliance (Issue #1017 Track B)

Attribution

Test plan

Uh oh!

andrewbaggio1 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants