Skip to content

Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)#1874

Open
AjAnubolu wants to merge 1 commit intoopenai:mainfrom
AjAnubolu:submission/2026-04-25-pr1790-polar-minlr-lqer
Open

Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)#1874
AjAnubolu wants to merge 1 commit intoopenai:mainfrom
AjAnubolu:submission/2026-04-25-pr1790-polar-minlr-lqer

Conversation

@AjAnubolu
Copy link
Copy Markdown

Record: PR #1790 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)

Summary

val_bpb = 1.06766 (3-seed mean, std 0.00076) | val_loss = 2.34122 nats/token | ~15.95 MB | 8×H100 SXM, 600s train / 600s eval

Builds directly on PR #1790's verbatim non-CaseOps stack and layers in three orthogonal, previously validated techniques from open PRs:

  1. Polar Express Newton-Schulz coefficients (PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344) — 5 per-iteration minimax-tuned tuples replace Muon's fixed (3.4445, −4.775, 2.0315) × 5, same MUON_BACKEND_STEPS=5.
  2. MIN_LR=0.10 warmdown floor (PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787) — floors LR at 10% of max so the warmdown tail delivers useful gradient updates instead of frozen no-ops.
  3. LQER asymmetric rank-4 quantization correction (PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797) — SVD on the top-K=3 highest-error GPTQ residuals, packed as int4 with per-group asymmetric scales.

−0.00225 BPP vs PR #1790 (1.06991 → 1.06766) on the same SP8192 base, all training-time legal under Issue #1017 conditions. No CaseOps, no casefold, no preprocessing — denominator is original UTF-8 bytes throughout. Improvement over current merged SOTA (PR #1493 = 1.0810) is −0.0133 BPP at p ≪ 0.001.

3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)

Seed Steps Pre-quant post-EMA Quantized Post-TTT Artifact (bytes) train_time eval_time
1337 4954 1.06842 1.07813 1.06699 15,953,831 596.15s 456.6s
42 4954 1.06903 1.07856 1.06751 15,950,901 596.12s 455.2s
2025 4953 1.06994 1.07955 1.06849 15,948,627 596.13s 394.4s
Mean 4954 1.06913 1.07875 1.06766 15,951,120 596.13s 435.4s
Std 0.00076 0.00072 0.00076 2,634 0.02s 35.5s

All three seeds clear size, train-time, and eval-time budgets. 3-seed std (0.00076) is well inside the 0.005 significance floor.

Head-to-head vs PR #1790 (matched seed)

Seed This PR PR #1790 Δ (mBPP)
1337 1.06699 1.06986 −2.87

Single-seed comparison on seed 1337 (the only seed PR #1790 reports individually that overlaps with our set). The improvement direction is consistent at −2.25 mBPP on the 3-seed means.

Key Techniques

1. Polar Express Newton-Schulz coefficients (from PR #1344)

Muon's zeropower_via_newtonschulz5 orthogonalizes a gradient matrix via 5 iterations of a Newton-Schulz polynomial. Stock Muon uses a single fixed coefficient tuple (a, b, c) = (3.4445, −4.775, 2.0315) for all 5 iterations. Polar Express replaces this with 5 distinct per-iteration tuples, each minimax-tuned for the residual error distribution at that step. Same compute budget, sharper orthogonal approximation.

# Stock Muon
COEFS = [(3.4445, -4.775, 2.0315)] * 5

# Polar Express (PR #1344)
COEFS = [
    (a1, b1, c1),  # tuned for far-from-orthogonal input
    (a2, b2, c2),  # tuned for residual after step 1
    ...
    (a5, b5, c5),  # final polish, near-orthogonal input
]

Env var: POLAR_EXPRESS_NS=1.

2. MIN_LR=0.10 warmdown floor (from PR #1787)

PR #1790's LR schedule decays linearly to 0 during the warmdown phase. The final ~25% of training under warmdown_frac=0.75 runs at near-zero LR, delivering no useful gradient updates. Flooring lr_mul at 0.10 keeps the tail of training productive.

# PR #1790 (decay to zero)
lr_mul(step) = max((1 - frac) / warmdown_frac, 0.0)

# This PR (10% floor)
lr_mul(step) = max((1 - frac) / warmdown_frac, 0.10)

Env var: MIN_LR=0.10.

3. LQER asymmetric rank-4 quantization correction (from PR #1797)

After GPTQ quantizes a weight W, the residual E = W − dequant(quantize(W)) is non-zero. LQER computes a rank-4 SVD of E for the top-K=3 highest-error layers and packs the factors as asymmetric int4 with per-group (g=64) scales.

U, S, Vh = torch.linalg.svd(E, full_matrices=False)
A = U[:, :4] * S[:4]                # (out_dim × 4)
B = Vh[:4, :]                        # (4 × in_dim)
qA, sA = symmetric_int4_per_row(A)
qB, sB = asymmetric_int4_per_group(B, g=64)
# At inference: W_used = dequant(W_q) + dequant(qA) @ dequant(qB)

Selection by ||E||_F; in practice top-3 captures the bulk of the recoverable error mass and going wider is neutral (tested at top-12 in single-seed ablation).

Env vars: LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=3 LQER_GROUP_SIZE=64.

Hyperparameters

SEED=<1337|42|2025> \
QK_GAIN_INIT=5.25 \
SMEAR_GATE=1 \
GATE_ATTN_OUT=1 \
GATE_ATTN_WIDTH=24 \
GPTQ_RESERVE_SECONDS=4 \
GPTQ_CALIBRATION_BATCHES=16 \
POLAR_EXPRESS_NS=1 \
MIN_LR=0.10 \
LQER_ENABLED=1 \
LQER_RANK=4 \
LQER_TOP_K=3 \
LQER_GROUP_SIZE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

All other hyperparameters inherited from PR #1790 base unchanged.

Rule Compliance (Issue #1017 Track B)

  1. Strict causal dependence: LoRA state built from prefix tokens only. Phased global SGD at boundaries operates on already-scored prefix docs only (inherited from PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790).
  2. Full normalized distribution: standard softmax over SP8192 vocab. No n-gram cache, no logit biasing.
  3. Score before update: per-chunk forward pass and loss accumulation BEFORE any LoRA gradient step. Last chunk of each phase explicitly skipped.
  4. Single left-to-right pass: each token scored exactly once, no rescoring.

Attribution

Test plan

  • Organizer reviews submission folder contents (train_gpt.py, 3 seed logs, submission.json, README.md).
  • Organizer reproduces at least one seed: SEED=1337 ... torchrun --standalone --nproc_per_node=8 train_gpt.py (full env above).
  • Reproduced quantized_ttt_phased val_bpb matches the logged 1.06699 (±0.00076) within seed noise.
  • Artifact size, train_time, total_eval_time within budgets on re-run.

…_bpb 1.06766, 3-seed mean)

Stacks three orthogonal techniques on PR openai#1790's verbatim non-CaseOps base:
- Polar Express Newton-Schulz coefficients (PR openai#1344)
- MIN_LR=0.10 warmdown floor (PR openai#1787)
- LQER asymmetric rank-4 quantization correction (PR openai#1797)

3-seed mean 1.06766 (std 0.00076), -0.01334 BPP vs merged SOTA (PR openai#1493 = 1.0810)
at p << 0.001. All seeds fit budget (15.95 MB, 596s train, 394-457s eval).
GodlyDonuts added a commit to GodlyDonuts/parameter-golf that referenced this pull request Apr 28, 2026
…olar Express NS + MIN_LR + LQER)

Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877):
- openai#1852: hard rule violation (pre-quant TTT on validation data).
- openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted.
- openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over
  token alphabet), reviewer @sharpobject caught.
- openai#1855: techniques mostly legit but apt-get install lrzip violates Issue
  openai#1017 Rule 3 (artifact must be self-contained).
- openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal
  training-time techniques citing prior validated PRs. If it merges,
  our submission threshold shifts from 1.0760 to ~1.0627.

PR openai#1874's three techniques:
1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples
   replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5.
2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max
   instead of decaying to 0. Already wired in our v1+; just env-var
   opt-in.
3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) -
   SVD on top-K=3 highest-error GPTQ residuals, packed as int4
   per-group-64 asymmetric. ~200-400 LOC; deferred to v4.

train_gpt_v3.py implements (1) and exposes (2):
- POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off).
- _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at
  import time so torch.compile sees them as constants.
- zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use
  per-iteration coefficients instead of fixed.
- MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in.

Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst-
seed artifact slack: ~4,888 bytes under cap. Tight but workable.

AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux).

Stacking projection (single-seed):
- Phase 0 baseline:       1.08038
- + LR=0.010 (Stage 2):   1.08021
- + Polar Express NS:     1.0787-1.0797
- + MIN_LR=0.10:          1.0777-1.0794
- + ConfTTT (PR openai#1879):   1.0772-1.0793
- + LQER (v4 work):       1.0742-1.0783
- + Phase 2 architecture: 1.0712-1.0773
- + Newton-Muon Stage E:  1.066-1.075

Path B (absorb-and-stack) recommended over Path A (race-to-merge-with-
current-stack) since current stack alone doesn't clear 1.0760.

Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open.
Whichever merges first becomes new SOTA and our threshold tightens.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
GodlyDonuts added a commit to GodlyDonuts/parameter-golf that referenced this pull request Apr 28, 2026
LQER (PR openai#1797 / PR openai#1874 / PR openai#1530 lineage) ported into v4 verbatim from
PR openai#1874's diff. The biggest single remaining lever in our stack:
PR openai#1797 measured -0.009 BPB recovery from int6 quant tax at ~30 KB
artifact cost. Default-OFF: LQER_ENABLED=0 returns v4 to v3 byte-for-byte.

Patch surface:
- 6 new env vars on Hyperparameters (LQER_ENABLED, LQER_RANK=4,
  LQER_TOP_K=3, LQER_FACTOR_BITS=4, LQER_ASYM_ENABLED=1, LQER_ASYM_GROUP=64).
- _lqer_pack (sym INT4 per-row) and _lqer_pack_asym (INT2 for A scalar,
  INT4 per-group-64 for B) helper functions.
- gptq_mixed_quantize: after each weight's GPTQ pass, capture residual
  E = W - W_quant and stash with Frobenius norm. After the main loop, if
  LQER_ENABLED=1, sort by norm, pick top-K, run torch.linalg.svd, take
  rank-r factors, pack via asym (default) or sym fallback.
- dequantize_mixed: if metadata contains 'lqer_asym' or 'lqer', dequant
  the factors and add A @ B to the dequantized weight.

Verified:
- AST-clean on Python 3.13 (macOS) and 3.12 (Linux/Vultr).
- CPU pack/dequant round-trip on a 512x2048 residual: confirms shape
  arithmetic and that asymmetric INT2/INT4 reconstruction tracks the
  symmetric INT4 reconstruction within 0.5%.
- Sizes: v4 raw 57,420 lzma 15,776 (+648 vs v3, +2,528 vs SOTA).
- Byte-cost projection: ~10.5 KB raw factors per 512x2048 weight, ~4-6
  KB after brotli compression of redundant int8 patterns. Top-K=3 ~ 12-18
  KB total. Worst-seed artifact projection ~ 9 KB OVER 16M cap; mitigated
  by LQER_TOP_K=2 fallback (~6 KB savings) or pre-flight serialize check.

Proposal is deliberately directive (per user request). Tells Claude:
1. Run sanity check (Step 1, $0.70) - LQER_ENABLED=0 must reproduce v3.
2. Run size pre-flight (Step 2, $3) - 30-step training to verify
   artifact stays under 16M with LQER enabled. Drop to TOP_K=2 if over.
3. Run single-seed full retrain (Step 3, $15) stacking LQER + Polar
   Express + MIN_LR + LR=0.010 + ConfTTT. Compare against Phase-2 best.
4. If <= 1.0780, run 3-seed validation (Step 4, $45) and submit.

Race awareness: PR openai#1797 and PR openai#1874 are both OPEN with LQER as a
core component. Either merging tightens our threshold significantly.
LQER is on the critical path either way.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@AjAnubolu AjAnubolu changed the title Record: PR #1790 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) Apr 28, 2026
GodlyDonuts added a commit to GodlyDonuts/parameter-golf that referenced this pull request Apr 28, 2026
…he README

- models/nm_default.int6.ptz             — Newton-Muon enabled, full 600s, seed=42 (val_bpb 1.10705)
- models/nm_smoke.int6.ptz               — Newton-Muon enabled, 180s smoke run
- models/baseline_pr1874_seed42.int6.ptz — PR openai#1874 baseline, NM disabled, seed=42 (val_bpb 1.06928)

This lets a reviewer inspect the actual trained quantized weights without
having to retrain. Total ~46 MB of binary artifacts. README updated with a
verified CPU-only inspection snippet (brotli decompress -> _byte_unshuffle ->
torch.load) demonstrating the artifacts are well-formed int6 GPTQ state dicts.

Author: Saicharan Ramineni <[email protected]>
@andrewbaggio1
Copy link
Copy Markdown

Potential SmearGate boundary leak in #1874 / inherited #1790 path

I noticed this PR appears to inherit the pre-BOS-fix SmearGate pattern from the #1790 stack. In the PR diff, both forward_logits and forward have the same shape:

g = sl * torch.sigmoid(self.smear_gate(x[:, 1:, :self.smear_width]))
x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1]], dim=1)

If production train/eval ever passes packed multi-document rows, then for a packed stream

[..., a, BOS, b1, b2, ...]

the BOS embedding for the new document becomes

E[BOS] + gate(E[BOS]) * E[a]

and the logit at that BOS position scores b1, so document B can depend on document A's final token before attention/cu_seqlens can reset anything.

Could you confirm whether #1874's production train/eval path ever presents packed multi-doc rows to this SmearGate code? If yes, this needs the same BOS mask used in the BOS-fix submissions:

bos = input_ids[:, 1:].eq(BOS_ID)
prev = x[:, :-1].masked_fill(bos[..., None], 0)
x = torch.cat([x[:, :1], x[:, 1:] + g * prev], dim=1)

This is independent of CaseOps; it is a packed-document boundary issue.

alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
V19c (seed 42) result: 1.06179 BPB (LOSS by +0.001 vs PR openai#1908 frontier 1.06081).

V19c data attribution:
  pre-quant 1.06906 vs PR openai#1908 1.06384 = +0.0052 hurt
    -> primary cause: MATRIX_LR=0.028 (vs default 0.026) penalty on seed 42
  TTT recovery -0.01489 vs PR openai#1908 -0.01269 = +0.0022 helped
    -> AsymLogit + PHASED_TTT_PREFIX=3500 actually working

V20 strategy: remove LR penalty + keep TTT helpers + add LORA capacity:
  - DROP MATRIX_LR=0.028 -> default 0.026 (recovers +0.005 BPB on pre-quant)
  - KEEP ASYM_LOGIT_RESCALE=1 (eval-only, verified -0.001 to -0.002)
  - KEEP TTT_WEIGHT_DECAY=2.0 (stability fix)
  - KEEP PHASED_TTT_PREFIX_DOCS=3500 (verified more LoRA training data)
  - ADD TTT_LORA_RANK=144 (vs 96 default, +50% LoRA capacity)
    PR openai#1909 GodlyDonuts verified rank=192 gives small benefit on PR openai#1874
    Conservative 144 to balance benefit vs eval-time budget (V19c was 527s, 73s buffer)

Predicted (seed 42):
  pre-quant: ~1.063 (no train hparam changes from PR openai#1908)
  quantized: ~1.072 (matches PR openai#1908 quant tax)
  post-TTT:  ~1.057 (TTT recovery -0.013 base + -0.002 AsymLogit/PHASED + -0.001 RANK = -0.016)

Win threshold: < 1.06021 (PR openai#1908 - 0.0006 community floor)
Probability of true win: ~50%

Cost: ~$22 single-seed scout on 8xH100 SXM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants