Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)#1874
Conversation
…_bpb 1.06766, 3-seed mean) Stacks three orthogonal techniques on PR openai#1790's verbatim non-CaseOps base: - Polar Express Newton-Schulz coefficients (PR openai#1344) - MIN_LR=0.10 warmdown floor (PR openai#1787) - LQER asymmetric rank-4 quantization correction (PR openai#1797) 3-seed mean 1.06766 (std 0.00076), -0.01334 BPP vs merged SOTA (PR openai#1493 = 1.0810) at p << 0.001. All seeds fit budget (15.95 MB, 596s train, 394-457s eval).
…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
LQER (PR openai#1797 / PR openai#1874 / PR openai#1530 lineage) ported into v4 verbatim from PR openai#1874's diff. The biggest single remaining lever in our stack: PR openai#1797 measured -0.009 BPB recovery from int6 quant tax at ~30 KB artifact cost. Default-OFF: LQER_ENABLED=0 returns v4 to v3 byte-for-byte. Patch surface: - 6 new env vars on Hyperparameters (LQER_ENABLED, LQER_RANK=4, LQER_TOP_K=3, LQER_FACTOR_BITS=4, LQER_ASYM_ENABLED=1, LQER_ASYM_GROUP=64). - _lqer_pack (sym INT4 per-row) and _lqer_pack_asym (INT2 for A scalar, INT4 per-group-64 for B) helper functions. - gptq_mixed_quantize: after each weight's GPTQ pass, capture residual E = W - W_quant and stash with Frobenius norm. After the main loop, if LQER_ENABLED=1, sort by norm, pick top-K, run torch.linalg.svd, take rank-r factors, pack via asym (default) or sym fallback. - dequantize_mixed: if metadata contains 'lqer_asym' or 'lqer', dequant the factors and add A @ B to the dequantized weight. Verified: - AST-clean on Python 3.13 (macOS) and 3.12 (Linux/Vultr). - CPU pack/dequant round-trip on a 512x2048 residual: confirms shape arithmetic and that asymmetric INT2/INT4 reconstruction tracks the symmetric INT4 reconstruction within 0.5%. - Sizes: v4 raw 57,420 lzma 15,776 (+648 vs v3, +2,528 vs SOTA). - Byte-cost projection: ~10.5 KB raw factors per 512x2048 weight, ~4-6 KB after brotli compression of redundant int8 patterns. Top-K=3 ~ 12-18 KB total. Worst-seed artifact projection ~ 9 KB OVER 16M cap; mitigated by LQER_TOP_K=2 fallback (~6 KB savings) or pre-flight serialize check. Proposal is deliberately directive (per user request). Tells Claude: 1. Run sanity check (Step 1, $0.70) - LQER_ENABLED=0 must reproduce v3. 2. Run size pre-flight (Step 2, $3) - 30-step training to verify artifact stays under 16M with LQER enabled. Drop to TOP_K=2 if over. 3. Run single-seed full retrain (Step 3, $15) stacking LQER + Polar Express + MIN_LR + LR=0.010 + ConfTTT. Compare against Phase-2 best. 4. If <= 1.0780, run 3-seed validation (Step 4, $45) and submit. Race awareness: PR openai#1797 and PR openai#1874 are both OPEN with LQER as a core component. Either merging tightens our threshold significantly. LQER is on the critical path either way. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…he README - models/nm_default.int6.ptz — Newton-Muon enabled, full 600s, seed=42 (val_bpb 1.10705) - models/nm_smoke.int6.ptz — Newton-Muon enabled, 180s smoke run - models/baseline_pr1874_seed42.int6.ptz — PR openai#1874 baseline, NM disabled, seed=42 (val_bpb 1.06928) This lets a reviewer inspect the actual trained quantized weights without having to retrain. Total ~46 MB of binary artifacts. README updated with a verified CPU-only inspection snippet (brotli decompress -> _byte_unshuffle -> torch.load) demonstrating the artifacts are well-formed int6 GPTQ state dicts. Author: Saicharan Ramineni <[email protected]>
|
Potential SmearGate boundary leak in #1874 / inherited #1790 path I noticed this PR appears to inherit the pre-BOS-fix SmearGate pattern from the #1790 stack. In the PR diff, both g = sl * torch.sigmoid(self.smear_gate(x[:, 1:, :self.smear_width]))
x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1]], dim=1)If production train/eval ever passes packed multi-document rows, then for a packed stream the BOS embedding for the new document becomes and the logit at that BOS position scores Could you confirm whether #1874's production train/eval path ever presents packed multi-doc rows to this SmearGate code? If yes, this needs the same BOS mask used in the BOS-fix submissions: bos = input_ids[:, 1:].eq(BOS_ID)
prev = x[:, :-1].masked_fill(bos[..., None], 0)
x = torch.cat([x[:, :1], x[:, 1:] + g * prev], dim=1)This is independent of CaseOps; it is a packed-document boundary issue. |
V19c (seed 42) result: 1.06179 BPB (LOSS by +0.001 vs PR openai#1908 frontier 1.06081). V19c data attribution: pre-quant 1.06906 vs PR openai#1908 1.06384 = +0.0052 hurt -> primary cause: MATRIX_LR=0.028 (vs default 0.026) penalty on seed 42 TTT recovery -0.01489 vs PR openai#1908 -0.01269 = +0.0022 helped -> AsymLogit + PHASED_TTT_PREFIX=3500 actually working V20 strategy: remove LR penalty + keep TTT helpers + add LORA capacity: - DROP MATRIX_LR=0.028 -> default 0.026 (recovers +0.005 BPB on pre-quant) - KEEP ASYM_LOGIT_RESCALE=1 (eval-only, verified -0.001 to -0.002) - KEEP TTT_WEIGHT_DECAY=2.0 (stability fix) - KEEP PHASED_TTT_PREFIX_DOCS=3500 (verified more LoRA training data) - ADD TTT_LORA_RANK=144 (vs 96 default, +50% LoRA capacity) PR openai#1909 GodlyDonuts verified rank=192 gives small benefit on PR openai#1874 Conservative 144 to balance benefit vs eval-time budget (V19c was 527s, 73s buffer) Predicted (seed 42): pre-quant: ~1.063 (no train hparam changes from PR openai#1908) quantized: ~1.072 (matches PR openai#1908 quant tax) post-TTT: ~1.057 (TTT recovery -0.013 base + -0.002 AsymLogit/PHASED + -0.001 RANK = -0.016) Win threshold: < 1.06021 (PR openai#1908 - 0.0006 community floor) Probability of true win: ~50% Cost: ~$22 single-seed scout on 8xH100 SXM
Record: PR #1790 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean)
Summary
val_bpb = 1.06766 (3-seed mean, std 0.00076) | val_loss = 2.34122 nats/token | ~15.95 MB | 8×H100 SXM, 600s train / 600s eval
Builds directly on PR #1790's verbatim non-CaseOps stack and layers in three orthogonal, previously validated techniques from open PRs:
(3.4445, −4.775, 2.0315) × 5, sameMUON_BACKEND_STEPS=5.−0.00225 BPP vs PR #1790 (1.06991 → 1.06766) on the same SP8192 base, all training-time legal under Issue #1017 conditions. No CaseOps, no casefold, no preprocessing — denominator is original UTF-8 bytes throughout. Improvement over current merged SOTA (PR #1493 = 1.0810) is −0.0133 BPP at p ≪ 0.001.
3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)
All three seeds clear size, train-time, and eval-time budgets. 3-seed std (0.00076) is well inside the 0.005 significance floor.
Head-to-head vs PR #1790 (matched seed)
Single-seed comparison on seed 1337 (the only seed PR #1790 reports individually that overlaps with our set). The improvement direction is consistent at −2.25 mBPP on the 3-seed means.
Key Techniques
1. Polar Express Newton-Schulz coefficients (from PR #1344)
Muon's
zeropower_via_newtonschulz5orthogonalizes a gradient matrix via 5 iterations of a Newton-Schulz polynomial. Stock Muon uses a single fixed coefficient tuple(a, b, c) = (3.4445, −4.775, 2.0315)for all 5 iterations. Polar Express replaces this with 5 distinct per-iteration tuples, each minimax-tuned for the residual error distribution at that step. Same compute budget, sharper orthogonal approximation.Env var:
POLAR_EXPRESS_NS=1.2. MIN_LR=0.10 warmdown floor (from PR #1787)
PR #1790's LR schedule decays linearly to 0 during the warmdown phase. The final ~25% of training under
warmdown_frac=0.75runs at near-zero LR, delivering no useful gradient updates. Flooringlr_mulat 0.10 keeps the tail of training productive.Env var:
MIN_LR=0.10.3. LQER asymmetric rank-4 quantization correction (from PR #1797)
After GPTQ quantizes a weight
W, the residualE = W − dequant(quantize(W))is non-zero. LQER computes a rank-4 SVD ofEfor the top-K=3 highest-error layers and packs the factors as asymmetric int4 with per-group (g=64) scales.Selection by
||E||_F; in practice top-3 captures the bulk of the recoverable error mass and going wider is neutral (tested at top-12 in single-seed ablation).Env vars:
LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=3 LQER_GROUP_SIZE=64.Hyperparameters
All other hyperparameters inherited from PR #1790 base unchanged.
Rule Compliance (Issue #1017 Track B)
Attribution
Test plan
SEED=1337 ... torchrun --standalone --nproc_per_node=8 train_gpt.py(full env above).quantized_ttt_phasedval_bpb matches the logged 1.06699 (±0.00076) within seed noise.