Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed)#1344
Open
Omrigotlieb wants to merge 1 commit intoopenai:mainfrom
Open
Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed)#1344Omrigotlieb wants to merge 1 commit intoopenai:mainfrom
Omrigotlieb wants to merge 1 commit intoopenai:mainfrom
Conversation
5 tasks
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 5, 2026
4 tasks
4 tasks
almutwakel
added a commit
to almutwakel/parameter-golf
that referenced
this pull request
Apr 8, 2026
… BPB Systems-only optimization: replace standard Newton-Schulz with Gram Newton-Schulz in Muon optimizer. Mathematically equivalent (same fixed point), +3.4% throughput. Based on PR openai#1344 (Omrigotlieb).
mradassaad
added a commit
to mradassaad/parameter-golf
that referenced
this pull request
Apr 9, 2026
Stateful eval was previously flagged as harmful on the grounds that INT6 quant errors accumulate in SSM recurrent state. Measurement shows the quant delta is actually flat ~8.2 mBPB across 100-1892 windows — no accumulation. Real cause of the pure-stateful BF16 regression was attention context loss at window boundaries. Stateful-overlap eval with overlap=1024 closes the gap to sliding within 0.3 mBPB while running in ~32s vs 500s, freeing 468s of eval budget for SLOT/TTT. Also corrects merged SOTA to 1.1147 bpb (PR openai#1019), flags PR openai#1329/openai#1344/ openai#1430 as unmerged/invalid, and revises the SLOT estimate from 50-150 to 15-30 mBPB based on capacity-regularization reasoning.
AjAnubolu
added a commit
to AjAnubolu/parameter-golf
that referenced
this pull request
Apr 14, 2026
Layers 3,4,5 share MLP weights; attention weights stay unique per layer. Weight decay bumped to 0.09 (from 0.04) to regularize the shared MLP. Based on PRs openai#1334/openai#1344 which report 1.089-1.092 BPB with this setup. Why this works now when our prior attempt failed: - Prior: shared ALL layer weights -> quant error amplified 900x - Now: share ONLY MLP, keep attention unique -> per-layer discrimination - Higher WD regularizes against per-layer overfitting - Full Hessian GPTQ correctly accumulates Hessians across sharers Saves ~6.3 MB of parameters. The reinvest budget is the whole point: wider MLP, larger BigramHash, more unique layers, or higher-precision quantization for critical layers. GPTQ integration: forward pass accumulates Hessians under a shared key, quantizes the shared weight once using the combined Hessian, dedupes in _rebank_state_dict when constructing the export bank.
AjAnubolu
added a commit
to AjAnubolu/parameter-golf
that referenced
this pull request
Apr 14, 2026
Layers 3,4,5 share MLP weights; attention weights stay unique per layer. Weight decay bumped to 0.09 (from 0.04) to regularize the shared MLP. Based on PRs openai#1334/openai#1344 which report 1.089-1.092 BPB with this setup. Why this works now when our prior attempt failed: - Prior: shared ALL layer weights -> quant error amplified 900x - Now: share ONLY MLP, keep attention unique -> per-layer discrimination - Higher WD regularizes against per-layer overfitting - Full Hessian GPTQ correctly accumulates Hessians across sharers Saves ~6.3 MB of parameters. The reinvest budget is the whole point: wider MLP, larger BigramHash, more unique layers, or higher-precision quantization for critical layers. GPTQ integration: forward pass accumulates Hessians under a shared key, quantizes the shared weight once using the combined Hessian, dedupes in _rebank_state_dict when constructing the export bank.
6 tasks
This was referenced Apr 26, 2026
Open
4 tasks
GodlyDonuts
added a commit
to GodlyDonuts/parameter-golf
that referenced
this pull request
Apr 28, 2026
…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
3 tasks
8 tasks
4 tasks
cocohearts
pushed a commit
that referenced
this pull request
Apr 29, 2026
…Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421). Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on seed 0 against stock #1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR #1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR #1779's frozen recurrent α/β and PR #1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
hilbertmeng
pushed a commit
to hilbertmeng/parameter-golf
that referenced
this pull request
Apr 30, 2026
… Attn Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421). Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on seed 0 against stock openai#1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR openai#1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and PR openai#1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
9 tasks
6 tasks
This was referenced May 1, 2026
jzmyres
pushed a commit
to jzmyres/parameter-golf
that referenced
this pull request
May 3, 2026
…_blend_lr Iter 117 v1 NaN'd at step 60 (recon 1.2e-4 → 6.8e-2, attn_cv 0.88 → NaN). Root cause: 0.7% entmax-1.5 contribution at init combined with no entropy cushion let CV concentration cascade once routing started concentrating. Three principled fixes in this commit: 1. Annealed blend (H87 v2): blend = (1-anneal) + sigmoid(blend_logit)*anneal, with anneal: 0 → 1 over warmup_delay_frac=0.3. At init (anneal=0), blend forced to 1.0 = pure softmax = iter 100b exact (strict-gen verified numerically: max abs diff 0.00 at anneal=0). Avoids cold-start instability. New buffer `_entmax_blend_anneal` + setter, written by training-loop annealer alongside variance/entropy schedules. 2. Polar-Express Newton-Schulz Muon coefficients (PR openai#1344 → openai#1787 in records). Per-iter (a, b, c) tuples replace stock fixed (3.4445, -4.7750, 2.0315). Iter 1: aggressive (8.16, -22.5, 15.9); iter 5: gentle (2.35, -1.71, 0.42). At 10 iters: 0.05 mean rel err vs stock 0.20 — 4× better orthogonalization quality at the same matmul count. IMPORTANT — DEQ STABILITY CONSTRAINT: at backend_steps=5 (records' default) the aggressive iter-1 coefficient overshoots and breaks DEQ reverse reconstruction (smoke recon 1.2e-4 → 6.8e-2). Records are non-DEQ so they tolerate this. We MUST run at backend_steps=10 — confirmed smoke PASSES. Default lifted 5 → 10. Net throughput cost ~5-10% step_avg; net quality gain 4× orthogonalization. Likely net-positive. 3. Slow LR for blend_logit (entmax_blend_lr=0.002, 10× smaller than scalar_lr=0.02). Mirrors parcae_lr precedent — params controlling sensitive system dynamics get a slow LR to bound drift rate. Once anneal ramps after warmup_delay, gradient flows but blend_logit drifts 10× slower → routing has time to adapt to gradually-introduced sparsity. New optimizer group carved from scalar_params via "_entmax_blend_logit" name filter; coverage assertion updated. Smoke test PASSED: loss 7.00 → 4.55 at 300 steps with all three fixes active (entmax off by default — needs --use-entmax-routing=1 to enable). CLAUDE.md §5 muon_backend_steps row added documenting the 5 → 10 lift and the DEQ-specific constraint vs records' default. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
May 5, 2026
- Parallel residuals: credit PR openai#1204 (first) and PR openai#1529 (layer 8) - Depth recurrence: clarify 2-pass origin (openai#1344), 3-pass is final config - Loop curriculum frac=0.35: correct PR to openai#1420 (not openai#1344) - SmearGate: note original PR openai#162, SP8192 reintroduction in openai#1667 - Attention output gate: PR openai#1667 first, PR openai#1787 narrowed it - LeakyReLU²: replaced relu², not GELU - EMA decay: remove specific wrong value, defer to PR openai#287 - GPTQ first: correct to PR openai#535 (not openai#1019) - Global SGD phase: 2000 docs (openai#1610) refined to 2500 (openai#1626) - Training closing line: "halfway point" → "35% mark" Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
May 5, 2026
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
May 5, 2026
- Depth recurrence: openai#1344 → openai#1204 (first appearance on leaderboard) - GPTQ: openai#535 → openai#374 (GPTQ-lite first introduced in openai#374) - int7 embeddings: openai#1586 → openai#1626 (openai#1586 not in leaderboard) - LQER: openai#1797 → openai#1851 (technique evolution credits openai#1851) - XSA table: openai#287 → openai#265 (openai#265 is "first XSA") - SmearGate table: openai#1667 → openai#1851 (openai#1851 fixed BOS bug; openai#1667 just reused SmearGate) - LeakyReLU² table: openai#493 → openai#549 (openai#549 is "first" per leaderboard) - AWQ-lite table: openai#1908 → openai#1945 (openai#1908 not in leaderboard) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Results
Innovations (on clarkkev PR #1218 SP4096 base)
Run Command
Test plan
🤖 Generated with Claude Code