Record: Bank QAT + seq4096 + SWA w=256 + QK-Gain 2.5 + PKO — val_bpb 1.1117 (3-seed mean)#1512
Open
Itssshikhar wants to merge 77 commits intoopenai:mainfrom
Open
Record: Bank QAT + seq4096 + SWA w=256 + QK-Gain 2.5 + PKO — val_bpb 1.1117 (3-seed mean)#1512Itssshikhar wants to merge 77 commits intoopenai:mainfrom
Itssshikhar wants to merge 77 commits intoopenai:mainfrom
Conversation
Novel: Efficient Partial Exclusive Self Attention on last 3 layers. GQA-aware reshape avoids tensor duplication (<2ms overhead). Beats prior SOTA (1.1318) by 0.0011 BPB. 15.9MB artifact.
…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…U-Net + NeoMuon + 4xrelu²MLP + Smear + Fact Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8 + Bit-packing LZMA + Stride-16 Eval - 2h (openai#641) * Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary U-Net (15L 768d 8192BPE relu² 4xMLP FP8 SmearGate, 50k steps) * Updated README.md for Non-record submission. --------- Co-authored-by: Ciprian-Florin Ifrim <[email protected]>
… relu² 4xMLP FP8) (openai#640) Co-authored-by: Ciprian-Florin Ifrim <[email protected]>
…What Works, What Doesn't, and Why (openai#363) * Non-record: depth recurrence + quantization error amplification finding 4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP BigramHash + XSA + LoRA + Late STE QAT + int8+zstd Key finding: quantization error amplifies ~900x through recurrence cycles, making int6 incompatible with weight-sharing architectures. Int8 for shared blocks reduces the gap from 1.14 to 0.37 bpb. 3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8) * docs: comprehensive depth recurrence research writeup Complete 4-day experimental report on looped transformers in Parameter Golf: - Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap) - Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb - 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins - 12 negative results with specific numbers - Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip) - Updated training script with all experimental features * Update README.md me when I cant write * fix: remove extra files, update writeup per reviewer feedback - Remove pr325_train_gpt.py from PR (dev file, not submission) - Restore original README.md - Update records/ writeup with v2 content - Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_) - Clarify T=0.90 is activation-dependent (relu² specific, found via grid search) --------- Co-authored-by: Evangeline Kamin <[email protected]>
Replaces standard Newton-Schulz in Parallel Muon with the Gram reformulation that iterates on the smaller n×n Gram matrix instead of the full n×m matrix. Saves ~40-50% FLOPs on MLP banks (512×1536) by avoiding repeated large matmuls in the inner loop. Drop-in replacement: same interface, same mathematical result, fewer FLOPs per Muon step → more training steps in 10 minutes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Use Polar Express per-step coefficients instead of fixed (a,b,c) - Initialize Q as Z + a*I on first iteration (not from identity) - Skip R update on last iteration and before restarts - Fall back to standard NS for square matrices (no Gram overhead) - Use torch.baddbmm for fused multiply-add operations - Add FlashAttention fallback chain (FA3 -> FA2 -> PyTorch SDPA) - Update documentation with corrections, 4090 run instructions Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Fix FA3 import: flash_attn>=2.7 bundles FA3 as flash_attn.flash_attn_interface, not as a top-level module. Added submodule import path to fallback chain. - Fix tensor contiguity: PyTorch 2.8 enforces contiguous tensors for dist.all_gather_into_tensor. Added .contiguous() after .mT transpose in zeropower_via_newtonschulz5. - Document full 8xH100 run: 6196 steps in 600s, 96.86ms/step, post-TTT val_bpb=1.1228 (seed 1337), 16.3MB submission size. - Include training log from successful run. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Previous pure PyTorch Gram NS was SLOWER than original (96.86ms vs 83.3ms) because without symmetric GEMM kernels, the Q/R tracking overhead exceeds the algorithmic FLOP savings. Now uses the actual GramNewtonSchulz class from the gram-newton-schulz package (pip install gram-newton-schulz) which provides: - CuTeDSL symmetric GEMM kernels (quack) that halve compute on H100+ - @torch.compile(fullgraph=True, mode="reduce-overhead") graph capture - fp16 iteration (more mantissa precision than bf16) - Automatic fallback to PyTorch ops on non-H100 GPUs Falls back to the pure PyTorch implementation if package is not installed. Added NS/attention backend logging for verification. Updated documentation with kernel installation instructions and analysis of why pure PyTorch fallback was insufficient. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…n baseline) GramNewtonSchulz returns non-contiguous tensors for tall matrices, crashing distributed all-gather. Added .contiguous() on the fast path. Full 8xH100 kernel run shows 99.02ms step_avg vs ~83.3ms baseline — the kernel overhead exceeds FLOP savings at these matrix sizes (512x2048). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Previous attempts were slower because they changed too many things: different coefficients, fp16 dtype, CUDA graph overhead, library abstraction layers, .contiguous() copies. This version changes ONLY the iteration math for rectangular matrices: - Same (a,b,c) = (3.4445, -4.7750, 2.0315) as original - Same bf16 dtype, same torch.compile behavior - Square matrices (qo_bank): identical to original, zero overhead - Rectangular matrices (mlp/kv banks): Gram NS with restart at step 2 - No external dependencies, no library imports - ~23% fewer FLOPs on rectangular banks (3 large matmuls vs 10) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sliding window attention (window=256) on layers 0-7, full attention on layers 8-10. Built on top of current openai#1 submission codebase. Results: 75.34ms step_avg (13% faster than openai#1's 86.6ms), 7965 steps (+1000 over openai#1), pre-quant val_bpb 1.1332 (0.002 better than openai#1). Sliding eval val_bpb 1.1186 with GPTQ-lite (0.003 from openai#1's 1.1159). Key finding: eval model MUST use same sliding window config as training. Switching to full attention at eval time causes catastrophic failure (2.4 BPB) because Q/K weights never learned distant attention patterns and int6 quantization noise amplifies the untrained scores. Confirmed via 4-way diagnostic (eager vs compiled x SWA vs full) — the bug is attention mismatch, not torch.compile. Includes: Modal run scripts, NS benchmarks, debug roundtrip scripts, original openai#1 code for baseline comparison, full experiment documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
v13a (strip VE): submittable at 15.84MB, sliding 1.0955 — matches v12 exactly. VE strip is free (0.000 BPB cost, actually slightly better post-EMA). v13c (TTT on bankless): catastrophic divergence +0.645 BPB (1.1067→1.7519), 571s eval time. TTT is dead for our stack across all architectures. train_v13.py includes TTT bug fixes: inference_mode→no_grad for Phase 1 scoring, RoPE cache invalidation before Phase 2 training. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…g dead Critical read of v13a + v13c results: - v13a sliding 1.0955 is submittable; both v12 and v13a converge to same 1.09553 ceiling under SEED=1337 — likely a real stack ceiling, needs SEED=42 to confirm. - TARGET_MB units bug retroactively explains why v9-v12 all "fit" the prune target while busting the decimal-byte cap. Fix: target_bytes = int(target_mb * 1_000_000) instead of *1024*1024. - TTT v13_plan declared "dead" prematurely. Phase 2 forward runs with looping_active=True (train_v13.py line 1296), giving recurred layers 4-5 double gradient per backward — predictable runaway at LR=0.005. Three retries to try (LR/10, epochs=1, looping_active=False during phase 2, exclude recurred layers) before declaring TTT impossible. Run plan in priority tiers: - Tier 1: submit v13a, fix TARGET_MB bug, run v13d (parallel residuals) - Tier 2: SEED=42 reproducibility, strip bigram - Tier 3: TTT retry sweep with sane HPs - Tier 4: 3-layer recurrence, HP sweep matching PR openai#1493 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Full experiment sweep on v12-based architecture: - v13a (strip VE): 1.0955 sliding — submittable baseline - v13b (strip bigram): 1.1075 — pruning kills quant (+0.012) - v13c (TTT): catastrophic divergence (+0.645) - v13d (parallel L7+): 1.0969 — slightly worse - v13e (3-layer recur): 1.0958 — neutral - v13f (PR#1493 hparams): 1.0985 — worse quant gap - v13g (strip both): 1.0959 — neutral, 302KB freed - v13h (12 layers): 1.1331 — best pre-quant (1.0886) but 6.8% pruning destroys - v13i (dim=528): crash — flash_attn head_dim%8 constraint Key findings: pruning is superlinear-destructive, quant gap dominates final quality, architecture is at capacity ceiling within 16MB cap. Next: reproduce PR openai#1493 baseline (1.0810) and build on top. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Document experimental findings from attempts to beat PR openai#1493's 1.0810 BPB: - QAT (5 variants): all failed due to EMA contamination at decay=0.9965 - PKO: works for sliding but catastrophically breaks TTT (+0.02 BPB) - Mixed-precision: budget too tight (6.4KB margin vs 16.5KB minimum) Includes training scripts, eval scripts, logs, and EXPERIMENTS.md summary. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Complete archive of parameter-golf work: TTT sweeps, v7-v13 experiments, profiling scripts, hessian/requant sweeps, 3-seed runs, and baseline logs. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Adds pr1493_priority_results.md, populated by a 10-min cron that watches the orchestrator logs for the 5 queued experiments and records BPB/size metrics, errors, and learnings as each run completes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
docshuffle finished. Quantized TTT bpb 1.08279 vs comparator 1.08079 = +0.00200 BPB worse. Doc loader cost ~10% of training steps via per-batch index overhead (tok/s drifted 7.6M -> 6.0M). Submission size 16,033,898 bytes also over the 16M limit by 33,898 bytes. Verdict: drop. wd experiment now running. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Weight-decay schedule beat the 1.08079 comparator by 0.00050 BPB. Above the 0.0002 noise threshold but well short of the ~0.0018 needed for leaderboard acceptance on its own. tok/s held ~7.7M -> 6.7M (no extra slowdown vs baseline). Submission 16,031,886 B, still over 16M limit by 31,886 B; code-size minification needed before submit. iha experiment now running. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
iha trained fine (pre-quant 1.08820, stop step 4527/20000) but crashed during GPTQ quantization with KeyError 'blocks.0.attn.c_q.weight'. Root cause: collect_hessians hooks nn.Linear forwards, but IHA replaces self.c_q(x) with F.linear(x, _mixed_weight(...)) so the Linear never fires and no Hessian gets recorded. Needs harness fix before the idea can be evaluated. Also flagged: orchestrator set -e doesn't catch torchrun failures because the run is piped through tee+tail. Fortunately mtp continued. And: all submissions so far are 30 KB over the 16 MB limit — code minification required. mtp now running. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
MTP auxiliary loss at weight=0.10 hurt across the board: - pre-quant 1.11283 vs baseline 1.0875-1.0880 - q_sw 1.11018 vs baseline 1.083 - q_ttt 1.09023 vs comparator 1.08079 (Delta = +0.00944) At this 4438-step training budget the gradient capacity spent on t+2 prediction is just stolen from t+1 fitting. Worth retrying at lower weight after confirmed wins land. Submission 16,035,001 B (over 16M). evalloop3 now running. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The prior session's "wd_paired 1.08009" was a no-op (FS turbulence rolled train_pr1493.py back to the pre-stacking version mid-session; log had zero hits on tagged=22). This commit records the first real wd_paired run on commit 74dc702. Result (seed 42, real paired-head Muon firing, tagged=22 confirmed on all 8 ranks): q_ttt 1.07974, q_sw 1.08209, q 1.09891, pre 1.08610. Stack is -0.00129 BPB vs raw PR1493, -0.00055 vs wd alone. Also: - safe_launch.sh: pre-launch guard that asserts HEAD/md5/git-blob/ symbol-count/working-tree-clean before exec'ing torchrun. - requirements.txt: add brotli (was implicit via Modal image only; attempt 1 of this run crashed in serialize() because local env lacked it). - pr1493_priority_results.md: append wd_paired row (openai#6). - pr1493_wd_paired_session.md: full session writeup (verification methodology, two attempts, learnings, plan-doc vs reality). - logs/pr1493_wd_paired_s42.{txt,stdout}: real run log + stdout. - logs/pr1493_wd_paired_s42.{txt,stdout}.attempt1_brotli_crash: preserved evidence of the brotli ModuleNotFoundError crash. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Stronger WD schedule (WD_SCHED_LOW_FACTOR=0.50, WD_SCHED_HIGH_FACTOR=1.75) on top of wd_paired. Verified on commit 49d1068 with paired-head Muon firing (tagged=22) and wd_sched_low_factor: 0.5 / wd_sched_high_factor: 1.75 in the hyperparameter dump on all 8 ranks. Result (seed 42): pre = 1.08573 (-0.00037 vs wd_paired) q = 1.09874 (-0.00017 vs wd_paired) q_sw = 1.08194 (-0.00015 vs wd_paired) q_ttt = 1.07971 (-0.00003 vs wd_paired -- below noise floor) Stronger WD adds a small, real pre-quant gain that is completely absorbed by GPTQ + TTT. Net q_ttt change is essentially zero. Default WD + paired-head Muon (q_ttt 1.07974) remains the right baseline for further stacking. Also records the FS rollback event at 2026-04-29 10:45:24 UTC: .git/ was reborn from a snapshot taken before commit 49d1068, and logs/pr1493_wd_paired_s42.txt was overwritten by the prior session's bogus version. Recovery: git fetch + reset --hard origin/shikhar restored the real wd_paired log from blob; the wd_strong_paired logs were untracked at the time and survived untouched. safe_launch.sh: changed to symbol-presence check with auto-recovery from local backup or origin/shikhar (replaces the strict HEAD/md5/blob pin). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Stacked IHA on top of wd_paired. Killed at the pre-quantization
post-EMA gate per agreed criterion (pre >= 1.08610 -> kill, since
wd_paired baseline is 1.08610).
IHA harness fix is verified working ("iha:folded active head mixes
into linear weights for 11 layers before GPTQ" appears in the log,
GPTQ Hessian collection succeeded with all 67 hessians, no KeyError
this time). The fix in commit 74dc702 fold_iha_mixes path correctly
folds q_mix/k_mix into the Q/K linear weights before GPTQ collection
runs.
But the recipe doesn't help. IHA's per-step forward overhead cost 68
training steps (4528 vs wd_paired's 4596), and the trained model is
worse pre-quant than every alternative (wd alone 1.08650, wd_paired
1.08610, wd_strong_paired 1.08573).
Killed via SIGTERM to torchrun parent during GPTQ. All 8 ranks
cleaned up, GPU mem freed.
Verdict: drop. Move to code-shrink scout + 3-seed wd_paired sweep.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Implements the user's "highest-confidence remaining technical fix":
collect_hessians now dist.all_reduce(SUM) per-rank Hessians and divides
by n_calibration_batches * world_size, gated by GPTQ_ALL_REDUCE
(default on). Exposes GPTQ_DAMP and GPTQ_BLOCK_SIZE as env vars,
plumbed through Hyperparameters -> gptq_quantize_weight -> call site.
requant_eval.py: standalone torchrun script that loads a saved FP
checkpoint, runs GPTQ + brotli, and evaluates q/q_sw/q_ttt without
re-training. ~3 min/cell with TTT, ~2 min with TTT_ENABLED=0.
safe_launch.sh: now host-portable (script-relative paths instead of
hard-coded /workspace/parameter-golf), REQUIRED_SYMBOLS extended with
the new all-reduce marker so a rolled-back source file fails loudly.
Findings (all numbers are q_ttt unless noted; full matrix in
pr1493_gptq_allreduce_session.md):
- 16 shards no-AR: 1.08060
- 16 shards AR: 1.07977 (-0.00084 vs no-AR -- AR works)
- 128 shards no-AR: 1.07976 (reproduces HF metadata 1.07976130)
- 128 shards AR: 1.07975 (-0.0000076 vs no-AR -- saturated)
Damp sweep at 128/AR: clean U-curve, 0.01 optimal (worst 0.05 is
+8e-5 BPB on q_sw). Block sweep at 128/AR: {64,128,256} all tied
within 3e-6 BPB on q_sw.
Net: AR is a real fix that rescues low-shard configs and gives
deterministic cross-rank quant, but at full data it does not move
the needle. GPTQ-knob tuning is exhausted at the default. Path to
target ~1.07910 needs methodology change (QK_GAIN/EMA retune,
fresh wd_paired training) plus the still-unsolved 30 KB code shrink.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…lone openai#1851 Part 1: BOS-fixed SmearGate + per-head attn output gate ported onto PR1493 wd_strong_paired baseline (15+/-6 lines in train_pr1493.py). 5 new env vars: SMEARGATE_{ENABLED,BOS_ID,INIT}, ATTN_GATE_{ENABLED,INIT}. SmearGate is causal previous-token mixing with the BOS document-boundary mask from PR openai#1851: at positions where input_ids == bos_id, the smear contribution is forced to zero so the final token of doc N cannot leak into BOS of doc N+1. Verified by a focused unit test. Per-head attn_gate added inside CausalSelfAttention applied to flash_attn output before XSA. smeargate.smear_gate is a top-level GPT parameter so it gets explicitly appended to Optimizers.scalar_params (not picked up by the blocks-only loop). CONTROL_TENSOR_NAME_PATTERNS extended; 100% optimizer coverage verified. Real-run results (single seed s42, 8xH100): variant pre q q_sw q_ttt d_qttt baseline (wd_strong_paired) 1.08573 1.09874 1.08194 1.07971 -- smear+attn_gate1d (sigmoid) 1.08663 1.09887 1.08220 1.08052 +0.00081 smearonly (gate off) 1.08601 1.09834 1.08170 1.07998 +0.00027 smear_gate2d (additive) killed mid-train (~step 4000, val 1.1051) The 1D per-head sigmoid gate (8 params/layer) is undercapacity vs upstream PR openai#1667's 96 params/layer, and is +0.00090 worse pre-quant -- a real regression in the trained model. SmearGate alone improves q (-0.00040) and q_sw (-0.00024) but disrupts our SGD TTT lift (0.0017 vs 0.0022 baseline); net q_ttt within seed noise. The artifact stays >16 MB (added code costs ~7 KB; still bust like baseline). Conclusion: port is mechanically correct, just doesn't help on PR1493 base without the rest of the top stack (LQER, phased TTT, CaseOps). Part 2: Critical leaderboard analysis. PR openai#1855 and PR openai#1851 are both verified-merged by maintainer cocohearts and listed on README. PR openai#1855 has an OPEN val_docs=10_000 vs canonical 50_000 dispute (jfc43, 2026-04-30, unresolved) that affects the entire CaseOps chain (PRs 1736/1769/1787/ 1851/1855/1868). If ruling lands against, all six fall and PR1493 family returns to the top -- so building on PR1493 is a hedged investment. Real pre/q/q_ttt comparison vs openai#1855 seed 42 log: their pre=1.06396 vs ours 1.08573 (+0.022 BPB gap at the trained-model level), bigger than the total 0.020 gap. The leaderboard wedge is dominated by training-level wins (CaseOps + SparseAttnGate + 9-knob hparam stack), not LQER/phased-TTT. Part 3: Pivot decision. Clone openai#1851's train_gpt.py (152 KB, 3,574 lines) as the new base rather than porting their 2,500+ lines into our 553-line file. openai#1851 picked over openai#1855 because: same q_ttt within noise (1.06128 vs 1.06108), no lrzip system dep, fewer disputes. Layer only our small PR1493 differentiators (paired-head Muon NS, wd_schedule, gptq_all_reduce). CaseOps shards already published at romeerp/parameter-golf-caseops-v1 (80 train + val + val_bytes sidecar + tokenizer); saves 1-2 hr CPU retokenization. Background download in progress at session-end. Plan for next session: reproduce openai#1851 unmodified at s42 (target q_ttt 1.06128 +/- 0.0005); if reproduced, layer paired-head Muon then wd_schedule one-at-a-time; if not reproduced, stop and debug. Files added: pr1493_smeargate_to_top_stack_session.md full session writeup _top_ref/ cached openai#1851 reference files (train_gpt.py, lossless_caps.py, prepare_caseops_data.py, README.md) run_smear_*.sh smear experiment runners run_chain_smear_experiments.sh chain runner run_mom97.sh drafted but superseded logs/smear_*.txt + .stdout full run logs Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Layers WD_SCHEDULE_ENABLED + low/high factors onto _top_ref/train_gpt.py (PR openai#1851 SmearGate BOS Fix base). Off by default; strict no-op when WD_SCHEDULE_ENABLED=0. Skips paired-head Muon NS port: PR openai#1851 uses parameter banks (qo_bank/kv_bank/mlp_*_bank stacked along dim 0) instead of per-layer c_q/c_k weights, so the _head_pair_ns tagging approach from train_pr1493.py does not apply without redesigning the per-bank NS path. Surgical diff (5 hunks): - 5 env-driven hyperparameters (WD_SCHEDULE_ENABLED, hold/ramp fracs, low/high factors) - snapshot base_wd per group in Optimizers.__init__ after self.optimizers - wd_mul(frac) helper next to lr_mul(frac), same hold/ramp shape as train_pr1493 - step_fn signature gains wd_scale=1.0; applies group["weight_decay"] = base_wd * wd_scale - caller passes wd_mul(frac) Run with WD_SCHEDULE_ENABLED=1 WD_SCHED_LOW_FACTOR=0.5 WD_SCHED_HIGH_FACTOR=1.75 plus the standard PR1851 env vars. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Single-seed s42 result for top_wd_strong (WD_SCHEDULE_ENABLED=1, low=0.5, high=1.75 layered onto PR openai#1851 base): q_ttt = 1.06111. Compared to PR openai#1851's published s42 numbers (1.06128 original / 1.06083 re-run gptq8s) the delta is within 1/2 of the published 3-seed std (0.00068) — a no-op at single-seed resolution. Stage decomposition shows the WD schedule slightly worsened pre (+0.00033 vs PR1855's pre 1.06396) and widened the LQER quant gap (+0.00116 vs PR1855), with phased-LoRA TTT recovering most of the q-stage damage. Sign-flipped from PR1493 where the same WD config gave -0.00037 pre. Includes a critical inventory of every PR1493-stack technique cross-referenced against PR openai#1851's stack, ranking portability by pragmatic value: 1. GPTQ Hessian all-reduce: HIGH confidence, ~10-line port, expected -0.0005 to -0.0009 BPB. PR openai#1851's collect_hessians (line 2037-2141) does NOT all-reduce across ranks — same bug PR1493 had. With PR openai#1851's default gptq_calibration_batches=16, AR is in the regime where it helps (saturates at 128). 2. wd_schedule with default factors (low=0.65, high=1.5): env-var only, defensive test of whether WD-schedule mechanism carries at all. 3. Paired-head Muon NS port to bank architecture: ~80-120 lines of careful porting around qo_bank/kv_bank reshape semantics. Bank-NS already does per-layer NS for free, so marginal gain expected smaller than PR1493's -0.00055. Honest ceiling: even with all three layered, expected q_ttt ~1.05970 — clears PR openai#1855 by ~0.00140 BPB but does NOT clear the 0.0024-BPB acceptance bar (0.00140 < 0.0024). Best-case submission is a non-record entry at this stack without something architecture-level we don't have ready. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
PR openai#1851's collect_hessians (line 2037-2150 of _top_ref/train_gpt.py) computes each rank's Hessian on its own data shard subset (ShuffledSequenceLoader splits files by rank) and divides only by n_calibration_batches — without all-reduce, only rank 0's Hessian is effectively used since only rank 0 writes the quantized blob. 7/8 of calibration compute is wasted. Fix: dist.all_reduce(SUM) each Hessian (sorted iteration to avoid deadlock if key order ever drifts), divide by n_calibration_batches * world_size. Smoking- gun log line "gptq:all-rank Hessian averaging across N ranks (denom=...)" when on, "gptq:per-rank Hessian (no all-reduce, denom=...)" when off. Gated by GPTQ_ALL_REDUCE env var (default 1, the bugfix behavior). Off path preserves the original upstream semantics for clean A/B if needed. PR1493 evidence at gptq_calibration_batches=16 (PR openai#1851's default): 16-shard no-AR: q_ttt = 1.08060 16-shard AR : q_ttt = 1.07977 (delta -0.00083) At 128 calibration batches the AR delta saturates to noise. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
PR1493's paired-head Muon worked by tagging per-layer block.attn.c_q/c_k.weight
matrices with _head_pair_ns and reshaping the gradient (model_dim, model_dim)
to (num_pairs, pair_dim, model_dim) before NS. PR1851 has no per-layer
attention weights — qo_bank stacks Q/O along dim 0 as
(2*L, model_dim, model_dim) and kv_bank stacks K/V as
(2*L, kv_dim, model_dim). The bank is sharded across ranks via reduce-scatter
along dim 0, so a single rank may have a shard mixing Q (paired) and O
(unpaired) slices.
This commit adds per-slot dispatch inside Muon.step:
- Optimizers.__init__ tags qo_bank and kv_bank with _head_pair_bank_spec
{pair_count: L, num_pairs: H/2 or KV_H/2, pair_dim: head_dim*2} when
PAIRED_HEAD_MUON_ENABLED=1. Q-half (indices 0..L-1) gets paired NS;
O/V-half (indices L..2L-1) gets regular NS. Padding indices (>= 2L)
are skipped.
- Muon._build computes per-rank, per-bank paired_local / unpaired_local
slot indices in the shard. If paired_local is empty (e.g. ranks holding
only O/V slices) the bank falls back to the original batched NS path,
identical to upstream — strict no-op vs upstream when feature is off.
- New _ns_paired_dispatch helper does the reshape:
paired_view (n_paired, model_dim, model_dim)
-> reshape (n_paired * num_pairs, pair_dim, model_dim)
-> zeropower_via_newtonschulz5
-> reshape back
unpaired slices go through the standard zeropower_via_newtonschulz5.
Verified slot assignment for ws=8, L=11 (B=22, padded=24, shard_B=3):
rank 0,1,2: paired_local=[0,1,2] unpaired_local=[] (all Q/K)
rank 3: paired_local=[0,1] unpaired_local=[2] (Q[9,10] + O[11])
rank 4,5,6: paired_local=[] unpaired_local=[0,1,2] (all O/V)
rank 7: paired_local=[] unpaired_local=[0] (O[21] + padding)
Gated by PAIRED_HEAD_MUON_ENABLED (default 0). Off-path is a strict no-op:
no spec is attached, no key in meta, dispatch returns plain batched NS.
Marginal value expected to be smaller than PR1493's -0.00055 BPB because
PR1851's bank-NS already does per-layer NS implicitly via the leading
batch-dim trick on zeropower_via_newtonschulz5; paired-head is a refinement
on top, not the full per-layer structure being added for the first time.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Run 1 (top_ar_s42): PR1851 + GPTQ_ALL_REDUCE=1, no WD scheduling. Single-
variable change vs PR1851 unmodified.
Partial results (q_ttt eval still in progress at commit time):
pre = 1.06623
q = 1.07548
q_gap (q-pre) = 0.00925
artifact = 15,956,401 B (under cap by 43.6 KB)
Within-pod A/B vs Run 0 (top_wd_strong_s42):
Run 0 (no AR, wd_strong) Run 1 (AR, no WD) delta
pre 1.06429 1.06623 +0.00194
q 1.07403 1.07548 +0.00145
q_gap 0.00974 0.00925 -0.00049
The Run 0 vs Run 1 contrast cleanly isolates two effects because they hit
different phases of training:
- AR fix runs only post-train (during collect_hessians) and cannot affect
pre-quant val_bpb. The +0.00194 pre-quant gap is therefore fully due to
WD scheduling.
- WD scheduling runs only during training and cannot affect Hessian
collection or LQER quantization directly. The -0.00049 narrowing of the
quant gap is therefore fully due to the AR fix.
Headline reversal: yesterday's top_wd_strong_session.md concluded wd_strong
was a no-op on PR1851. That conclusion compared Run 0's pre (1.06429) to
PR openai#1855's published pre (1.06396) and read a +0.00033 regression. That was
the wrong baseline — PR openai#1855 is a different stack (lrzip + 9 extra knobs)
and a different pod, so the cross-stack comparison was below noise floor.
The within-pod A/B (Run 0 vs Run 1, same pod, same seed, only one variable
changing) gives a clean signal: wd_strong improves PR1851 pre-quant by
+0.00194 BPB, ~3x the published seed-to-seed std of 0.00068.
Both signals are real:
- AR fix: -0.00049 BPB on quant gap (smaller than PR1493's -0.00083 because
PR1851's LQER asymmetric is more robust to Hessian sparsity than PR1493's
GPTQ-int6, but the direction and order-of-magnitude are right)
- wd_strong: -0.00194 BPB on pre-quant (real, above noise, opposite sign
to my earlier wrong cross-stack read)
Updated expected ceiling for AR + wd_strong + paired-head stack: q_ttt
~1.06020, beats PR openai#1855 mean by ~0.0008 BPB but still does not clear the
0.0024 BPB acceptance bar. Best plausible submission is a non-record entry,
not a record.
Run 2 (queued, auto-launches when Run 1 GPUs free): AR + WD default factors
(low=0.65, high=1.5) to test whether default factors carry the same pre-quant
value as strong factors. Run 3 follows: full AR + WD + paired-head stack,
factor choice TBD on Run 2.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Three runs in the AR-then-WD-then-paired-head sequence: Run 1 (AR only): q_ttt = 1.06266 (+0.00138 vs baseline) Run 2 (AR + wd_default): q_ttt = 1.06129 (+0.00001 vs baseline) Run 3 (AR + wd_strong + PH): q_ttt = 1.06136 (+0.00008 vs baseline) Best (Run 0, wd_strong only): q_ttt = 1.06111 (-0.00017 vs baseline) Three findings, ordered by impact: 1. AR + WD scheduling don't fully stack. AR narrows the LQER quant gap by ~0.0005 BPB when WD is off (Run 1 q_gap 0.00925 vs Run 0 q_gap 0.00974); when WD is on the gap reverts to ~0.00965 (Runs 2 and 3). WD widens the quant gap (+0.00040 BPB) and AR narrows it (-0.00049 BPB) — they roughly cancel. Net AR contribution at q_ttt with WD on collapses from ~-0.0014 BPB (standalone) to ~+0.0001 BPB (stacked). 2. Paired-head Muon NS is a no-op on PR1851's bank architecture. Striking mid-train signal (Run 3 at step 4000: val_bpb 1.0969 vs Run 0's 1.1015, -0.0046 BPB) but the EMA + WD-spike + LQER pipeline converged the trajectory back. Pre-quant lands at 1.06467 (+0.00038 worse than Run 0's 1.06429). Mechanism: PR1851's bank-NS already does per-layer NS via the leading-batch-dim trick on zeropower_via_newtonschulz5. Adding explicit head-pair structure on top is a refinement the EMA+LQER pipeline can't preserve. ~110 fewer training steps due to per-slot dispatch overhead. The "engine of the PR1493 win" was not what we thought it was. 3. wd_strong alone (Run 0) is the only configuration that beat PR1851 baseline, and by less than 1/4 of the published 3-seed std. Single-seed, below noise floor. None of the four configs clear the 0.005-nat / 0.0024- BPB acceptance bar. Audit of remaining PR1493-stack candidates: nothing left that can plausibly bridge a 0.0024 BPB gap. The techniques that worked on PR1493 (wd, paired- head, AR) are now all accounted for; the techniques that didn't work on PR1493 won't work on PR1851 either (mtp/iha/qat had implementation bugs; pko/doc_shuffle are pipeline-incompatible). Non-PR1493 paths (lrzip compressor port from PR openai#1855, architectural changes, tokenizer work) exist but either have unresolved disputes (lrzip rule-3) or aren't partly-built on this branch. Recommendation: 3-seed validation of Run 0 + submit as non-record entry with full analysis of the AR-WD interaction and bank-NS paired-head findings. Negative results are explicitly accepted per the README. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…1855.py Previous session's choice of PR openai#1851 over PR openai#1855 was a mistake we inherited. PR openai#1855 is currently openai#1 on the upstream leaderboard at 1.06108 (3-seed mean), 0.00037 BPB ahead of PR openai#1851's 1.06145. PR openai#1855 also ships the per-group lrzip+brotli compressor (COMPRESSOR=pergroup, ~280 KB smaller artifact than brotli) that PR openai#1851 lacks. Without that compressor, even the 9-hparam stack on PR openai#1851 base busts the 16 MB cap (Run 4 artifact = 16,140,607 B, +140 KB over). train_top_1855.py = PR openai#1855's train_gpt.py + same surgical patches we applied to train_top.py: wd_schedule (5 hparams + base_wd snapshot + wd_mul + step_fn injection + caller) and GPTQ_ALL_REDUCE=1 in collect_hessians. 41 line additions, 3 line modifications, syntax OK. Run 4 evidence (PR openai#1851 + 9 hparams + wd_strong + AR, single seed s42): pre = 1.06331 (vs Run 0's 1.06429 — best pre of session) q = 1.07239 (q_gap 0.00908 — tightest gap of session) artifact = 16,140,607 B (busts cap with brotli; pergroup needed) lrzip 0.651 installed via add-apt-repository universe. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Run 4 results, single seed s42: pre = 1.06331 (best pre of session, beats Run 0's 1.06429 by 0.00098) q = 1.07239 (q_gap 0.00908 — tightest gap of session) q_ttt = 1.05950 (best q_ttt of session, beats PR openai#1855 published s42 1.05989 by 0.00039) artifact = 16,140,607 B (BUSTS 16 MB cap by 140,607 B with brotli; PR openai#1855's pergroup compressor saves ~280 KB, which is needed for this hparam stack to fit) Three findings: 1. The 9 hparams transfer cleanly through to final EMA model quality. Contrast with paired-head Muon NS (Run 3): also gave a striking mid-train signal (-0.0046 at step 4000) but that gain converged out by pre-quant time (+0.00038 vs Run 0). Run 4's mid-train gain (-0.0059) carried through to pre-quant (-0.00098). Mechanism: the 9 hparams change *what's actually being trained* (tighter clipping preserves outliers, longer warmdown reshapes convergence, tuned TTT-LoRA reshapes recovery), not just the optimizer's update direction. 2. Tightest quant gap of the session (0.00908). Tighter MLP/EMBED clipping (11.5/14.0) preserves outliers that LQER asymmetric int4 rank-4 correction can exploit, on top of AR's narrowing. 3. Artifact busts cap with brotli alone — confirms PR openai#1855's claim that their pergroup compressor saves ~280 KB on this stack. With brotli, even PR openai#1855 itself would land ~16,180,000 B. They needed pergroup; we need pergroup. This run made the case to pivot to PR openai#1855 base for Run 5. Earlier session's choice of PR openai#1851 (yesterday's "no lrzip dispute" reasoning) overturned by Run 4's evidence: PR openai#1855 is 0.00037 BPB ahead at 3-seed mean, ships the pergroup compressor we need to fit cap, and the 9 hparams we manually applied transfer cleanly. Run 5 (queued, auto-launch when Run 4 GPUs free) = PR openai#1855's full env stack + our wd_strong + AR + COMPRESSOR=pergroup. Expected q_ttt ~1.0590-1.0595 single-seed; 3-seed mean ~1.0593 ± 0.001. Honest acceptance-bar math: SOTA = 1.06108 (PR openai#1855 3-seed mean) Bar = SOTA - 0.005 nats ≈ 1.0588 Run 4 single = 1.05950, +0.00070 short of bar Run 5 predicted = 1.0590-1.0595, still 0.0002-0.0007 short Even best-case Run 5 likely just misses the record bar by ~half a sigma. Best plausible outcome is non-record submission with documented findings. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds COMPRESSOR=pergroup support to train_top.py (Run 4's graph) so the
9-hparam stack can produce a valid-size artifact under the 16 MB cap. The
PR1851-derived train_top.py previously only shipped brotli/lzma, which
busted the cap by 140,607 B in Run 4 even after the rest of the stack
landed.
The port is surgical: serialize/deserialize gain a `compressor == "pergroup"`
branch that calls the new `_serialize_pergroup` / `_deserialize_pergroup`
helpers and the deserialize side detects the `_PACK_MAGIC` (b"PGRP") prefix
to route blobs that were saved with pergroup. The byte-shuffle wrapper that
train_top.py applies to brotli/lzma blobs is intentionally bypassed by
pergroup, matching how train_top_1855.py's pergroup path already structures
its own per-group compression.
The helpers themselves (_GROUP_ORDER, _SIMSORT_KEYS, _PACK_MAGIC,
_similarity_sort_l1, _lrzip_compress / _lrzip_decompress, _pack_streams /
_unpack_streams, _serialize_pergroup, _deserialize_pergroup) are copied
verbatim from train_top_1855.py with a small numpy-import cleanup (we use
the module-level np instead of re-importing inside each helper).
Verified: a synthetic 138-tensor roundtrip exercising the full
{tok_emb, qkv banks, mlp banks} + LQER remainder shape matches expected
naming conventions yields 138/138 bit-exact tensor recovery and a blob
with the expected b"PGRP" prefix.
Required system dep: lrzip 0.651 (apt-get install lrzip; universe).
Required Python deps: brotli, python-minifier (for code wrapper).
This is the code change only; the actual Run 4 + pergroup retrain is
separate.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…_ttt Run 6 is the pergroup-recovery run from top_run4_pergroup_recovery_runbook.md: keep Run 4's training graph (train_top.py, PR openai#1851 base) and Run 4's hparam stack (9 PR openai#1855 overrides + wd_strong + GPTQ AR), and replace the cap-busting brotli serialization with PR openai#1855's pergroup compressor that we ported in commit 0209a50. Result, single seed s42: pre = 1.06335 (Run 4 was 1.06331; +0.00005) q = 1.07246 (Run 4 was 1.07239; +0.00008) q_ttt = 1.05957 (Run 4 was 1.05950; +0.00006) total = 15,901,624 B (Run 4 was 16,140,607 B brotli, INVALID +140,607 B) (UNDER 16,000,000 B cap by 98,376 B — VALID) Pergroup saves 240,863 B on the model blob and 238,983 B on total vs brotli on this exact stack. That matches PR openai#1855 README's published "~280 KB savings" claim within tolerance — different runs have different quantized weight distributions so brotli/pergroup deltas aren't exactly transportable, but the order of magnitude lines up. Quality drift between Run 4 and Run 6 is <=0.00008 BPB across pre/q/q_ttt, which is below typical pod-to-pod nondeterminism (Run 4 vs PR openai#1855 published s42 differed by 0.00039 even on the "same" stack). Compressor swap is functionally a no-op on quality. Comparison summary: Run 6 vs Run 4 (best, but invalid): +0.00006 BPB worse, but VALID Run 6 vs Run 5 (PR openai#1855 base recovery): -0.00053 BPB BETTER and same compressor Run 6 vs PR openai#1855 published s42: -0.00033 BPB better, +4365 B Run 6 vs PR openai#1855 3-seed mean SOTA: -0.00152 BPB better (~1.7sigma) Run 6 vs acceptance bar (~1.0588): +0.00077 BPB SHORT So Run 6 is the strongest single-seed valid-size submission of the session. Not yet a record (single-seed, ~half-sigma short of acceptance bar) but a strong non-record submission with a documented win: - Validates the ported pergroup compressor end-to-end (synthetic 138-tensor roundtrip preflight + live deserialize during phased TTT eval). - Confirms the runbook's hypothesis that "preserve Run 4 graph + only swap compressor" beats "preserve compressor + retrain on PR openai#1855 base + apply our patches" (Run 5 path). - Reproduces Run 4's quality bit-equivalent within pod noise. Pod prep this session: - apt-get install -y lrzip (lrzip 0.651, required by pergroup) - pip install brotli python-minifier - snapshot_download romeerp/parameter-golf-caseops-v1 (16 GB) for the canonical sp8192-caseops shards + canonical fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model SP model. Layout matches train_top.py's _default_caseops_data path exactly. Files in this commit: top_run6_pergroup_recovery_session.md (full Run 6 report) upload_run6_to_hf.py (pushes artifacts to HF) logs/top_pr1855_hparams_s42_pergroup.stdout (torchrun stdout/stderr) logs/top_pr1855_hparams_s42_pergroup.txt (per-rank training log) Artifacts pushed to HuggingFace (shikhar007/parameter-golf-gram-ns): models/top_pr1855_hparams_s42_pergroup.pt (135.4 MB FP ckpt) models/top_pr1855_hparams_s42_pergroup.int6.ptz (15.9 MB pergroup blob) logs/top_pr1855_hparams_s42_pergroup.txt logs/top_pr1855_hparams_s42_pergroup.stdout Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb: 1.1117 (3-seed mean, std 0.0008) | ~16.1 MB | 8×H100 SXM, 600s | No TTT
Built on PR #1019 stack. Key additions: seq4096 training, bank-weight QAT (fake int6 on all 26M F.linear params), QK-Gain 2.5, and partial key offset from NanoGPT speedrun.
Results (8×H100 80GB SXM, 600s, no TTT)
Delta vs SOTA PR #1019 (1.1147): -0.0030 BPB.
Key Techniques
Negative Results
Reproduction
Test Plan