[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759#1695
Open
X-Abhishek-X wants to merge 5 commits intoopenai:mainfrom
Open
[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759#1695X-Abhishek-X wants to merge 5 commits intoopenai:mainfrom
X-Abhishek-X wants to merge 5 commits intoopenai:mainfrom
Conversation
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 17, 2026
The original round26 worker setup reconstructed PR openai#1695 from diff-only additions, which dropped unchanged lines and produced an invalid train_gpt.py. This commit replaces that broken surface with the exact PR-head file content so the reproduction lane tests the real Stage 3 + SpinQuant + MP-SGD-TTT family. Constraint: Must preserve the exact public PR surface rather than hand-editing or simplifying it. Rejected: Keep the diff-reconstructed file and debug it manually | that would test a synthetic surface, not the actual PR Confidence: high Scope-risk: narrow Reversibility: clean Directive: PR-surface extraction must use full file content from the PR head, not diff-only + lines, whenever the file is modified rather than newly added Tested: python3 -m py_compile train_gpt.py Not-tested: GPU execution on Heimdall
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 17, 2026
…#1695 surface The public openai#1695 run command relies on several env-var overrides that materially change the surface: SpinQuant on, phased TTT on, matrix LR 0.026, warmdown 0.75, embed_bits 7, embed_clip 20, chunk size 48, and the higher LoRA layer alpha. This branch bakes those settings into defaults so the reproduction lane can test the claimed surface rather than the inert default one. Constraint: Must preserve the public PR surface and only move claimed run-command settings into code defaults. Rejected: Reproduce with env vars only | the current evaluator path does not forward arbitrary env vars to remote jobs Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any future public frontier PR must pass claimed-surface/default-surface comparison before it is treated as a serious candidate family Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall
Open
7 tasks
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
First lever layered on the new openai#1736 baseline. Hadamard rotation of weight matrices before GPTQ quantization, hotstarted off spec 008's pre_gptq.pt FP checkpoint. No retraining. Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a openai#1529-adjacent base; expected to compose cleanly with openai#1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. Rotation is a post-training transform with three classes (residual-stream, per-layer attn, per-layer MLP); FP forward pass is invariant by construction, only quantization error drops. Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full retrain. Same hotstart checkpoint reused by future quant experiments (per-group bit, AR-selfgen calib, AWQ). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Based on reading train_gpt.py at commit 154c9b8: Good: RMSNorm is gamma-free (line 529), so the usual gamma-fold step doesn't apply. RMSNorm is rotation-equivariant directly. Bad: openai#1736 has five OTHER per-channel multipliers on residual flow (attn_scale, mlp_scale, resid_mix, skip_weights, skip_gates). These are the real fold targets, not RMSNorm. resid_mix is pre-norm and cannot be cleanly folded. Split into three SpinQuant modes selectable by SPINQUANT_MODE: - internal_only (R_a, R_m per layer; no residual rotation) - full (internal + R0, with attn_scale/mlp_scale/skip folds and resid_mix freeze-to-mean compromise) - port_1695 (conditional on openai#1695 diff being meaningfully different) All three run back-to-back on one pod hotstarted off spec 008's final_model.pt. ~30 min total GPU, ~$17-22 budget, one eval. research/ideas/spinquant-integration-notes.md captures the full design analysis (per-multiplier fold feasibility, three-option tradeoff, shared-code plan). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Removed the conditional gate on port_1695. All three run back-to-back: internal_only, full, port_1695. Cheap enough (~$22 total) and having three data points is worth the extra ~$5 even if openai#1695 turns out to match Option A. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
…approach Read openai#1695's diff. Their approach is fundamentally different from the static-weight-rotation + folds design I had in mind for 'full' mode. They do ONLINE activation rotation: 4 global Hadamard rotations inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv, attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in the rotated basis; rotated Hessians keep the quant-side accounting honest. Rotations OFF during training, ON after deserialize for eval+TTT. Why this matters: their scheme sidesteps BOTH blockers that made the full mode complicated: - LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the LeakyReLU-square, not across it. - resid_mix: rotations are per-linear-input, never touch the residual stream. All per-channel multipliers (attn_scale, mlp_scale, resid_mix, skip_weights) operate in unchanged basis. No float invariance — the model IS different post-rotation. The bet is that the rotated-basis GPTQ delivers lower quant error and that the perturbation is smaller than the savings. Implication: deprecate the 'full' static-rotation-with-folds plan in favor of a future 'port_1695' spec that ports their online scheme. Internal_only mode from spec 009 remains useful as an independent data point (R_a only, fp-invariant). Spec 010 (tapered WD) drafted as an independent parallel track: - Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50 - Muon-WD-only taper on top of openai#1736's existing schedule - Full retrain on 8xH100, single seed, ~\$20 - Independent of spec 009 (different pod, no shared state) - Can run in parallel with 009's eval-only sweep Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
User flagged that port_1695 should be the next spec (higher-impact, natural follow-up to 009) rather than tapered WD. Reshuffled: - 010-port-1695-online-rotation.md (NEW) — port openai#1695's online Hadamard rotation with rotated-basis GPTQ. Hotstart off spec 008 pre_gptq.pt. Expected Delta -0.003 to -0.005 bpb vs spec 009 baseline. ~\$10, 8xH100. - 011-tapered-wd.md (renumbered from 010) — Muon WD taper from openai#1729. Full retrain, ~\$20. Independent of specs 009/010, can run in parallel. Spec 010 inherits the design analysis from research/ideas/ spinquant-integration-notes.md (addendum section). Depends on spec 009 baseline measurement for apples-to-apples Delta. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
…n sprint Session-narrative entry covering today's work: - Frontier filter + baseline migration from merged SOTA openai#1493 (1.0810) to unmerged openai#1736 (1.0655), rationale, CLAUDE.md update. - Spec 008 run partial result (training reproduced openai#1736 within +0.00016 at pre-quant; post-TTT gate number not captured due to watcher bug; projected pass ~1.06626). - Spec 009 design evolution through three scope cuts: 4 modes -> unified sweep -> +baseline mode -> cut to 2 modes after discovering real architectural blockers (MLP LeakyReLU breaks R_m, resid_mix doesn't fold cleanly). - openai#1695 diff discovery: they do online activation rotation, not static weight rotation. Sidesteps both LeakyReLU and resid_mix. Reframes 'full' mode -> port_1695 mode as the next quant-side spec. - Specs 010 (port_1695, design only) and 011 (tapered WD, design only) drafted. Only spec 009 is truly runnable right now. Closes with state-of-play table, modal plan, lessons-learned, and open questions for next session. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Implements the port_1695 SpinQuant variant from PR openai#1695 onto the openai#1736 stack. All changes env-var-gated (SPINQUANT_ENABLED=0 default) so spec 008 and spec 009's baseline/internal_only modes are unaffected bit-for-bit. train_gpt.py changes (+247 lines): - import hashlib - Hyperparameters.spinquant_enabled, spinquant_seed - CastedLinear._sq_active class flag (default False) - Utility block: _stable_seed, _hadamard_rotation, install_spinquant_ rotations, _SQ_KEY_TO_TAG, _spinquant_rotate_sd_and_H - 4 forward-path hook sites (2 each in CausalSelfAttention, MLP, _block_with_lora, _parallel_block_with_lora): - pre-QKV: x_qkv = x @ R_attn_in - pre-attn-proj: y @ R_attn_proj_in - pre-fc: x @ R_mlp_in - post-activation pre-proj: hidden @ R_mlp_proj_in - serialize(): call _spinquant_rotate_sd_and_H after Hessian collection and before GPTQ. Rotates weights (W @ R) and Hessians (R.T @ H @ R). - deserialize(): install_spinquant_rotations + set _sq_active=True after loading rotated weights. - MLP.forward: disable fused kernel when SpinQuant active. - LoRA (TTT path) uses unrotated n, base path uses rotated n_qkv. spinquant_hotstart.py changes: - port_1695 mode no longer raises NotImplementedError. Sets h.spinquant_enabled=True and h.spinquant_seed; train_gpt.py's machinery does the rest. Math: orthogonal R means R @ R.T == I, so x_rot @ W_rot = x @ R @ (W @ R).T = x @ R @ R.T @ W.T = x @ W.T. Pre-quant forward is bit-identical to unrotated; GPTQ sees rotated basis where outliers are spread more evenly and quantization error drops. Spec 010 doc updated to reflect the implementation state. Execution runs via SPINQUANT_MODE=port_1695 on spinquant_hotstart.py. Not tested on GPU — flash_attn_3 not available on the dev box. Syntax clean. First pod run will verify end-to-end behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Captures the central empirical finding from spec 010's per-batch analysis: Hadamard rotation (as in openai#1695's SpinQuant V1) has a regime-dependent effect on prediction quality. - Docs > 1000 tokens: rotation improves val_bpb by -0.007 bpb. - Docs < 300 tokens: rotation hurts val_bpb by +0.015 bpb. - Crossover ~500 tokens; effect exists at both pre-TTT and post-TTT. - Float forward is identity under orthogonal rotation; only the quantized model differs. Working hypothesis: long contexts aggregate quant error toward the mean, which rotation lowers. Short contexts depend on per-token variance, which rotation raises (small mixing perturbation + bf16 roundoff per token, un-averaged). Exploitation is the open problem — 16MB cap blocks most obvious routes (ship two models, toggle per doc). Feasible paths: - Site-selective ablation (spec 010b, active, \$25) - Layer-selective ablation (possible 010c) - Seed sweep (cheap, \$5/seed) - Rotation-aware retraining (spec 011+ territory) Worth keeping as a live idea because it's an unusual result: a null-aggregate with a strong structural decomposition that we can actually measure and potentially exploit. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Add records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ with: - README.md with results table and technique description - submission.json with compliance block, per-seed results, track field - train_gpt.py - train_seed42.log, train_seed1337.log, train_seed2024.log All files were previously at repo root (incorrect format). Proper folder structure required by competition submission guidelines. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
All submission files now live exclusively in: records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
This file should not be modified by record submissions. Our submission lives exclusively in records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
val_bpb: 1.07590 (3-seed mean, std 0.00019) | ~15.0 MB | 8×H100 80GB SXM | 600s | Legal TTT | Beats PR #1529 (1.08100) by 0.00510
Results
Key Changes vs PR #1445
SpinQuant V1 — Banked Architecture Port
SpinQuant pre-rotates weight matrices with a random Hadamard matrix
Rbefore INT6 GPTQ. This spreads weight outliers uniformly, reducing quantization error without changing float model predictions.Porting to Stage 3's banked layout (
qo_bank,kv_bank,mlp_up_bank,mlp_down_bank) required per-slot rotation at bake time.Ris stored as a non-parameter buffer — no optimizer ever touches it.Quantization penalty (pre → post-quant): consistently +0.012–0.013 BPB across all seeds.
SpinQuant × MP-SGD-TTT Composition
MP-SGD-TTT (PR #1626) runs SGD on base model weights between LoRA TTT phases. After SpinQuant baking, weights live in rotated space (
W @ R). SGD updatesW_rotdirectly; sinceRis a fixed buffer, the invariantF.linear(x @ R, W_rot') == F.linear(x, W_rot' @ R.T)holds across all steps. No special interaction terms required.TTT config:
prefix_docs=2000,num_phases=3,lr=0.001,momentum=0.9.Training Configuration
ITERATIONS=20000(wallclock truncates at ~4500 steps, ~98ms/step)MATRIX_LR=0.026,WARMDOWN_FRAC=0.75MLP_CLIP_SIGMAS=12.0,ATTN_CLIP_SIGMAS=13.0,EMBED_CLIP_SIGMAS=20.0EMBED_BITS=7,TTT_CHUNK_SIZE=48TTT_LORA_LAYER_LR_ALPHA=0.5,LORA_PLUS_RATIO=1.0Quantization
percdamp=0.01, 64 calibration batches.ptz, 15,698,706 bytesReproduction
Credits