[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 by X-Abhishek-X · Pull Request #1695 · openai/parameter-golf

X-Abhishek-X · 2026-04-17T12:00:44Z

Results

Seed	Pre-quant BPB	Post-quant BPB	TTT BPB	Artifact Size
42	1.07276	1.08544	1.07591	15,698,706 B
1337	1.07306	1.08544	1.07609	15,698,706 B
2024	1.07273	1.08531	1.07570	15,698,706 B
Mean			1.07590
Std			0.00019

Key Changes vs PR #1445

Component	PR #1445	This PR	Source
Weight rotation	None	Hadamard SpinQuant V1	Meta AI 2024
TTT algorithm	LoRA only	MP-SGD-TTT (phased)	PR #1626
Quantization error	Baseline	Reduced (outliers suppressed)	SpinQuant

SpinQuant V1 — Banked Architecture Port

SpinQuant pre-rotates weight matrices with a random Hadamard matrix R before INT6 GPTQ. This spreads weight outliers uniformly, reducing quantization error without changing float model predictions.

Porting to Stage 3's banked layout (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank) required per-slot rotation at bake time. R is stored as a non-parameter buffer — no optimizer ever touches it.

Quantization penalty (pre → post-quant): consistently +0.012–0.013 BPB across all seeds.

SpinQuant × MP-SGD-TTT Composition

MP-SGD-TTT (PR #1626) runs SGD on base model weights between LoRA TTT phases. After SpinQuant baking, weights live in rotated space (W @ R). SGD updates W_rot directly; since R is a fixed buffer, the invariant F.linear(x @ R, W_rot') == F.linear(x, W_rot' @ R.T) holds across all steps. No special interaction terms required.

TTT config: prefix_docs=2000, num_phases=3, lr=0.001, momentum=0.9.

Training Configuration

ITERATIONS=20000 (wallclock truncates at ~4500 steps, ~98ms/step)
MATRIX_LR=0.026, WARMDOWN_FRAC=0.75
MLP_CLIP_SIGMAS=12.0, ATTN_CLIP_SIGMAS=13.0, EMBED_CLIP_SIGMAS=20.0
EMBED_BITS=7, TTT_CHUNK_SIZE=48
TTT_LORA_LAYER_LR_ALPHA=0.5, LORA_PLUS_RATIO=1.0

Quantization

INT6 GPTQ, percdamp=0.01, 64 calibration batches
SpinQuant rotation applied before GPTQ, baked into weights
Artifact: brotli-compressed .ptz, 15,698,706 bytes

Reproduction

for SEED in 42 1337 2024; do
  SEED=${SEED} \
  SPINQUANT_ENABLED=1 SPINQUANT_SEED=20260416 \
  PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
  GLOBAL_TTT_LR=0.001 GLOBAL_TTT_MOMENTUM=0.9 GLOBAL_TTT_CHUNK_TOKENS=32768 \
  GLOBAL_TTT_BATCH_SEQS=32 GLOBAL_TTT_GRAD_CLIP=1.0 \
  ITERATIONS=20000 MATRIX_LR=0.026 WARMDOWN_FRAC=0.75 \
  MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 EMBED_CLIP_SIGMAS=20.0 \
  EMBED_BITS=7 TTT_CHUNK_SIZE=48 TTT_LORA_LAYER_LR_ALPHA=0.5 \
  LORA_PLUS_RATIO=1.0 PARALLEL_LAMBDA_ASYM=0 VAL_LOSS_EVERY=20000 \
  VOCAB_SIZE=8192 DATA_DIR=/workspace/data/ \
  torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Credits

PR [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445 (@X-Abhishek-X): Stage 3 banked architecture, SpinQuant port to banked layout, MP-SGD-TTT composition
PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626 (@dexhunter): Multi-Phase Global SGD TTT — TTT loop, DDP coordination, phase scheduling ported from his implementation
PR Record: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb #78 (@mtybadger): SP8192 tokenizer and pre-tokenized dataset
SpinQuant (Liu et al., Meta AI 2024): Hadamard rotation for quantization error reduction

The original round26 worker setup reconstructed PR openai#1695 from diff-only additions, which dropped unchanged lines and produced an invalid train_gpt.py. This commit replaces that broken surface with the exact PR-head file content so the reproduction lane tests the real Stage 3 + SpinQuant + MP-SGD-TTT family. Constraint: Must preserve the exact public PR surface rather than hand-editing or simplifying it. Rejected: Keep the diff-reconstructed file and debug it manually | that would test a synthetic surface, not the actual PR Confidence: high Scope-risk: narrow Reversibility: clean Directive: PR-surface extraction must use full file content from the PR head, not diff-only + lines, whenever the file is modified rather than newly added Tested: python3 -m py_compile train_gpt.py Not-tested: GPU execution on Heimdall

…#1695 surface The public openai#1695 run command relies on several env-var overrides that materially change the surface: SpinQuant on, phased TTT on, matrix LR 0.026, warmdown 0.75, embed_bits 7, embed_clip 20, chunk size 48, and the higher LoRA layer alpha. This branch bakes those settings into defaults so the reproduction lane can test the claimed surface rather than the inert default one. Constraint: Must preserve the public PR surface and only move claimed run-command settings into code defaults. Rejected: Reproduce with env vars only | the current evaluator path does not forward arbitrary env vars to remote jobs Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any future public frontier PR must pass claimed-surface/default-surface comparison before it is treated as a serious candidate family Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall

First lever layered on the new openai#1736 baseline. Hadamard rotation of weight matrices before GPTQ quantization, hotstarted off spec 008's pre_gptq.pt FP checkpoint. No retraining. Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a openai#1529-adjacent base; expected to compose cleanly with openai#1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. Rotation is a post-training transform with three classes (residual-stream, per-layer attn, per-layer MLP); FP forward pass is invariant by construction, only quantization error drops. Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full retrain. Same hotstart checkpoint reused by future quant experiments (per-group bit, AR-selfgen calib, AWQ). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Based on reading train_gpt.py at commit 154c9b8: Good: RMSNorm is gamma-free (line 529), so the usual gamma-fold step doesn't apply. RMSNorm is rotation-equivariant directly. Bad: openai#1736 has five OTHER per-channel multipliers on residual flow (attn_scale, mlp_scale, resid_mix, skip_weights, skip_gates). These are the real fold targets, not RMSNorm. resid_mix is pre-norm and cannot be cleanly folded. Split into three SpinQuant modes selectable by SPINQUANT_MODE: - internal_only (R_a, R_m per layer; no residual rotation) - full (internal + R0, with attn_scale/mlp_scale/skip folds and resid_mix freeze-to-mean compromise) - port_1695 (conditional on openai#1695 diff being meaningfully different) All three run back-to-back on one pod hotstarted off spec 008's final_model.pt. ~30 min total GPU, ~$17-22 budget, one eval. research/ideas/spinquant-integration-notes.md captures the full design analysis (per-multiplier fold feasibility, three-option tradeoff, shared-code plan). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Removed the conditional gate on port_1695. All three run back-to-back: internal_only, full, port_1695. Cheap enough (~$22 total) and having three data points is worth the extra ~$5 even if openai#1695 turns out to match Option A. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…approach Read openai#1695's diff. Their approach is fundamentally different from the static-weight-rotation + folds design I had in mind for 'full' mode. They do ONLINE activation rotation: 4 global Hadamard rotations inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv, attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in the rotated basis; rotated Hessians keep the quant-side accounting honest. Rotations OFF during training, ON after deserialize for eval+TTT. Why this matters: their scheme sidesteps BOTH blockers that made the full mode complicated: - LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the LeakyReLU-square, not across it. - resid_mix: rotations are per-linear-input, never touch the residual stream. All per-channel multipliers (attn_scale, mlp_scale, resid_mix, skip_weights) operate in unchanged basis. No float invariance — the model IS different post-rotation. The bet is that the rotated-basis GPTQ delivers lower quant error and that the perturbation is smaller than the savings. Implication: deprecate the 'full' static-rotation-with-folds plan in favor of a future 'port_1695' spec that ports their online scheme. Internal_only mode from spec 009 remains useful as an independent data point (R_a only, fp-invariant). Spec 010 (tapered WD) drafted as an independent parallel track: - Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50 - Muon-WD-only taper on top of openai#1736's existing schedule - Full retrain on 8xH100, single seed, ~\$20 - Independent of spec 009 (different pod, no shared state) - Can run in parallel with 009's eval-only sweep Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

User flagged that port_1695 should be the next spec (higher-impact, natural follow-up to 009) rather than tapered WD. Reshuffled: - 010-port-1695-online-rotation.md (NEW) — port openai#1695's online Hadamard rotation with rotated-basis GPTQ. Hotstart off spec 008 pre_gptq.pt. Expected Delta -0.003 to -0.005 bpb vs spec 009 baseline. ~\$10, 8xH100. - 011-tapered-wd.md (renumbered from 010) — Muon WD taper from openai#1729. Full retrain, ~\$20. Independent of specs 009/010, can run in parallel. Spec 010 inherits the design analysis from research/ideas/ spinquant-integration-notes.md (addendum section). Depends on spec 009 baseline measurement for apples-to-apples Delta. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…n sprint Session-narrative entry covering today's work: - Frontier filter + baseline migration from merged SOTA openai#1493 (1.0810) to unmerged openai#1736 (1.0655), rationale, CLAUDE.md update. - Spec 008 run partial result (training reproduced openai#1736 within +0.00016 at pre-quant; post-TTT gate number not captured due to watcher bug; projected pass ~1.06626). - Spec 009 design evolution through three scope cuts: 4 modes -> unified sweep -> +baseline mode -> cut to 2 modes after discovering real architectural blockers (MLP LeakyReLU breaks R_m, resid_mix doesn't fold cleanly). - openai#1695 diff discovery: they do online activation rotation, not static weight rotation. Sidesteps both LeakyReLU and resid_mix. Reframes 'full' mode -> port_1695 mode as the next quant-side spec. - Specs 010 (port_1695, design only) and 011 (tapered WD, design only) drafted. Only spec 009 is truly runnable right now. Closes with state-of-play table, modal plan, lessons-learned, and open questions for next session. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Implements the port_1695 SpinQuant variant from PR openai#1695 onto the openai#1736 stack. All changes env-var-gated (SPINQUANT_ENABLED=0 default) so spec 008 and spec 009's baseline/internal_only modes are unaffected bit-for-bit. train_gpt.py changes (+247 lines): - import hashlib - Hyperparameters.spinquant_enabled, spinquant_seed - CastedLinear._sq_active class flag (default False) - Utility block: _stable_seed, _hadamard_rotation, install_spinquant_ rotations, _SQ_KEY_TO_TAG, _spinquant_rotate_sd_and_H - 4 forward-path hook sites (2 each in CausalSelfAttention, MLP, _block_with_lora, _parallel_block_with_lora): - pre-QKV: x_qkv = x @ R_attn_in - pre-attn-proj: y @ R_attn_proj_in - pre-fc: x @ R_mlp_in - post-activation pre-proj: hidden @ R_mlp_proj_in - serialize(): call _spinquant_rotate_sd_and_H after Hessian collection and before GPTQ. Rotates weights (W @ R) and Hessians (R.T @ H @ R). - deserialize(): install_spinquant_rotations + set _sq_active=True after loading rotated weights. - MLP.forward: disable fused kernel when SpinQuant active. - LoRA (TTT path) uses unrotated n, base path uses rotated n_qkv. spinquant_hotstart.py changes: - port_1695 mode no longer raises NotImplementedError. Sets h.spinquant_enabled=True and h.spinquant_seed; train_gpt.py's machinery does the rest. Math: orthogonal R means R @ R.T == I, so x_rot @ W_rot = x @ R @ (W @ R).T = x @ R @ R.T @ W.T = x @ W.T. Pre-quant forward is bit-identical to unrotated; GPTQ sees rotated basis where outliers are spread more evenly and quantization error drops. Spec 010 doc updated to reflect the implementation state. Execution runs via SPINQUANT_MODE=port_1695 on spinquant_hotstart.py. Not tested on GPU — flash_attn_3 not available on the dev box. Syntax clean. First pod run will verify end-to-end behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Captures the central empirical finding from spec 010's per-batch analysis: Hadamard rotation (as in openai#1695's SpinQuant V1) has a regime-dependent effect on prediction quality. - Docs > 1000 tokens: rotation improves val_bpb by -0.007 bpb. - Docs < 300 tokens: rotation hurts val_bpb by +0.015 bpb. - Crossover ~500 tokens; effect exists at both pre-TTT and post-TTT. - Float forward is identity under orthogonal rotation; only the quantized model differs. Working hypothesis: long contexts aggregate quant error toward the mean, which rotation lowers. Short contexts depend on per-token variance, which rotation raises (small mixing perturbation + bf16 roundoff per token, un-averaged). Exploitation is the open problem — 16MB cap blocks most obvious routes (ship two models, toggle per doc). Feasible paths: - Site-selective ablation (spec 010b, active, \$25) - Layer-selective ablation (possible 010c) - Seed sweep (cheap, \$5/seed) - Rotation-aware retraining (spec 011+ territory) Worth keeping as a live idea because it's an unusual result: a null-aggregate with a strong structural decomposition that we can actually measure and potentially exploit. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Add records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ with: - README.md with results table and technique description - submission.json with compliance block, per-seed results, track field - train_gpt.py - train_seed42.log, train_seed1337.log, train_seed2024.log All files were previously at repo root (incorrect format). Proper folder structure required by competition submission guidelines. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

All submission files now live exclusively in: records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

This file should not be modified by record submissions. Our submission lives exclusively in records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

X-Abhishek-X added 2 commits April 17, 2026 16:00

Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.07590

224ae10

Add submission.json and training logs for 3 seeds

ef95665

mikeapedia mentioned this pull request Apr 19, 2026

Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706) #1728

Open

7 tasks

X-Abhishek-X and others added 3 commits April 20, 2026 16:50

Remove root-level files (were incorrectly placed outside records/)

9ce2562

All submission files now live exclusively in: records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Restore root train_gpt.py to upstream main

ff805b1

This file should not be modified by record submissions. Our submission lives exclusively in records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Programmerryoki mentioned this pull request Apr 23, 2026

Add 2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849 #1794

Open

5 tasks

This was referenced Apr 26, 2026

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property" #1837

Open

Record: Partial SpinQuant (start_layer=5) + PR#1851 Stack — val_bpb 1.06614 (3-seed mean) #1898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759#1695

[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759#1695
X-Abhishek-X wants to merge 5 commits intoopenai:mainfrom
X-Abhishek-X:spinquant-mpsgd-ttt-stage3

X-Abhishek-X commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

X-Abhishek-X commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Key Changes vs PR #1445

SpinQuant V1 — Banked Architecture Port

SpinQuant × MP-SGD-TTT Composition

Training Configuration

Quantization

Reproduction

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

X-Abhishek-X commented Apr 17, 2026 •

edited

Loading