Skip to content

[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759#1695

Open
X-Abhishek-X wants to merge 5 commits intoopenai:mainfrom
X-Abhishek-X:spinquant-mpsgd-ttt-stage3
Open

[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759#1695
X-Abhishek-X wants to merge 5 commits intoopenai:mainfrom
X-Abhishek-X:spinquant-mpsgd-ttt-stage3

Conversation

@X-Abhishek-X
Copy link
Copy Markdown

@X-Abhishek-X X-Abhishek-X commented Apr 17, 2026

val_bpb: 1.07590 (3-seed mean, std 0.00019) | ~15.0 MB | 8×H100 80GB SXM | 600s | Legal TTT | Beats PR #1529 (1.08100) by 0.00510


Results

Seed Pre-quant BPB Post-quant BPB TTT BPB Artifact Size
42 1.07276 1.08544 1.07591 15,698,706 B
1337 1.07306 1.08544 1.07609 15,698,706 B
2024 1.07273 1.08531 1.07570 15,698,706 B
Mean 1.07590
Std 0.00019

Key Changes vs PR #1445

Component PR #1445 This PR Source
Weight rotation None Hadamard SpinQuant V1 Meta AI 2024
TTT algorithm LoRA only MP-SGD-TTT (phased) PR #1626
Quantization error Baseline Reduced (outliers suppressed) SpinQuant

SpinQuant V1 — Banked Architecture Port

SpinQuant pre-rotates weight matrices with a random Hadamard matrix R before INT6 GPTQ. This spreads weight outliers uniformly, reducing quantization error without changing float model predictions.

Porting to Stage 3's banked layout (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank) required per-slot rotation at bake time. R is stored as a non-parameter buffer — no optimizer ever touches it.

Quantization penalty (pre → post-quant): consistently +0.012–0.013 BPB across all seeds.


SpinQuant × MP-SGD-TTT Composition

MP-SGD-TTT (PR #1626) runs SGD on base model weights between LoRA TTT phases. After SpinQuant baking, weights live in rotated space (W @ R). SGD updates W_rot directly; since R is a fixed buffer, the invariant F.linear(x @ R, W_rot') == F.linear(x, W_rot' @ R.T) holds across all steps. No special interaction terms required.

TTT config: prefix_docs=2000, num_phases=3, lr=0.001, momentum=0.9.


Training Configuration

  • ITERATIONS=20000 (wallclock truncates at ~4500 steps, ~98ms/step)
  • MATRIX_LR=0.026, WARMDOWN_FRAC=0.75
  • MLP_CLIP_SIGMAS=12.0, ATTN_CLIP_SIGMAS=13.0, EMBED_CLIP_SIGMAS=20.0
  • EMBED_BITS=7, TTT_CHUNK_SIZE=48
  • TTT_LORA_LAYER_LR_ALPHA=0.5, LORA_PLUS_RATIO=1.0

Quantization

  • INT6 GPTQ, percdamp=0.01, 64 calibration batches
  • SpinQuant rotation applied before GPTQ, baked into weights
  • Artifact: brotli-compressed .ptz, 15,698,706 bytes

Reproduction

for SEED in 42 1337 2024; do
  SEED=${SEED} \
  SPINQUANT_ENABLED=1 SPINQUANT_SEED=20260416 \
  PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
  GLOBAL_TTT_LR=0.001 GLOBAL_TTT_MOMENTUM=0.9 GLOBAL_TTT_CHUNK_TOKENS=32768 \
  GLOBAL_TTT_BATCH_SEQS=32 GLOBAL_TTT_GRAD_CLIP=1.0 \
  ITERATIONS=20000 MATRIX_LR=0.026 WARMDOWN_FRAC=0.75 \
  MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 EMBED_CLIP_SIGMAS=20.0 \
  EMBED_BITS=7 TTT_CHUNK_SIZE=48 TTT_LORA_LAYER_LR_ALPHA=0.5 \
  LORA_PLUS_RATIO=1.0 PARALLEL_LAMBDA_ASYM=0 VAL_LOSS_EVERY=20000 \
  VOCAB_SIZE=8192 DATA_DIR=/workspace/data/ \
  torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Credits

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
The original round26 worker setup reconstructed PR openai#1695 from diff-only additions, which dropped unchanged lines and produced an invalid train_gpt.py. This commit replaces that broken surface with the exact PR-head file content so the reproduction lane tests the real Stage 3 + SpinQuant + MP-SGD-TTT family.

Constraint: Must preserve the exact public PR surface rather than hand-editing or simplifying it.
Rejected: Keep the diff-reconstructed file and debug it manually | that would test a synthetic surface, not the actual PR
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: PR-surface extraction must use full file content from the PR head, not diff-only + lines, whenever the file is modified rather than newly added
Tested: python3 -m py_compile train_gpt.py
Not-tested: GPU execution on Heimdall
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
…#1695 surface

The public openai#1695 run command relies on several env-var overrides that materially change the surface: SpinQuant on, phased TTT on, matrix LR 0.026, warmdown 0.75, embed_bits 7, embed_clip 20, chunk size 48, and the higher LoRA layer alpha. This branch bakes those settings into defaults so the reproduction lane can test the claimed surface rather than the inert default one.

Constraint: Must preserve the public PR surface and only move claimed run-command settings into code defaults.
Rejected: Reproduce with env vars only | the current evaluator path does not forward arbitrary env vars to remote jobs
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Any future public frontier PR must pass claimed-surface/default-surface comparison before it is treated as a serious candidate family
Tested: python3 -m py_compile train_gpt.py evaluate.py
Not-tested: GPU execution on Heimdall
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
First lever layered on the new openai#1736 baseline. Hadamard rotation of
weight matrices before GPTQ quantization, hotstarted off spec 008's
pre_gptq.pt FP checkpoint. No retraining.

Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a
openai#1529-adjacent base; expected to compose cleanly with openai#1736 since
the quant stage is orthogonal to CaseOps / attention gates / phased
TTT. Rotation is a post-training transform with three classes
(residual-stream, per-layer attn, per-layer MLP); FP forward pass is
invariant by construction, only quantization error drops.

Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full
retrain. Same hotstart checkpoint reused by future quant
experiments (per-group bit, AR-selfgen calib, AWQ).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Based on reading train_gpt.py at commit 154c9b8:

Good: RMSNorm is gamma-free (line 529), so the usual gamma-fold step
doesn't apply. RMSNorm is rotation-equivariant directly.

Bad: openai#1736 has five OTHER per-channel multipliers on residual flow
(attn_scale, mlp_scale, resid_mix, skip_weights, skip_gates). These
are the real fold targets, not RMSNorm. resid_mix is pre-norm and
cannot be cleanly folded.

Split into three SpinQuant modes selectable by SPINQUANT_MODE:
- internal_only (R_a, R_m per layer; no residual rotation)
- full (internal + R0, with attn_scale/mlp_scale/skip folds and
  resid_mix freeze-to-mean compromise)
- port_1695 (conditional on openai#1695 diff being meaningfully different)

All three run back-to-back on one pod hotstarted off spec 008's
final_model.pt. ~30 min total GPU, ~$17-22 budget, one eval.

research/ideas/spinquant-integration-notes.md captures the full
design analysis (per-multiplier fold feasibility, three-option
tradeoff, shared-code plan).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Removed the conditional gate on port_1695. All three run back-to-back:
internal_only, full, port_1695. Cheap enough (~$22 total) and
having three data points is worth the extra ~$5 even if openai#1695 turns
out to match Option A.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…approach

Read openai#1695's diff. Their approach is fundamentally different from
the static-weight-rotation + folds design I had in mind for 'full'
mode. They do ONLINE activation rotation: 4 global Hadamard rotations
inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv,
attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in
the rotated basis; rotated Hessians keep the quant-side accounting
honest. Rotations OFF during training, ON after deserialize for
eval+TTT.

Why this matters: their scheme sidesteps BOTH blockers that made the
full mode complicated:
- LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the
  LeakyReLU-square, not across it.
- resid_mix: rotations are per-linear-input, never touch the
  residual stream. All per-channel multipliers (attn_scale,
  mlp_scale, resid_mix, skip_weights) operate in unchanged basis.

No float invariance — the model IS different post-rotation. The bet
is that the rotated-basis GPTQ delivers lower quant error and that
the perturbation is smaller than the savings.

Implication: deprecate the 'full' static-rotation-with-folds plan in
favor of a future 'port_1695' spec that ports their online scheme.
Internal_only mode from spec 009 remains useful as an independent
data point (R_a only, fp-invariant).

Spec 010 (tapered WD) drafted as an independent parallel track:
- Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50
- Muon-WD-only taper on top of openai#1736's existing schedule
- Full retrain on 8xH100, single seed, ~\$20
- Independent of spec 009 (different pod, no shared state)
- Can run in parallel with 009's eval-only sweep

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
User flagged that port_1695 should be the next spec (higher-impact,
natural follow-up to 009) rather than tapered WD. Reshuffled:

- 010-port-1695-online-rotation.md (NEW) — port openai#1695's online
  Hadamard rotation with rotated-basis GPTQ. Hotstart off spec 008
  pre_gptq.pt. Expected Delta -0.003 to -0.005 bpb vs spec 009
  baseline. ~\$10, 8xH100.

- 011-tapered-wd.md (renumbered from 010) — Muon WD taper from openai#1729.
  Full retrain, ~\$20. Independent of specs 009/010, can run in
  parallel.

Spec 010 inherits the design analysis from research/ideas/
spinquant-integration-notes.md (addendum section). Depends on spec
009 baseline measurement for apples-to-apples Delta.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…n sprint

Session-narrative entry covering today's work:

- Frontier filter + baseline migration from merged SOTA openai#1493 (1.0810)
  to unmerged openai#1736 (1.0655), rationale, CLAUDE.md update.
- Spec 008 run partial result (training reproduced openai#1736 within
  +0.00016 at pre-quant; post-TTT gate number not captured due to
  watcher bug; projected pass ~1.06626).
- Spec 009 design evolution through three scope cuts: 4 modes ->
  unified sweep -> +baseline mode -> cut to 2 modes after discovering
  real architectural blockers (MLP LeakyReLU breaks R_m, resid_mix
  doesn't fold cleanly).
- openai#1695 diff discovery: they do online activation rotation, not
  static weight rotation. Sidesteps both LeakyReLU and resid_mix.
  Reframes 'full' mode -> port_1695 mode as the next quant-side
  spec.
- Specs 010 (port_1695, design only) and 011 (tapered WD, design
  only) drafted. Only spec 009 is truly runnable right now.

Closes with state-of-play table, modal plan, lessons-learned, and
open questions for next session.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Implements the port_1695 SpinQuant variant from PR openai#1695 onto the
openai#1736 stack. All changes env-var-gated (SPINQUANT_ENABLED=0 default)
so spec 008 and spec 009's baseline/internal_only modes are
unaffected bit-for-bit.

train_gpt.py changes (+247 lines):
- import hashlib
- Hyperparameters.spinquant_enabled, spinquant_seed
- CastedLinear._sq_active class flag (default False)
- Utility block: _stable_seed, _hadamard_rotation, install_spinquant_
  rotations, _SQ_KEY_TO_TAG, _spinquant_rotate_sd_and_H
- 4 forward-path hook sites (2 each in CausalSelfAttention,
  MLP, _block_with_lora, _parallel_block_with_lora):
  - pre-QKV: x_qkv = x @ R_attn_in
  - pre-attn-proj: y @ R_attn_proj_in
  - pre-fc: x @ R_mlp_in
  - post-activation pre-proj: hidden @ R_mlp_proj_in
- serialize(): call _spinquant_rotate_sd_and_H after Hessian collection
  and before GPTQ. Rotates weights (W @ R) and Hessians (R.T @ H @ R).
- deserialize(): install_spinquant_rotations + set _sq_active=True
  after loading rotated weights.
- MLP.forward: disable fused kernel when SpinQuant active.
- LoRA (TTT path) uses unrotated n, base path uses rotated n_qkv.

spinquant_hotstart.py changes:
- port_1695 mode no longer raises NotImplementedError. Sets
  h.spinquant_enabled=True and h.spinquant_seed; train_gpt.py's
  machinery does the rest.

Math: orthogonal R means R @ R.T == I, so x_rot @ W_rot = x @ R @
(W @ R).T = x @ R @ R.T @ W.T = x @ W.T. Pre-quant forward is
bit-identical to unrotated; GPTQ sees rotated basis where outliers
are spread more evenly and quantization error drops.

Spec 010 doc updated to reflect the implementation state. Execution
runs via SPINQUANT_MODE=port_1695 on spinquant_hotstart.py.

Not tested on GPU — flash_attn_3 not available on the dev box.
Syntax clean. First pod run will verify end-to-end behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Captures the central empirical finding from spec 010's per-batch
analysis: Hadamard rotation (as in openai#1695's SpinQuant V1) has a
regime-dependent effect on prediction quality.

- Docs > 1000 tokens: rotation improves val_bpb by -0.007 bpb.
- Docs < 300 tokens: rotation hurts val_bpb by +0.015 bpb.
- Crossover ~500 tokens; effect exists at both pre-TTT and post-TTT.
- Float forward is identity under orthogonal rotation; only the
  quantized model differs.

Working hypothesis: long contexts aggregate quant error toward the
mean, which rotation lowers. Short contexts depend on per-token
variance, which rotation raises (small mixing perturbation + bf16
roundoff per token, un-averaged).

Exploitation is the open problem — 16MB cap blocks most obvious
routes (ship two models, toggle per doc). Feasible paths:

- Site-selective ablation (spec 010b, active, \$25)
- Layer-selective ablation (possible 010c)
- Seed sweep (cheap, \$5/seed)
- Rotation-aware retraining (spec 011+ territory)

Worth keeping as a live idea because it's an unusual result: a
null-aggregate with a strong structural decomposition that we
can actually measure and potentially exploit.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
X-Abhishek-X and others added 3 commits April 20, 2026 16:50
Add records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/ with:
- README.md with results table and technique description
- submission.json with compliance block, per-seed results, track field
- train_gpt.py
- train_seed42.log, train_seed1337.log, train_seed2024.log

All files were previously at repo root (incorrect format). Proper folder
structure required by competition submission guidelines.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
All submission files now live exclusively in:
records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
This file should not be modified by record submissions.
Our submission lives exclusively in records/track_10min_16mb/2026-04-17_Stage3_SpinQuant_MPSGDTTT_1.0759/

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant