Record attempt: SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly) — val_bpb 1.07037 (3-seed mean)#1807
Open
davie2009kh wants to merge 1 commit intoopenai:mainfrom
Conversation
…on (SDPA-friendly) — val_bpb 1.07037 (3-seed mean)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly)
val_bpb 1.07037 (3-seed mean, std 0.00027) on the 10 min / 16 MB track.
Summary
Record over merged SOTA (PR #1493, 1.0810) by −0.01063 BPB at a 3-seed mean.
This submission adapts the parallel pre-quant AdamW TTT stack (PR #1735, @AjAnubolu + PR #1738, @alertcat) for environments without FlashAttention-3 — specifically a torch 2.11+cu130 stack that has no compatible FA3 wheel and no nvcc available for source builds. On such a stack, SDPA is the only available attention backend and TTT epochs cost ~4× longer than the FA3 reference. The original 21-epoch schedule blows the 600s eval budget in that regime; this PR rebalances the schedule to fit.
The three concrete changes, each small and defensible in isolation:
A dedicated ablation (seed 1337) showed that with SDPA's ~96 s warmup + ~19 s/epoch costs, the budget only fits 3 full TTT epochs. A 4-epoch cosine-warm-restart variant (cycle 1 = 3 ep, cycle 2 = 1 ep) was tried first and regressed from 1.0701 (3-epoch) to 1.0727 (4-epoch) — the restart LR jolt hurt when the follow-on cycle was too short to re-converge. Final schedule is plain
CosineAnnealingLR(T_max=3, eta_min=1e-4).eval_valcalls after every epoch cost ~12s on SDPA. We run them on epochs 1, 3, 5, … and always the final epoch; a budget guard breaks TTT early ifelapsed + 150s × remaining_epochs > 600s. Under the 3-epoch schedule the guard never triggers, but it protects long-tail variance on slower runs.p ← p · (1 − lr·wd)with a Huber variant: L2 for|w| < δ, L1 above, withδ = 3/√(fan_in). The intent is to suppress outlier weights that cause int6 GPTQ clipping loss, without over-penalizing typical weights. Contribution to final BPB is small (within noise of the 3-epoch TTT change).Everything else is inherited verbatim from the PR #1493 stack (SP8192 + CaseOps tokenizer + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + EMA + GPTQ SDClip + Brotli).
3-seed results (8× H100 80GB SXM, 10-min train / 10-min eval)
Artifact margin: worst-case 137,006 bytes under 16MB. Training uses 588s of the 600s cap across all seeds; SDPA eval uses ~300s total.
Per-epoch TTT trajectory (seed 1337)
The epoch-1 eval intentionally overshoots because the initial LR is at peak — the loss floor at epoch 3 (1.07589) is what matters for the quantization step that follows.
Compliance (Issue #1017 Track A)
Relationship to pending PRs
PR #1735 (@AjAnubolu, 1.0429), PR #1738 (@alertcat, 1.0354 with CaseOps), and the kilojoules follow-up (1.0284 with LR=1e-3/freeze=0) all use FA3 and run 21 epochs of pre-quant TTT. On FA3-less hardware those scores are not reachable; this submission reconstructs the best TTT schedule that is reachable there, and separately adds Huber-Muon WD.
If any of those PRs merge first and become the new record baseline, this PR should be rebased or withdrawn — it does not claim improvement over them.
Reproduction
Environment: pytorch 2.11.0+cu130, no FA3 (script falls back to SDPA). A reproduction on pytorch 2.9.1+cu128 with FA3 would finish faster but should land at the same BPB to within ~0.001.
Attribution
This PR's contribution: schedule + eval-budget rebalancing for FA3-less stacks, and Huber-WD variant for Muon.