Record: SP8192 + No Gates + Multi-Phase Global SGD TTT — val_bpb 1.07285 (3-seed mean)#1775
Open
dentity007 wants to merge 3 commits intoopenai:mainfrom
Open
Conversation
…er optimization, and SSM exploration
- Base: pr1667 MarioPaerle SP8192 architecture (SmearGate + AttnOutGate both disabled) - TTT: pr1626 dexhunter Multi-Phase Global SGD TTT, 3 phases, 2000 prefix docs - 3-seed mean val_bpb 1.07285 std 0.00051 across seeds 1337, 42, 0 - All seeds under 600s train budget, 600s eval budget, 16 MB artifact cap - Issue openai#1017 Track B compliant (all four conditions individually addressed in README) - No tokenizer changes (vanilla SP8192), no Casefold/CaseOps, no SLOT - Hardware: 8xH100 SXM, Kansas City US-MO-1, per-GPU GEMM 657 TFLOPS
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, FA3)
All three seeds clear the 600s train budget, the 600s eval budget, and the 16,000,000-byte decimal artifact cap. 3-seed std of 0.00051 is well inside the 0.005-nat significance floor.
What this submission is
A combinatorial submission that isolates one specific hypothesis at full 8xH100 production scale: whether Multi-Phase Global SGD TTT (PR #1626) outperforms single-phase score-first TTT on the PR #1667 base when both of MarioPaerle's gates (SmearGate and AttnOutGate) are disabled. On the same pod I also ran single-phase score-first TTT with the same no-gates configuration and got 1.07612, so on this base the MP-SGD path beats single-phase by 0.0028 BPB.
No novel architecture. No tokenizer changes. No contested legality path.
Issue #1017 Track B compliance (summarized; full writeup in README)
train_val_ttt_global_sgd_distributedinvoked at phase boundaries on already-scored docs only. Last chunk of each training slice explicitly skipped.The MP-SGD code path is unchanged from PR #1626, which has been accepted by the community as Issue #1017 Track B compliant.
Reproduction
Env per seed (README has the complete setup):
Hardware: 8xH100 80GB HBM3 SXM, per-GPU GEMM 0.21ms / 657 TFLOPS, NVLink NV18 all-pairs, Intel Xeon 8470. Image: runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404 with flash_attn_interface from https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/.
Test plan
Attribution
Notes
The
train_gpt.pyin this folder contains two dev-only shims that are inert on H100: a FA3 to FA2 to SDPA backend auto-detect (activates only on compute capability 12.x, Blackwell), and a Triton block-size override for the fused linear-leaky-relu-square kernel (also cc 12.x only). Both no-op on Hopper H100. They are in the file so the same code runs on a Blackwell dev box without forking.