Skip to content

Record: SP8192 + No Gates + Multi-Phase Global SGD TTT — val_bpb 1.07285 (3-seed mean)#1775

Open
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:submission/nathanmaine-h2-mpsgd-no-gates-1.0729
Open

Record: SP8192 + No Gates + Multi-Phase Global SGD TTT — val_bpb 1.07285 (3-seed mean)#1775
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:submission/nathanmaine-h2-mpsgd-no-gates-1.0729

Conversation

@dentity007
Copy link
Copy Markdown

Summary

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, FA3)

Seed Steps Train time Post-TTT val_bpb Eval time Artifact (bytes)
1337 4827 587.52s 1.07333739 429.1s 15,935,536
42 4839 587.16s 1.07287895 338.7s 15,935,501
0 4832 587.16s 1.07232205 385.1s 15,943,766
Mean 4833 587.28s 1.07285 384.3s 15,938,268
Std 0.00051

All three seeds clear the 600s train budget, the 600s eval budget, and the 16,000,000-byte decimal artifact cap. 3-seed std of 0.00051 is well inside the 0.005-nat significance floor.

What this submission is

A combinatorial submission that isolates one specific hypothesis at full 8xH100 production scale: whether Multi-Phase Global SGD TTT (PR #1626) outperforms single-phase score-first TTT on the PR #1667 base when both of MarioPaerle's gates (SmearGate and AttnOutGate) are disabled. On the same pod I also ran single-phase score-first TTT with the same no-gates configuration and got 1.07612, so on this base the MP-SGD path beats single-phase by 0.0028 BPB.

No novel architecture. No tokenizer changes. No contested legality path.

Issue #1017 Track B compliance (summarized; full writeup in README)

  1. Condition 1 (Strict causal dependence): LoRA state is built only from prefix tokens. Global base-model SGD at phase boundaries operates only on tokens from documents whose scoring is already complete.
  2. Condition 2 (Full normalized distribution): Standard softmax over the full sentencepiece vocabulary of 8192. No bucket normalization or x_t-contingent completion.
  3. Condition 3 (Score-before-update): Per-chunk: forward and accumulation into loss_sum complete before any LoRA gradient step on that chunk. Global: train_val_ttt_global_sgd_distributed invoked at phase boundaries on already-scored docs only. Last chunk of each training slice explicitly skipped.
  4. Condition 4 (Single left-to-right pass): Each batch claimed exactly once via atomic file-lock counter. No rescoring.

The MP-SGD code path is unchanged from PR #1626, which has been accepted by the community as Issue #1017 Track B compliant.

Reproduction

Env per seed (README has the complete setup):

SEED=<1337|42|0> \
TTT_ENABLED=1 \
PHASED_TTT_ENABLED=1 \
PHASED_TTT_NUM_PHASES=3 \
PHASED_TTT_PREFIX_DOCS=2000 \
GLOBAL_TTT_LR=0.001 \
GLOBAL_TTT_EPOCHS=1 \
SMEAR_GATE=0 \
GATE_ATTN_OUT=0 \
DATA_DIR=<path to FineWeb10B sp8192 data> \
ARTIFACT_DIR=<output dir> \
RUN_ID=<run id> \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Hardware: 8xH100 80GB HBM3 SXM, per-GPU GEMM 0.21ms / 657 TFLOPS, NVLink NV18 all-pairs, Intel Xeon 8470. Image: runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404 with flash_attn_interface from https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/.

Test plan

  • Organizer confirms each of the 3 seed logs reports val_bpb matching the table
  • Organizer reproduces at least one seed (seed 1337 is the simplest) and gets val_bpb within 0.0007 (observed 3-seed spread)
  • Artifact size, train_time, total_eval_time all within budgets on rerun

Attribution

Notes

The train_gpt.py in this folder contains two dev-only shims that are inert on H100: a FA3 to FA2 to SDPA backend auto-detect (activates only on compute capability 12.x, Blackwell), and a Triton block-size override for the fused linear-leaky-relu-square kernel (also cc 12.x only). Both no-op on Hopper H100. They are in the file so the same code runs on a Blackwell dev box without forking.

dentity007 and others added 3 commits March 30, 2026 19:12
- Base: pr1667 MarioPaerle SP8192 architecture (SmearGate + AttnOutGate both disabled)
- TTT: pr1626 dexhunter Multi-Phase Global SGD TTT, 3 phases, 2000 prefix docs
- 3-seed mean val_bpb 1.07285 std 0.00051 across seeds 1337, 42, 0
- All seeds under 600s train budget, 600s eval budget, 16 MB artifact cap
- Issue openai#1017 Track B compliant (all four conditions individually addressed in README)
- No tokenizer changes (vanilla SP8192), no Casefold/CaseOps, no SLOT
- Hardware: 8xH100 SXM, Kansas City US-MO-1, per-GPU GEMM 657 TFLOPS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant