Record: SP8192 + No Gates + Multi-Phase Global SGD TTT — val_bpb 1.07285 (3-seed mean) by dentity007 · Pull Request #1775 · openai/parameter-golf

dentity007 · 2026-04-22T14:24:23Z

Summary

3-seed mean val_bpb 1.07285, std 0.00051, on 8xH100 SXM
Base architecture: PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 (@MarioPaerle) with SmearGate and AttnOutGate both disabled
TTT: PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626 (@dexhunter) Multi-Phase Global SGD, 3 phases, 2000 prefix docs
Vanilla SP8192 tokenizer, no Casefold/CaseOps, no SLOT, no n-gram cache
Issue A Field Guide to Valid Submissions #1017 Track B compliant (all four conditions individually addressed in README)

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, FA3)

Seed	Steps	Train time	Post-TTT val_bpb	Eval time	Artifact (bytes)
1337	4827	587.52s	1.07333739	429.1s	15,935,536
42	4839	587.16s	1.07287895	338.7s	15,935,501
0	4832	587.16s	1.07232205	385.1s	15,943,766
Mean	4833	587.28s	1.07285	384.3s	15,938,268
Std			0.00051

All three seeds clear the 600s train budget, the 600s eval budget, and the 16,000,000-byte decimal artifact cap. 3-seed std of 0.00051 is well inside the 0.005-nat significance floor.

What this submission is

A combinatorial submission that isolates one specific hypothesis at full 8xH100 production scale: whether Multi-Phase Global SGD TTT (PR #1626) outperforms single-phase score-first TTT on the PR #1667 base when both of MarioPaerle's gates (SmearGate and AttnOutGate) are disabled. On the same pod I also ran single-phase score-first TTT with the same no-gates configuration and got 1.07612, so on this base the MP-SGD path beats single-phase by 0.0028 BPB.

No novel architecture. No tokenizer changes. No contested legality path.

Issue #1017 Track B compliance (summarized; full writeup in README)

Condition 1 (Strict causal dependence): LoRA state is built only from prefix tokens. Global base-model SGD at phase boundaries operates only on tokens from documents whose scoring is already complete.
Condition 2 (Full normalized distribution): Standard softmax over the full sentencepiece vocabulary of 8192. No bucket normalization or x_t-contingent completion.
Condition 3 (Score-before-update): Per-chunk: forward and accumulation into loss_sum complete before any LoRA gradient step on that chunk. Global: train_val_ttt_global_sgd_distributed invoked at phase boundaries on already-scored docs only. Last chunk of each training slice explicitly skipped.
Condition 4 (Single left-to-right pass): Each batch claimed exactly once via atomic file-lock counter. No rescoring.

The MP-SGD code path is unchanged from PR #1626, which has been accepted by the community as Issue #1017 Track B compliant.

Reproduction

Env per seed (README has the complete setup):

SEED=<1337|42|0> \
TTT_ENABLED=1 \
PHASED_TTT_ENABLED=1 \
PHASED_TTT_NUM_PHASES=3 \
PHASED_TTT_PREFIX_DOCS=2000 \
GLOBAL_TTT_LR=0.001 \
GLOBAL_TTT_EPOCHS=1 \
SMEAR_GATE=0 \
GATE_ATTN_OUT=0 \
DATA_DIR=<path to FineWeb10B sp8192 data> \
ARTIFACT_DIR=<output dir> \
RUN_ID=<run id> \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Hardware: 8xH100 80GB HBM3 SXM, per-GPU GEMM 0.21ms / 657 TFLOPS, NVLink NV18 all-pairs, Intel Xeon 8470. Image: runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404 with flash_attn_interface from https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/.

Test plan

Organizer confirms each of the 3 seed logs reports val_bpb matching the table
Organizer reproduces at least one seed (seed 1337 is the simplest) and gets val_bpb within 0.0007 (observed 3-seed spread)
Artifact size, train_time, total_eval_time all within budgets on rerun

Attribution

@MarioPaerle PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 for the SP8192 base architecture and the score-first TTT scaffold
@dexhunter PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626 for the Multi-Phase Global SGD TTT implementation
@abaybektursun PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 as the currently merged record-track rank 1

Notes

The train_gpt.py in this folder contains two dev-only shims that are inert on H100: a FA3 to FA2 to SDPA backend auto-detect (activates only on compute capability 12.x, Blackwell), and a Triton block-size override for the fused linear-leaky-relu-square kernel (also cc 12.x only). Both no-op on Hopper H100. They are in the file so the same code runs on a Blackwell dev box without forking.

…er optimization, and SSM exploration

- Base: pr1667 MarioPaerle SP8192 architecture (SmearGate + AttnOutGate both disabled) - TTT: pr1626 dexhunter Multi-Phase Global SGD TTT, 3 phases, 2000 prefix docs - 3-seed mean val_bpb 1.07285 std 0.00051 across seeds 1337, 42, 0 - All seeds under 600s train budget, 600s eval budget, 16 MB artifact cap - Issue openai#1017 Track B compliant (all four conditions individually addressed in README) - No tokenizer changes (vanilla SP8192), no Casefold/CaseOps, no SLOT - Hardware: 8xH100 SXM, Kansas City US-MO-1, per-GPU GEMM 657 TFLOPS

dentity007 and others added 3 commits March 30, 2026 19:12

Add approach notes for parameter golf challenge

ad23b7f

Update approach with depth recurrence, factorized embeddings, tokeniz…

300eb5c

…er optimization, and SSM exploration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + No Gates + Multi-Phase Global SGD TTT — val_bpb 1.07285 (3-seed mean)#1775

Record: SP8192 + No Gates + Multi-Phase Global SGD TTT — val_bpb 1.07285 (3-seed mean)#1775
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:submission/nathanmaine-h2-mpsgd-no-gates-1.0729

dentity007 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dentity007 commented Apr 22, 2026

Summary

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, FA3)

What this submission is

Issue #1017 Track B compliance (summarized; full writeup in README)

Reproduction

Test plan

Attribution

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant