Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed) by dentity007 · Pull Request #1127 · openai/parameter-golf

dentity007 · 2026-03-30T09:01:09Z

11L XSA4 + EMA + LoRA TTT + Partial RoPE + GPTQ-lite (dim480)

val_bpb: 1.13112 (3-seed mean, std 0.00051) | ~15.5 MB | 8×H100 SXM Reykjavík Iceland

PR #462 architecture compressed to fit 16MB with MODEL_DIM=480.

3-seed validation

Seed	Val BPB	Steps	ms/step	Size
1337	1.13041826	7,847	76.47	15,489,698
42	1.13161931	7,958	75.40	15,462,345
7	1.13133583	7,935	75.62	15,436,056
Mean	1.13112 (std 0.00051)

Key components

11L U-Net, MODEL_DIM=480, NUM_KV_HEADS=4, MLP_HIDDEN=1536 (3×)
EMA decay=0.9985
Partial RoPE (16/64 dims)
Late QAT int6 STE (threshold 0.15)
Single-pass LoRA TTT (rank=8, lr=0.01, 1 epoch)
XSA on deepest 4 layers
BigramHash (8192 buckets, dim=128) + SmearGate
int6 + zstd-22 compression (3.83× ratio)

Compliance

Training: ≤600s on 8×H100 SXM
Artifact: ~15.5MB (under 16,000,000 bytes)
3-seed verified

Note on reproduction

Current runpod/parameter-golf:latest (PyTorch 2.9.1+cu128) requires manual FA3 install:

pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291

…PB (3-seed)

MatoTeziTanka · 2026-04-11T20:07:11Z

Community Review — Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)

BPB: 1.13112 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 8096c1b3c87a, file records/track_10min_16mb/2026-03-30_11L_XSA4_EMA_LoRATTT_PartialRoPE_dim480_1.1311/train_gpt.py):

At line 936 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=64073 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=64073 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

…ka review pattern Proactive self-flag before the Agora compliance review reaches this PR. Same illegal pattern as PR openai#1193 and PR openai#406: ttt_adapt() runs on val_tokens for 1 epoch with no score-first discipline before the final eval. Changes: - train_gpt.py: TTT_ENABLED default changed from "1" to "0". Added comment explaining the fix and cross-referencing the flagged sibling PRs. - submission.json: val_bpb set to null, val_bpb_retracted preserved for record. Status set to "retracted". - README.md: Update notice at top explaining the retraction, original summary struck through. Unlike PR openai#406 which had clean DIAGNOSTIC post_swa numbers in the train logs, this submission has no pre-TTT diagnostic numbers preserved, so no clean substitute BPB is available.

Proactive compliance documentation while awaiting maintainer ruling on hash-based eval-time n-gram caches per Issue openai#402, Issue openai#677, and PR openai#886. No code changes. Just README documenting: - The open dispute (valerio-oai leaning legal, abaybektursun openai#886 disputing via hash collision density, Robert-Sneiderman openai#900 defending Dirichlet formula validity) - What this submission does (backward-looking causal n-gram cache with Dirichlet-Multinomial smoothing) - What it does NOT do (no training on val_tokens, no backward passes, model frozen during eval) - Explicit statement that I asked on Issue openai#402 on April 2 and will retract if ruled invalid Distinct from the TTT-on-val class of violations I retracted in PR openai#1193, PR openai#406, and PR openai#1127.

Same approach as PR openai#948 compliance note. This submission extends openai#948 with order-20 backoff but uses the same eval-time hash n-gram cache architecture under the same community dispute (Issue openai#402, Issue openai#677, PR openai#886, PR openai#900). No code changes. README documents: - The open dispute and relevant threads - What this submission does (causal backward-looking cache, Dirichlet smoothing, model frozen) - What it does NOT do (no training on val_tokens, no backward passes) - Distinct from the TTT-on-val class I retracted in openai#1193, openai#406, openai#1127 - Will retract if maintainers rule the class invalid

dentity007 · 2026-04-13T22:45:48Z

Self-flag (proactive compliance fix, April 13, 2026)

Following the community review pattern @MatoTeziTanka applied to my PR #1193 and PR #406, I audited this PR on my end and confirmed the same illegal TTT-on-val pattern.

What I found:

ttt_adapt() at line 936 takes val_tokens as a parameter
Body iterates val_tokens[raw_start:raw_end], calls loss.backward() and optimizer.step() in a single-pass loop
TTT_ENABLED defaulted to "1" so the reported 1.13112 BPB was computed with LoRA adapter updates on val_tokens
No score-first discipline: the final eval runs after all val_tokens have been trained on
The 'single-pass LoRA TTT' terminology in the original description does not rescue it; 1 epoch with no score-first is still illegal under Issue Invalid submissions due to information leakage during TTT #402 / Issue Illegal submissions megathread #677

Fix pushed (commit 23dab4f):

train_gpt.py: TTT_ENABLED default changed from "1" to "0" with inline comment referencing Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193, Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) #406, Record: SLOT-24 + Pre-quant TTT — val_bpb 0.7094 (3-seed mean) #1376
submission.json: val_bpb set to null, val_bpb_retracted preserves the original 1.13112 for record, status set to 'retracted'
README.md: Update notice at top explaining the retraction, original summary struck through

No clean substitute BPB available. Unlike PR #406 where train logs preserved clean DIAGNOSTIC post_swa numbers, this submission does not have pre-TTT diagnostic numbers in the records folder. If a legal no-TTT rerun becomes available in the future, I will update this PR. Otherwise, treating this as withdrawn for the record track.

Other PRs audited on my end:

Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193 Universal Transformer (TTT) - already fixed, PR Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant) #1554 is the clean resubmission
Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) #406 SDTTT - already fixed with post_swa numbers (commit 14bda5f)
Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed) #1127 (this one) - fixed now
Non-record: H-Net Dynamic Chunking — Learned Tokenization Layer (val_bpb 1.3587) #1191, Non-record: Fused Triton Megakernels — RMSNorm + LeakyReLU² (val_bpb 1.3560) #1192, Non-record: Text Diffusion (MDLM) — Masked Discrete Diffusion (val_bpb 3.3801) #1194, Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017) #1195, Non-record: LLM-JEPA — Joint Embedding Prediction (val_bpb 2.2020) #1196, Non-record: Mamba-Inspired SSM Hybrid 3:1 (val_bpb 3.3168) #1197 (the 7 research PRs) - all clean, no TTT-on-val
Record: Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed) #948, Record: Order-20 Dirichlet Posterior + Phrase Cache — 0.11545 BPB (3-seed) #968 (n-gram cache) - different class (no TTT), added compliance note about the active dispute thread

Thanks to @MatoTeziTanka and The Agora for the systematic compliance review. Self-flagging before the queue reaches a PR is faster than waiting for the audit.

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — 1.13112 B…

8096c1b

…PB (3-seed)

This was referenced Apr 13, 2026

Record: Two-Level Dirichlet Posterior + Phrase Cache — 0.11556 BPB (3-seed) #948

Open

Record: Order-20 Dirichlet Posterior + Phrase Cache — 0.11545 BPB (3-seed) #968

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)#1127

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)#1127
dentity007 wants to merge 2 commits intoopenai:mainfrom
dentity007:submission/dentity007-dim480-1.1311

dentity007 commented Mar 30, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

dentity007 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dentity007 commented Mar 30, 2026

11L XSA4 + EMA + LoRA TTT + Partial RoPE + GPTQ-lite (dim480)

3-seed validation

Key components

Compliance

Note on reproduction

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)

Uh oh!

dentity007 commented Apr 13, 2026

Self-flag (proactive compliance fix, April 13, 2026)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants