Skip to content

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)#1127

Open
dentity007 wants to merge 2 commits intoopenai:mainfrom
dentity007:submission/dentity007-dim480-1.1311
Open

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)#1127
dentity007 wants to merge 2 commits intoopenai:mainfrom
dentity007:submission/dentity007-dim480-1.1311

Conversation

@dentity007
Copy link
Copy Markdown

11L XSA4 + EMA + LoRA TTT + Partial RoPE + GPTQ-lite (dim480)

val_bpb: 1.13112 (3-seed mean, std 0.00051) | ~15.5 MB | 8×H100 SXM Reykjavík Iceland

PR #462 architecture compressed to fit 16MB with MODEL_DIM=480.

3-seed validation

Seed Val BPB Steps ms/step Size
1337 1.13041826 7,847 76.47 15,489,698
42 1.13161931 7,958 75.40 15,462,345
7 1.13133583 7,935 75.62 15,436,056
Mean 1.13112 (std 0.00051)

Key components

  • 11L U-Net, MODEL_DIM=480, NUM_KV_HEADS=4, MLP_HIDDEN=1536 (3×)
  • EMA decay=0.9985
  • Partial RoPE (16/64 dims)
  • Late QAT int6 STE (threshold 0.15)
  • Single-pass LoRA TTT (rank=8, lr=0.01, 1 epoch)
  • XSA on deepest 4 layers
  • BigramHash (8192 buckets, dim=128) + SmearGate
  • int6 + zstd-22 compression (3.83× ratio)

Compliance

  • Training: ≤600s on 8×H100 SXM
  • Artifact: ~15.5MB (under 16,000,000 bytes)
  • 3-seed verified

Note on reproduction

Current runpod/parameter-golf:latest (PyTorch 2.9.1+cu128) requires manual FA3 install:

pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed)

BPB: 1.13112 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA 8096c1b3c87a, file records/track_10min_16mb/2026-03-30_11L_XSA4_EMA_LoRATTT_PartialRoPE_dim480_1.1311/train_gpt.py):

At line 936 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

ttt_adapt(args, base_model, device, val_tokens, rank, world_size, log_fn) — for epoch in range(args.ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not val_tokens) with score-first-per-chunk discipline. The distinction is on the function signature itself — the argument tensor passed in.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=64073 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of val_tokens (per #1416/#1423 reference implementation) would be welcomed.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=64073 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

…ka review pattern

Proactive self-flag before the Agora compliance review reaches this PR.
Same illegal pattern as PR openai#1193 and PR openai#406: ttt_adapt() runs on val_tokens
for 1 epoch with no score-first discipline before the final eval.

Changes:
- train_gpt.py: TTT_ENABLED default changed from "1" to "0". Added comment
  explaining the fix and cross-referencing the flagged sibling PRs.
- submission.json: val_bpb set to null, val_bpb_retracted preserved for
  record. Status set to "retracted".
- README.md: Update notice at top explaining the retraction, original
  summary struck through.

Unlike PR openai#406 which had clean DIAGNOSTIC post_swa numbers in the train
logs, this submission has no pre-TTT diagnostic numbers preserved, so no
clean substitute BPB is available.
dentity007 added a commit to NathanMaine/parameter-golf that referenced this pull request Apr 13, 2026
Proactive compliance documentation while awaiting maintainer ruling on
hash-based eval-time n-gram caches per Issue openai#402, Issue openai#677, and PR openai#886.

No code changes. Just README documenting:
- The open dispute (valerio-oai leaning legal, abaybektursun openai#886 disputing
  via hash collision density, Robert-Sneiderman openai#900 defending Dirichlet
  formula validity)
- What this submission does (backward-looking causal n-gram cache with
  Dirichlet-Multinomial smoothing)
- What it does NOT do (no training on val_tokens, no backward passes,
  model frozen during eval)
- Explicit statement that I asked on Issue openai#402 on April 2 and will
  retract if ruled invalid

Distinct from the TTT-on-val class of violations I retracted in PR openai#1193,
PR openai#406, and PR openai#1127.
dentity007 added a commit to NathanMaine/parameter-golf that referenced this pull request Apr 13, 2026
Same approach as PR openai#948 compliance note. This submission extends openai#948
with order-20 backoff but uses the same eval-time hash n-gram cache
architecture under the same community dispute (Issue openai#402, Issue openai#677,
PR openai#886, PR openai#900).

No code changes. README documents:
- The open dispute and relevant threads
- What this submission does (causal backward-looking cache, Dirichlet
  smoothing, model frozen)
- What it does NOT do (no training on val_tokens, no backward passes)
- Distinct from the TTT-on-val class I retracted in openai#1193, openai#406, openai#1127
- Will retract if maintainers rule the class invalid
@dentity007
Copy link
Copy Markdown
Author

Self-flag (proactive compliance fix, April 13, 2026)

Following the community review pattern @MatoTeziTanka applied to my PR #1193 and PR #406, I audited this PR on my end and confirmed the same illegal TTT-on-val pattern.

What I found:

  • ttt_adapt() at line 936 takes val_tokens as a parameter
  • Body iterates val_tokens[raw_start:raw_end], calls loss.backward() and optimizer.step() in a single-pass loop
  • TTT_ENABLED defaulted to "1" so the reported 1.13112 BPB was computed with LoRA adapter updates on val_tokens
  • No score-first discipline: the final eval runs after all val_tokens have been trained on
  • The 'single-pass LoRA TTT' terminology in the original description does not rescue it; 1 epoch with no score-first is still illegal under Issue Invalid submissions due to information leakage during TTT #402 / Issue Illegal submissions megathread #677

Fix pushed (commit 23dab4f):

  1. train_gpt.py: TTT_ENABLED default changed from "1" to "0" with inline comment referencing Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193, Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) #406, Record: SLOT-24 + Pre-quant TTT — val_bpb 0.7094 (3-seed mean) #1376
  2. submission.json: val_bpb set to null, val_bpb_retracted preserves the original 1.13112 for record, status set to 'retracted'
  3. README.md: Update notice at top explaining the retraction, original summary struck through

No clean substitute BPB available. Unlike PR #406 where train logs preserved clean DIAGNOSTIC post_swa numbers, this submission does not have pre-TTT diagnostic numbers in the records folder. If a legal no-TTT rerun becomes available in the future, I will update this PR. Otherwise, treating this as withdrawn for the record track.

Other PRs audited on my end:

Thanks to @MatoTeziTanka and The Agora for the systematic compliance review. Self-flagging before the queue reaches a PR is faster than waiting for the audit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants