Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean)#1550
Conversation
…e — val_bpb 1.0587 (3-seed mean) Non-record submission. Pre-quant TTT violates Condition 3 of Issue openai#1017 (score-before-update). Submitted as technique study documenting: - Condition 3 boundary quantification (illegal TTT -0.044 vs legal -0.002) - Compiled TTT (torch.compile 2x speedup, applicable to legal TTT) - Artifact budget engineering (VE dim optimization, pruning analysis) 3-seed mean sliding BPB: 1.05869 (std 0.00038) All artifacts under 16,000,000 bytes. Zero pruning needed.
6028279 to
d217bf5
Compare
|
@translatingthename — compliance flag. Running our static compliance checker against
No preceding There is additionally a SLOT-style per-batch Checker: https://github.com/Bortlesboat/parameter-golf-checker |
Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence
val_bpb = 1.0587 (3-seed mean, std 0.0004) | ~15.5 MB | 8xH100 SXM
Compliance
This is a non-record submission. The pre-quant TTT implementation (lines 2371-2458 of
train_gpt.py) runs 6 AdamW epochs on the full validation token stream before GPTQ quantization, then scores the same tokens via sliding window. This violates Condition 3 of Issue #1017 (score-before-update) and is structurally identical to the pattern in closed PR #1376 and withdrawn PR #1485.3-Seed Results
Contributions
1. Quantifying the Condition 3 boundary
Two measured points bound the illegal TTT contribution at -0.044 BPB (post-EMA 1.103 → post-GPTQ sliding 1.059):
For comparison, the legal score-first TTT in merged PR #1493 contributes approximately -0.002 BPB. This is not an apples-to-apples comparison — different optimizers, epoch counts, chunk sizes, and base models — but the order-of-magnitude gap illustrates why Condition 3 is load-bearing.
On the theoretical ceiling: Issue #1017 states: "Corpus-level TTT has a ceiling of approximately 0.0003 bits" — referring to the gain from closing the train-val distribution gap. However, the author also notes "a model that undertrained on the training distribution can still benefit." Legal TTT's -0.002 exceeds the 0.0003 distribution-gap ceiling because our 600s-capped model is undertrained — this is legitimate undertraining compensation, not memorization. Our illegal TTT's -0.044, however, is 22x larger than legal TTT on a similar architecture, a magnitude not explainable by undertraining compensation alone.
A per-epoch ablation (not performed) would strengthen this argument: linear scaling with epoch count = memorization signature; rapid saturation = generalization.
2. Compiled TTT (2x speedup)
torch.compile(dynamic=False, fullgraph=True)reduces TTT from ~860s to ~426s. Safe in train mode withtorch.autocast— notorch.inference_mode()tensor poisoning. Fresh model instance avoids rotary cache contamination. Applies equally to legal score-first TTT.3. Artifact budget engineering
With SP8192, fitting under 16MB required component-level analysis:
Compliance Statement
Violates Condition 3 of Issue #1017. Lines 2371-2458 run 6 AdamW epochs on full val stream before GPTQ. Same tokens scored afterward. No score-before-adapt discipline. Structurally identical to closed PR #1376 and withdrawn PR #1485.
Architecture
11L × 512d × 8H/4KV, MLP 4× (2048), LeakyReLU(0.5)², Partial RoPE (16/64), LN scale, tied embeddings, softcap=30. Depth recurrence L3-5 (14 virtual). Parallel residuals L7+. XSA all layers. VE dim=44 L9-10. SmearGate.
Reproduction
Credits
PR #1394 @clarkkev, PR #1493 @bigbag, PR #1485 @ndokutovich, PR #1412 @Robby955, PR #1204 @msisovic, PR #1285 @dexhunter
Checklist
records/track_10min_16mb/