Record: dTTT + BigramHash 3072×112 — val_bpb 1.0800 (3-seed mean)#1408
Record: dTTT + BigramHash 3072×112 — val_bpb 1.0800 (3-seed mean)#1408aamodbhatt wants to merge 1 commit intoopenai:mainfrom
Conversation
Discriminative pre-quant AdamW TTT (per-block LR 0.3x-1.0x, 10 epochs, freeze=0) on BigramHash 3072x112 base. Builds on PR openai#1351 dTTT framework; BigramHash scaled from 2048x128 to 3072x112. 3-seed mean 1.0800 (std 0.0002), all artifacts under 16MB.
…ctions - N-gram Tilt bug: PR openai#1420 kernel is non-causal; PR openai#1437 (dexhunter) found/fixed it (pre-fix 1.07807 → post-fix 1.08091). Updated primary reference to PR openai#1437 kernel. - PR openai#1423 flagged illegal (pre-quant TTT, same as openai#1351/openai#1408/openai#1416) - Added full PR openai#1421–1444 scan results - Updated best open legal PR: ~1.08091 (PR openai#1437) not 1.08014 (openai#1420) - Session 8 lessons learned added to CLAUDE.md https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
Comprehensive leaderboard of openai/parameter-golf record submissions compiled from open PRs. Each entry classified as valid/invalid/suspect based on source code review against PR openai#1017 validity rules. Key findings: - Best verified-valid score: 1.0800 BPB (PR openai#1408) - 3 submissions confirmed invalid (pre-quant TTT, unnormalized n-gram) - Sub-0.70 BPB submissions violate normalization requirements - 6 submissions fully code-reviewed and verified valid https://claude.ai/code/session_017F8GGeKA7MhUoQdqMGcTpg
Deep review of train_gpt.py reveals ttt_adapt_adamw() trains on val data for 10 full epochs (TTT_EPOCHS=10, TTT_ENABLED=1 by default) before quantization. This is the same pre-quantization TTT violation as PRs openai#1423 and openai#1416 — the artifact encodes information from the entire validation set, violating strict causal dependence. The ~0.04-0.05 BPB improvement from dTTT is entirely attributable to fitting the test set. Best verified-valid score updated to 1.0801 BPB (PR openai#1420). https://claude.ai/code/session_017F8GGeKA7MhUoQdqMGcTpg
Local copy of aamodbhatt's train_gpt.py from PR openai#1408 used during the thorough validity review that identified the pre-quant dTTT violation (10 epochs on val data). https://claude.ai/code/session_017F8GGeKA7MhUoQdqMGcTpg
Two of the three comp-frontier wins are env-var bumps with no code change: - LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 / openai#1437 use this. Expected -0.005 to -0.01 BPB. - QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5; openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve. Expected -0.001 BPB. C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in background. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Community Review — Record: dTTT + BigramHash 3072×112 — val_bpb 1.0800 (3-seed mean)BPB: 1.0800 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 1208 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=119311 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=119311 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Record Summary
Final submitted score:
val_bpb 1.0800(std0.0002)Reference neural roundtrip:
1.09935(std0.00007)Hardware: 8×H100 SXM | Artifact: ≤15.9 MB | Training: ≤600s
What changed
3-Seed Results
Submission Checklist
records/track_10min_16mb/2026-04-06_dTTT_BH3072_11L_8xH100/Metric Verification
final_int6_sliding_window_exactin each seed logfinal_int6_roundtrip_exactin each seed logTotal submission size int6+lzmain each seed logCredits