Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)#1060
Conversation
3-seed mean val_bpb: 1.1123 (std 0.0005) All artifacts under 16MB, all eval under 600s. Key changes from PR openai#549: - Coprime-stride multi-shard data pipeline (PR openai#726 style) - Full Hessian GPTQ with Cholesky error compensation - XSA on all 11 layers - BigramHash(2816×112) - No TTT (sliding-only outperforms on this stack) Built on PR openai#549 by @abaybektursun.
Seed logs now generated with the same 96,398-byte train_gpt.py that ships in this record. Previous logs were from the pre-strip 111,130-byte version. Updated results: Seed 1337: 1.1118 BPP, 15,973,962 bytes Seed 42: 1.1127 BPP, 15,980,438 bytes Seed 2025: 1.1121 BPP, 15,983,626 bytes Mean: 1.1122 ± 0.0004
|
Updated: re-verified all 3 seeds with the stripped train_gpt.py (96,398 bytes) that ships in this record. Previous logs were generated with a pre-strip version (111,130 bytes) that included unused code paths. Scores are unchanged — 3-seed mean 1.1122 ± 0.0004, all artifacts under 16MB. Code size and logs are now fully consistent. |
|
Follow-up cleanup for the stripped submission artifacts only. What changed:
Why:
I re-ran the local rule checker on all 3 bundled logs after the cleanup and they pass cleanly. |
Competition moved while we were experimenting locally: PR openai#634: 1.1178 BPB (Full GPTQ + XSA-all + selective pruning) PR openai#1060: 1.1122 BPB (+ coprime loader + BigramHash 2816) Our contribution: TTT periodic reset on the PR openai#1060 base. PR openai#1060 found TTT unnecessary with Full GPTQ, but they didn't test TTT with anti-drift reset. If TTT drift was the reason it stopped helping, reset could unlock further gains. Files: train_gpt_ours.py — PR openai#1060 + TTT reset mechanism train_gpt_pr634.py — Full GPTQ reference (for study) train_gpt_pr1060.py — Original PR openai#1060 (for comparison) run_h100.sh — Train once, sweep 4 TTT configs TTT configs tested: A: SOTA (lr=0.002, 3ep) — baseline TTT B: PR openai#1039 (lr=0.0025, 4ep) — tuned TTT C: B + reset/100 — anti-drift, moderate D: B + reset/50 — anti-drift, aggressive Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-gram invalidation - PR openai#771 closed (rule violation: multi-epoch TTT re-scored same eval tokens) - N-gram eval cache banned: 33+ PRs closed by @valerio-oai on 2026-03-27 due to normalization bug; correct n-gram achieves ~1.51 BPB (worse than baseline) - Update merged SOTA to 1.1194 (PR openai#549, was 1.1228) - New target: PR openai#1060 (1.1122) — Full Hessian GPTQ + XSA-all + Coprime-stride - Add Lessons 17-20 and v8.0 strategy to CLAUDE.md - Add 2026-03-29 daily research report to logs/daily_research.md https://claude.ai/code/session_01GabEptdqRohHFtkKNZNL17
…(3-seed mean) 3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003) Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).
… reset Combines the best of three approaches: PR openai#1060 (1.1122): coprime loader + Full GPTQ + XSA-all PR openai#1072 (1.117): fused Triton MLP (matmul+activation, 70ms/step) Ours: TTT periodic reset (anti-drift) Expected: ~7900 steps (vs 6700) with PR openai#1060 quality innovations = best training throughput + best quantization + best eval. Fused MLP kernel from PR openai#1072 uses TMA TensorDescriptors (H100 only). Falls back to standard path on non-Hopper GPUs. TTT sweep tests 4 configs on the same trained checkpoint: sota_ttt, pr1039, reset/100, reset/50 Total H100 time: ~10min train + 4×7min TTT ≈ 40 min Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Agreement Package the validated three-seed rerun of the PR openai#1060-derived Loader FullGPTQ XSA11 stack with the online causal ngram agreement evaluator. Include the runnable record folder, benchmark log, and submission metadata for the under-10-minute eval path. Made-with: Cursor
Critical realization: our ported innovations (EngramLite, gated skips, LeakyReLU(0.3)², Turbo-Muon) HURT by 0.003 BPB vs the PR openai#1060 baseline. PR openai#1060 gets 1.1122, our merged version gets 1.1151. The partial port of PR openai#1089 innovations doesn't capture their interactions. Clean path to record: run PR openai#1060's exact code with FA3 + GPTQ_RESERVE=9s. Expected: 1.1115-1.1122 BPB (well below 1.1144 threshold).
Complete pipeline to beat openai#1 (1.0806 BPB): - train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading - retokenize.py: TokenMonster retokenization of FineWeb - deploy_scylla.sh: two-phase deploy (retokenize once, train many) Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack (GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer. Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).
…eferred (upstream stateless) Two-subagent investigation of coprime-stride loader from PR openai#1099/openai#1060. First subagent confirmed 26 PRs use it, top merged record uses it, ~0.01 BPB estimated gain. Second subagent extracted exact upstream DistributedTokenLoader code: it's COMPLETELY STATELESS (~10 lines, just slices TokenStream). PR openai#1099's implementation is NOT a small patch — it's a fundamental rewrite adding stateful per-shard cursor management. Real implementation is 60-100 LOC, needs to interact with TokenStream class I haven't read yet. DEFERRED because data loader is on the critical path — buggy patch could silently corrupt training data. Better to validate existing MS3/EL/MR cycle 2+3 results first. Spec captured for next focused research fire. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…prime stride sampling Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL variant: modify _advance_file() to use a coprime stride instead of +1, so nearby training steps see topically-different shards rather than adjacent similar ones. Implementation: 13 LOC, two anchors in TokenStream class (none of the existing 24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1, falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER. Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards before repeating. Max spacing diversity = better gradient noise reduction. Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY at near-zero risk vs. 60+ LOC structural rewrite. 4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram. This is the FIRST data-side patch in our 24-patch stack. Tests a completely new vector after the "neutrality plateau" of architectural/optimizer/training-time patches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)BPB: 1.1122 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1124 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=96398 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=96398 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
…eed mean) (openai#1060) * Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all 3-seed mean val_bpb: 1.1123 (std 0.0005) All artifacts under 16MB, all eval under 600s. Key changes from PR openai#549: - Coprime-stride multi-shard data pipeline (PR openai#726 style) - Full Hessian GPTQ with Cholesky error compensation - XSA on all 11 layers - BigramHash(2816×112) - No TTT (sliding-only outperforms on this stack) Built on PR openai#549 by @abaybektursun. * fix: add run command, requirements.txt for reproducibility * chore: strip dead code from train_gpt.py (111KB→96KB, +14KB artifact headroom) * fix: re-verify 3 seeds with stripped train_gpt.py for full consistency Seed logs now generated with the same 96,398-byte train_gpt.py that ships in this record. Previous logs were from the pre-strip 111,130-byte version. Updated results: Seed 1337: 1.1118 BPP, 15,973,962 bytes Seed 42: 1.1127 BPP, 15,980,438 bytes Seed 2025: 1.1121 BPP, 15,983,626 bytes Mean: 1.1122 ± 0.0004 * docs(record): clean stripped submission logs Fixes openai#1060
Summary
What's New
3-Seed Results
Compliance
See README.md for full details.