Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) by dexhunter · Pull Request #1060 · openai/parameter-golf

dexhunter · 2026-03-29T06:15:28Z

Summary

val_bpb: 1.1122 (3-seed mean, std 0.0004)
Artifact: ~15.98 MB
Eval time: ~87s (no TTT)
Built on PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 by @abaybektursun

What's New

Coprime-stride multi-shard data pipeline (PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 style) — diverse batches from coprime-stride block sampling across shards
Full Hessian GPTQ (PR Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634/Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 style) — Cholesky error compensation replaces GPTQ-lite
XSA on all 11 layers — extended from last 4
No TTT — sliding-only outperforms TTT on this stack (confirmed independently by PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019)

3-Seed Results

Seed	Sliding BPB	Artifact
1337	1.1118	15,973,962
42	1.1127	15,980,438
2025	1.1121	15,983,626
Mean	1.1122

Compliance

3-seed verification, all under budget
Standard F.cross_entropy scoring (no mixer, no cache)
Artifact < 16,000,000 bytes (all seeds)
Training < 600s, eval < 600s
No TTT — pure sliding window evaluation

See README.md for full details.

@abaybektursun

3-seed mean val_bpb: 1.1123 (std 0.0005) All artifacts under 16MB, all eval under 600s. Key changes from PR openai#549: - Coprime-stride multi-shard data pipeline (PR openai#726 style) - Full Hessian GPTQ with Cholesky error compensation - XSA on all 11 layers - BigramHash(2816×112) - No TTT (sliding-only outperforms on this stack) Built on PR openai#549 by @abaybektursun.

…headroom)

Seed logs now generated with the same 96,398-byte train_gpt.py that ships in this record. Previous logs were from the pre-strip 111,130-byte version. Updated results: Seed 1337: 1.1118 BPP, 15,973,962 bytes Seed 42: 1.1127 BPP, 15,980,438 bytes Seed 2025: 1.1121 BPP, 15,983,626 bytes Mean: 1.1122 ± 0.0004

dexhunter · 2026-03-29T08:34:00Z

Updated: re-verified all 3 seeds with the stripped train_gpt.py (96,398 bytes) that ships in this record. Previous logs were generated with a pre-strip version (111,130 bytes) that included unused code paths. Scores are unchanged — 3-seed mean 1.1122 ± 0.0004, all artifacts under 16MB. Code size and logs are now fully consistent.

Fixes openai#1060

dexhunter · 2026-03-29T13:31:29Z

Follow-up cleanup for the stripped submission artifacts only.

What changed:

replaced the bundled train_seed1337.log short extract with the clean extract from the actual stripped-code run log
clarified in the record README that all 3 bundled seed results and the included train_gpt.py are from the stripped submission script (Code size: 96,398 bytes)
clarified reproduction from within the records folder and tightened the eval/rule-compliance wording

Why:

the previous train_seed1337.log extract accidentally included a launcher traceback / truncated preamble from an earlier invocation, which made the record bundle look inconsistent even though the underlying stripped run was valid
there is no model/code/score change here; all 3 seeds already match the stripped script, and the recorded metrics are unchanged

I re-ran the local rule checker on all 3 bundled logs after the cleanup and they pass cleanly.

Competition moved while we were experimenting locally: PR openai#634: 1.1178 BPB (Full GPTQ + XSA-all + selective pruning) PR openai#1060: 1.1122 BPB (+ coprime loader + BigramHash 2816) Our contribution: TTT periodic reset on the PR openai#1060 base. PR openai#1060 found TTT unnecessary with Full GPTQ, but they didn't test TTT with anti-drift reset. If TTT drift was the reason it stopped helping, reset could unlock further gains. Files: train_gpt_ours.py — PR openai#1060 + TTT reset mechanism train_gpt_pr634.py — Full GPTQ reference (for study) train_gpt_pr1060.py — Original PR openai#1060 (for comparison) run_h100.sh — Train once, sweep 4 TTT configs TTT configs tested: A: SOTA (lr=0.002, 3ep) — baseline TTT B: PR openai#1039 (lr=0.0025, 4ep) — tuned TTT C: B + reset/100 — anti-drift, moderate D: B + reset/50 — anti-drift, aggressive Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@valerio-oai

…-gram invalidation - PR openai#771 closed (rule violation: multi-epoch TTT re-scored same eval tokens) - N-gram eval cache banned: 33+ PRs closed by @valerio-oai on 2026-03-27 due to normalization bug; correct n-gram achieves ~1.51 BPB (worse than baseline) - Update merged SOTA to 1.1194 (PR openai#549, was 1.1228) - New target: PR openai#1060 (1.1122) — Full Hessian GPTQ + XSA-all + Coprime-stride - Add Lessons 17-20 and v8.0 strategy to CLAUDE.md - Add 2026-03-29 daily research report to logs/daily_research.md https://claude.ai/code/session_01GabEptdqRohHFtkKNZNL17

…(3-seed mean) 3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003) Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).

… reset Combines the best of three approaches: PR openai#1060 (1.1122): coprime loader + Full GPTQ + XSA-all PR openai#1072 (1.117): fused Triton MLP (matmul+activation, 70ms/step) Ours: TTT periodic reset (anti-drift) Expected: ~7900 steps (vs 6700) with PR openai#1060 quality innovations = best training throughput + best quantization + best eval. Fused MLP kernel from PR openai#1072 uses TMA TensorDescriptors (H100 only). Falls back to standard path on non-Hopper GPUs. TTT sweep tests 4 configs on the same trained checkpoint: sota_ttt, pr1039, reset/100, reset/50 Total H100 time: ~10min train + 4×7min TTT ≈ 40 min Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Agreement Package the validated three-seed rerun of the PR openai#1060-derived Loader FullGPTQ XSA11 stack with the online causal ngram agreement evaluator. Include the runnable record folder, benchmark log, and submission metadata for the under-10-minute eval path. Made-with: Cursor

Critical realization: our ported innovations (EngramLite, gated skips, LeakyReLU(0.3)², Turbo-Muon) HURT by 0.003 BPB vs the PR openai#1060 baseline. PR openai#1060 gets 1.1122, our merged version gets 1.1151. The partial port of PR openai#1089 innovations doesn't capture their interactions. Clean path to record: run PR openai#1060's exact code with FA3 + GPTQ_RESERVE=9s. Expected: 1.1115-1.1122 BPB (well below 1.1144 threshold).

Complete pipeline to beat openai#1 (1.0806 BPB): - train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading - retokenize.py: TokenMonster retokenization of FineWeb - deploy_scylla.sh: two-phase deploy (retokenize once, train many) Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack (GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer. Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).

…eferred (upstream stateless) Two-subagent investigation of coprime-stride loader from PR openai#1099/openai#1060. First subagent confirmed 26 PRs use it, top merged record uses it, ~0.01 BPB estimated gain. Second subagent extracted exact upstream DistributedTokenLoader code: it's COMPLETELY STATELESS (~10 lines, just slices TokenStream). PR openai#1099's implementation is NOT a small patch — it's a fundamental rewrite adding stateful per-shard cursor management. Real implementation is 60-100 LOC, needs to interact with TokenStream class I haven't read yet. DEFERRED because data loader is on the critical path — buggy patch could silently corrupt training data. Better to validate existing MS3/EL/MR cycle 2+3 results first. Spec captured for next focused research fire. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…prime stride sampling Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL variant: modify _advance_file() to use a coprime stride instead of +1, so nearby training steps see topically-different shards rather than adjacent similar ones. Implementation: 13 LOC, two anchors in TokenStream class (none of the existing 24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1, falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER. Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards before repeating. Max spacing diversity = better gradient noise reduction. Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY at near-zero risk vs. 60+ LOC structural rewrite. 4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram. This is the FIRST data-side patch in our 24-patch stack. Tests a completely new vector after the "neutrality plateau" of architectural/optimizer/training-time patches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:05:24Z

Community Review — Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)

BPB: 1.1122 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 87c1e24d6ebe, file records/track_10min_16mb/2026-03-29_Loader_FullGPTQ_XSA11_BigramHash2816/train_gpt.py):

The TTT path at line 1124 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=96398 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=96398 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

@abaybektursun

…eed mean) (openai#1060) * Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all 3-seed mean val_bpb: 1.1123 (std 0.0005) All artifacts under 16MB, all eval under 600s. Key changes from PR openai#549: - Coprime-stride multi-shard data pipeline (PR openai#726 style) - Full Hessian GPTQ with Cholesky error compensation - XSA on all 11 layers - BigramHash(2816×112) - No TTT (sliding-only outperforms on this stack) Built on PR openai#549 by @abaybektursun. * fix: add run command, requirements.txt for reproducibility * chore: strip dead code from train_gpt.py (111KB→96KB, +14KB artifact headroom) * fix: re-verify 3 seeds with stripped train_gpt.py for full consistency Seed logs now generated with the same 96,398-byte train_gpt.py that ships in this record. Previous logs were from the pre-strip 111,130-byte version. Updated results: Seed 1337: 1.1118 BPP, 15,973,962 bytes Seed 42: 1.1127 BPP, 15,980,438 bytes Seed 2025: 1.1121 BPP, 15,983,626 bytes Mean: 1.1122 ± 0.0004 * docs(record): clean stripped submission logs Fixes openai#1060

dexhunter added 2 commits March 29, 2026 06:15

fix: add run command, requirements.txt for reproducibility

a274d9e

notapplica mentioned this pull request Mar 29, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

dexhunter added 2 commits March 29, 2026 07:33

chore: strip dead code from train_gpt.py (111KB→96KB, +14KB artifact …

fcf51c9

…headroom)

dexhunter changed the title ~~Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)~~ Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) Mar 29, 2026

resouer added a commit to resouer/parameter-golf that referenced this pull request Mar 29, 2026

exp: port openai#1060 quantizer envelope on top of openai#64

0a6d76b

docs(record): clean stripped submission logs

87c1e24

Fixes openai#1060

Bortlesboat mentioned this pull request Mar 29, 2026

Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean) #1099

Closed

5 tasks

icryo mentioned this pull request Mar 30, 2026

Record: EngramLite + Gated Skips + Full GPTQ + FA3 — val_bpb 1.1146 (1-seed, 2 pending) #1122

Closed

6 tasks

Gusanidas added a commit to Gusanidas/parameter-golf that referenced this pull request Mar 30, 2026

Fix README: LeakyReLU squared, credit PR openai#1060 for GPTQ

521f148

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

barneywohl mentioned this pull request Mar 30, 2026

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116) #1135

Open

AnirudhRahul mentioned this pull request Mar 30, 2026

Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment #1145

Open

3 tasks

icryo mentioned this pull request Mar 31, 2026

Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184

Merged

5 tasks

clarkkev mentioned this pull request Apr 1, 2026

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218

Merged

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

resouer mentioned this pull request Apr 8, 2026

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean) #1460

Closed

This was referenced Apr 15, 2026

Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1643

Closed

Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1644

Merged

cocohearts added the record submission ready for review label Apr 23, 2026

cocohearts merged commit f56ef88 into openai:main Apr 23, 2026

cocohearts mentioned this pull request Apr 24, 2026

Update README leaderboard with recent record submissions #1806

Merged

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)#1060

Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)#1060
cocohearts merged 5 commits intoopenai:mainfrom
dexhunter:submission/2026-03-29-loader-fullgptq-xsa11

dexhunter commented Mar 29, 2026 •

edited

Loading

Uh oh!

dexhunter commented Mar 29, 2026

Uh oh!

dexhunter commented Mar 29, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dexhunter commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's New

3-Seed Results

Compliance

Uh oh!

dexhunter commented Mar 29, 2026

Uh oh!

dexhunter commented Mar 29, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dexhunter commented Mar 29, 2026 •

edited

Loading