Skip to content

Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)#1060

Merged
cocohearts merged 5 commits intoopenai:mainfrom
dexhunter:submission/2026-03-29-loader-fullgptq-xsa11
Apr 23, 2026
Merged

Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)#1060
cocohearts merged 5 commits intoopenai:mainfrom
dexhunter:submission/2026-03-29-loader-fullgptq-xsa11

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

@dexhunter dexhunter commented Mar 29, 2026

Summary

What's New

  1. Coprime-stride multi-shard data pipeline (PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 style) — diverse batches from coprime-stride block sampling across shards
  2. Full Hessian GPTQ (PR Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634/Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 style) — Cholesky error compensation replaces GPTQ-lite
  3. XSA on all 11 layers — extended from last 4
  4. No TTT — sliding-only outperforms TTT on this stack (confirmed independently by PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019)

3-Seed Results

Seed Sliding BPB Artifact
1337 1.1118 15,973,962
42 1.1127 15,980,438
2025 1.1121 15,983,626
Mean 1.1122

Compliance

  • 3-seed verification, all under budget
  • Standard F.cross_entropy scoring (no mixer, no cache)
  • Artifact < 16,000,000 bytes (all seeds)
  • Training < 600s, eval < 600s
  • No TTT — pure sliding window evaluation

See README.md for full details.

3-seed mean val_bpb: 1.1123 (std 0.0005)
All artifacts under 16MB, all eval under 600s.

Key changes from PR openai#549:
- Coprime-stride multi-shard data pipeline (PR openai#726 style)
- Full Hessian GPTQ with Cholesky error compensation
- XSA on all 11 layers
- BigramHash(2816×112)
- No TTT (sliding-only outperforms on this stack)

Built on PR openai#549 by @abaybektursun.
Seed logs now generated with the same 96,398-byte train_gpt.py that ships
in this record. Previous logs were from the pre-strip 111,130-byte version.

Updated results:
  Seed 1337: 1.1118 BPP, 15,973,962 bytes
  Seed 42:   1.1127 BPP, 15,980,438 bytes
  Seed 2025: 1.1121 BPP, 15,983,626 bytes
  Mean: 1.1122 ± 0.0004
@dexhunter dexhunter changed the title Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) Mar 29, 2026
@dexhunter
Copy link
Copy Markdown
Contributor Author

Updated: re-verified all 3 seeds with the stripped train_gpt.py (96,398 bytes) that ships in this record. Previous logs were generated with a pre-strip version (111,130 bytes) that included unused code paths. Scores are unchanged — 3-seed mean 1.1122 ± 0.0004, all artifacts under 16MB. Code size and logs are now fully consistent.

resouer added a commit to resouer/parameter-golf that referenced this pull request Mar 29, 2026
@dexhunter
Copy link
Copy Markdown
Contributor Author

Follow-up cleanup for the stripped submission artifacts only.

What changed:

  • replaced the bundled train_seed1337.log short extract with the clean extract from the actual stripped-code run log
  • clarified in the record README that all 3 bundled seed results and the included train_gpt.py are from the stripped submission script (Code size: 96,398 bytes)
  • clarified reproduction from within the records folder and tightened the eval/rule-compliance wording

Why:

  • the previous train_seed1337.log extract accidentally included a launcher traceback / truncated preamble from an earlier invocation, which made the record bundle look inconsistent even though the underlying stripped run was valid
  • there is no model/code/score change here; all 3 seeds already match the stripped script, and the recorded metrics are unchanged

I re-ran the local rule checker on all 3 bundled logs after the cleanup and they pass cleanly.

icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 29, 2026
Competition moved while we were experimenting locally:
  PR openai#634: 1.1178 BPB (Full GPTQ + XSA-all + selective pruning)
  PR openai#1060: 1.1122 BPB (+ coprime loader + BigramHash 2816)

Our contribution: TTT periodic reset on the PR openai#1060 base.
PR openai#1060 found TTT unnecessary with Full GPTQ, but they
didn't test TTT with anti-drift reset. If TTT drift was the
reason it stopped helping, reset could unlock further gains.

Files:
  train_gpt_ours.py  — PR openai#1060 + TTT reset mechanism
  train_gpt_pr634.py — Full GPTQ reference (for study)
  train_gpt_pr1060.py — Original PR openai#1060 (for comparison)
  run_h100.sh — Train once, sweep 4 TTT configs

TTT configs tested:
  A: SOTA (lr=0.002, 3ep) — baseline TTT
  B: PR openai#1039 (lr=0.0025, 4ep) — tuned TTT
  C: B + reset/100 — anti-drift, moderate
  D: B + reset/50 — anti-drift, aggressive

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Mar 29, 2026
…-gram invalidation

- PR openai#771 closed (rule violation: multi-epoch TTT re-scored same eval tokens)
- N-gram eval cache banned: 33+ PRs closed by @valerio-oai on 2026-03-27 due to
  normalization bug; correct n-gram achieves ~1.51 BPB (worse than baseline)
- Update merged SOTA to 1.1194 (PR openai#549, was 1.1228)
- New target: PR openai#1060 (1.1122) — Full Hessian GPTQ + XSA-all + Coprime-stride
- Add Lessons 17-20 and v8.0 strategy to CLAUDE.md
- Add 2026-03-29 daily research report to logs/daily_research.md

https://claude.ai/code/session_01GabEptdqRohHFtkKNZNL17
Bortlesboat added a commit to Bortlesboat/parameter-golf that referenced this pull request Mar 29, 2026
…(3-seed mean)

3-seed results: 1.1136/1.1133/1.1139 (mean 1.1136, std 0.0003)
Built on PR openai#549 + PR openai#1060 with optimized GPTQ reserve (10s vs 14s).
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 29, 2026
… reset

Combines the best of three approaches:
  PR openai#1060 (1.1122): coprime loader + Full GPTQ + XSA-all
  PR openai#1072 (1.117):  fused Triton MLP (matmul+activation, 70ms/step)
  Ours:              TTT periodic reset (anti-drift)

Expected: ~7900 steps (vs 6700) with PR openai#1060 quality innovations
= best training throughput + best quantization + best eval.

Fused MLP kernel from PR openai#1072 uses TMA TensorDescriptors (H100 only).
Falls back to standard path on non-Hopper GPUs.

TTT sweep tests 4 configs on the same trained checkpoint:
  sota_ttt, pr1039, reset/100, reset/50

Total H100 time: ~10min train + 4×7min TTT ≈ 40 min

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gusanidas added a commit to Gusanidas/parameter-golf that referenced this pull request Mar 30, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AnirudhRahul pushed a commit to AnirudhRahul/parameter-golf that referenced this pull request Mar 30, 2026
…Agreement

Package the validated three-seed rerun of the PR openai#1060-derived Loader FullGPTQ XSA11 stack with the online causal ngram agreement evaluator. Include the runnable record folder, benchmark log, and submission metadata for the under-10-minute eval path.

Made-with: Cursor
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 31, 2026
Critical realization: our ported innovations (EngramLite, gated skips,
LeakyReLU(0.3)², Turbo-Muon) HURT by 0.003 BPB vs the PR openai#1060 baseline.
PR openai#1060 gets 1.1122, our merged version gets 1.1151. The partial port
of PR openai#1089 innovations doesn't capture their interactions.

Clean path to record: run PR openai#1060's exact code with FA3 + GPTQ_RESERVE=9s.
Expected: 1.1115-1.1122 BPB (well below 1.1144 threshold).
icryo added a commit to icryo/parameter-golf that referenced this pull request Mar 31, 2026
Complete pipeline to beat openai#1 (1.0806 BPB):
- train_gpt_scylla_stack.py: PR openai#1060 + metadata-based tokenizer loading
- retokenize.py: TokenMonster retokenization of FineWeb
- deploy_scylla.sh: two-phase deploy (retokenize once, train many)

Strategy: PR openai#1143 used old stack. We use PR openai#1060's modern stack
(GPTQ, XSA-all, coprime loader) on the same Scylla tokenizer.
Expected: ~1.070-1.080 BPB (beating both openai#1143 and openai#1089).
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…eferred (upstream stateless)

Two-subagent investigation of coprime-stride loader from PR openai#1099/openai#1060.
First subagent confirmed 26 PRs use it, top merged record uses it, ~0.01 BPB
estimated gain. Second subagent extracted exact upstream DistributedTokenLoader
code: it's COMPLETELY STATELESS (~10 lines, just slices TokenStream).

PR openai#1099's implementation is NOT a small patch — it's a fundamental rewrite
adding stateful per-shard cursor management. Real implementation is 60-100 LOC,
needs to interact with TokenStream class I haven't read yet.

DEFERRED because data loader is on the critical path — buggy patch could
silently corrupt training data. Better to validate existing MS3/EL/MR cycle 2+3
results first. Spec captured for next focused research fire.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…prime stride sampling

Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level
needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL
variant: modify _advance_file() to use a coprime stride instead of +1, so nearby
training steps see topically-different shards rather than adjacent similar ones.

Implementation: 13 LOC, two anchors in TokenStream class (none of the existing
24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1,
falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER.

Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards
before repeating. Max spacing diversity = better gradient noise reduction.

Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY
at near-zero risk vs. 60+ LOC structural rewrite.

4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram.

This is the FIRST data-side patch in our 24-patch stack. Tests a completely new
vector after the "neutrality plateau" of architectural/optimizer/training-time
patches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)

BPB: 1.1122 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 87c1e24d6ebe, file records/track_10min_16mb/2026-03-29_Loader_FullGPTQ_XSA11_BigramHash2816/train_gpt.py):

The TTT path at line 1124 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=96398 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=96398 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

@cocohearts cocohearts merged commit f56ef88 into openai:main Apr 23, 2026
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
…eed mean) (openai#1060)

* Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all

3-seed mean val_bpb: 1.1123 (std 0.0005)
All artifacts under 16MB, all eval under 600s.

Key changes from PR openai#549:
- Coprime-stride multi-shard data pipeline (PR openai#726 style)
- Full Hessian GPTQ with Cholesky error compensation
- XSA on all 11 layers
- BigramHash(2816×112)
- No TTT (sliding-only outperforms on this stack)

Built on PR openai#549 by @abaybektursun.

* fix: add run command, requirements.txt for reproducibility

* chore: strip dead code from train_gpt.py (111KB→96KB, +14KB artifact headroom)

* fix: re-verify 3 seeds with stripped train_gpt.py for full consistency

Seed logs now generated with the same 96,398-byte train_gpt.py that ships
in this record. Previous logs were from the pre-strip 111,130-byte version.

Updated results:
  Seed 1337: 1.1118 BPP, 15,973,962 bytes
  Seed 42:   1.1127 BPP, 15,980,438 bytes
  Seed 2025: 1.1121 BPP, 15,983,626 bytes
  Mean: 1.1122 ± 0.0004

* docs(record): clean stripped submission logs

Fixes openai#1060
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants