Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116) by barneywohl · Pull Request #1135 · openai/parameter-golf

barneywohl · 2026-03-30T15:31:54Z

Record Submission

Author: @barneywohl
Date: 2026-03-30
val_bpb: 1.1116 ± 0.0005 (3-seed mean)

Results (8×H100 SXM)

Seed	Sliding BPB	Artifact
1337	1.1110	15,982,859
42	1.1121	15,981,083
2024	1.1118	15,982,475
Mean ± Std	1.1116 ± 0.0005

Improvement over SOTA

vs merged SOTA (1.1194): -0.0078 nats (requirement: ≥0.005) ✅
vs PR Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 (1.1122): -0.0006 nats

Stack

Fused Triton MLP — custom kernel for leaky_relu(x,0.5).square(), saves 1.8ms/step
Full Hessian GPTQ — Cholesky + actorder + 5-way clip sweep
Coprime-stride loader — multi-shard diversity with memmap
XSA on all 11 layers — exclusive self-attention everywhere
BigramHash(2816×112) — enlarged bigram features
fullgraph=True torch.compile

Built on PR #549 by @abaybektursun with techniques from PRs #726, #634, #1019, #287.

See records folder for full README, logs, and reproducible script.

…816 (val_bpb 1.1116) 3-seed mean: 1.1116 ± 0.0005 Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118 Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky) + coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112) + fullgraph=True torch.compile Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.

…ed mean) 3-seed mean: 1.0962 BPB (std 0.0005) Seeds: 1337=1.0957, 42=1.0963, 2024=1.0966 Beats merged SOTA (1.1147) by 0.019 BPB Built on PR openai#1135 with: QK_GAIN_INIT=4.0, XSA all 11 layers, Muon-TTT (score-first, 3 epochs), SLOT eval-time delta optimization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…prime stride sampling Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL variant: modify _advance_file() to use a coprime stride instead of +1, so nearby training steps see topically-different shards rather than adjacent similar ones. Implementation: 13 LOC, two anchors in TokenStream class (none of the existing 24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1, falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER. Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards before repeating. Max spacing diversity = better gradient noise reduction. Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY at near-zero risk vs. 60+ LOC structural rewrite. 4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram. This is the FIRST data-side patch in our 24-patch stack. Tests a completely new vector after the "neutrality plateau" of architectural/optimizer/training-time patches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:05:20Z

Community Review — Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)

BPB: 1.1116 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA fe30417220b1, file records/track_10min_16mb/2026-03-30_FusedMLP_FullGPTQ_CoprimeLoader_XSA11_BH2816/train_gpt.py):

The TTT path at line 1173 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.13s, dim=512, layers=11, vocab=1024, code=100911 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.13s, dim=512, layers=11, vocab=1024, code=100911 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

bigbag mentioned this pull request Mar 31, 2026

Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)#1135

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)#1135
barneywohl wants to merge 1 commit intoopenai:mainfrom
barneywohl:submission-fused-gptq-coprime

barneywohl commented Mar 30, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

barneywohl commented Mar 30, 2026

Record Submission

Results (8×H100 SXM)

Improvement over SOTA

Stack

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants