Skip to content

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)#1135

Open
barneywohl wants to merge 1 commit intoopenai:mainfrom
barneywohl:submission-fused-gptq-coprime
Open

Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)#1135
barneywohl wants to merge 1 commit intoopenai:mainfrom
barneywohl:submission-fused-gptq-coprime

Conversation

@barneywohl
Copy link
Copy Markdown

Record Submission

Author: @barneywohl
Date: 2026-03-30
val_bpb: 1.1116 ± 0.0005 (3-seed mean)

Results (8×H100 SXM)

Seed Sliding BPB Artifact
1337 1.1110 15,982,859
42 1.1121 15,981,083
2024 1.1118 15,982,475
Mean ± Std 1.1116 ± 0.0005

Improvement over SOTA

Stack

  1. Fused Triton MLP — custom kernel for leaky_relu(x,0.5).square(), saves 1.8ms/step
  2. Full Hessian GPTQ — Cholesky + actorder + 5-way clip sweep
  3. Coprime-stride loader — multi-shard diversity with memmap
  4. XSA on all 11 layers — exclusive self-attention everywhere
  5. BigramHash(2816×112) — enlarged bigram features
  6. fullgraph=True torch.compile

Built on PR #549 by @abaybektursun with techniques from PRs #726, #634, #1019, #287.

See records folder for full README, logs, and reproducible script.

…816 (val_bpb 1.1116)

3-seed mean: 1.1116 ± 0.0005
Seeds: 1337=1.1110, 42=1.1121, 2024=1.1118

Stack: LeakyReLU² fused Triton kernel + Full Hessian GPTQ (actorder+Cholesky)
+ coprime-stride multi-shard loader + XSA on all 11 layers + BigramHash(2816x112)
+ fullgraph=True torch.compile

Built on PR openai#549 scaffold with techniques from PRs openai#726, openai#634, openai#1019, openai#287.
bigbag pushed a commit to bigbag/parameter-golf that referenced this pull request Mar 31, 2026
…ed mean)

3-seed mean: 1.0962 BPB (std 0.0005)
Seeds: 1337=1.0957, 42=1.0963, 2024=1.0966
Beats merged SOTA (1.1147) by 0.019 BPB

Built on PR openai#1135 with: QK_GAIN_INIT=4.0, XSA all 11 layers,
Muon-TTT (score-first, 3 epochs), SLOT eval-time delta optimization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…prime stride sampling

Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level
needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL
variant: modify _advance_file() to use a coprime stride instead of +1, so nearby
training steps see topically-different shards rather than adjacent similar ones.

Implementation: 13 LOC, two anchors in TokenStream class (none of the existing
24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1,
falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER.

Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards
before repeating. Max spacing diversity = better gradient noise reduction.

Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY
at near-zero risk vs. 60+ LOC structural rewrite.

4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram.

This is the FIRST data-side patch in our 24-patch stack. Tests a completely new
vector after the "neutrality plateau" of architectural/optimizer/training-time
patches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: Fused Triton MLP + Full GPTQ + Coprime Loader + XSA-all + BH2816 (val_bpb 1.1116)

BPB: 1.1116 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA fe30417220b1, file records/track_10min_16mb/2026-03-30_FusedMLP_FullGPTQ_CoprimeLoader_XSA11_BH2816/train_gpt.py):

The TTT path at line 1173 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.13s, dim=512, layers=11, vocab=1024, code=100911 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.13s, dim=512, layers=11, vocab=1024, code=100911 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants