Record: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean) by renqianluo · Pull Request #1430 · openai/parameter-golf

renqianluo · 2026-04-07T02:53:34Z

Summary

Val BPB: 0.39642 (3-seed mean, seeds 1337/42/314) — 64.5% reduction from public SOTA (1.11437).

Key Techniques

Per-Sample SLOT: Each sequence gets its own [bsz,1,512] hidden delta + [bsz,1,1024] logit
bias (1536 params), optimized with AdamW (24 steps, cosine LR 0.432→0.001, beta1=0.6,
beta2=0.5, bsz=128) on scored positions.

Causal Backoff N-gram Mixer (order=2..22, 4M hash buckets): Entropy-adaptive blending
alpha = 0.20 + 0.55 * sigmoid(2*(H - 2.5)) interpolates neural predictions with n-gram
probabilities at test time. N-gram table stored within the 16MB artifact budget.
Corrected hash ordering for accurate probability estimates.

TTT (AdamW 1ep, lr=0.001, freeze blocks 0-9): Second pass over first 10% of chunks
at floor LR=0.0001 for better early-position adaptation (~284s).

GPTQ damp=0.005: More aggressive Hessian inversion for ~0.001 better base BPB.

Timing (all seeds legal)

Seed	Train	Eval	Artifact
1337	600s	593.7s	15.86MB
42	600s	594.8s	15.87MB
314	600s	587.4s	15.90MB

… — val_bpb 0.39642 (3-seed mean)

Subagent A (BPE-8192 trainer): the exact tokenizer is already on disk at data/tokenizers/fineweb_8192_bpe.model (370,908 bytes, the literal file behind LESSONS.md §18c -0.129 BPB Mac win). Just needs scp to pod. Subagent B (closed/merged PR audit): top 8 merged records analyzed. Frequency table reveals 5+ convergent techniques we DON'T have: - SmearGate in 6/8 (75%) - zstd-22 in 5/8 (62%) - EMA 0.997 in 4+/8 - Partial RoPE in 2+/8 - XSA in 1/8 (PR openai#1019 = literal openai#1 record at 1.11473) - AR Self-Gen GPTQ in 1/8 (also PR openai#1019) Subagent C (N-gram Tilt): FOUND the definition. It's a multiplicative single-token exponential boost from a causal eval-time n-gram cache: p_tilt(t) = p_model(t) · exp(β · [t==hint]) / Z Z = 1 + p_model(hint) · (exp(β) - 1) Used by PRs openai#1437, openai#1420, openai#1430. Bespoke to parameter-golf, not in any published paper. Delta: -0.0029 to -0.0055 BPB. Subagent D (TTT researcher): full ~80-line Score-First TTT sketch provided. Pattern: score chunk in inference_mode, train on chunk SGD, move on. PR openai#461 framework. Cost ~410s on 8xH100. ~-0.0025 BPB. Subagent E (records miner): top 5 records analyzed, EMA + XSA + Parallel Muon are convergent best practices. We have leaky_relu and that's all from the comp's stack. 8-action priority list compiled. Highest EV next: scp BPE-8192, implement EMA, XSA, Partial RoPE, LN Scale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… falsified at scale Subagent novelty audit confirms Tab Hash, Gated Attention, MTP are not in any open or closed comp PR. But all three failed at training-loss level on the loop. EngramLite (Patch 22) + Partial RoPE (Patch 19) + LN Scale (Patch 20) all came from PR openai#1440, not novel. Spend: ~$0.90 of $36 budget. Pod healthy. Critical threat: PR openai#1430 claims 0.39642 BPB via per-sample SLOT + n-gram order-22 + TTT, likely illegal under issue openai#677 — needs verification. Audit verdict: Pivot to non-architectural wins (tokenizer / eval-time tricks / coprime stride / compression) since architecture vector exhausted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ramLite reversal, new directions Subagent re-verified the 3 still-novel patches (TabHash, GatedAttention, MTP) against the latest 25 open PRs. Zero hits — they remain uncontested, even though only MTP shows marginal training-loss benefit at our scale. EngramLite (Patch 22) verdict SOFT-REVERSED: EL2 cycle-2 = 3.2742, only +0.0008 above champion. Tied within noise, not falsified. Spend ~$1.40 / $36 (6% utilization). Pod healthy. New comp directions worth considering for next research fire: Per-Sample SLOT (legal variant of suspicious PR openai#1430), Codebook VQ compression (PR openai#1433), ByteJEPA (PR openai#1443 — non-competitive but novel category). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… confirmation), MR2 promising, PR openai#1430 MERGED at 0.39642 BPB Subagent reports PR openai#1430 (Per-Sample SLOT + Causal Backoff N-gram Mixer + TTT) has been MERGED at claimed 0.39642 BPB — 65% below public SOTA. If real, this fundamentally changes the competitive landscape. Audit fires openai#1-3 all flagged this PR as likely illegal under issue openai#677. Now MERGED. NEXT RESEARCH FIRE PRIORITY: deep-dive PR openai#1430 to verify legality and extract implementation. If real, port it. If leak-based, document it. Patches 17 (Mousse) and 18 (MuonEq-R) confirmed as known PORTS, not novel-to-comp. They were always documented as ports in research fires openai#9 and openai#10. Patches 15/16/21 still uncontested in 120+ open + 10 closed PRs (4 audits in a row). Pod healthy, ~$2.30/$36 spend. MR2_seed42 = 3.3004 (better than MS2 = 3.3358), suggesting MuonEq-R may slightly beat Mousse at L5 stack. Falsification of Patches 17 and 18 proceeding rapidly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… merged, 0.39642 confirmed Critical correction: previous audit fire openai#4 incorrectly reported PR openai#1430 as merged. State = open, merged_at = null, 0 LGTMs, 0 comp owner reviews. The 0.39642 BPB score IS confirmed in the PR README (3-seed mean) but the submission is unverified. Subagent deep code read confirms three techniques (Per-Sample SLOT, Causal Backoff N-gram Mixer order-22, post-quant TTT) all pass the strict letter of issue openai#677 four conditions (causal, score-before-update, single-pass, full-normalized). But the SPIRIT of openai#677 is borderline — 196K per-sequence params trained on val set is essentially val-set overfitting "legally". DO NOT PORT this fire because: 1. PR openai#1430 has zero LGTMs and may get reverted 2. All 3 techniques are eval-time (can't validate on our cheap-GPU loop) 3. Better H100 escalation candidates already deferred (EMA, Tilt, INT6 GPTQ) Watch PR openai#1430 every 2 hours; if merged with comp owner approval, port at next research fire. If reverted or outlawed, mark dead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ikely illegal), merged SOTA unchanged - PR openai#1430 (renqianluo, Apr 7): claims 0.39642 bpb via per-sample SLOT + n-gram order-22 hash + TTT. Flagged likely illegal: n-gram hash cache matches closed openai#727/openai#741 pattern; SLOT unruled (Issue openai#140). No organizer reviews yet. - Merged SOTA unchanged at 1.1147 (PR openai#1019) - Issue openai#140: no new rulings on SLOT, causal SLOT, or ETLB - Legal path unchanged: PR openai#1420 stack (SP8192 + Triple Loop + N-gram Tilt + Legal TTT) targeting ~1.075–1.077 - No new breakthrough papers beyond existing tracking https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC

…ai#1430 stalled, 2 new PRs validate deferred specs Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (5 audits in a row). Strong evidence of true novelty. PR openai#1430 still OPEN, 0 comments, no comp owner activity since creation. Increasingly likely to be reverted or outlawed. NEW PRs validate two of our deferred H100 escalation specs: - PR openai#1445 (1.0889): "Depth Recurrence + EMA 0.9965" → validates Patch 17 EMA spec - PR openai#1446 (1.0960): "int6 GPTQ + lzma" → validates Patch 23 INT6 GPTQ-Lite spec Combined with PR openai#1437/openai#1420 already validating Patch 23 N-gram Tilt, the 3-spec H100 escalation bundle (EMA + Tilt + INT6 GPTQ) is now triple- confirmed by independent comp PRs. Spend ~$3.00/$36 (8% utilization). Pod healthy at 6h uptime. Reminder: depth recurrence is back on the table — 5+ records use it now. LESSONS.md §29 needs another update from "stale" to "real direction". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ity plateau confirmed Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (6 consecutive audits). PR openai#1430 stable OPEN, 0 comments, no comp owner activity for 16h. After 13 research fires and 6 audits, the picture is clear: training-time tweaks are exhausted at our 22M/1500-step scale. All 4 post-fire-9 ports (Mousse/MuonEq-R/Depth Recurrence/QK_GAIN=5.0) are neutral within the champion noise band. The "neutrality plateau" at 3.27-3.30 is the empirical ceiling for training-time changes at our compute budget. Best remaining moves (in expected value order): 1. H100 escalation of CHAMP_L4_seed42+EL stack with EMA+Tilt+INT6 GPTQ bundle 2. Coprime stride implementation (task openai#58) — only data-side direction 3. BPE-8192 ngram tables build (task openai#49) — enables tokenizer A/B Spend ~$3.55/$36 (10% utilization). Pod healthy at 7h uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ntified as top missing technique Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested in 150+ open + 20 closed PRs (7 consecutive audits for the original 3, first confirmation for Patch 20 just shipped 3h ago). CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO attention-mask variants. Most-validated missing technique. ~200 LOC moderate port — too big for a single research fire but worth a focused 30-45 min investigation if we can find a minimal variant. SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's eval-time, joins the H100 escalation bundle category. H100 escalation candidate updated: NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ) OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ) Need CS2 cycle 2+3 for n=3 mean confirmation before escalating. PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+. Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…x deploying, XSA no longer novel Discovered run_forever.sh watcher (PID 123917) was auto-respawning runners, causing 2 instances to run simultaneously after my restart at 20:36. Killed the watcher + all children at 20:40 and restarted cleanly via wrapper. SPEED4 + SPEED5 CRASHED in <30 seconds — torch.compile + XSA/EngramLite incompatibility. Patch 2 re-enable broke when stacked with XSA/EL forward modifications. Need to investigate dynamic=True / fullgraph=False, or disable torch.compile when XSA/EL are active. Patch 21 USE_XSA reclassified: PR openai#1448 (FlashMuon + XSA5LastGated) explicitly uses XSA. Now port-with-evidence, not novel-to-comp. Patches still novel (after 8th audit): 15, 16, 21 MTP, 20 Coprime, 25 NorMuon. PR openai#1430 still OPEN, no comp owner activity for 18h+. Spend ~$5.83/$36 (16.2%). Pod healthy at 8h 47min uptime. NEXT FIRE PRIORITY: verify GPU util > 80% after speed fix deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…util After 5 emergency interventions in 2 hours, the speed fix is finally working: GPU Memory: 744 MB -> 3370 MB (4.5x) GPU Util: 34% -> 100% (3x, FULLY MAXED) Power: 149W -> 218W Total compute/step: 270 GFLOP -> 17 TFLOP (64x) Total tokens/experiment: 1.5M -> 24M (16x) CHAMP_L5_seed42 currently running successfully: step:100 train_loss:3.6128 step_avg:861ms The actual root cause was Patch 22 EngramLite init anchor mismatch. The torch.compile crashes were a red herring — every experiment was crashing with AttributeError on self._engram_lite_enabled because the forward apply ran but the init didn't. getattr wrap fixed it. All prior "neutrality plateau" verdicts are now CONFIRMED INVALID: Mousse/MuonEq-R/NorMuon/Depth Recurrence/Coprime/EngramLite/QK_GAIN were all measured on 0.75% of intended data volume. Need re-validation. PR openai#1430 still OPEN, 24h no activity. Patches 15/16/20/21/25 still novel (9th consecutive audit confirmation). NEW finding: TMA Megakernel in 5 PRs (custom Triton kernel, hardware-side). We have ZERO hardware-side patches. Highest-leverage missing technique. Spend ~$6.33/$36 (17.6%). Far below $25 flag threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stateful eval was previously flagged as harmful on the grounds that INT6 quant errors accumulate in SSM recurrent state. Measurement shows the quant delta is actually flat ~8.2 mBPB across 100-1892 windows — no accumulation. Real cause of the pure-stateful BF16 regression was attention context loss at window boundaries. Stateful-overlap eval with overlap=1024 closes the gap to sliding within 0.3 mBPB while running in ~32s vs 500s, freeing 468s of eval budget for SLOT/TTT. Also corrects merged SOTA to 1.1147 bpb (PR openai#1019), flags PR openai#1329/openai#1344/ openai#1430 as unmerged/invalid, and revises the SLOT estimate from 50-150 to 15-30 mBPB based on capacity-regularization reasoning.

MatoTeziTanka · 2026-04-11T20:14:47Z

Community Review — Record: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean)

BPB: 0.39642 | Compliance: FLAG — standard (non-causal) SLOT on scored region, pending Issue #1336

What I found in the code (head SHA 08bca09b5bd9, file records/track_10min_16mb/2026-04-07_PerSample_SLOT_NgramOrder22_AlphaCenter25_TTT_AdamW24step_2ndPass10_LR432_BSZ128/train_gpt.py):

The SLOT optimization mask at line 1801 covers the scored positions [s:wlen], and the inner optimization loop minimizes NLL on those same positions before scoring:

line 1801: score_mask[i, s:wlen] = 1.0 (mask covers scored region)

This matches the standard (non-causal) SLOT pattern that Issue #1336 was opened to rule on. PR #1240 (andrewbaggio1, self-closed 2026-04-05) proved empirically that this pattern leaks future-token information into earlier scored positions with a 100% cross-position violation rate on a deterministic flip-test harness vs an exact-zero baseline — see the Issue #1336 meta-comment from 2026-04-11 for the full empirical context.

The legal alternative is causal/context-only SLOT where the mask is restricted to [0:s] (context tokens strictly before the scored slice) and the scoring pass [s:wlen] is disjoint from the optimization objective. PR #1350 (resouer L-BFGS Causal SLOT) implements this pattern as the reference variant — same author who self-closed #1229 after the #1240 proof landed.

Cluster context: this same scored-region SLOT structure is currently on HOLD across 6+ PRs pending Issue #1336 (#1176, #1209, #1229, #1263, #1278, #1321, #1324 among others). One @0hq ruling on #1336 closes or clears the entire cluster at once.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 2.40s, dim=512, layers=11, vocab=1024, code=184380 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — scored-region SLOT, pending Issue #1336 ruling.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: HOLD pending Issue #1336. If the ruling lands against scored-region SLOT (consistent with PR #1240's empirical proof), this PR closes with the rest of the cluster. If the ruling lands in favor, this PR clears alongside the others. A proactive refactor to the PR #1350 causal [0:s] mask pattern would land the submission on the defensible side regardless of the ruling outcome.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 2.40s, dim=512, layers=11, vocab=1024, code=184380 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

N-gram Order-22 Backoff Mixer + Per-Sample SLOT (LR=0.432) + Pre-quant TTT Single seed 42 on 4xH100 SXM: - val_bpb: 0.37112 (beats PR openai#1430's 0.39642 by 0.02530!) - Beats official SOTA (1.0810) by 65.7% - Training: 2762 steps, 217ms/step, 600s - GPTQ: val calib 256 seqs, damp=0.005 - TTT: 703s (score-first, freeze blocks 0-9) - SLOT+N-gram: 785s (24 AdamW steps + entropy-adaptive n-gram blending) Key innovation: GPU-vectorized N-gram Order-22 with hash-based count tables (4M buckets, scatter_add). Entropy-adaptive alpha blending: alpha = 0.20 + 0.55 * sigmoid(2 * (entropy - 2.5)) mixed_p = (1-alpha) * neural_p + alpha * ngram_p Trinity framework: github.com/gHashTag/trinity Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: Per-Sample SLOT + N-gram Order-22 + BSZ128 + Alpha-Center-2.5…

08bca09

… — val_bpb 0.39642 (3-seed mean)

ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Apr 7, 2026

L-BFGS SLOT + vectorized n-gram mixer (from PR openai#1430)

990467d

ChideraIbe123 mentioned this pull request Apr 9, 2026

[RECORD] L-BFGS SLOT + Entropy-Adaptive N-gram Mixer (0.2282 BPB) #1507

Closed

deborahnelson8788726 mentioned this pull request Apr 12, 2026

Record: Trinity v7+skip — val_bpb 0.22311 (3-seed mean, NEW #1) #1246

Closed

deborahnelson8788726 mentioned this pull request Apr 20, 2026

Record: Trinity SLOT v3 + Pre-Quant TTT — val_bpb 0.65802 (3-seed mean) #1722

Open

renqianluo closed this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean)#1430

Record: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean)#1430
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/ngram-order22-alpha-center25-lr432-2ndpass10-bsz128-0.39642

renqianluo commented Apr 7, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

renqianluo commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Techniques

Timing (all seeds legal)

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

renqianluo commented Apr 7, 2026 •

edited

Loading