Record: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean)#1430
Conversation
… — val_bpb 0.39642 (3-seed mean)
Subagent A (BPE-8192 trainer): the exact tokenizer is already on disk at data/tokenizers/fineweb_8192_bpe.model (370,908 bytes, the literal file behind LESSONS.md §18c -0.129 BPB Mac win). Just needs scp to pod. Subagent B (closed/merged PR audit): top 8 merged records analyzed. Frequency table reveals 5+ convergent techniques we DON'T have: - SmearGate in 6/8 (75%) - zstd-22 in 5/8 (62%) - EMA 0.997 in 4+/8 - Partial RoPE in 2+/8 - XSA in 1/8 (PR openai#1019 = literal openai#1 record at 1.11473) - AR Self-Gen GPTQ in 1/8 (also PR openai#1019) Subagent C (N-gram Tilt): FOUND the definition. It's a multiplicative single-token exponential boost from a causal eval-time n-gram cache: p_tilt(t) = p_model(t) · exp(β · [t==hint]) / Z Z = 1 + p_model(hint) · (exp(β) - 1) Used by PRs openai#1437, openai#1420, openai#1430. Bespoke to parameter-golf, not in any published paper. Delta: -0.0029 to -0.0055 BPB. Subagent D (TTT researcher): full ~80-line Score-First TTT sketch provided. Pattern: score chunk in inference_mode, train on chunk SGD, move on. PR openai#461 framework. Cost ~410s on 8xH100. ~-0.0025 BPB. Subagent E (records miner): top 5 records analyzed, EMA + XSA + Parallel Muon are convergent best practices. We have leaky_relu and that's all from the comp's stack. 8-action priority list compiled. Highest EV next: scp BPE-8192, implement EMA, XSA, Partial RoPE, LN Scale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… falsified at scale Subagent novelty audit confirms Tab Hash, Gated Attention, MTP are not in any open or closed comp PR. But all three failed at training-loss level on the loop. EngramLite (Patch 22) + Partial RoPE (Patch 19) + LN Scale (Patch 20) all came from PR openai#1440, not novel. Spend: ~$0.90 of $36 budget. Pod healthy. Critical threat: PR openai#1430 claims 0.39642 BPB via per-sample SLOT + n-gram order-22 + TTT, likely illegal under issue openai#677 — needs verification. Audit verdict: Pivot to non-architectural wins (tokenizer / eval-time tricks / coprime stride / compression) since architecture vector exhausted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ramLite reversal, new directions Subagent re-verified the 3 still-novel patches (TabHash, GatedAttention, MTP) against the latest 25 open PRs. Zero hits — they remain uncontested, even though only MTP shows marginal training-loss benefit at our scale. EngramLite (Patch 22) verdict SOFT-REVERSED: EL2 cycle-2 = 3.2742, only +0.0008 above champion. Tied within noise, not falsified. Spend ~$1.40 / $36 (6% utilization). Pod healthy. New comp directions worth considering for next research fire: Per-Sample SLOT (legal variant of suspicious PR openai#1430), Codebook VQ compression (PR openai#1433), ByteJEPA (PR openai#1443 — non-competitive but novel category). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… confirmation), MR2 promising, PR openai#1430 MERGED at 0.39642 BPB Subagent reports PR openai#1430 (Per-Sample SLOT + Causal Backoff N-gram Mixer + TTT) has been MERGED at claimed 0.39642 BPB — 65% below public SOTA. If real, this fundamentally changes the competitive landscape. Audit fires openai#1-3 all flagged this PR as likely illegal under issue openai#677. Now MERGED. NEXT RESEARCH FIRE PRIORITY: deep-dive PR openai#1430 to verify legality and extract implementation. If real, port it. If leak-based, document it. Patches 17 (Mousse) and 18 (MuonEq-R) confirmed as known PORTS, not novel-to-comp. They were always documented as ports in research fires openai#9 and openai#10. Patches 15/16/21 still uncontested in 120+ open + 10 closed PRs (4 audits in a row). Pod healthy, ~$2.30/$36 spend. MR2_seed42 = 3.3004 (better than MS2 = 3.3358), suggesting MuonEq-R may slightly beat Mousse at L5 stack. Falsification of Patches 17 and 18 proceeding rapidly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… merged, 0.39642 confirmed Critical correction: previous audit fire openai#4 incorrectly reported PR openai#1430 as merged. State = open, merged_at = null, 0 LGTMs, 0 comp owner reviews. The 0.39642 BPB score IS confirmed in the PR README (3-seed mean) but the submission is unverified. Subagent deep code read confirms three techniques (Per-Sample SLOT, Causal Backoff N-gram Mixer order-22, post-quant TTT) all pass the strict letter of issue openai#677 four conditions (causal, score-before-update, single-pass, full-normalized). But the SPIRIT of openai#677 is borderline — 196K per-sequence params trained on val set is essentially val-set overfitting "legally". DO NOT PORT this fire because: 1. PR openai#1430 has zero LGTMs and may get reverted 2. All 3 techniques are eval-time (can't validate on our cheap-GPU loop) 3. Better H100 escalation candidates already deferred (EMA, Tilt, INT6 GPTQ) Watch PR openai#1430 every 2 hours; if merged with comp owner approval, port at next research fire. If reverted or outlawed, mark dead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ikely illegal), merged SOTA unchanged - PR openai#1430 (renqianluo, Apr 7): claims 0.39642 bpb via per-sample SLOT + n-gram order-22 hash + TTT. Flagged likely illegal: n-gram hash cache matches closed openai#727/openai#741 pattern; SLOT unruled (Issue openai#140). No organizer reviews yet. - Merged SOTA unchanged at 1.1147 (PR openai#1019) - Issue openai#140: no new rulings on SLOT, causal SLOT, or ETLB - Legal path unchanged: PR openai#1420 stack (SP8192 + Triple Loop + N-gram Tilt + Legal TTT) targeting ~1.075–1.077 - No new breakthrough papers beyond existing tracking https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
…ai#1430 stalled, 2 new PRs validate deferred specs Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (5 audits in a row). Strong evidence of true novelty. PR openai#1430 still OPEN, 0 comments, no comp owner activity since creation. Increasingly likely to be reverted or outlawed. NEW PRs validate two of our deferred H100 escalation specs: - PR openai#1445 (1.0889): "Depth Recurrence + EMA 0.9965" → validates Patch 17 EMA spec - PR openai#1446 (1.0960): "int6 GPTQ + lzma" → validates Patch 23 INT6 GPTQ-Lite spec Combined with PR openai#1437/openai#1420 already validating Patch 23 N-gram Tilt, the 3-spec H100 escalation bundle (EMA + Tilt + INT6 GPTQ) is now triple- confirmed by independent comp PRs. Spend ~$3.00/$36 (8% utilization). Pod healthy at 6h uptime. Reminder: depth recurrence is back on the table — 5+ records use it now. LESSONS.md §29 needs another update from "stale" to "real direction". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ity plateau confirmed Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (6 consecutive audits). PR openai#1430 stable OPEN, 0 comments, no comp owner activity for 16h. After 13 research fires and 6 audits, the picture is clear: training-time tweaks are exhausted at our 22M/1500-step scale. All 4 post-fire-9 ports (Mousse/MuonEq-R/Depth Recurrence/QK_GAIN=5.0) are neutral within the champion noise band. The "neutrality plateau" at 3.27-3.30 is the empirical ceiling for training-time changes at our compute budget. Best remaining moves (in expected value order): 1. H100 escalation of CHAMP_L4_seed42+EL stack with EMA+Tilt+INT6 GPTQ bundle 2. Coprime stride implementation (task openai#58) — only data-side direction 3. BPE-8192 ngram tables build (task openai#49) — enables tokenizer A/B Spend ~$3.55/$36 (10% utilization). Pod healthy at 7h uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntified as top missing technique Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested in 150+ open + 20 closed PRs (7 consecutive audits for the original 3, first confirmation for Patch 20 just shipped 3h ago). CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO attention-mask variants. Most-validated missing technique. ~200 LOC moderate port — too big for a single research fire but worth a focused 30-45 min investigation if we can find a minimal variant. SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's eval-time, joins the H100 escalation bundle category. H100 escalation candidate updated: NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ) OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ) Need CS2 cycle 2+3 for n=3 mean confirmation before escalating. PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+. Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…x deploying, XSA no longer novel Discovered run_forever.sh watcher (PID 123917) was auto-respawning runners, causing 2 instances to run simultaneously after my restart at 20:36. Killed the watcher + all children at 20:40 and restarted cleanly via wrapper. SPEED4 + SPEED5 CRASHED in <30 seconds — torch.compile + XSA/EngramLite incompatibility. Patch 2 re-enable broke when stacked with XSA/EL forward modifications. Need to investigate dynamic=True / fullgraph=False, or disable torch.compile when XSA/EL are active. Patch 21 USE_XSA reclassified: PR openai#1448 (FlashMuon + XSA5LastGated) explicitly uses XSA. Now port-with-evidence, not novel-to-comp. Patches still novel (after 8th audit): 15, 16, 21 MTP, 20 Coprime, 25 NorMuon. PR openai#1430 still OPEN, no comp owner activity for 18h+. Spend ~$5.83/$36 (16.2%). Pod healthy at 8h 47min uptime. NEXT FIRE PRIORITY: verify GPU util > 80% after speed fix deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…util After 5 emergency interventions in 2 hours, the speed fix is finally working: GPU Memory: 744 MB -> 3370 MB (4.5x) GPU Util: 34% -> 100% (3x, FULLY MAXED) Power: 149W -> 218W Total compute/step: 270 GFLOP -> 17 TFLOP (64x) Total tokens/experiment: 1.5M -> 24M (16x) CHAMP_L5_seed42 currently running successfully: step:100 train_loss:3.6128 step_avg:861ms The actual root cause was Patch 22 EngramLite init anchor mismatch. The torch.compile crashes were a red herring — every experiment was crashing with AttributeError on self._engram_lite_enabled because the forward apply ran but the init didn't. getattr wrap fixed it. All prior "neutrality plateau" verdicts are now CONFIRMED INVALID: Mousse/MuonEq-R/NorMuon/Depth Recurrence/Coprime/EngramLite/QK_GAIN were all measured on 0.75% of intended data volume. Need re-validation. PR openai#1430 still OPEN, 24h no activity. Patches 15/16/20/21/25 still novel (9th consecutive audit confirmation). NEW finding: TMA Megakernel in 5 PRs (custom Triton kernel, hardware-side). We have ZERO hardware-side patches. Highest-leverage missing technique. Spend ~$6.33/$36 (17.6%). Far below $25 flag threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stateful eval was previously flagged as harmful on the grounds that INT6 quant errors accumulate in SSM recurrent state. Measurement shows the quant delta is actually flat ~8.2 mBPB across 100-1892 windows — no accumulation. Real cause of the pure-stateful BF16 regression was attention context loss at window boundaries. Stateful-overlap eval with overlap=1024 closes the gap to sliding within 0.3 mBPB while running in ~32s vs 500s, freeing 468s of eval budget for SLOT/TTT. Also corrects merged SOTA to 1.1147 bpb (PR openai#1019), flags PR openai#1329/openai#1344/ openai#1430 as unmerged/invalid, and revises the SLOT estimate from 50-150 to 15-30 mBPB based on capacity-regularization reasoning.
Community Review — Record: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean)BPB: 0.39642 | Compliance: FLAG — standard (non-causal) SLOT on scored region, pending Issue #1336 What I found in the code (head SHA The SLOT optimization mask at line 1801 covers the scored positions This matches the standard (non-causal) SLOT pattern that Issue #1336 was opened to rule on. PR #1240 (andrewbaggio1, self-closed 2026-04-05) proved empirically that this pattern leaks future-token information into earlier scored positions with a 100% cross-position violation rate on a deterministic flip-test harness vs an exact-zero baseline — see the Issue #1336 meta-comment from 2026-04-11 for the full empirical context. The legal alternative is causal/context-only SLOT where the mask is restricted to Cluster context: this same scored-region SLOT structure is currently on HOLD across 6+ PRs pending Issue #1336 (#1176, #1209, #1229, #1263, #1278, #1321, #1324 among others). One @0hq ruling on #1336 closes or clears the entire cluster at once. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 2.40s, dim=512, layers=11, vocab=1024, code=184380 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — scored-region SLOT, pending Issue #1336 ruling. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: HOLD pending Issue #1336. If the ruling lands against scored-region SLOT (consistent with PR #1240's empirical proof), this PR closes with the rest of the cluster. If the ruling lands in favor, this PR clears alongside the others. A proactive refactor to the PR #1350 causal Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 2.40s, dim=512, layers=11, vocab=1024, code=184380 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
N-gram Order-22 Backoff Mixer + Per-Sample SLOT (LR=0.432) + Pre-quant TTT Single seed 42 on 4xH100 SXM: - val_bpb: 0.37112 (beats PR openai#1430's 0.39642 by 0.02530!) - Beats official SOTA (1.0810) by 65.7% - Training: 2762 steps, 217ms/step, 600s - GPTQ: val calib 256 seqs, damp=0.005 - TTT: 703s (score-first, freeze blocks 0-9) - SLOT+N-gram: 785s (24 AdamW steps + entropy-adaptive n-gram blending) Key innovation: GPU-vectorized N-gram Order-22 with hash-based count tables (4M buckets, scatter_add). Entropy-adaptive alpha blending: alpha = 0.20 + 0.55 * sigmoid(2 * (entropy - 2.5)) mixed_p = (1-alpha) * neural_p + alpha * ngram_p Trinity framework: github.com/gHashTag/trinity Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Val BPB: 0.39642 (3-seed mean, seeds 1337/42/314) — 64.5% reduction from public SOTA (1.11437).
Key Techniques
Per-Sample SLOT: Each sequence gets its own [bsz,1,512] hidden delta + [bsz,1,1024] logit
bias (1536 params), optimized with AdamW (24 steps, cosine LR 0.432→0.001, beta1=0.6,
beta2=0.5, bsz=128) on scored positions.
Causal Backoff N-gram Mixer (order=2..22, 4M hash buckets): Entropy-adaptive blending
alpha = 0.20 + 0.55 * sigmoid(2*(H - 2.5))interpolates neural predictions with n-gramprobabilities at test time. N-gram table stored within the 16MB artifact budget.
Corrected hash ordering for accurate probability estimates.
TTT (AdamW 1ep, lr=0.001, freeze blocks 0-9): Second pass over first 10% of chunks
at floor LR=0.0001 for better early-position adaptation (~284s).
GPTQ damp=0.005: More aggressive Hessian inversion for ~0.001 better base BPB.
Timing (all seeds legal)