Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)#1217
Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)#1217bigbag wants to merge 2 commits intoopenai:mainfrom
Conversation
3-seed mean 1.10272 BPB (std 0.00106), beats merged SOTA by 0.012. Built on PR openai#1179 with MuonEq-R optimizer, context-only SLOT (causal variant), and QK_GAIN=5.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- train_gpt.py: LZMA2+base85 self-extracting wrapper (saves 49KB artifact) - Added train_seed1337.log, train_seed42.log, train_seed2024.log - Updated code_bytes in submission.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I think this version of SLOT may still leak information. Restricting the update to context tokens fixes the issue for a single window. However, in the current setup, minibatches contain overlapping windows. In that case, the train update from a later-positioned window in the minibatch can leak information to the earlier windows. |
|
@clarkkev — good catch. The cross-window gradient leak through a shared delta is a valid concern. Here's the precise fix and analysis. The problem, stated preciselyIf The The fix: per-window delta with masked loss# OLD (shared delta — has cross-window leak):
delta = torch.zeros(1, 1, d_model, device=device, requires_grad=True)
# NEW (per-window delta — no cross-window leak):
delta = torch.zeros(bsz, 1, d_model, device=device, requires_grad=True)With shape AdamW's running moments are also per-element, so each window's delta gets its own momentum and variance tracking. The loss mask remains per-window: for window Edge case: the first window (
|
|
Thanks @clarkkev and @AnubhavBharadwaaj for the detailed analysis. The cross-window gradient leak through a shared delta is a valid concern. Fix implemented and testedChanged delta shape from ResultPer-window delta is strictly causal but costs ~0.010 BPB:
Per-window SLOT provides almost no benefit over pure sliding (1.1120 vs 1.1104 = only -0.002). The shared delta's advantage came from aggregating gradient across 1984×32 = 63,488 context tokens, vs only 1984 per window. |
Previous SLOT-24 computed optimization loss on all positions including the new scored tokens — non-causal. Context-Only SLOT restricts loss to positions 0..wlen-stride (context only), so the scored tokens never influence the delta. Steps 24→8, lr 0.012→0.005, matching PR openai#1217 which achieves 1.1027 BPB with ~190s eval time.
Community Review — Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)BPB: 1.1027 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 409 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=22718 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=22718 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
val_bpb: 1.1027 (3-seed mean, std 0.0011) | ≤15.80 MB | 8×H100 SXM | ~88.8ms/step | ~6654 steps
Built on PR #1179 (@dexhunter) with three additions:
3-Seed Results
Beats merged SOTA (PR #1019, 1.1147) by 0.012 BPB (p ≪ 0.01).
Improvement Breakdown
Legality
Training (≤600s on 8×H100)
Evaluation — Context-Only SLOT (LEGAL, causal by construction)
This is a causal variant of SLOT that addresses all prior causality concerns.
Protocol for each sliding window (seq_len=2048, stride=64):
torch.no_grad()— model weights frozen, no gradient.Why this is causal:
Comparison to standard SLOT (which had causality concerns):
This approach was proposed by @AnubhavBharadwaaj (original SLOT author) as a defensible causal variant in PR #1172 discussion, with claimed ~0.0002 BPB difference from standard SLOT.
Evaluation — TTT (score-first, ≤10 min additional)
torch.inference_mode()FIRST. NLL recorded BEFORE any parameter update.No illegal techniques
Reproduction
pip install brotli QK_GAIN_INIT=5.0 SLOT_ENABLED=1 SLOT_STEPS=8 SLOT_LR=0.005 SEED=$SEED \ torchrun --standalone --nproc_per_node=8 train_gpt.pyTraining: ~600s. Eval (sliding + context-only SLOT): ~190s. Total: ~13 min end-to-end.
Acknowledgments
PR #1179 (@dexhunter), MuonEq (arXiv:2603.28254), SLOT (Hu et al. arXiv:2505.12392v2), PR #549 (legal TTT pattern), @AnubhavBharadwaaj (context-only SLOT proposal).
🤖 Generated with Claude Code