Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) by bigbag · Pull Request #1217 · openai/parameter-golf

bigbag · 2026-04-01T10:37:01Z

Summary

val_bpb: 1.1027 (3-seed mean, std 0.0011) | ≤15.80 MB | 8×H100 SXM | ~88.8ms/step | ~6654 steps

Built on PR #1179 (@dexhunter) with three additions:

MuonEq-R (row-normalization before Newton-Schulz) — from arXiv:2603.28254, ~15 lines
QK_GAIN_INIT=5.0 — our hyperparameter sweep across 45 experiments, monotonic gains from 1.5→5.0
Context-Only SLOT — causal variant that optimizes delta using only already-scored context tokens (see Legality section)

3-Seed Results

Seed	Context-SLOT BPB	TTT BPB	Steps	ms/step	Artifact
1337	1.10166	1.11008	6660	88.8	15,795,518
42	1.10378	1.11206	6650	88.9	15,793,163
2024	1.10271	1.11108	6653	88.9	15,796,779
Mean	1.10272 ± 0.00106	1.11107	6654	88.8	15,795,153

Beats merged SOTA (PR #1019, 1.1147) by 0.012 BPB (p ≪ 0.01).

Improvement Breakdown

Technique	BPB Impact	Cumulative
PR #1179 base (sliding, no SLOT)	1.1105	1.1105
+ MuonEq-R optimizer	-0.001	~1.1095
+ QK_GAIN=5.0	-0.001	~1.1090
+ Context-Only SLOT (8 steps, lr=0.005)	-0.006	~1.1027

Legality

Training (≤600s on 8×H100)

Standard transformer training with Parallel Muon optimizer
MuonEq-R: row-normalization before Newton-Schulz orthogonalization (arXiv:2603.28254). Standard optimizer improvement — no rule restricts it.
QK_GAIN_INIT=5.0: hyperparameter choice — no rule restricts it
Full GPTQ calibration runs within the 600s training budget
No validation data accessed during training

Evaluation — Context-Only SLOT (LEGAL, causal by construction)

This is a causal variant of SLOT that addresses all prior causality concerns.

Protocol for each sliding window (seq_len=2048, stride=64):

Hidden states computed for all 2048 positions under torch.no_grad() — model weights frozen, no gradient.
Delta optimization: A 512-dim additive delta is optimized using cross-entropy loss on context positions only (positions 0 to 1983). The 64 new tokens being scored (positions 1984–2047) are excluded from the loss computation and contribute zero gradient.
Scoring: Final logits computed for all positions with the optimized delta applied. NLL recorded for the 64 new positions.

Why this is causal:

The delta is learned exclusively from already-scored tokens (the context window)
The 64 new tokens at the end are never used for optimization — they only appear in the scoring step
This is equivalent to: "observe past tokens → learn a bias → predict future tokens"
The gradient signal comes 100% from the past, never from the future
With stride=64 and seq_len=2048, 96.9% of the window is context (already scored in previous windows)

Comparison to standard SLOT (which had causality concerns):

Standard SLOT: optimizes delta on ALL positions including the new 64 → future tokens influence the delta → causality concern
Context-Only SLOT: optimizes delta on context positions ONLY → future tokens have zero influence → trivially causal

This approach was proposed by @AnubhavBharadwaaj (original SLOT author) as a defensible causal variant in PR #1172 discussion, with claimed ~0.0002 BPB difference from standard SLOT.

Evaluation — TTT (score-first, ≤10 min additional)

Score-first protocol: Each chunk scored under torch.inference_mode() FIRST. NLL recorded BEFORE any parameter update.
After scoring, parameters updated via SGD on already-scored tokens. Same legal pattern as merged PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549.
TTT provides 0 additional improvement on this base (1.1111 vs 1.1027 context-SLOT), so the primary result uses context-SLOT only.

No illegal techniques

❌ No n-gram cache
❌ No two-pass rescoring
❌ No min-NLL epoch selection
❌ No eval-time GPTQ on training data
❌ No oracle/hindsight selection
❌ No future-token information in SLOT optimization

Reproduction

pip install brotli
QK_GAIN_INIT=5.0 SLOT_ENABLED=1 SLOT_STEPS=8 SLOT_LR=0.005 SEED=$SEED \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Training: ~600s. Eval (sliding + context-only SLOT): ~190s. Total: ~13 min end-to-end.

Acknowledgments

PR #1179 (@dexhunter), MuonEq (arXiv:2603.28254), SLOT (Hu et al. arXiv:2505.12392v2), PR #549 (legal TTT pattern), @AnubhavBharadwaaj (context-only SLOT proposal).

🤖 Generated with Claude Code

3-seed mean 1.10272 BPB (std 0.00106), beats merged SOTA by 0.012. Built on PR openai#1179 with MuonEq-R optimizer, context-only SLOT (causal variant), and QK_GAIN=5.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- train_gpt.py: LZMA2+base85 self-extracting wrapper (saves 49KB artifact) - Added train_seed1337.log, train_seed42.log, train_seed2024.log - Updated code_bytes in submission.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clarkkev · 2026-04-01T12:11:58Z

I think this version of SLOT may still leak information. Restricting the update to context tokens fixes the issue for a single window. However, in the current setup, minibatches contain overlapping windows. In that case, the train update from a later-positioned window in the minibatch can leak information to the earlier windows.

AnubhavBharadwaaj · 2026-04-01T12:24:13Z

@clarkkev — good catch. The cross-window gradient leak through a shared delta is a valid concern. Here's the precise fix and analysis.

The problem, stated precisely

If delta has shape [1, 1, 512] and a batch contains overlapping windows w1 and w2 where w2's context includes w1's scored positions, then during delta optimization:

$$\nabla_\delta \mathcal{L} = \nabla_\delta \sum_{t \in w1_{\text{context}}} \mathcal{L}_t + \nabla_\delta \sum_{t \in w1_{\text{scored}}} \mathcal{L}_t + \nabla_\delta \sum_{t \in w2_{\text{context}}} \mathcal{L}_t + \nabla_\delta \sum_{t \in w2_{\text{scored}}} \mathcal{L}_t$$

The $w2_{\text{context}}$ term may include tokens that are also in $w1_{\text{scored}}$, leaking future information into w1's delta. Valid concern.

The fix: per-window delta with masked loss

# OLD (shared delta — has cross-window leak):
delta = torch.zeros(1, 1, d_model, device=device, requires_grad=True)

# NEW (per-window delta — no cross-window leak):
delta = torch.zeros(bsz, 1, d_model, device=device, requires_grad=True)

With shape [bsz, 1, 512], each window's delta is an independent slice along the batch dimension. The gradient of delta[i] depends only on window i's loss terms — PyTorch's autograd naturally separates the batch dimensions. No cross-window gradient flow.

AdamW's running moments are also per-element, so each window's delta gets its own momentum and variance tracking.

The loss mask remains per-window: for window i, compute CE loss only on context positions (0 to seq_len - stride - 1), zero out the stride positions. Each delta[i] is optimized exclusively from its own window's already-scored context.

Edge case: the first window (`ws=0`)

When ws=0, the code sets s=0 meaning all 2048 tokens are scored and there are zero context tokens. Context-only SLOT has nothing to optimize on, so delta stays at zero for this window. This is the correct conservative behavior — no calibration when there's no past data.

This affects 2048 out of ~62M total tokens (0.003%) — negligible impact on final BPB.

On the cascade concern

One might ask: "if window w_prev used SLOT to score its tokens, doesn't that contaminate the context for window w_next?"

No — the validation tokens themselves are fixed integers from the dataset. SLOT affects the score (NLL) assigned to each token, not the token values. Window w_next's context contains the same token IDs regardless of what score w_prev assigned. The hidden states H = forward_hidden(x) are deterministic functions of the fixed token sequence under frozen model weights. No SLOT output propagates forward.

Performance impact — honest assessment

I won't claim "negligible" without data. A shared delta aggregates gradient from $1984 \times \text{bsz}$ context tokens across the batch. A per-window delta sees only 1984 tokens. More gradient signal could mean better calibration. Alternatively, a shared delta averages over diverse local distributions, while a per-window delta specializes to each window's local context.

Whether shared or per-window is better is an empirical question — but both are strictly causal.

If the concern is that per-window delta might perform worse, note that the shared delta's advantage (more tokens) comes partly from the cross-window leak that @clarkkev identified. The "clean" shared delta — one that somehow excludes scored tokens from all windows in the batch — would see approximately the same effective token count as per-window, just different tokens.

Summary

Per-window delta ([bsz, 1, 512]) with context-only loss masking is strictly causal, handles the first-window edge case correctly, involves a one-line shape change, and has no cascade effects. The only open question is whether per-window calibration matches batch-wide calibration empirically. I'd welcome a comparison run.

@0hq @valerio-oai — context-only SLOT with per-window delta has zero information flow from scored tokens to the optimization. Is this the variant the organizers would accept?

bigbag · 2026-04-01T17:33:48Z

Thanks @clarkkev and @AnubhavBharadwaaj for the detailed analysis. The cross-window gradient leak through a shared delta is a valid concern.

Fix implemented and tested

Changed delta shape from [1, 1, 512] to [bsz, 1, 512] (per-window delta). Each window's delta is independent — PyTorch autograd naturally separates batch dimensions. Zero cross-window gradient flow.

Result

Per-window delta is strictly causal but costs ~0.010 BPB:

Variant	Sliding+SLOT BPB	Delta
Shared delta `[1,1,512]` (original)	1.1017	—
Per-window delta `[bsz,1,512]` (fixed)	1.1120	+0.010
No SLOT at all	1.1104	—

Per-window SLOT provides almost no benefit over pure sliding (1.1120 vs 1.1104 = only -0.002). The shared delta's advantage came from aggregating gradient across 1984×32 = 63,488 context tokens, vs only 1984 per window.

Previous SLOT-24 computed optimization loss on all positions including the new scored tokens — non-causal. Context-Only SLOT restricts loss to positions 0..wlen-stride (context only), so the scored tokens never influence the delta. Steps 24→8, lr 0.012→0.005, matching PR openai#1217 which achieves 1.1027 BPB with ~190s eval time.

MatoTeziTanka · 2026-04-11T20:04:58Z

Community Review — Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)

BPB: 1.1027 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 1b462047cba3, file records/track_10min_16mb/2026-04-01_MuonEqR_ContextSLOT_QKGain5/train_gpt.py):

The TTT path at line 409 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=22718 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.09s, dim=512, layers=11, vocab=1024, code=22718 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Pavel Liashkov and others added 2 commits April 1, 2026 17:36

bigbag changed the title ~~Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)~~ Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) Apr 1, 2026

BiggerDABOSS mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0 #1276

Open

This was referenced Apr 4, 2026

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1338

Closed

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1339

Open

clarkkev mentioned this pull request Apr 5, 2026

Legality question: Is context-only (causal) SLOT legal? #1336

Open

resouer mentioned this pull request Apr 5, 2026

Record: L-BFGS Causal SLOT — val_bpb 1.0046 (3-seed mean) #1350

Closed

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

clarkkev mentioned this pull request Apr 5, 2026

Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394

Merged

Robby955 mentioned this pull request Apr 6, 2026

Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412

Merged

bigbag mentioned this pull request Apr 6, 2026

Record: SP4096 + 3-Layer Recurrence + GPTQ Embeddings + SDClip + ETLB — val_bpb 1.0913 (3-seed mean) #1415

Open

4 tasks

erichroepke mentioned this pull request Apr 6, 2026

Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean) #1416

Open

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

aryanbhosale mentioned this pull request Apr 6, 2026

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) #1423

Open

This was referenced Apr 7, 2026

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1426

Closed

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence #1429

Open

dexhunter mentioned this pull request Apr 7, 2026

Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437

Open

This was referenced Apr 15, 2026

Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1643

Closed

Non-record: Mamba-3 Hybrid SSM + SP8192 + Legal TTT — 1.1473 bpb #1644

Open

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)#1217

Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)#1217
bigbag wants to merge 2 commits intoopenai:mainfrom
bigbag:submission/muoneq-causal-slot

bigbag commented Apr 1, 2026

Uh oh!

clarkkev commented Apr 1, 2026

Uh oh!

AnubhavBharadwaaj commented Apr 1, 2026 •

edited

Loading

Uh oh!

bigbag commented Apr 1, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bigbag commented Apr 1, 2026

Summary

3-Seed Results

Improvement Breakdown

Legality

Training (≤600s on 8×H100)

Evaluation — Context-Only SLOT (LEGAL, causal by construction)

Evaluation — TTT (score-first, ≤10 min additional)

No illegal techniques

Reproduction

Acknowledgments

Uh oh!

clarkkev commented Apr 1, 2026

Uh oh!

AnubhavBharadwaaj commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem, stated precisely

The fix: per-window delta with masked loss

Edge case: the first window (ws=0)

On the cascade concern

Performance impact — honest assessment

Summary

Uh oh!

bigbag commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix implemented and tested

Result

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AnubhavBharadwaaj commented Apr 1, 2026 •

edited

Loading

Edge case: the first window (`ws=0`)

bigbag commented Apr 1, 2026 •

edited

Loading