Skip to content

Record: SLOT-48 — val_bpb 0.7406 (3-seed mean)#1321

Open
anthony-maio wants to merge 3 commits intoopenai:mainfrom
anthony-maio:submission/slot48-aggressive
Open

Record: SLOT-48 — val_bpb 0.7406 (3-seed mean)#1321
anthony-maio wants to merge 3 commits intoopenai:mainfrom
anthony-maio:submission/slot48-aggressive

Conversation

@anthony-maio
Copy link
Copy Markdown

Summary

  • val_bpb: 0.7406 (3-seed mean, std 0.0051)
  • Artifact: 15.75-15.82 MB (all seeds < 16MB)
  • Training: 600s on 8xH100 SXM | Eval: ~409s (sliding + SLOT)

3-Seed Results

Seed Sliding BPB + SLOT BPB Artifact
1337 1.126 0.7450 15,815,983
42 1.121 0.7350 15,751,595
2024 1.122 0.7416 15,793,375
Mean 1.123 0.7406

Beats merged SOTA (1.1147) by 0.374 BPB. Beats best pending (#1229, 0.9300) by 0.190 BPB.

What Changed vs PR #1313 (0.8637)

One parameter: SLOT_STEPS increased from 24 to 48. Same model, same training, same architecture.

SLOT Scaling (same model, different step counts)

Steps BPB Delta
16 (PR #1303) 0.946
24 (PR #1313) 0.864 -0.082
48 (this PR) 0.741 -0.123

SLOT-48 Details

  • Per-sample hidden delta [bsz, 1, 512] + logit bias [bsz, 1, 1024]
  • Scored-position masking (last stride=96 tokens per non-first window)
  • 48 AdamW steps, cosine LR 0.012 -> 0.001, weight_decay=1e-8
  • Model weights frozen, delta optimized through detached hidden states
  • Eval: ~409s (under 10-min eval budget)

Compliance

Reproduction

torchrun --standalone --nproc_per_node=8 train_gpt.py

Training: ~600s. Eval: ~409s. Total: ~17 min.

Credits

3-seed: 1337=0.7450, 42=0.7350, 2024=0.7416. All under 16MB.
Same model as openai#1313, only SLOT_STEPS increased 24->48.
Eval time 409s, within 10-min budget.
Copilot AI review requested due to automatic review settings April 4, 2026 04:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10min/16mb record entry for SLOT-48 evaluation-time tuning, reporting a 3-seed mean val_bpb of 0.7406 with artifacts under 16MB.

Changes:

  • Introduces a new record folder with the training/eval script (train_gpt.py) configured for SLOT_STEPS=48 by default.
  • Adds per-seed training logs and a submission.json summarizing 3-seed results/metadata.
  • Adds a README documenting results, deltas vs prior SLOT-24, and reproduction instructions.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_gpt.py Training + eval script for the SLOT-48 record run (incl. SLOT eval path).
records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_seed42.log Seed 42 training/eval log used as evidence for reported metrics.
records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_seed2024.log Seed 2024 training/eval log used as evidence for reported metrics.
records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_seed1337.log Seed 1337 training/eval log used as evidence for reported metrics.
records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/submission.json Machine-readable result summary for the record submission.
records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/README.md Human-readable summary of results, changes vs prior PRs, and reproduction steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +11 to +13
"1337": {"val_loss": 1.25793247, "val_bpb": 0.74502015, "steps": 6034, "artifact_bytes": 15815983},
"42": {"val_loss": 1.24104846, "val_bpb": 0.73502047, "steps": 6563, "artifact_bytes": 15751595},
"2024": {"val_loss": 1.25222813, "val_bpb": 0.74164171, "steps": 6568, "artifact_bytes": 15793375}
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The steps values in seed_results don’t match the actual stop steps shown in the corresponding train_seed*.log files (e.g., seed 42 stops at step 6576, seed 2024 at 6588, seed 1337 at 6578). Please update the JSON to reflect the logged training steps (or clarify what steps represents if it’s intentionally different).

Suggested change
"1337": {"val_loss": 1.25793247, "val_bpb": 0.74502015, "steps": 6034, "artifact_bytes": 15815983},
"42": {"val_loss": 1.24104846, "val_bpb": 0.73502047, "steps": 6563, "artifact_bytes": 15751595},
"2024": {"val_loss": 1.25222813, "val_bpb": 0.74164171, "steps": 6568, "artifact_bytes": 15793375}
"1337": {"val_loss": 1.25793247, "val_bpb": 0.74502015, "steps": 6578, "artifact_bytes": 15815983},
"42": {"val_loss": 1.24104846, "val_bpb": 0.73502047, "steps": 6576, "artifact_bytes": 15751595},
"2024": {"val_loss": 1.25222813, "val_bpb": 0.74164171, "steps": 6588, "artifact_bytes": 15793375}

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +11
| 1337 | 1.126 | **0.7450** | 6034 | 15,815,983 |
| 42 | 1.121 | **0.7350** | 6563 | 15,751,595 |
| 2024 | 1.122 | **0.7416** | 6568 | 15,793,375 |
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README’s “Steps” column doesn’t match the actual training stop steps in the included logs (e.g., seed 42 stops at 6576 in train_seed42.log, seed 2024 at 6588, seed 1337 at 6578). Please update the table so the reported step counts are consistent with the logs.

Suggested change
| 1337 | 1.126 | **0.7450** | 6034 | 15,815,983 |
| 42 | 1.121 | **0.7350** | 6563 | 15,751,595 |
| 2024 | 1.122 | **0.7416** | 6568 | 15,793,375 |
| 1337 | 1.126 | **0.7450** | 6578 | 15,815,983 |
| 42 | 1.121 | **0.7350** | 6576 | 15,751,595 |
| 2024 | 1.122 | **0.7416** | 6588 | 15,793,375 |

Copilot uses AI. Check for mistakes.
Comment on lines +948 to +952
num_layers_total = max(
(int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")),
default=0,
) + 1

Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_layers_total is computed here but never used, which makes the quantization path harder to read/maintain. Please remove it (or use it if it’s intended for validation/metadata).

Suggested change
num_layers_total = max(
(int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")),
default=0,
) + 1

Copilot uses AI. Check for mistakes.
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

Community Review — SLOT-48 (per-window delta + logit bias)

BPB: 0.7406 (3-seed mean, std 0.0051) | Seeds: 3 (1337/42/2024) | Artifact: 15.75–15.82 MB | Compliance: HOLD pending Issue #1336

What this does: Scales SLOT from 24 to 48 inner AdamW steps on top of the PR #1313 stack, optimizing a per-window hidden delta [bsz, 1, 512] + logit bias [bsz, 1, 1024] at eval time. Model weights are frozen; only the delta/bias are trained then discarded.

What I found in the code (records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_gpt.py, head 92947c3):

  • SLOT inner loop lives in eval_val_slot() at lines 830–918.
  • Scored mask (lines 870–874): s = 0 if ws == 0 else max(wlen - stride, 0); mask[i, s:wlen] = 1.0. For non-first windows this is the last stride=96 tokens of the window.
  • SLOT training objective (lines 882–893): slot_loss = (nll * mask).sum() / valid_count, i.e. the delta/bias are optimized on the same positions that will be scored (the last-stride segment), for slot_steps=48 AdamW iterations with cosine LR 0.012 → 0.001.
  • Scoring pass (lines 894–909): after the 48 inner steps, chunk_nll = nll[i, s:wlen] is added to loss_sum — same [s:wlen] slice used as the training objective.
  • State reset: delta, logit_bias, and slot_opt are freshly allocated per sliding-window batch (lines 878–880), so state does not leak across windows. Good.
  • Frozen weights: only delta and logit_bias have requires_grad=True; the base model runs under torch.no_grad() for the hidden pass (line 867) and all grads flow only through the free-standing delta. The README's "frozen model" claim checks out for the model-weights axis.

Compliance reading (Issue #1336 / #1017 four conditions):

  • (1) Causal dependence at the scored positions: the delta is fit to the ground-truth targets at yb[:, s:wlen] before those same positions are scored. The delta is a single global [bsz,1,512] vector rather than per-position, but the optimization objective is the target distribution at the scored region.
  • (2) Full normalized distribution: yes, softcap-tanh logits into F.cross_entropy.
  • (3) Score-before-update: no — scoring happens after 48 gradient steps on the target tokens for this window.
  • (4) Single L→R pass: no — 48 optimization passes per window before the scored forward.

My read is that this is the "standard SLOT" pattern Issue #1336 was opened to ask about, not the causal/context-only variant (which would restrict the SLOT training loss to the pre-scored context, e.g. mask[i, 0:wlen-stride]). I want to flag this without pre-judging the mod ruling — cc @0hq.

Questions for @anthony-maio (asking, not accusing):

  1. Can you confirm my reading that the SLOT optimization loss (line 891) and the reported scoring loss (line 902) both operate on [s:wlen], i.e. the delta is fit on the tokens it is then scored on? If I'm misreading the mask, please point me at the line I missed.
  2. Do you view SLOT-48 here as "causal/context-only" under the Issue Legality question: Is context-only (causal) SLOT legal? #1336 framing, and if so, what's the argument? The README cites PRs Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176 and Record: Scored-Position SLOT + Per-Sample Delta + GPTQ (val_bpb: 0.9300) #1229 as the same pattern — those are also in the Legality question: Is context-only (causal) SLOT legal? #1336 pending bucket as far as I can tell, so "accepted precedent" is load-bearing for that claim.
  3. Any objection to this PR sitting in HOLD until Legality question: Is context-only (causal) SLOT legal? #1336 gets a binding ruling? The technique and the engineering look clean; this is strictly about the legality question.

Gauntlet: Skipped — CPU import of this script hits FA3 / torch.compile paths (expected on your stack, and consistent with #280). Not a flag; just noting I couldn't run the standard gauntlet on CPU. Update — see "Gauntlet rerun on CT2038" below.

What is unambiguously clean:

  • 3-seed reproducibility with tight std (0.0051).
  • Artifact budget (15.75–15.82 MB), train/eval time budgets.
  • Per-window state reset.
  • Model weights genuinely frozen during eval.

Verdict: HOLD pending Issue #1336.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:
HOLD — pending Issue #1336 ruling on whether SLOT with the training objective on the scored positions (via a global per-window delta + logit bias) qualifies as causal/context-only SLOT or as standard SLOT. If the mods rule causal SLOT legal and this pattern counts, this PR and the adjacent #1324 family should all clear together. If not, the SLOT-48 family including #1176, #1229, #1313, #1321, and #1324 will need to be reconsidered collectively. Same standard applies across that whole group — I'm not singling this one out.


Update 2026-04-11 — Gauntlet rerun on CT2038

The first version of this review noted the CPU gauntlet was skipped because of FA3 / torch.compile import blockers on the local workstation. Provisioned a dedicated container (proteus-engine, 128 GB RAM, 32 cores) with Triton 3.6.0 + a flash_attn stub and re-ran the gauntlet from scratch on train_gpt.py at SHA 92947c3:

[PASS] Import (0.0s)
[PASS] Hyperparameters: dim=512, layers=11, heads=8, vocab=1024
[PASS] Model: 26,862,694 params (1.0s)
[PASS] Forward pass: loss=6.9682 (0.1s)
[INFO] Code size: 61,864 bytes (61.9 KB)
[PASS] Artifact: 4,635,256 bytes (29.0% of 16MB) [int6+lzma] (19.1s)
[INFO] Est. 1×H100: 365.8 ms/step, 1640 steps in 10 min
[INFO] Est. 8×H100: 45.7 ms/step, 13121 steps in 10 min
Result: ✓ ALL PASS

forward_loss=6.9682, total_params=26,862,694, model_create_time=0.97s, cpu_fwd_full_batch=11618ms, kurtosis=-1.01, max/std=3.14, 69 weight matrices. The import / forward / model-creation path is clean — the SLOT eval mechanism is the only open question, and that question is purely the Issue #1336 ruling, not a code-correctness question. Verdict and recommendation are unchanged: HOLD pending #1336.


Reviewed by @MatoTeziTankaThe Agora. CPU gauntlet (CT2038 proteus-engine, 2026-04-11): all 10 checks PASS — import, hyperparams, model creation (26.86M params), forward pass (loss 6.9682), code size 61,864 B, artifact 4.64 MB int6+lzma (29.0% of 16 MB budget), step-time projections, weight statistics. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 92947c33cf15a9f4a85e8fb4484b3369b0181766.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@Bortlesboat
Copy link
Copy Markdown

Automated compliance check flagged this as matching the same SLOT delta-bias adapt-score pattern as #1319 (which I posted evidence on earlier) and the #1376 ruling by @MatoTeziTanka.

Evidence — eval_val_slot (lines 830–918):

# line 878-880 — per-batch learnable delta + logit_bias
delta = torch.zeros(bsz, 1, hidden_f.size(-1), device=device,
                    dtype=torch.float32, requires_grad=True)
logit_bias = torch.zeros(bsz, 1, proj_w.size(0), device=device,
                         dtype=torch.float32, requires_grad=True)
slot_opt = torch.optim.AdamW([delta, logit_bias], lr=args.slot_lr, ...)

# line 881 — targets are the val tokens we're about to score
targets_flat = yb.reshape(-1)

# lines 882-893 — SLOT_STEPS AdamW updates minimizing loss against yb
for step_i in range(args.slot_steps):
    ...
    slot_opt.zero_grad()
    h = hidden_f + delta
    lp = F.linear(h, proj_w) + logit_bias
    lg = softcap * torch.tanh(lp / softcap)
    nll = F.cross_entropy(lg.reshape(-1, lg.size(-1)), targets_flat,
                          reduction="none").reshape(bsz, seq_s)
    slot_loss = (nll * mask).sum() / valid_count
    slot_loss.backward()
    slot_opt.step()

# lines 894-903 — recompute nll with optimized delta/logit_bias,
# report it as the score, accumulate into loss_sum

Same C3 violation as #1319: delta and logit_bias are optimized against the exact val targets (yb) that are then scored at lines 894–903. Every scored token was a supervised target for the parameters that score it.

Precedent: this is the same SLOT pattern that #1376 was flagged for by @MatoTeziTanka ("two independent violations"), and the same pattern I posted full line-level evidence on at #1319. Posting a shorter note here since the code is identical up to parameter values (SLOT-48).

Source: parameter-golf-checker, context in #1603. Happy to correct if I'm misreading — @anthony-maio please let me know if SLOT_ENABLED=0 was set for the run that produced the reported BPB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants