Record: SLOT-48 — val_bpb 0.7406 (3-seed mean)#1321
Record: SLOT-48 — val_bpb 0.7406 (3-seed mean)#1321anthony-maio wants to merge 3 commits intoopenai:mainfrom
Conversation
3-seed: 1337=0.7450, 42=0.7350, 2024=0.7416. All under 16MB. Same model as openai#1313, only SLOT_STEPS increased 24->48. Eval time 409s, within 10-min budget.
There was a problem hiding this comment.
Pull request overview
Adds a new 10min/16mb record entry for SLOT-48 evaluation-time tuning, reporting a 3-seed mean val_bpb of 0.7406 with artifacts under 16MB.
Changes:
- Introduces a new record folder with the training/eval script (
train_gpt.py) configured for SLOT_STEPS=48 by default. - Adds per-seed training logs and a
submission.jsonsummarizing 3-seed results/metadata. - Adds a README documenting results, deltas vs prior SLOT-24, and reproduction instructions.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_gpt.py | Training + eval script for the SLOT-48 record run (incl. SLOT eval path). |
| records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_seed42.log | Seed 42 training/eval log used as evidence for reported metrics. |
| records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_seed2024.log | Seed 2024 training/eval log used as evidence for reported metrics. |
| records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/train_seed1337.log | Seed 1337 training/eval log used as evidence for reported metrics. |
| records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/submission.json | Machine-readable result summary for the record submission. |
| records/track_10min_16mb/2026-04-03_SLOT48_LR012_Stride96/README.md | Human-readable summary of results, changes vs prior PRs, and reproduction steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "1337": {"val_loss": 1.25793247, "val_bpb": 0.74502015, "steps": 6034, "artifact_bytes": 15815983}, | ||
| "42": {"val_loss": 1.24104846, "val_bpb": 0.73502047, "steps": 6563, "artifact_bytes": 15751595}, | ||
| "2024": {"val_loss": 1.25222813, "val_bpb": 0.74164171, "steps": 6568, "artifact_bytes": 15793375} |
There was a problem hiding this comment.
The steps values in seed_results don’t match the actual stop steps shown in the corresponding train_seed*.log files (e.g., seed 42 stops at step 6576, seed 2024 at 6588, seed 1337 at 6578). Please update the JSON to reflect the logged training steps (or clarify what steps represents if it’s intentionally different).
| "1337": {"val_loss": 1.25793247, "val_bpb": 0.74502015, "steps": 6034, "artifact_bytes": 15815983}, | |
| "42": {"val_loss": 1.24104846, "val_bpb": 0.73502047, "steps": 6563, "artifact_bytes": 15751595}, | |
| "2024": {"val_loss": 1.25222813, "val_bpb": 0.74164171, "steps": 6568, "artifact_bytes": 15793375} | |
| "1337": {"val_loss": 1.25793247, "val_bpb": 0.74502015, "steps": 6578, "artifact_bytes": 15815983}, | |
| "42": {"val_loss": 1.24104846, "val_bpb": 0.73502047, "steps": 6576, "artifact_bytes": 15751595}, | |
| "2024": {"val_loss": 1.25222813, "val_bpb": 0.74164171, "steps": 6588, "artifact_bytes": 15793375} |
| | 1337 | 1.126 | **0.7450** | 6034 | 15,815,983 | | ||
| | 42 | 1.121 | **0.7350** | 6563 | 15,751,595 | | ||
| | 2024 | 1.122 | **0.7416** | 6568 | 15,793,375 | |
There was a problem hiding this comment.
The README’s “Steps” column doesn’t match the actual training stop steps in the included logs (e.g., seed 42 stops at 6576 in train_seed42.log, seed 2024 at 6588, seed 1337 at 6578). Please update the table so the reported step counts are consistent with the logs.
| | 1337 | 1.126 | **0.7450** | 6034 | 15,815,983 | | |
| | 42 | 1.121 | **0.7350** | 6563 | 15,751,595 | | |
| | 2024 | 1.122 | **0.7416** | 6568 | 15,793,375 | | |
| | 1337 | 1.126 | **0.7450** | 6578 | 15,815,983 | | |
| | 42 | 1.121 | **0.7350** | 6576 | 15,751,595 | | |
| | 2024 | 1.122 | **0.7416** | 6588 | 15,793,375 | |
| num_layers_total = max( | ||
| (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), | ||
| default=0, | ||
| ) + 1 | ||
|
|
There was a problem hiding this comment.
num_layers_total is computed here but never used, which makes the quantization path harder to read/maintain. Please remove it (or use it if it’s intended for validation/metadata).
| num_layers_total = max( | |
| (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), | |
| default=0, | |
| ) + 1 |
Community Review — SLOT-48 (per-window delta + logit bias)BPB: 0.7406 (3-seed mean, std 0.0051) | Seeds: 3 (1337/42/2024) | Artifact: 15.75–15.82 MB | Compliance: HOLD pending Issue #1336 What this does: Scales SLOT from 24 to 48 inner AdamW steps on top of the PR #1313 stack, optimizing a per-window hidden delta What I found in the code (
Compliance reading (Issue #1336 / #1017 four conditions):
My read is that this is the "standard SLOT" pattern Issue #1336 was opened to ask about, not the causal/context-only variant (which would restrict the SLOT training loss to the pre-scored context, e.g. Questions for @anthony-maio (asking, not accusing):
Gauntlet: What is unambiguously clean:
Verdict: HOLD pending Issue #1336. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Update 2026-04-11 — Gauntlet rerun on CT2038The first version of this review noted the CPU gauntlet was skipped because of FA3 /
Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet (CT2038 proteus-engine, 2026-04-11): all 10 checks PASS — import, hyperparams, model creation (26.86M params), forward pass (loss 6.9682), code size 61,864 B, artifact 4.64 MB int6+lzma (29.0% of 16 MB budget), step-time projections, weight statistics. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA |
…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Automated compliance check flagged this as matching the same SLOT delta-bias adapt-score pattern as #1319 (which I posted evidence on earlier) and the #1376 ruling by @MatoTeziTanka. Evidence — # line 878-880 — per-batch learnable delta + logit_bias
delta = torch.zeros(bsz, 1, hidden_f.size(-1), device=device,
dtype=torch.float32, requires_grad=True)
logit_bias = torch.zeros(bsz, 1, proj_w.size(0), device=device,
dtype=torch.float32, requires_grad=True)
slot_opt = torch.optim.AdamW([delta, logit_bias], lr=args.slot_lr, ...)
# line 881 — targets are the val tokens we're about to score
targets_flat = yb.reshape(-1)
# lines 882-893 — SLOT_STEPS AdamW updates minimizing loss against yb
for step_i in range(args.slot_steps):
...
slot_opt.zero_grad()
h = hidden_f + delta
lp = F.linear(h, proj_w) + logit_bias
lg = softcap * torch.tanh(lp / softcap)
nll = F.cross_entropy(lg.reshape(-1, lg.size(-1)), targets_flat,
reduction="none").reshape(bsz, seq_s)
slot_loss = (nll * mask).sum() / valid_count
slot_loss.backward()
slot_opt.step()
# lines 894-903 — recompute nll with optimized delta/logit_bias,
# report it as the score, accumulate into loss_sumSame C3 violation as #1319: Precedent: this is the same SLOT pattern that #1376 was flagged for by @MatoTeziTanka ("two independent violations"), and the same pattern I posted full line-level evidence on at #1319. Posting a shorter note here since the code is identical up to parameter values (SLOT-48). Source: parameter-golf-checker, context in #1603. Happy to correct if I'm misreading — |
Summary
3-Seed Results
Beats merged SOTA (1.1147) by 0.374 BPB. Beats best pending (#1229, 0.9300) by 0.190 BPB.
What Changed vs PR #1313 (0.8637)
One parameter:
SLOT_STEPSincreased from 24 to 48. Same model, same training, same architecture.SLOT Scaling (same model, different step counts)
SLOT-48 Details
Compliance
Reproduction
Training: ~600s. Eval: ~409s. Total: ~17 min.
Credits