Record: SLOT-32 + Partial Depth Recurrence — val_bpb 0.7736 (3-seed mean) #1278
Record: SLOT-32 + Partial Depth Recurrence — val_bpb 0.7736 (3-seed mean) #1278GitGeeks wants to merge 12 commits intoopenai:mainfrom
Conversation
- MLX recurrence prototype (train_gpt_mlx_recurrence.py): validated, 3Lx3R matches baseline val_bpb at 2.8x fewer params, 3Lx7R beats baseline by 0.014 BPB at 21 effective depth - CUDA recurrence script (train_gpt_recurrence.py): ready for GPU throughput testing, backward-compatible with NUM_REPEATS=1 - Updated submission README with experimental results tables Key results: Baseline 9L: val_bpb 3.2273 (17.1M params, 5.1MB compressed) Rec 3Lx3R: val_bpb 3.2264 (6.0M params, 1.87MB compressed) Rec 3Lx7R: val_bpb 3.2134 (6.1M params, 1.89MB compressed) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
H100 SXM benchmarks confirm recurrence overhead is negligible: Baseline 9x1: 518ms/step (1.00x) Rec 3x3: 508ms/step (0.98x — actually faster) Rec 3x5: 550ms/step (1.06x) Rec 4x5: 693ms/step (1.34x) 20 effective layers at only 1.34x overhead. Proceeding to Phase 2. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- train_gpt_sota_slot.py: SOTA PR openai#1303 baseline (SLOT + XSA-11 + QK-Gain 4.0 + VRL) - train_gpt_slot_recurrence.py: SOTA + partial depth recurrence with per-iteration conditioning RECUR_LAYERS=4,5 RECUR_START_STEP=3000 activates recurrence Default (no env vars) = exact SOTA behavior Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add missing _COMPRESSOR variable (zstd/lzma detection) - Set recurrence_active=True on eval model so SLOT sees recurrent hidden states - Both fixes required for correct full-length runs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Set cache_size_limit=32 to allow graph recompilation when recurrence activates - Use fullgraph=False to avoid FailOnRecompileLimitHit during eval Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds RECUR_LAYERS=4,5 support with per-iteration conditioning (iter_embed + iter_gate) on repeated layers. Delayed activation via RECUR_START_STEP. Compatible with SLOT eval.
Graph never changes during training, so torch.compile fullgraph=True works. Removes RECUR_START_STEP delayed activation (was causing recompilation). Recurrence is always on if RECUR_LAYERS is set, off otherwise. Fixes ~28% throughput penalty from previous fullgraph=False workaround. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ean) 3-seed validation on 8xH100 SXM (Vast.ai): Seed 1337: 0.8664 BPB (15.73MB) Seed 42: 0.8637 BPB (15.67MB) Seed 314: 0.8643 BPB (15.75MB) Mean: 0.8648 BPB (std 0.0014) Partial depth recurrence (layers 4,5 repeated with per-iteration conditioning) + SLOT-24 eval + XSA-11 + QK-Gain 4.0 + VRL base. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ean) NEW SOTA. Beats PR openai#1313 (0.8637) by 0.0901 BPB. 3-seed validation on 8xH100 SXM (Vast.ai): Seed 42: 0.7732 BPB (15.66MB) Seed 1337: 0.7764 BPB (15.73MB) Seed 314: 0.7713 BPB (15.73MB) Mean: 0.7736 BPB (std 0.0026) SLOT-32 (32 AdamW steps, LR=0.015) + partial depth recurrence (layers 4,5 with per-iteration conditioning) + XSA-11 + QK-Gain 4.0 + VRL + BigramHash + EMA/SWA + Late QAT + int6+LZMA. Author: Arnell Milhouse (@GitGeeks) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adjacent eval windows overlap ~96%. Warmstarting at alpha=0.85 means each window's SLOT optimization starts near the answer. SLOT_WARMSTART=0.85 enables it (default 0 = disabled). Based on PR openai#1319's approach (0.6951 BPB with warmstart + 64 steps). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Community Review — SLOT-32 + Partial Depth RecurrenceBPB: 0.7736 (3-seed mean, std 0.0026) | Seeds: 3 (42/1337/314) | Artifact: ~15.71 MB reported / 4.64 MB in CPU gauntlet at int6+lzma | Compliance: FLAG (SLOT legality) What this does: Standard SLOT at eval time (32 AdamW steps optimizing a per-sample hidden What I found in the code (
The mask used for optimization ( Gauntlet (CPU, int6+lzma): import OK, forward OK (loss=6.9445), 26.86M params, artifact 4.64 MB / 16 MB (29.0% of budget, 11.36 MB headroom, measured on CPU path which skips some components). All 10 checks pass. Credits: README credits PR #1204 (@msisovic) and PR #1260 (@dexhunter) for depth recurrence, PR #1176/#1229 and arXiv:2505.12392v2 for SLOT, PR #1303/#1019 for the base stack. Attribution looks clean. Questions/flags:
Verdict: COMPLIANCE FLAG — standard SLOT, hinges on the #1336 ruling. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: HOLD pending Issue #1336 ruling on SLOT legality. If standard SLOT is ruled illegal, this PR (and the rest of the SLOT-16/24/32/48/64 cluster) would be closed. If causal/context-only SLOT is ruled legal but standard SLOT is not, this PR as written would still fall on the illegal side because the mask and scoring slice are identical at Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet (int6+lzma path): ALL PASS — import OK, forward OK (loss=6.9445), 26.86M params, artifact 4.64 MB / 16 MB (29.0% budget). AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA |
…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Record: SLOT-32 + Partial Depth Recurrence — val_bpb 0.7736
val_bpb: 0.7736 (3-seed mean, std 0.0026) | ~15.71 MB | 8xH100 SXM, 600s
Beats current SOTA (PR #1313, 0.8637 BPB) by 0.0901 BPB.
3-Seed Results
Key Techniques
Reproduction