SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)#1499
SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)#1499dippatel1994 wants to merge 7 commits intoopenai:mainfrom
Conversation
SP8192 tokenizer, 3-layer depth recurrence (layers 3-5 looped 2x), parallel residuals on layers 7+, U-Net skips, GPTQ int6/int5. 14.09MB artifact. val_bpb=1.6323 on 1xH100 (1/8 competition compute).
- 11 layers (from 10), MLP mult 4.0 (from 3.0) - SP8192 as default tokenizer - Depth recurrence layers 3-5 x2 enabled by default - Parallel residuals on layers 7+ enabled by default - Weight decay 0.085 (frontier-tuned) - val_bpb=1.620 on 1xH100 (14.42MB artifact)
- SDClip (k=12.85) for GPTQ scale selection - MuonEq-R (row-normalized Muon optimizer) - Pre-quant TTT (10 epochs, AdamW lr=0.00045, cosine decay) - Brotli compression with byte shuffle - Delayed depth recurrence (step 3000) - QK-Gain 5.25, XSA last 4, EMA 0.9965, WD 0.095 8xH100 validated: 915 steps, val_bpb=1.3079, pre-quant TTT loss 3.74->3.06 GPTQ artifact: 12.23 MB (brotli). Sliding eval needs competition infra.
Full frontier config validated: - SDClip GPTQ (k=12.85): fixed quantization for QK-Gain 5.25 - MuonEq-R: row-normalized optimizer - Pre-quant TTT: rank-0 only with weight broadcast (fixed DDP issue) - Brotli + byte shuffle compression: 14.09 MB artifact - 2896 steps, val_bpb=1.261 pre-GPTQ, 1.479 post-GPTQ (standard eval) - On 8xH100 with sliding eval + pre-quant TTT: estimated 1.10-1.20 BPB
- Added train_seed42_1xH100.log (required by competition rules) - Updated submission.json with v4 confirmed results (1.4794 BPB) - Updated README with full technique descriptions and reproduction steps - Includes all required files: README.md, submission.json, train_gpt.py, train log
Community Review — SP8192 + Depth Recurrence + Parallel Residuals (14.09MB)BPB: 1.6323 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on What I found in the code (head SHA At line 996 the pre-quant TTT function takes Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster. Contrast with the legal Pre-Quant TTT pattern (e.g. PR #1416 / PR #1423 lineage): those train the adapter on a held-out slice of training data (not CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=8192, code=74713 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission with the TTT function taking a training-data slice instead of Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=8192, code=74713 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
…#402/openai#677) Pre-quant TTT that trains directly on val_tokens without score-first discipline is non-compliant per community review. Disabled by default (PREQUANT_TTT_ENABLED=0). The function remains in code but is not called unless explicitly enabled. All other techniques (SDClip GPTQ, MuonEq-R, depth recurrence, parallel residuals, brotli, QK-Gain 5.25) are unaffected. Confirmed val_bpb=1.4794 on 1xH100 WITHOUT pre-quant TTT.
|
Thanks @MatoTeziTanka for the detailed review and catching the compliance issue. Fixed in commit fd9bde7: Pre-quant TTT is now disabled by default ( The confirmed val_bpb=1.4794 was already measured without pre-quant TTT (the 1xH100 test run had All other techniques are compliant:
Happy to make further adjustments if needed. |
Training now stops at 590s (600s - 10s reserve), leaving time for GPTQ compression to complete within the total budget. Matches the pattern from PR openai#1487 (gptq_reserve_seconds=10).
|
Re-audited at head SHA Gauntlet result (CT2038, Python 3.10, torch 2.10.0+cpu): Pre-quant TTT fix confirmed. Line 1300: What the code does at default competition settings (
Updated verdict: LOOKS CLEAN. The pre-quant TTT that was flagged is disabled. The two active eval-time mechanisms (sliding TTT + n-gram cache) are both legal — score-first TTT matches #1413, and the n-gram uses context-only keys matching #803. Citation correction: My original review cited #1416/#1423 as "the legal Pre-Quant TTT pattern." That was wrong — both have the illegal flat-epoch pattern. The correct legal TTT reference is PR #1413 (dexhunter). Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending usual record-track checks. Note that the reported 1.4794 BPB is a conservative 1×H100 standard-eval baseline — the 8×H100 score with sliding TTT + n-gram active will be lower. Thanks for the fast turnaround @dippatel1994. Re-audit by @MatoTeziTanka. CPU gauntlet on CT2038 (Python 3.10, torch 2.10.0+cpu): IMPORT_OK, MODEL_OK, FORWARD_OK. Full code review: pre-quant TTT disabled (line 1300), sliding TTT is score-first (lines 942-971), n-gram cache uses context-only keys (line 789). |
|
Hi @MatoTeziTanka competition is ending in 6 days and so far it's not approved. Wanted to see my score for next submission but couldn't since this PR is not merged! Any idea who in the team will merge this submission? |
Summary
kevclark/parameter-golf) — 8192 vocab BPE for lower tokens-per-byteStacked with: BigramHash(10240), SmearGate, EMA(0.997), LeakyReLU squared, GQA(8q/4kv), partial RoPE(16), value residual, XSA(last 4), orthogonal init, Muon+AdamW, late QAT.
Results
Reproduction
Ablation (sp1024, 2-min, 1xH100)
Test plan