Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505#1756
Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505#1756romeerp wants to merge 4 commits intoopenai:mainfrom
Conversation
|
Running |
…uant TTT; Recurrence Depth Curriculum; Parcae stable loops - SOTA 1.0810 still holds (Day 12 plateau, longest in competition history) - PR openai#1758 (1.02840): pre-quant TTT — 6th attempt at illegal pattern, ignore - PR openai#1756 (1.06505): CaseOps + Recurrence Depth Curriculum (depth 1→3→4); has BOS bug; awaits Issue openai#1604 - PR openai#1755 (1.07462): CaseOps + Legal TTT; awaits Issue openai#1604 - New paper: Parcae (arXiv:2604.12946) — stable looped LMs via spectral norm constraint on injection params, relevant to Triple Loop stability - New paper: Gated Attention (arXiv:2505.06708, NeurIPS 2025) — backs PR openai#1667 Attention Output Gate - Added Session 18 lessons learned; Issue openai#1604 self-deadline Apr 24 - Primary action: implement PR openai#1586+openai#1667 immediately (9 days to deadline) https://claude.ai/code/session_0151v7YeWWUSnhmcC8U8NGUV
|
Hi @codemath3000 - Between PR #1729 (which built the actual CaseOps dataset/export I used in this PR) and PR #1736, discrepancies were introduced in the standalone rebuild path. In particular, the published I’ve now updated |
Summary
4.3-seed results (8xH100 80GB SXM, 10-min train / 10-min eval budgets)
Compared with PR #1736's 3-seed mean of 1.06549, this improves the final endpoint by 0.00043 BPB.
Specific contribution here
The important new idea here is curriculum recurrence depth.
The inherited stack already existed:
This PR changes how recurrence depth is used during training and evaluation.
Once the loop path is enabled, training follows a deterministic equal-thirds curriculum over total passes through the recurrent loop block:
1344So the hypothesis is not "train deeper everywhere." It is: teach the recurrent block to act like a scalable refinement operator, then evaluate at the deepest trained depth.
The intended mechanism is:
Empirically, that improves the 3-seed mean even though one seed (
1234) is slightly worse than PR #1736; the gain comes from seeds0and42improving more strongly.Attribution / lineage
1 -> 3 -> 4recurrence-depth curriculum with fixed-depth-4eval.Why this is legal
Reproducibility
The new record folder contains:
train_gpt.pyREADME.mdsubmission.jsontrain_seed0.log,train_seed42.log,train_seed1234.loglossless_caps.pyprepare_caseops_data.pytokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.modelReproduction is the same as the prior stack except for the added recurrence-curriculum env vars:
TRAIN_LOOP_PHASE_DEPTHS=1,3,4TRAIN_LOOP_PREWARM_DEPTHS=3,4EVAL_LOOP_DEPTH=4