Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505 by romeerp · Pull Request #1756 · openai/parameter-golf

romeerp · 2026-04-20T22:22:21Z

Summary

val_bpb 1.06505 (3-seed mean, std 0.00081), val_loss 2.33073 nats/token.
Uses the inherited SP8192 base architecture stack together with the CaseOps tokenizer / original-byte byte-sidecar path.
The novel contribution in this PR is a deterministic 1 -> 3 -> 4 recurrence-depth curriculum after loop activation, with eval / phased TTT at fixed depth 4.

3-seed results (8xH100 80GB SXM, 10-min train / 10-min eval budgets)

Seed	Steps	Pre-TTT BPB	Post-TTT BPB	Artifact (bytes)	train_time	eval_time
0	4599	1.07689	1.06417	15,984,426	596.10s	470.6s
42	4603	1.07792	1.06521	15,986,579	596.15s	513.8s
1234	4604	1.07836	1.06578	15,982,914	596.14s	470.6s
Mean	4602	1.07772	1.06505	15,984,640	596.13s	484.98s
Std		0.00076	0.00081	1,842	0.03s	24.9s

Compared with PR #1736's 3-seed mean of 1.06549, this improves the final endpoint by 0.00043 BPB.

Specific contribution here

The important new idea here is curriculum recurrence depth.

The inherited stack already existed:

SP8192 base architecture / looped stack from earlier merged work
CaseOps tokenizer + original-byte sidecar accounting from PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729
phased TTT from the prior stack
gated attention / quant-gate components from earlier work

This PR changes how recurrence depth is used during training and evaluation.

Once the loop path is enabled, training follows a deterministic equal-thirds curriculum over total passes through the recurrent loop block:

first third at depth 1
second third at depth 3
final third at depth 4
eval / phased TTT at fixed depth 4

So the hypothesis is not "train deeper everywhere." It is: teach the recurrent block to act like a scalable refinement operator, then evaluate at the deepest trained depth.

The intended mechanism is:

learn a useful shallow refinement operator first
recover the normal middle-depth behavior next
only in the final phase require the same shared recurrent block to support one additional refinement pass
then cash in that extra trained depth at eval / phased TTT

Empirically, that improves the 3-seed mean even though one seed (1234) is slightly worse than PR #1736; the gain comes from seeds 0 and 42 improving more strongly.

Attribution / lineage

PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 contributed the core SP8192 base architecture / looped-stack foundation.
PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626 contributed the phased-TTT schedule that this line continues to use.
PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 contributed the CaseOps tokenizer, lossless capitalization transform, and original-byte sidecar BPB accounting.
PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 contributed the attention out-gate pattern used in this family of runs.
PR Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736 assembled those ingredients into one competitive stack.
This PR's novel change is the deterministic 1 -> 3 -> 4 recurrence-depth curriculum with fixed-depth-4 eval.

Why this is legal

The CaseOps tokenizer path remains fully lossless.
BPB is still scored on the original raw UTF-8 bytes via the validation byte sidecar, not on transformed text length.
Phased TTT remains score-first: each chunk is scored before the LoRA update step.
All three seeds stay under the decimal 16,000,000-byte artifact cap.
All three seeds stay under both the 600s train and 600s eval budgets.

Reproducibility

The new record folder contains:

train_gpt.py
README.md
submission.json
train_seed0.log, train_seed42.log, train_seed1234.log
lossless_caps.py
prepare_caseops_data.py
tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model

Reproduction is the same as the prior stack except for the added recurrence-curriculum env vars:

TRAIN_LOOP_PHASE_DEPTHS=1,3,4
TRAIN_LOOP_PREWARM_DEPTHS=3,4
EVAL_LOOP_DEPTH=4

codemath3000 · 2026-04-21T04:01:40Z

Running prepare_caseops_data.py as published, then running train_gpt.py with PHASED_TTT_ENABLED=1 reproducibly raises ZeroDivisionError: float division by zero at train_gpt.py:2408 in _loss_bpb_from_sums — byte_sum.item() is 0 because _find_docs (line 2314) returns an empty list. The prep script never inserts BOS markers, and the tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, and the four CaseOps operators), so sp.encode can never naturally output id 1. The training loop has a fallback at _init_shard line 488-489 (if self.bos_idx.size == 0: self.bos_idx = np.array([0], ...)) so training completes, but the phased TTT eval path has no analogous fallback. (Same issue as I reported on PR#1736; prepare_caseops_data.py is byte-identical to that PR's, md5 4a987eeca7f1dcb5a3e4ee9e4aea10e4.) Am I missing a prep step, or should prepare_caseops_data.py be prepending bos_id=1 to each doc (matching download_hf_docs_and_tokenize.py:364-366)?

…uant TTT; Recurrence Depth Curriculum; Parcae stable loops - SOTA 1.0810 still holds (Day 12 plateau, longest in competition history) - PR openai#1758 (1.02840): pre-quant TTT — 6th attempt at illegal pattern, ignore - PR openai#1756 (1.06505): CaseOps + Recurrence Depth Curriculum (depth 1→3→4); has BOS bug; awaits Issue openai#1604 - PR openai#1755 (1.07462): CaseOps + Legal TTT; awaits Issue openai#1604 - New paper: Parcae (arXiv:2604.12946) — stable looped LMs via spectral norm constraint on injection params, relevant to Triple Loop stability - New paper: Gated Attention (arXiv:2505.06708, NeurIPS 2025) — backs PR openai#1667 Attention Output Gate - Added Session 18 lessons learned; Issue openai#1604 self-deadline Apr 24 - Primary action: implement PR openai#1586+openai#1667 immediately (9 days to deadline) https://claude.ai/code/session_0151v7YeWWUSnhmcC8U8NGUV

romeerp · 2026-04-21T19:42:47Z

Hi @codemath3000 - Between PR #1729 (which built the actual CaseOps dataset/export I used in this PR) and PR #1736, discrepancies were introduced in the standalone rebuild path. In particular, the published prepare_caseops_data.py no longer exactly matched the original CaseOps export format expected by train_gpt.py during phased TTT.

I’ve now updated prepare_caseops_data.py to match the original dataset I used for training byte-for-byte. Thanks for flagging this.

romeerp added 2 commits April 20, 2026 17:21

Add loop 1-3-4 curriculum record

37029d6

Refine attribution and writeup

f2f062d

romeerp changed the title ~~Record: SP8192 + CaseOps + GatedAttn + QuantGate + 1->3->4 Curriculum + PhasedTTT — val_bpb 1.06505~~ Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505 Apr 20, 2026

romeerp added 2 commits April 21, 2026 13:38

Fix CaseOps rebuild script BOS handling

9d07919

Use exact CaseOps export byte-count logic

00f836b

bigbag mentioned this pull request Apr 22, 2026

Record: SP8192 CaseOps + V13 Curriculum + SmearGate + LoRA-TTT — val_bpb 1.06513 (3-seed mean) #1771

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505#1756

Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505#1756
romeerp wants to merge 4 commits intoopenai:mainfrom
romeerp:codex/loop134-curriculum-main

romeerp commented Apr 20, 2026 •

edited

Loading

Uh oh!

codemath3000 commented Apr 21, 2026

Uh oh!

romeerp commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

romeerp commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-seed results (8xH100 80GB SXM, 10-min train / 10-min eval budgets)

Specific contribution here

Attribution / lineage

Why this is legal

Reproducibility

Uh oh!

codemath3000 commented Apr 21, 2026

Uh oh!

romeerp commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

romeerp commented Apr 20, 2026 •

edited

Loading

romeerp commented Apr 21, 2026 •

edited

Loading