Skip to content

Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505#1756

Open
romeerp wants to merge 4 commits intoopenai:mainfrom
romeerp:codex/loop134-curriculum-main
Open

Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505#1756
romeerp wants to merge 4 commits intoopenai:mainfrom
romeerp:codex/loop134-curriculum-main

Conversation

@romeerp
Copy link
Copy Markdown

@romeerp romeerp commented Apr 20, 2026

Summary

  • val_bpb 1.06505 (3-seed mean, std 0.00081), val_loss 2.33073 nats/token.
  • Uses the inherited SP8192 base architecture stack together with the CaseOps tokenizer / original-byte byte-sidecar path.
  • The novel contribution in this PR is a deterministic 1 -> 3 -> 4 recurrence-depth curriculum after loop activation, with eval / phased TTT at fixed depth 4.

3-seed results (8xH100 80GB SXM, 10-min train / 10-min eval budgets)

Seed Steps Pre-TTT BPB Post-TTT BPB Artifact (bytes) train_time eval_time
0 4599 1.07689 1.06417 15,984,426 596.10s 470.6s
42 4603 1.07792 1.06521 15,986,579 596.15s 513.8s
1234 4604 1.07836 1.06578 15,982,914 596.14s 470.6s
Mean 4602 1.07772 1.06505 15,984,640 596.13s 484.98s
Std 0.00076 0.00081 1,842 0.03s 24.9s

Compared with PR #1736's 3-seed mean of 1.06549, this improves the final endpoint by 0.00043 BPB.

Specific contribution here

The important new idea here is curriculum recurrence depth.

The inherited stack already existed:

This PR changes how recurrence depth is used during training and evaluation.

Once the loop path is enabled, training follows a deterministic equal-thirds curriculum over total passes through the recurrent loop block:

  • first third at depth 1
  • second third at depth 3
  • final third at depth 4
  • eval / phased TTT at fixed depth 4

So the hypothesis is not "train deeper everywhere." It is: teach the recurrent block to act like a scalable refinement operator, then evaluate at the deepest trained depth.

The intended mechanism is:

  • learn a useful shallow refinement operator first
  • recover the normal middle-depth behavior next
  • only in the final phase require the same shared recurrent block to support one additional refinement pass
  • then cash in that extra trained depth at eval / phased TTT

Empirically, that improves the 3-seed mean even though one seed (1234) is slightly worse than PR #1736; the gain comes from seeds 0 and 42 improving more strongly.

Attribution / lineage

Why this is legal

  • The CaseOps tokenizer path remains fully lossless.
  • BPB is still scored on the original raw UTF-8 bytes via the validation byte sidecar, not on transformed text length.
  • Phased TTT remains score-first: each chunk is scored before the LoRA update step.
  • All three seeds stay under the decimal 16,000,000-byte artifact cap.
  • All three seeds stay under both the 600s train and 600s eval budgets.

Reproducibility

The new record folder contains:

  • train_gpt.py
  • README.md
  • submission.json
  • train_seed0.log, train_seed42.log, train_seed1234.log
  • lossless_caps.py
  • prepare_caseops_data.py
  • tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model

Reproduction is the same as the prior stack except for the added recurrence-curriculum env vars:

  • TRAIN_LOOP_PHASE_DEPTHS=1,3,4
  • TRAIN_LOOP_PREWARM_DEPTHS=3,4
  • EVAL_LOOP_DEPTH=4

@romeerp romeerp changed the title Record: SP8192 + CaseOps + GatedAttn + QuantGate + 1->3->4 Curriculum + PhasedTTT — val_bpb 1.06505 Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505 Apr 20, 2026
@codemath3000
Copy link
Copy Markdown

Running prepare_caseops_data.py as published, then running train_gpt.py with PHASED_TTT_ENABLED=1 reproducibly raises ZeroDivisionError: float division by zero at train_gpt.py:2408 in _loss_bpb_from_sumsbyte_sum.item() is 0 because _find_docs (line 2314) returns an empty list. The prep script never inserts BOS markers, and the tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, and the four CaseOps operators), so sp.encode can never naturally output id 1. The training loop has a fallback at _init_shard line 488-489 (if self.bos_idx.size == 0: self.bos_idx = np.array([0], ...)) so training completes, but the phased TTT eval path has no analogous fallback. (Same issue as I reported on PR#1736; prepare_caseops_data.py is byte-identical to that PR's, md5 4a987eeca7f1dcb5a3e4ee9e4aea10e4.) Am I missing a prep step, or should prepare_caseops_data.py be prepending bos_id=1 to each doc (matching download_hf_docs_and_tokenize.py:364-366)?

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 21, 2026
…uant TTT; Recurrence Depth Curriculum; Parcae stable loops

- SOTA 1.0810 still holds (Day 12 plateau, longest in competition history)
- PR openai#1758 (1.02840): pre-quant TTT — 6th attempt at illegal pattern, ignore
- PR openai#1756 (1.06505): CaseOps + Recurrence Depth Curriculum (depth 1→3→4); has BOS bug; awaits Issue openai#1604
- PR openai#1755 (1.07462): CaseOps + Legal TTT; awaits Issue openai#1604
- New paper: Parcae (arXiv:2604.12946) — stable looped LMs via spectral norm constraint on injection params, relevant to Triple Loop stability
- New paper: Gated Attention (arXiv:2505.06708, NeurIPS 2025) — backs PR openai#1667 Attention Output Gate
- Added Session 18 lessons learned; Issue openai#1604 self-deadline Apr 24
- Primary action: implement PR openai#1586+openai#1667 immediately (9 days to deadline)

https://claude.ai/code/session_0151v7YeWWUSnhmcC8U8NGUV
@romeerp
Copy link
Copy Markdown
Author

romeerp commented Apr 21, 2026

Hi @codemath3000 - Between PR #1729 (which built the actual CaseOps dataset/export I used in this PR) and PR #1736, discrepancies were introduced in the standalone rebuild path. In particular, the published prepare_caseops_data.py no longer exactly matched the original CaseOps export format expected by train_gpt.py during phased TTT.

I’ve now updated prepare_caseops_data.py to match the original dataset I used for training byte-for-byte. Thanks for flagging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants