Record: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean) by codemath3000 · Pull Request #1585 · openai/parameter-golf

codemath3000 · 2026-04-13T02:49:43Z

Summary

val_bpb: 1.0639 (3-seed mean, std 0.0006) | 8xH100 SXM, 600s | Legal TTT
Systems-level optimizations on PR Record: Custom Casefold Tokenizer — 1.0668 BPB #1578's casefold tokenizer + PR Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 #1529's parallel residuals: fused Muon kernel, batched EMA, loader prealloc
Identical ML; faster step time yields extra training steps in the same 600s budget
Clears 0.005-nat threshold vs PR Record: Custom Casefold Tokenizer — 1.0668 BPB #1578 baseline (delta: -0.0083 nats)

Submission series: This PR is one of three related submissions applying the same systems optimizations to different base stacks (PR #1493, PR #1529, PR #1578). We submit against multiple bases so that a ready-to-merge option exists regardless of how the pending PRs are resolved. Judges should feel free to evaluate whichever base(s) they consider valid and disregard the rest.

Results

Seed	TTT BPB	Artifact
1337	1.0646	15,985,530
2024	1.0634	15,980,244
42	1.0639	15,982,918
Mean	1.0639	15,982,897

Tokenizer

Casefold v2 vocabulary from PR #1578: SP8192 retrained on NFKC + lowercased text, ~10.4% better compression. Byte counting verified correct on 15.4M FineWeb docs (0 mismatches). See CASEFOLD_TOKENIZER.md and verify_bytes.py.

CUTLASS EVT Build

Required for full throughput. Source included in cutlass_evt_fusion/:

git clone https://github.com/NVIDIA/cutlass.git /opt/cutlass
cd /opt/cutlass && git checkout 08185b9c3e90510ee2b656662ed0d53b06d28157
pip install --no-build-isolation ./cutlass_evt_fusion

Test plan

3-seed training on 8xH100 SXM (seeds 1337, 2024, 42)
All artifacts under 16MB
All runs under 600s training + 600s eval
Round-trip quantization + TTT verified
Byte counting verified on full FineWeb corpus
Judges verify reproducibility
Judges confirm case normalization legality

🤖 Generated with Claude Code

…0639 Systems-level optimizations (fused Muon, EMA foreach, loader prealloc) on PR openai#1578's casefold tokenizer + PR openai#1529's parallel residuals. Identical ML; faster step time yields extra training steps. 3-seed mean: 1.0639 BPB / 3.0705 nats. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ai#1586 per-layer GPTQ highest-EV - PR openai#758 n-gram effectively dead: MatoTeziTanka (Apr 12) flagged XOR hash includes target token, same illegality as openai#727/openai#741 - GDN-Hybrid BPB bug confirmed: PR openai#1576 space-token double-count inflates denominator ~14%; actual score ~1.16-1.18, not 1.01671 - PR openai#1586 (dexhunter, 1.07493): Per-Layer Adaptive GPTQ MLP=12σ/Attn=13σ + int7 Emb (saves 530KB) + MLR=0.026; -0.0127 nats vs SOTA; implement now - PR openai#1584: systems-only (fused Muon, batched EMA, loader prealloc) ~+20 steps - Casefold Tokenizer (openai#1578/openai#1585): legality debated; await organizer ruling - New paper: arXiv:2604.06169 In-Place TTT (Apr 7) NTP-aligned score-first TTT - Merged SOTA 1.0810 unchanged (4-day stable streak); target ≤1.0760; 17 days https://claude.ai/code/session_01BE8wc8zxvZAo52QBXSNiL8

@MarioPaerle

…TTT — val_bpb 1.05733 (3-seed mean) Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base. Zero-init gates (identity at init) add 1,056 + 13 parameters total. - Seed 42: val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B - Seed 0: val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B - Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B - 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats - Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar) - Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats Casefold legality pending organizer review at Issue openai#1604. AttnOutGate and SmearGate are pure architectural additions and comply with all Issue openai#1017 conditions (causality, normalized distribution, score-before- update, single pass).

This was referenced Apr 13, 2026

New record submissions for review (#1583, #1584, #1585) #1587

Open

Record: Custom Casefold Tokenizer — 1.0668 BPB #1578

Open

dexhunter mentioned this pull request Apr 17, 2026

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean) #1693

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean)#1585

Record: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean)#1585
codemath3000 wants to merge 1 commit intoopenai:mainfrom
codemath3000:submission/systems-opt-casefold

codemath3000 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

codemath3000 commented Apr 13, 2026

Summary

Results

Tokenizer

CUTLASS EVT Build

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant