Skip to content

Record: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean)#1585

Open
codemath3000 wants to merge 1 commit intoopenai:mainfrom
codemath3000:submission/systems-opt-casefold
Open

Record: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean)#1585
codemath3000 wants to merge 1 commit intoopenai:mainfrom
codemath3000:submission/systems-opt-casefold

Conversation

@codemath3000
Copy link
Copy Markdown
Contributor

Summary

Submission series: This PR is one of three related submissions applying the same systems optimizations to different base stacks (PR #1493, PR #1529, PR #1578). We submit against multiple bases so that a ready-to-merge option exists regardless of how the pending PRs are resolved. Judges should feel free to evaluate whichever base(s) they consider valid and disregard the rest.

Results

Seed TTT BPB Artifact
1337 1.0646 15,985,530
2024 1.0634 15,980,244
42 1.0639 15,982,918
Mean 1.0639 15,982,897

Tokenizer

Casefold v2 vocabulary from PR #1578: SP8192 retrained on NFKC + lowercased text, ~10.4% better compression. Byte counting verified correct on 15.4M FineWeb docs (0 mismatches). See CASEFOLD_TOKENIZER.md and verify_bytes.py.

CUTLASS EVT Build

Required for full throughput. Source included in cutlass_evt_fusion/:

git clone https://github.com/NVIDIA/cutlass.git /opt/cutlass
cd /opt/cutlass && git checkout 08185b9c3e90510ee2b656662ed0d53b06d28157
pip install --no-build-isolation ./cutlass_evt_fusion

Test plan

  • 3-seed training on 8xH100 SXM (seeds 1337, 2024, 42)
  • All artifacts under 16MB
  • All runs under 600s training + 600s eval
  • Round-trip quantization + TTT verified
  • Byte counting verified on full FineWeb corpus
  • Judges verify reproducibility
  • Judges confirm case normalization legality

🤖 Generated with Claude Code

…0639

Systems-level optimizations (fused Muon, EMA foreach, loader prealloc)
on PR openai#1578's casefold tokenizer + PR openai#1529's parallel residuals.
Identical ML; faster step time yields extra training steps. 3-seed mean:
1.0639 BPB / 3.0705 nats.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 13, 2026
…ai#1586 per-layer GPTQ highest-EV

- PR openai#758 n-gram effectively dead: MatoTeziTanka (Apr 12) flagged XOR hash
  includes target token, same illegality as openai#727/openai#741
- GDN-Hybrid BPB bug confirmed: PR openai#1576 space-token double-count inflates
  denominator ~14%; actual score ~1.16-1.18, not 1.01671
- PR openai#1586 (dexhunter, 1.07493): Per-Layer Adaptive GPTQ MLP=12σ/Attn=13σ +
  int7 Emb (saves 530KB) + MLR=0.026; -0.0127 nats vs SOTA; implement now
- PR openai#1584: systems-only (fused Muon, batched EMA, loader prealloc) ~+20 steps
- Casefold Tokenizer (openai#1578/openai#1585): legality debated; await organizer ruling
- New paper: arXiv:2604.06169 In-Place TTT (Apr 7) NTP-aligned score-first TTT
- Merged SOTA 1.0810 unchanged (4-day stable streak); target ≤1.0760; 17 days

https://claude.ai/code/session_01BE8wc8zxvZAo52QBXSNiL8
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 17, 2026
…TTT — val_bpb 1.05733 (3-seed mean)

Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate
on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base.
Zero-init gates (identity at init) add 1,056 + 13 parameters total.

- Seed 42:   val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B
- Seed 0:    val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B
- Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B
- 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats
- Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar)
- Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats

Casefold legality pending organizer review at Issue openai#1604.
AttnOutGate and SmearGate are pure architectural additions and comply with
all Issue openai#1017 conditions (causality, normalized distribution, score-before-
update, single pass).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant