Record: Casefold V4 Tokenizer + Multi-Phase Global SGD TTT — val_bpb 1.05970 (3-seed mean)#1670
Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
…1.05970 (3-seed mean)
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Apr 16, 2026
…Output Gate; PR openai#1670 dexhunter 1.05970 casefold pending; PR openai#1647 SLOT-4 risky; Session 15 https://claude.ai/code/session_01VS9iDJJ7C5Qqpk8AAd1Avv
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 17, 2026
…TTT — val_bpb 1.05733 (3-seed mean) Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base. Zero-init gates (identity at init) add 1,056 + 13 parameters total. - Seed 42: val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B - Seed 0: val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B - Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B - 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats - Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar) - Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats Casefold legality pending organizer review at Issue openai#1604. AttnOutGate and SmearGate are pure architectural additions and comply with all Issue openai#1017 conditions (causality, normalized distribution, score-before- update, single pass).
Open
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Key Innovation
Casefold tokenizer preprocessing normalizes text to lowercase before SP8192 tokenization, reducing vocabulary entropy. Combined with our multi-phase global SGD TTT from PR #1626. Casefold legality is pending organizer review at Issue #1604.
Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, Phased TTT)
Lineage
PR #1530 (@samacqua) → PR #1626 (@dexhunter, multi-phase SGD TTT) → this PR (+ casefold tokenizer)
Credits
Note
Casefold tokenizer normalization is pending organizer review at Issue #1604. This submission is offered for evaluation under that pending ruling.