Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)#1796
Open
simon-marcus wants to merge 1 commit intoopenai:mainfrom
Open
Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)#1796simon-marcus wants to merge 1 commit intoopenai:mainfrom
simon-marcus wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replacement for #1143 with the diff trimmed to the single intended submission folder per review request.
Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553)
Results
Against the currently accepted leader #549 at
1.1194, this is an improvement of0.03883447BPB, or about3.47%.Summary
This submission combines three ideas:
#461framework.Scylla) selected through iterative autoresearch and proxy validation rather than manual guesswork.val_bpbaccounting driven by explicit per-token metadata rather than SentencePiece runtime inspection.Our strategy is a stack change that starts at the tokenizer and runs all the way through evaluation:
To the best of our knowledge, this is also among the first leaderboard-caliber submissions in the competition to change the tokenizer itself rather than inherit the published
sp1024tokenization. If reviewers spot an earlier example we missed, we would be happy to correct that framing; either way, we think tokenizer search is a genuinely promising avenue here and welcome scrutiny and follow-up work.Tokenizer Journey
The tokenizer work went through several iterative stages. The short version is that we tried the obvious thing first, watched it flatten out, and then had the good sense to stop being sentimental about it.
1. SentencePiece autoresearch
We first built an autoresearch loop around SentencePiece. That loop optimized tokenizer candidates against a FineWeb-aligned screening metric and later against budget-aware heuristics.
This turned out to be useful exploration, but not the winning path:
That negative result mattered. It told us that “better tokenizer statistics” were not enough by themselves, and that larger vocabularies were often buying slim marginal gains with too much artifact budget. It also gave us permission to leave SentencePiece alone instead of continuing to hammer on a local maximum.
2. TokenMonster sidecar and proxy calibration
We then evaluated TokenMonster as a challenger family. Early cheap-screen results suggested that small TokenMonster vocabularies, especially around the
1024regime, were more promising than either larger TokenMonster vocabularies or the best SentencePiece variants.Proxy validation sharpened that impression:
3. TokenMonster-only autoresearch
We then narrowed the search into a TokenMonster-only lane. After broadening the proposal policy away from tiny local resize-only edits, the best line became a lightly pruned derivative of
english-1024-clean-v1.That candidate, tracked internally as
tm0054and nicknamed Scylla, kept the good byte efficiency of the parent vocabulary while reducing waste in the active vocabulary.This was then promoted through:
The important negative result was that larger-vocab and SentencePiece-side improvements looked better on cheap screening than they did in proxy or full runs. The winning lesson was not “make the tokenizer bigger.” It was “make the tokenizer better aligned to the artifact budget and to the tiny-model learning dynamics.”
If this submission does end up being among the first tokenizer-changing entries seriously pushed to the top of the leaderboard, we would be delighted to see other people push on the same door. This competition has been especially exciting for cultivating unusual and interesting ideas, and we think tokenizer search deserves a place in that mix.
Full-Data Bundle
For the corrected competition path, we built a full-data
Scyllabundle from the publishedsp1024FineWeb export by retokenizing in shard order.The corrected bundle uses:
79train shards1val shardRuntime tokenizer assets:
candidate.vocabcandidate.meta.npzThe metadata artifact supplies:
so the runtime path does not need SentencePiece to inspect tokenizer internals during evaluation.
A compact audit note is included in
TOKENIZER_VALIDATION.md.Legality
This record path is intended to stay within the currently accepted legality standard:
Backward-looking, score-first TTT following PR
#461's framework:Score-first protocol: the model scores each validation chunk before adapting on it. No token is ever re-scored after adaptation. This follows the causal score-before-update TTT pattern that organizers have treated as legal in the adaptive track discussion and accepted submissions.
Implementation Notes
The main script in this folder is the promoted legal TTT stack adapted for tokenizer bundles:
TOKENIZER_PATHpoints to the promoted tokenizer vocabTOKENIZER_META_PATHpoints to the exported metadata LUTsTTT_ENABLED=1The strongest path found so far combines:
ScyllatokenizerIncluded Files
train_gpt.pycandidate.vocabcandidate.meta.npzmanifest.jsontrain_seed42.logtrain_seed1337.logtrain_seed2026.logTOKENIZER_VALIDATION.mdAcknowledgements
Thanks to @0hq and @valerio-oai for organizing, maintaining, and moderating an unusually fun and technically demanding competition.
The tokenizer lane also benefited from reading and learning from other competitors’ public work, especially the broader discussion around legal evaluation methods and tokenizer tradeoffs.