Non-record: competition research notes#2111
Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: competition research notes#2111himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre wants to merge 1 commit intoopenai:mainfrom
Conversation
c0063ff to
38212b9
Compare
38212b9 to
f5e5d7b
Compare
f5e5d7b to
fa4b455
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: Competition Research Notes
Track: non-record / methodology
Author: Himanshu Dongre
Date: 2026-05-01
Leaderboard claim: none
These are research notes from the 10min/16MB track. They are separate from my
final-run evidence package in PR #2110. This folder does not contain a scored
model, logs, or a leaderboard claim.
The aim is to describe what I think the competition taught us about small
models, tokenization, quantization, eval-time memory, and benchmark semantics.
I have tried to keep this grounded in public PRs/issues and in my own failed
experiments, without pretending to give an official ruling on any open PR.
The Main Split
By the end, final BPB was no longer enough to understand a submission. Similar
numbers could come from very different mechanisms:
Mixing those effects into one leaderboard number made review hard. The notes
below are organized around that split.
1. Clean Neural and Quantization Work
The clean neural frontier was incremental but real. The repeated ingredients
were:
Public examples include PR #1855, #1953, #2014, #2018, #2041, #2060, and
#2101.
This was the easiest part of the leaderboard to reason about. The evaluated
object stayed close to a standard causal neural model over the token vocabulary.
The gains were smaller than the PPM/representation jumps, but the legality
story was clearer.
The practical lesson I got from my own final runs is also simple:
In PR #2110, my branches were about
+0.013 BPBworse than the PR #2018reference before quantization. That was already enough to stop.
2. Tokenizer and Representation Work
Tokenizer-side work was one of the largest levers. It also had the highest
burden of proof.
The public context:
transforms.
build_sentencepiece_lutsaffects GDN-family submissions #1719 show how byte-denominator bugs can create largephantom gains.
My strongest unfinished tokenizer result was CrossWS:
That number came from a 10 MB train-proxy slice decoded from an official train
shard. The effect was stable on val-derived samples around
0.9466-0.9484.I did not finish it as a record because a tokenizer result needs more than a
token-count table:
The research signal remains interesting. Standard whitespace-splitting
assumptions appear to leave capacity unused for small compressed models.
3. Eval-Time N-gram and PPM Methods
Eval-time memory was the most sensitive part of the late competition. I would
separate it into at least two categories.
Token-level tilt
The cleanest form is a prefix-only token hint with closed-form renormalization:
This keeps a full normalized distribution over the SP token vocabulary. PR
#2018 and #2041 are useful public references for this style of method.
Byte-level PPM
Byte-level PPM can be strictly causal and score-first. The open question is
C2: whether the scored alphabet can be bytes rather than the official token
vocabulary. Issue #1872 is the main thread I would read here.
PR #1991, #2039, #2083, #2098, and #2103 all belong in this broader family,
with different arguments for how the byte distribution relates to the neural
token distribution. The mechanisms are interesting. The policy question is
separate from the engineering question.
4. Runtime Memory Needs a Better Gate
I spent a lot of time on the idea that eval-time working memory might be
underpriced: the artifact is capped, but cache/RAM at eval time is not.
The first copy-memory probe looked promising. After fixing a sliding-window
prefix-depth bug, the gain collapsed or turned negative at deeper context.
The reason was instructive:
The principle I would carry forward:
This applies beyond the competition. Retrieval and caches for small assistants
need to know whether the base model is already confident.
5. Validation Adaptation
Score-first TTT is useful when the update affects only future tokens. The
unsafe pattern is adapting on validation tokens and then reporting scores for
those same tokens after the adapted state has seen them.
For adaptive submissions, I would want score/update intervals in the logs:
That turns a vague legality argument into something inspectable.
6. Byte Accounting
Byte accounting was not a side detail. It defined the metric.
For any custom tokenizer or sidecar method, the basic invariants should be:
The tests should cover byte fallback, NUL, U+2581, multi-byte Unicode, empty
documents, BOS boundaries, and packed documents.
7. What Transfers Beyond Parameter Golf
The benchmark is artificial, but the pressure it creates is real. It asks:
That resembles small OSS models, local models, cheap specialist models, and
adaptive assistants.
The useful object is the full compressed prediction system:
I would study those pieces together rather than treating the tokenizer,
quantizer, and evaluation state as afterthoughts.
8. What Does Not Transfer Directly
I would not build a production small language model by copying the final
competition stack unchanged.
Outside the competition, the best path would likely include tools the contest
mostly rules out or makes unattractive:
Distillation is the clearest example. Under the competition rules, a large
teacher is hard to use because all useful training has to fit inside the
600-second training budget or inside the submitted artifact. In ordinary small
model work, a large teacher can supply soft targets, reasoning traces, data
selection, and curriculum. I would expect that to dominate many of the tiny
last-day leaderboard knobs once the rules allow it.
So the claim here is narrower:
The parts I think transfer are the systems lessons:
9. Model Memory vs Working Memory
One unusual feature of this competition is the split between persistent model
memory and eval-time working memory.
The persistent artifact is capped at 16 MB. That strongly limits model weights,
code, and anything shipped as part of the predictor. Eval-time working memory
is different. During validation, the program can use H100 memory, KV cache,
temporary lookup tables, online statistics, and other prefix-derived state, as
long as it stays causal and finishes inside the eval-time budget.
That makes Parameter Golf different from a normal deployed LLM:
This explains why eval-time n-gram caches, PPM-style memory, TTT state, and
large temporary statistics were so tempting in the competition. They spend the
resource that the rules price least directly.
For production, that tradeoff changes. A method that wins by using large
working memory may be unattractive if it increases latency or reduces batch
throughput. Techniques that compress KV cache, reduce activation memory, or
speed up inference can be more valuable in production than they look in this
contest.
The research question I would take forward is:
Parameter Golf put almost all pressure on persistent weights. Production
systems need both sides to be efficient.
10. Claims I Would Test Next
The notes above can be turned into testable claims. These are the ones I would
prioritize.
Claim A: representation first
In this competition, tokenization and representation often moved the target
more than another small gate, rank, or learning-rate tweak. My CrossWS result
is one example, not a proof. I would test whether this remains true after
byte accounting is fully controlled.
Test:
The important part is to keep the byte denominator exact. Otherwise the test
measures accounting, not modeling.
Claim B: memory needs marginal pricing
The repeated-span cache looked good until I fixed prefix depth. Then the base
model already knew many of the cache hits.
Test:
If a memory method cannot predict positive marginal gain before seeing the
token, it is not a memory policy. It is a hopeful cache.
Claim C: search the deployed model
A BF16 improvement that disappears after GPTQ is not useful for a 16 MB model.
Test:
Then rank by the deployed tuple, not by pre-quant loss alone.
Claim D: distillation outside the rules
The competition mostly prevents a large teacher from being useful because the
teacher has to be trained or encoded within the budget. In normal small-model
training, a teacher can shape data, targets, curriculum, and error correction.
Test:
My expectation is that distillation would beat many of the last-day
hyperparameter tricks, while the competition's quantization and tokenizer
lessons would still matter.
Claim E: serving cost decides memory placement
Parameter Golf made persistent memory scarce and working memory relatively
cheap. Production makes both expensive, but in different units.
Test:
Report quality per dollar or quality per token-second, not BPB alone.
My Research Arc
These notes also reflect how my own view changed during the competition.
The through-line is that the work moved from tricks toward measurement:
mechanism, denominator, legality, quantization, and hardware budget.
Source Map
Closing
The main thing I would keep from this competition is the systems view. A tiny
language model is a representation, a set of weights, a quantizer, a memory
policy, an evaluator, and an update protocol.
Most confusing results came from mixing those pieces without saying which one
actually moved. Most useful results made the split visible.