diff --git a/records/track_non_record_16mb/2026-05-01_Competition_Research_Notes/README.md b/records/track_non_record_16mb/2026-05-01_Competition_Research_Notes/README.md new file mode 100644 index 0000000000..bfcf2ea268 --- /dev/null +++ b/records/track_non_record_16mb/2026-05-01_Competition_Research_Notes/README.md @@ -0,0 +1,433 @@ +# Non-record: Competition Research Notes + +**Track:** non-record / methodology +**Author:** Himanshu Dongre +**Date:** 2026-05-01 +**Leaderboard claim:** none + +These are research notes from the 10min/16MB track. They are separate from my +final-run evidence package in PR #2110. This folder does not contain a scored +model, logs, or a leaderboard claim. + +The aim is to describe what I think the competition taught us about small +models, tokenization, quantization, eval-time memory, and benchmark semantics. +I have tried to keep this grounded in public PRs/issues and in my own failed +experiments, without pretending to give an official ruling on any open PR. + +## The Main Split + +By the end, final BPB was no longer enough to understand a submission. Similar +numbers could come from very different mechanisms: + +```text +reported BPB = + neural model quality ++ quantization damage or recovery ++ tokenizer / normalization choice ++ byte-denominator accounting ++ legal eval-time memory ++ validation-adaptation protocol ++ timing boundary choices +``` + +Mixing those effects into one leaderboard number made review hard. The notes +below are organized around that split. + +## 1. Clean Neural and Quantization Work + +The clean neural frontier was incremental but real. The repeated ingredients +were: + +- gated attention / XSA / SmearGate, +- LQER, AWQ-lite, and asymmetric quantization, +- longer context around 2048-2560, +- score-first phased TTT, +- no-QV or retuned local TTT, +- per-group compression and artifact-aware tensor routing. + +Public examples include PR #1855, #1953, #2014, #2018, #2041, #2060, and +#2101. + +This was the easiest part of the leaderboard to reason about. The evaluated +object stayed close to a standard causal neural model over the token vocabulary. +The gains were smaller than the PPM/representation jumps, but the legality +story was clearer. + +The practical lesson I got from my own final runs is also simple: + +```text +pre-quant BPB is the first serious kill gate on a mature stack +``` + +In PR #2110, my branches were about `+0.013 BPB` worse than the PR #2018 +reference before quantization. That was already enough to stop. + +## 2. Tokenizer and Representation Work + +Tokenizer-side work was one of the largest levers. It also had the highest +burden of proof. + +The public context: + +- Issue #43 discusses tokenizer artifact accounting. +- Issue #1604 discusses tokenizer normalization, casefolding, and CaseOps-style + transforms. +- Issue #897 and Issue #1719 show how byte-denominator bugs can create large + phantom gains. + +My strongest unfinished tokenizer result was CrossWS: + +| tokenizer | tokens | tokens/byte | ratio | +|---|---:|---:|---:| +| default SP8192 training | 2,880,110 | 0.26126 | 1.00000 | +| cross-whitespace SP8192 | 2,731,553 | 0.24778 | 0.94842 | + +That number came from a 10 MB train-proxy slice decoded from an official train +shard. The effect was stable on val-derived samples around `0.9466-0.9484`. + +I did not finish it as a record because a tokenizer result needs more than a +token-count table: + +- raw-doc train/val split by row index, +- tokenizer trained only on train rows, +- exact validation byte sidecar, +- byte-fallback handling, +- U+2581 handling, +- adversarial Unicode round-trip tests, +- and full validation byte sums. + +The research signal remains interesting. Standard whitespace-splitting +assumptions appear to leave capacity unused for small compressed models. + +## 3. Eval-Time N-gram and PPM Methods + +Eval-time memory was the most sensitive part of the late competition. I would +separate it into at least two categories. + +### Token-level tilt + +The cleanest form is a prefix-only token hint with closed-form renormalization: + +```text +p'(a) = exp(beta * 1[a = h]) * p(a) / Z +Z = 1 + p(h) * (exp(beta) - 1) +``` + +This keeps a full normalized distribution over the SP token vocabulary. PR +#2018 and #2041 are useful public references for this style of method. + +### Byte-level PPM + +Byte-level PPM can be strictly causal and score-first. The open question is +C2: whether the scored alphabet can be bytes rather than the official token +vocabulary. Issue #1872 is the main thread I would read here. + +PR #1991, #2039, #2083, #2098, and #2103 all belong in this broader family, +with different arguments for how the byte distribution relates to the neural +token distribution. The mechanisms are interesting. The policy question is +separate from the engineering question. + +## 4. Runtime Memory Needs a Better Gate + +I spent a lot of time on the idea that eval-time working memory might be +underpriced: the artifact is capped, but cache/RAM at eval time is not. + +The first copy-memory probe looked promising. After fixing a sliding-window +prefix-depth bug, the gain collapsed or turned negative at deeper context. + +The reason was instructive: + +- repeated spans are high precision when they fire, +- a strong long-context neural model already predicts many of those spans, +- memory hits add little when the model already knows, +- memory misses still cost probability mass. + +The principle I would carry forward: + +```text +External memory should be gated by expected improvement over the model's own +distribution, not by memory confidence alone. +``` + +This applies beyond the competition. Retrieval and caches for small assistants +need to know whether the base model is already confident. + +## 5. Validation Adaptation + +Score-first TTT is useful when the update affects only future tokens. The +unsafe pattern is adapting on validation tokens and then reporting scores for +those same tokens after the adapted state has seen them. + +For adaptive submissions, I would want score/update intervals in the logs: + +```text +score token range: [a, b) +update token range: [c, d) +assert updates affect only strictly future score ranges +``` + +That turns a vague legality argument into something inspectable. + +## 6. Byte Accounting + +Byte accounting was not a side detail. It defined the metric. + +For any custom tokenizer or sidecar method, the basic invariants should be: + +```text +decode(encode(text)) == text +sum(byte_sidecar) == len(text.encode("utf-8")) +each original byte is counted exactly once +each scored token contributes exactly one score term +``` + +The tests should cover byte fallback, NUL, U+2581, multi-byte Unicode, empty +documents, BOS boundaries, and packed documents. + +## 7. What Transfers Beyond Parameter Golf + +The benchmark is artificial, but the pressure it creates is real. It asks: + +```text +How much prediction quality can be bought with: + - small persistent weights, + - limited training time, + - bounded inference time, + - quantized artifacts, + - strict causal scoring? +``` + +That resembles small OSS models, local models, cheap specialist models, and +adaptive assistants. + +The useful object is the full compressed prediction system: + +```text +representation + weights + quantizer + memory + evaluator + update protocol +``` + +I would study those pieces together rather than treating the tokenizer, +quantizer, and evaluation state as afterthoughts. + +## 8. What Does Not Transfer Directly + +I would not build a production small language model by copying the final +competition stack unchanged. + +Outside the competition, the best path would likely include tools the contest +mostly rules out or makes unattractive: + +- longer training, +- larger training mixtures, +- supervised and preference tuning, +- distillation from a much larger teacher, +- synthetic data from stronger models, +- architecture search with more than one 10-minute shot, +- and latency/throughput constraints measured on real serving hardware. + +Distillation is the clearest example. Under the competition rules, a large +teacher is hard to use because all useful training has to fit inside the +600-second training budget or inside the submitted artifact. In ordinary small +model work, a large teacher can supply soft targets, reasoning traces, data +selection, and curriculum. I would expect that to dominate many of the tiny +last-day leaderboard knobs once the rules allow it. + +So the claim here is narrower: + +```text +Parameter Golf is not the best recipe for training a production small LM. +It is a useful stress test for compressed prediction systems. +``` + +The parts I think transfer are the systems lessons: + +- tokenizer and representation matter, +- quantization has to be part of model design, +- pre-quant/quant/post-adaptation metrics should be logged separately, +- eval-time memory needs calibrated gating against the base model, +- and adaptive benchmarks need explicit score/update timing. + +## 9. Model Memory vs Working Memory + +One unusual feature of this competition is the split between persistent model +memory and eval-time working memory. + +The persistent artifact is capped at 16 MB. That strongly limits model weights, +code, and anything shipped as part of the predictor. Eval-time working memory +is different. During validation, the program can use H100 memory, KV cache, +temporary lookup tables, online statistics, and other prefix-derived state, as +long as it stays causal and finishes inside the eval-time budget. + +That makes Parameter Golf different from a normal deployed LLM: + +| Setting | Persistent model memory | Working memory / inference state | +|---|---|---| +| Parameter Golf | very expensive, capped at 16 MB | comparatively cheap until eval time runs out | +| Production serving | expensive, but amortized across many users | also expensive because it drives latency, KV cache, batch size, and serving cost | + +This explains why eval-time n-gram caches, PPM-style memory, TTT state, and +large temporary statistics were so tempting in the competition. They spend the +resource that the rules price least directly. + +For production, that tradeoff changes. A method that wins by using large +working memory may be unattractive if it increases latency or reduces batch +throughput. Techniques that compress KV cache, reduce activation memory, or +speed up inference can be more valuable in production than they look in this +contest. + +The research question I would take forward is: + +```text +Given a fixed serving budget, what should live in persistent weights and what +should live in per-request working memory? +``` + +Parameter Golf put almost all pressure on persistent weights. Production +systems need both sides to be efficient. + +## 10. Claims I Would Test Next + +The notes above can be turned into testable claims. These are the ones I would +prioritize. + +### Claim A: representation first + +In this competition, tokenization and representation often moved the target +more than another small gate, rank, or learning-rate tweak. My CrossWS result +is one example, not a proof. I would test whether this remains true after +byte accounting is fully controlled. + +Test: + +```text +Fix the architecture, quantizer, training time, and eval protocol. +Train several byte-exact tokenizers on the same train rows. +Report tokens/byte, pre-quant BPB, quantized BPB, and eval latency. +``` + +The important part is to keep the byte denominator exact. Otherwise the test +measures accounting, not modeling. + +### Claim B: memory needs marginal pricing + +The repeated-span cache looked good until I fixed prefix depth. Then the base +model already knew many of the cache hits. + +Test: + +```text +For every memory event, log: + memory confidence + model probability on the memory hint + realized hit/miss + loss delta after normalized mixing +``` + +If a memory method cannot predict positive marginal gain before seeing the +token, it is not a memory policy. It is a hopeful cache. + +### Claim C: search the deployed model + +A BF16 improvement that disappears after GPTQ is not useful for a 16 MB model. + +Test: + +```text +For each candidate: + pre-quant BPB + quantized BPB + post-adaptation BPB + artifact bytes + eval seconds +``` + +Then rank by the deployed tuple, not by pre-quant loss alone. + +### Claim D: distillation outside the rules + +The competition mostly prevents a large teacher from being useful because the +teacher has to be trained or encoded within the budget. In normal small-model +training, a teacher can shape data, targets, curriculum, and error correction. + +Test: + +```text +Train the same small quantized architecture with: + CE only + teacher soft targets + teacher-selected data + teacher-generated hard negatives + teacher reasoning traces when applicable +Compare after quantization as well as before. +``` + +My expectation is that distillation would beat many of the last-day +hyperparameter tricks, while the competition's quantization and tokenizer +lessons would still matter. + +### Claim E: serving cost decides memory placement + +Parameter Golf made persistent memory scarce and working memory relatively +cheap. Production makes both expensive, but in different units. + +Test: + +```text +For a target latency and batch size, compare: + more weights + longer context / larger KV cache + retrieval or cache memory + online adaptation state + smaller weights plus better tokenizer +``` + +Report quality per dollar or quality per token-second, not BPB alone. + +## My Research Arc + +These notes also reflect how my own view changed during the competition. + +| Date | PR | Type | Lesson | +|---|---:|---|---| +| 2026-03-26 | #826 / #846 | closed record attempts | Large eval-time memory numbers need strict scoring semantics first. | +| 2026-03-28 | #1012 / #1013 | non-record | Synthetic and SSM/JEPA-style successes did not transfer cleanly. | +| 2026-04-01 | #1227 | non-record | Small-scale tests lie when they miss the real bottleneck. | +| 2026-04-02 | #1259 | non-record | Retrieval that helps weak models can vanish on strong contextual models. | +| 2026-04-03 | #1301 | non-record | Mechanical novelty is cheap; frontier transfer is hard. | +| 2026-04-04 | #1341 | non-record | Adaptation and quantization have to be designed together. | +| 2026-04-15 | #1642 | non-record | Legal eval-time memory can still be a null result. | +| 2026-04-18 | #1716 | record attempt | Small causal input features can help in the right base. | +| 2026-04-18 | #1718 | non-record | Ablations are necessary to avoid copying ingredients blindly. | +| 2026-04-30 | #1965 | record candidate | Tail seeds matter; fixed seed policy matters. | +| 2026-05-01 | #2110 | non-record | Final frontier transfer failed at pre-quant. | + +The through-line is that the work moved from tricks toward measurement: +mechanism, denominator, legality, quantization, and hardware budget. + +## Source Map + +| Item | Why it matters | +|---|---| +| Issue #1017 | C1-C4 framing: causal dependence, normalized distribution, score-before-update, single pass. | +| Issue #1604 | Custom tokenizer normalization and casefold/CaseOps policy. | +| Issue #43 | Tokenizer artifact accounting. | +| Issue #897 | U+2581 / byte-fallback denominator bug. | +| Issue #1719 | Leading-space byte double-count bug. | +| Issue #1872 | Byte-level PPM-D mixture legality question under C2. | +| PR #1855 | Merged late SOTA with LQER, SparseAttnGate, BOS-fixed SmearGate, per-group compression, phased TTT. | +| PR #1953 | Long-context 2560, no-QV TTT mask, local LR 0.75, QK_GAIN 5.25. | +| PR #2018 | Gated XSA + LQER top-1 + strict token-only in-timer n-gram TTT. | +| PR #2041 | V21 + inside-timer n-gram TTT without Gated XSA. | +| PR #2060 | LongCtx/no-QV/AsymLogit/LQER retune. | +| PR #2101 | AWQ-lite + AsymLogit + GradCentral + LabelSmooth. | +| PR #1991 / #2083 / #2098 | Byte/PPM mixture line with large claimed gains and C2 sensitivity. | +| PR #1972 | PreQuantTTT line, useful as a warning about validation adaptation. | + +## Closing + +The main thing I would keep from this competition is the systems view. A tiny +language model is a representation, a set of weights, a quantizer, a memory +policy, an evaluator, and an update protocol. + +Most confusing results came from mixing those pieces without saying which one +actually moved. Most useful results made the split visible. diff --git a/records/track_non_record_16mb/2026-05-01_Competition_Research_Notes/submission.json b/records/track_non_record_16mb/2026-05-01_Competition_Research_Notes/submission.json new file mode 100644 index 0000000000..c048e5da75 --- /dev/null +++ b/records/track_non_record_16mb/2026-05-01_Competition_Research_Notes/submission.json @@ -0,0 +1,21 @@ +{ + "kind": "non_record", + "title": "Competition Research Notes", + "author": "Himanshu Dongre", + "track": "track_non_record_16mb", + "leaderboard_claim": false, + "summary": "Research notes on mechanisms, legality, byte accounting, tokenization, eval-time memory, and small-model lessons from the 10min/16MB track.", + "companion_pr": 2110, + "major_topics": [ + "clean neural and quantization frontier", + "tokenizer and representation work", + "token-level n-gram and byte/PPM eval methods", + "runtime memory gating", + "validation adaptation", + "byte accounting", + "small-model research lessons", + "production vs competition constraints", + "model memory vs working memory", + "distillation outside the competition rules" + ] +}