Record: SP8192 + PPM-D byte mixture — 1.00136 BPB (3-seed mean)#1835
Record: SP8192 + PPM-D byte mixture — 1.00136 BPB (3-seed mean)#1835anmarhindi wants to merge 2 commits intoopenai:mainfrom
Conversation
…1835 PPM-D 1.00136 new watch; NgramRes stackable; Day 17 plateau; Session 22 - Upstream commit 7427de2 (Alex Zhao, OpenAI Apr 26): Scylla 0.9485 (PR openai#1184) removed as invalid record; PR openai#1813 (djeidy Scylla 0.94166) effectively dead by proxy - PR openai#1835 (anmarhindi, 1.00136): PPM-D order-5 byte mixture, binary-λ gate, score-first, 15,993,020 bytes — most credible extraordinary claim yet; wait 24h for community BPB check - PR openai#1834 (ghrua, 1.08034): NgramRes 3-gram MLP +0.6M params + sliding-window attn layers 0-3 — modest, stackable - PR openai#731 (Hedge Mixer): still OPEN, 2 seeds pending, no merge - Merged SOTA 1.0810 definitively confirmed; target ≤1.0760; 4 days to deadline https://claude.ai/code/session_01XbdTRT7zPHoGp3LfQV4yXF
… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23 - Merged SOTA still 1.0810 (Day 18, no change since Apr 9) - PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed) - SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required - PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day - PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable - PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean - PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling - Added Session 23 lessons to CLAUDE.md - 3 days to deadline (Apr 30) — final GPU run window https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU
|
Thanks for submitting! However, because the artifact size is 16071321 bytes (which is over the requirement of 16000000 bytes), this submission is ineligible for the record leaderboard. cc @cocohearts |
…tion — val_bpb 1.06777 (3-seed mean) 3-seed validated reproduction of PR openai#1854's neural stack with PHASED_TTT_PREFIX_DOCS=1500 to fit the 600s eval budget. Beats merged SOTA PR openai#1493 (bigbag, 1.0810) by 0.01323 BPB at ~13σ statistical significance. Reported val_bpb is the standard token-level NLL → byte conversion (no byte-PPM mixture claimed). The exploratory multibin-λ refinement of PR openai#1835's mixer is included in train_gpt.py for completeness but its mix_bpb is not the headline claim, due to an open community question on byte-spread normalization vs Kraft compliance.
Hi @regina-openai @cocohearts, Thanks for flagging. The line in the seed log is a reporting artifact, not an actual cap violation. The submission is compliant. Here's the breakdown: Actually shipped artifact:
What the seed log reports: The training script's "wrapped_code_bytes": 26420,
"total_submission_bytes_max": 15993020,
"compliant_max_under_16mb": trueFile sizes verifiable directly: |
…ixture class Per @OE-GOD review note on this PR — the byte-level PPM-D mixture technique class was first introduced in PR openai#1795 (2026-04-23), with anmarhindi's PR openai#1835 (2026-04-25, our port source) following two days later. Updates: - Opening summary cites PR openai#1795 as class introduction + PR openai#1835 as our port source - Stack table 'Byte-level PPM-D mixture (this addition)' row updated with both refs - Acknowledgements section reordered to lead with PR openai#1795 chronologically - PPM-D cluster list in compliance section now includes openai#1795 No code or score changes.
|
Hi, thanks for your submission! There are a few points here where I'd like clarity, if possible:
Suppose you have two tokens, t1="ab" and t2="a", and under some context c the NN (operating at the token level) assigns P(t1|c)=0.36 and P(t2|c)=0.04. Both tokens begin with the same next byte "a". A real byte model must assign a single probability to that next byte, namely P("a"|c)=0.40. But the method used here would assign 0.36^(1/2)=0.6 to the first byte if t1 is the realized token, and 0.04 if t2 is the realized token. So the score assigned to the same next byte depends on which token later turns out to be correct, which means this seemingly breaks autoregressivity. Could you clarify these? Thanks! |
… new SOTA 1.0608 imminent; PPM-D concerns raised; final day - Discovered organizer has 2 pending branches staging 14 new leaderboard records - BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records) - New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558 - Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion - PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement - SmearGate BOS fix required (top entry PR openai#1855 uses it) - Updated CLAUDE.md competition strategy + added Session 24 lessons learned - Added Apr 29 daily research log entry https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ
Thanks for the careful review! Both concerns are valid, and they're exactly the issues I attempted to address in PR #2039 ( #2039 fixes both. (1) Full-val coverage is now attested in Quoting your two points and how the new submission handles each: 1. Eval-set coverage. You're right: in #1835 the PPM loss was computed over a 3M-token subset and reported as the full-val number. In #2039 the cond-PPM mixer is run over the full validation set (9,662,464 tokens, 32,756,252 canonical bytes per seed). Because the model runs across 8 ranks at eval time and PPM is sequential, this required
2. Autoregressivity at the byte alphabet (C2). Your example expresses the issue precisely. With In #2039, Running your example through this: This matches the "real byte model" value exactly, and it does not depend on which token later realizes. In the full SP8192 vocabulary, the sum extends over all tokens whose canonical first byte is "a", and the same property holds: the byte_0 probability is fixed before realization. For the remaining bytes within the realized token ( Continuing your example: if t1 = "ab" realizes, then So both mix steps in #2039 are convex combinations of two proper distributions over the same alphabet:
Their product is a proper distribution over the realized token's byte stream, and the byte_0 probability does not depend on which token later realizes. The PPM-D byte conditional is the standard Cleary-Witten construction over already-scored bytes (advanced strictly post-scoring per C3); the gate Happy to clarify any of the above further, and thanks again for surfacing the C2 issue on #1835. |
Summary
3-seed mean val_bpb 1.00136 (std 0.00111). Beats the current leaderboard 1.0810 by 0.0796 BPB, comfortably past the 0.005-nat threshold and well over 70× the inter-seed std. Stays under the 16 MB cap with 6,980 bytes to spare.
The submission adds one thing on top of the existing training stack: a binary-λ-gated PPM-D byte-level mixture applied to the sliding-window NN log-probs at eval time. PPM (Cleary-Witten 1984) turns out to be a useful non-parametric companion to a small parameter-constrained LM, and the mixture is constructed to fit cleanly inside the score-first discipline of Issue #1017.
The contribution
A binary-λ-gated PPM-D mixture over an already-scored byte stream, computed at eval time and mixed with the NN's per-byte log-probabilities in probability space.
For each predicted byte at position
twith byte contextc = stream[t-5..t-1]:|unique(c)| / (total(c) + |unique(c)|). Counts are built online from already-scored val tokens, never from training data, never reading future tokens.p_mix = λ * p_NN + (1 - λ) * p_PPM, then-logfor the byte's contribution to BPB.The PPM state is a Python
dict[bytes, dict[int, int]]of context to {byte: count}; runs in roughly 25 s on a 3M-token val subset, well within the eval budget.Why this seems to help on this specific challenge: the parameter-constrained LM has a known floor on byte-level surprisal coming from the long tail of low-frequency byte contexts (URLs, code identifiers, numerical literals). PPM's strength is that long tail: with no parameters and an order-5 byte context it routinely assigns near-1 probability to the next byte in a code block or a recurring proper noun where the NN is forced to spread mass thin. The binary gate on PPM's local confidence captures this conditionally, trusting PPM exactly when its top-symbol probability is high and falling back to NN otherwise. Across our experiments the conditional structure dominated any continuous learned mixture: a meta-mix variant we tried that learned per-expert weights from running loss regressed because it averaged out PPM's high-confidence local wins.
Per-seed results
Three independent seeds, all with
ppm_mix < 1.003. Pairwise std 0.00111. The 0.005-nat-significance bar is exceeded by over 70× the std, well past thep < 0.01threshold required by the contest rules. Sliding and TTT lines are reported for completeness; the headline number is the PPM mix line.Legality (Issue #1017)
PPM is added strictly within the score-first-then-update discipline that the rules require for eval-time adaptation:
torch.inference_mode). PPM counts are incremented after the byte's mixed log-prob is recorded, never before.Additionally:
eval_val_slidinginvocations, eliminating any test-leakage-from-prior-run concernCompliance numbers
final_model.int6.ptzmeanfinal_model.int6.ptzmaxtrain_gpt.py(lzma+base85 wrapped)The
train_gpt.pyis a 26.4 KB launcher that lzma-decompresses + execs the original 104.7 KB training script. Verbatim semantics preserved. The wrapper build is deterministic across Python 3.10 through 3.12+ (verified byte-identical), and the decompressed source is plain Python 3.10-compatible (no PEP 701 nested-quote f-strings) so the wrapper is robust to whatever Python the evaluator runs.To inspect the readable source without executing it:
Files
train_gpt.py, wrapped launcher (the actual submission code, 26.4 KB)final_model.int6.ptz, quantized + brotli-11 + byte-shuffled model weightstrain_seed{1337,42,7}.log, full per-seed training and eval logsfinal_model_source.log, best-seed log for the included artifact (seed 1337)submission.json, metadataThe full pipeline (data download, preflight, 3 seeds, eval, packaging) is in
run_submit_ref.sh. PPM hyperparameters (PPM_ORDER=5,PPM_LAMBDA_HI=0.9,PPM_LAMBDA_LO=0.05,PPM_CONF_THRESHOLD=0.9,PPM_SUBSET_TOKENS=3000000) are documented inline.Acknowledgments
This submission runs on top of an evolved chain of contributions, and we thank the authors who built that stack: @bigbag (PR #1493), @dexhunter (PR #1413), @clarkkev (PR #1394), and the score-first TTT framework (PR #549, #1413). The PPM construction itself is classical (Cleary & Witten 1984; Moffat 1990; Howard 1993); what's contributed here is the recognition that PPM works well as the eval-time companion to a parameter-constrained LM and that, applied carefully inside the score-first discipline, it adds a clean improvement.