Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621#1881
Conversation
…ge) — mix_bpb 0.9019 / quantized_ttt 1.0621 Direct successor to our PR openai#1854 with the data-coverage correction motivated by @dexhunter's PR openai#1858 comment. Inherits PR openai#1797 verbatim base stack (CaseOps + SparseAttnGate + PolarNS + MIN_LR + FusedCE + LQER asym + Phased TTT) and ports PPM-D byte mixture from @anmarhindi PR openai#1835 (order-5, binary-lambda gate, score-before-update). 3-seed mean (8xH100 SXM, brotli, ~15.95MB): mix_bpb (8M PPM subset): 0.901886 (std 0.000803) quantized_ttt_phased (full 47.85M val): 1.062106 (std 0.001166) total_eval_time: 576.7s, train_time: 599.6s, all under 600s caps Data parity correction vs PR openai#1854: PR openai#1854 used --val-docs default=10000 (9.66M val), this v2 uses explicit --val-docs 50000 matching dexhunter PR openai#1797 reference seed log (47,853,344 val tokens vs reference 47,851,520, parity 0.004%). Neural-only quantized_ttt_phased on shared seeds vs dexhunter PR openai#1797: seed 42: ours 1.06181 vs dex 1.06181 (byte-identical) seed 314: ours 1.06112 vs dex 1.06083 (delta +0.00029, within seed noise) Headline disclosure: mix_bpb is on PPM_SUBSET_TOKENS=8000000 (16.7% of full val); structural — PPM at 35M coverage measured 1041s eval, exceeding 600s cap. All non-PPM diagnostics computed on full 47.85M val. Compliance: PPM-D legality pending Issue openai#1872 (token-vs-byte alphabet question, called out by name). CaseOps legality pending Issue openai#1604. If openai#1872 rules byte-alphabet legal: headline 0.9019 valid. If openai#1872 rules against: 1.062106 neural-only remains, beats merged SOTA PR openai#1493 (1.0810) by -0.019 BPB on full val.
|
Hi @ndokutovich — thanks for the careful disclosure in the README. Two notes for the reviewers' benefit: 1. PPM-mixture lineage. The byte-level PPM-D mixture lever was first introduced in #1795 (filed 2026-04-23). PR #1835 followed on 2026-04-25. If the lineage table is meant to track origin of the technique class, #1795 is the earlier reference. 2. Headline metric is on a subset. The headline Not a blocker on the implementation quality — just want the lineage and the coverage caveat visible to anyone reading the leaderboard. |
…ixture class Per @OE-GOD review note on this PR — the byte-level PPM-D mixture technique class was first introduced in PR openai#1795 (2026-04-23), with anmarhindi's PR openai#1835 (2026-04-25, our port source) following two days later. Updates: - Opening summary cites PR openai#1795 as class introduction + PR openai#1835 as our port source - Stack table 'Byte-level PPM-D mixture (this addition)' row updated with both refs - Acknowledgements section reordered to lead with PR openai#1795 chronologically - PPM-D cluster list in compliance section now includes openai#1795 No code or score changes.
|
Thanks @OE-GOD — both points well taken. 1. Lineage correction. Updated in 4cbab86: PR #1795 is now credited as the earliest reference of the byte-level PPM-D mixture technique class (2026-04-23), with anmarhindi's PR #1835 (2026-04-25) noted as the specific implementation we ported from. Acknowledgements reordered to lead chronologically. 2. Coverage caveat. Agreed — and the README's PPM coverage disclosure section was the explicit attempt to make this visible: `mix_bpb=0.901886` is on `PPM_SUBSET_TOKENS=8,000,000` (16.7% of the 47,853,344 val), and the comparable full-val number in this submission is the neural-only `quantized_ttt_phased=1.062106` — which sits at parity with PR #1797 (1.06157, byte-identical to dexhunter on shared seed 42). The 8M-subset headline class metric is comparable to PR #1795/#1835/#1850/#1854 — and Issue #1872 is indeed where the class-level decision lies. For what it's worth, on the C2 question your phrasing of "`p_NN` (bit-conserving spread) and `p_PPM` are both normalized over 256 bytes, convex combination is normalized" matches the byte-alphabet reading in the issue. We're agnostic to which way the ruling goes — this submission is hedged with the 1.0621 full-val neural-only number as the structural fallback if the class is disallowed. Thanks for the careful flag — the leaderboard's traceability is better with PR #1795 cited. |
Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage)
val_bpb (mix): 0.901886 (3-seed mean, std 0.000803,
PPM_SUBSET_TOKENS=8,000,000)val_bpb (neural-only
quantized_ttt_phased): 1.062106 (3-seed mean, std 0.001166, full 47.85M val)~15.95 MB | 8×H100 SXM | 599.6s train / 576.7s eval
What this is
Direct successor to our PR #1854 with the data-coverage correction motivated by @dexhunter's comment on PR #1858. Inherits @dexhunter PR #1797 verbatim base stack (CaseOps + SparseAttnGate + PolarNS + MIN_LR + FusedCE + LQER asym + Phased TTT) and ports the PPM-D byte mixture from @anmarhindi PR #1835 (order-5, binary-lambda gate, score-before-update).
What changed vs PR #1854
PR #1854 inherited @dexhunter's prep with the argparse default `--val-docs 10000`, producing `val_tokens=9,662,464` (~17% of leaderboard val coverage). @dexhunter's own seed log silently uses `--val-docs 50000` (47.85M val tokens). This v2 reproduces the explicit reference invocation.
Reproduction parity with PR #1797
On the two seeds shared with @dexhunter PR #1797 (seeds 42 and 314), our quantized_ttt_phased on his exact val coverage:
The PPM-D byte-mixture layer is the only delta from his stack, demonstrating clean additivity.
PPM coverage disclosure
Headline `mix_bpb=0.901886` is measured on `PPM_SUBSET_TOKENS=8,000,000` (16.7% of full 47,853,344 val). This is structural in the stack: PPM mix at 35M coverage measured `total_eval_time:1041s` for seed 42 (over the 600s eval cap, see internal logs); the subset is the largest that fits under cap with the full Phased TTT pipeline.
The headline is therefore directly comparable to other PPM-D byte-mixture submissions using the same subset (PR #1835, PR #1850, PR #1854, PR #1858 if re-run on subset). All non-PPM diagnostics (`quantized_ttt_phased`, `diagnostic_quantized_no_ttt`, `diagnostic_pre_quantization_post_ema`) are computed on full 47,853,344 val and ARE directly comparable to PR #1797 (1.06157) and merged SOTA PR #1493 (1.0810) per the leaderboard's standard byte-level BPB metric.
Hedged ruling outcomes
This submission contains both numbers in the same artifact:
Compliance (Issue #1017)
Files
Acknowledgements
@dexhunter for the PR #1797 base stack and the PR #1858 methodology comment that motivated this v2. @anmarhindi for the PR #1835 PPM-D byte-mixture port. @romeerp / PR #1729 lineage for the CaseOps bijective tokenizer.