Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage)

ndokutovich · 2026-04-28T03:50:13Z

val_bpb (mix): 0.901886 (3-seed mean, std 0.000803, PPM_SUBSET_TOKENS=8,000,000)
val_bpb (neural-only quantized_ttt_phased): 1.062106 (3-seed mean, std 0.001166, full 47.85M val)
~15.95 MB | 8×H100 SXM | 599.6s train / 576.7s eval

What this is

Direct successor to our PR #1854 with the data-coverage correction motivated by @dexhunter's comment on PR #1858. Inherits @dexhunter PR #1797 verbatim base stack (CaseOps + SparseAttnGate + PolarNS + MIN_LR + FusedCE + LQER asym + Phased TTT) and ports the PPM-D byte mixture from @anmarhindi PR #1835 (order-5, binary-lambda gate, score-before-update).

What changed vs PR #1854

PR #1854 inherited @dexhunter's prep with the argparse default `--val-docs 10000`, producing `val_tokens=9,662,464` (~17% of leaderboard val coverage). @dexhunter's own seed log silently uses `--val-docs 50000` (47.85M val tokens). This v2 reproduces the explicit reference invocation.

Metric	PR #1854 (v1)	This (v2)
`val_tokens`	9,662,464	47,853,344
`total_docs`	10,000	50,000
Reference parity vs PR #1797 (47,851,520)	79.8%	100.0% (delta 0.004%)
Headline `mix_bpb` 3-seed mean	0.90236	0.901886
Neural-only `quantized_ttt_phased` 3-seed mean	1.06791 (on 9.66M, not comparable)	1.062106 (on full 47.85M)

Reproduction parity with PR #1797

On the two seeds shared with @dexhunter PR #1797 (seeds 42 and 314), our quantized_ttt_phased on his exact val coverage:

Seed	dexhunter PR #1797	this v2	Delta
42	1.06181	1.06181	+0.00000 (byte-identical)
314	1.06083	1.06112	+0.00029 (within seed noise)

The PPM-D byte-mixture layer is the only delta from his stack, demonstrating clean additivity.

PPM coverage disclosure

Headline `mix_bpb=0.901886` is measured on `PPM_SUBSET_TOKENS=8,000,000` (16.7% of full 47,853,344 val). This is structural in the stack: PPM mix at 35M coverage measured `total_eval_time:1041s` for seed 42 (over the 600s eval cap, see internal logs); the subset is the largest that fits under cap with the full Phased TTT pipeline.

The headline is therefore directly comparable to other PPM-D byte-mixture submissions using the same subset (PR #1835, PR #1850, PR #1854, PR #1858 if re-run on subset). All non-PPM diagnostics (`quantized_ttt_phased`, `diagnostic_quantized_no_ttt`, `diagnostic_pre_quantization_post_ema`) are computed on full 47,853,344 val and ARE directly comparable to PR #1797 (1.06157) and merged SOTA PR #1493 (1.0810) per the leaderboard's standard byte-level BPB metric.

Hedged ruling outcomes

This submission contains both numbers in the same artifact:

If Issue #1872 (PPM-D C2 ruling) concludes byte-alphabet legal: `mix_bpb=0.901886` is the headline.
If Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2 #1872 rules against byte-mixture: `quantized_ttt_phased=1.062106` (full-val neural-only) remains record-eligible — beats merged SOTA PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (1.0810) by −0.019 BPB ≈ 0.049 nats/token ≈ 10× the 0.005-nat record-bar inflection.
If Issue #1604 (CaseOps ruling) concludes against, the entire CaseOps PR cluster (including PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 and successors) is affected — independent of this submission.

Compliance (Issue #1017)

C1 (causal): PPM context at byte t uses bytes <t only. Phased TTT updates the per-document LoRA adapter AFTER scoring every chunk. SparseAttnGate / Smear gate causal per PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 audit.
C2 (normalized): token-vs-byte alphabet question is the subject of Issue Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2 #1872 (cocohearts ruling pending). Submission is in the PPM-D cluster called out by name.
C3 (score-before-update): Phased TTT scores chunk before SGD step (per-document LoRA reset). PPM-D counts incremented at byte t only AFTER `−log p_mix(t)` is recorded.
C4 (single pass): one left-to-right traversal, sliding stride 64; no rescore/selection.
Section V byte-level BPB: scored on original pre-transform UTF-8 bytes via per-token byte sidecar (`fineweb_val_bytes_*.bin`).
Caps: all 3 seeds 599.575–599.628s train; 575.9–578.3s eval; 15,950,213–15,953,505 byte artifact.

Files

`README.md` — full table of results, stack, reproducibility recipe
`submission.json` — machine-readable per-seed metrics + compliance notes
`train_gpt.py` — PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 base + PPM-D byte-mixture port (208-line addition: `build_token_bytes_lut` + `_ppm_mixture_bpb`)
`prepare_caseops_data.py` — CaseOps prep (use `--val-docs 50000` explicitly)
`lossless_caps.py` — bijective CaseOps tokenizer transform
`tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` — SP8192 + CaseOps SP model
`train_seed{42,1337,314}.log` — full per-seed training/eval logs

Acknowledgements

@dexhunter for the PR #1797 base stack and the PR #1858 methodology comment that motivated this v2. @anmarhindi for the PR #1835 PPM-D byte-mixture port. @romeerp / PR #1729 lineage for the CaseOps bijective tokenizer.

@dexhunter

…ge) — mix_bpb 0.9019 / quantized_ttt 1.0621 Direct successor to our PR openai#1854 with the data-coverage correction motivated by @dexhunter's PR openai#1858 comment. Inherits PR openai#1797 verbatim base stack (CaseOps + SparseAttnGate + PolarNS + MIN_LR + FusedCE + LQER asym + Phased TTT) and ports PPM-D byte mixture from @anmarhindi PR openai#1835 (order-5, binary-lambda gate, score-before-update). 3-seed mean (8xH100 SXM, brotli, ~15.95MB): mix_bpb (8M PPM subset): 0.901886 (std 0.000803) quantized_ttt_phased (full 47.85M val): 1.062106 (std 0.001166) total_eval_time: 576.7s, train_time: 599.6s, all under 600s caps Data parity correction vs PR openai#1854: PR openai#1854 used --val-docs default=10000 (9.66M val), this v2 uses explicit --val-docs 50000 matching dexhunter PR openai#1797 reference seed log (47,853,344 val tokens vs reference 47,851,520, parity 0.004%). Neural-only quantized_ttt_phased on shared seeds vs dexhunter PR openai#1797: seed 42: ours 1.06181 vs dex 1.06181 (byte-identical) seed 314: ours 1.06112 vs dex 1.06083 (delta +0.00029, within seed noise) Headline disclosure: mix_bpb is on PPM_SUBSET_TOKENS=8000000 (16.7% of full val); structural — PPM at 35M coverage measured 1041s eval, exceeding 600s cap. All non-PPM diagnostics computed on full 47.85M val. Compliance: PPM-D legality pending Issue openai#1872 (token-vs-byte alphabet question, called out by name). CaseOps legality pending Issue openai#1604. If openai#1872 rules byte-alphabet legal: headline 0.9019 valid. If openai#1872 rules against: 1.062106 neural-only remains, beats merged SOTA PR openai#1493 (1.0810) by -0.019 BPB on full val.

OE-GOD · 2026-04-28T06:02:54Z

Hi @ndokutovich — thanks for the careful disclosure in the README. Two notes for the reviewers' benefit:

1. PPM-mixture lineage. The byte-level PPM-D mixture lever was first introduced in #1795 (filed 2026-04-23). PR #1835 followed on 2026-04-25. If the lineage table is meant to track origin of the technique class, #1795 is the earlier reference.

2. Headline metric is on a subset. The headline mix_bpb=0.901886 is measured on PPM_SUBSET_TOKENS=8,000,000 (16.7% of the 47.85M val). The full-val honest number in this submission is the neural-only quantized_ttt_phased=1.062106, which is at parity with the #1797 base (1.06157) — i.e. the PPM mix's contribution isn't measured at full leaderboard coverage. Other PPM-mixture submissions using the same 8M subset (#1835, #1850, #1854) are in the same situation, and #1872 seems to be where the legality of this class is being decided.

Not a blocker on the implementation quality — just want the lineage and the coverage caveat visible to anyone reading the leaderboard.

@OE-GOD

…ixture class Per @OE-GOD review note on this PR — the byte-level PPM-D mixture technique class was first introduced in PR openai#1795 (2026-04-23), with anmarhindi's PR openai#1835 (2026-04-25, our port source) following two days later. Updates: - Opening summary cites PR openai#1795 as class introduction + PR openai#1835 as our port source - Stack table 'Byte-level PPM-D mixture (this addition)' row updated with both refs - Acknowledgements section reordered to lead with PR openai#1795 chronologically - PPM-D cluster list in compliance section now includes openai#1795 No code or score changes.

ndokutovich · 2026-04-28T09:52:52Z

Thanks @OE-GOD — both points well taken.

1. Lineage correction. Updated in 4cbab86: PR #1795 is now credited as the earliest reference of the byte-level PPM-D mixture technique class (2026-04-23), with anmarhindi's PR #1835 (2026-04-25) noted as the specific implementation we ported from. Acknowledgements reordered to lead chronologically.

2. Coverage caveat. Agreed — and the README's PPM coverage disclosure section was the explicit attempt to make this visible: `mix_bpb=0.901886` is on `PPM_SUBSET_TOKENS=8,000,000` (16.7% of the 47,853,344 val), and the comparable full-val number in this submission is the neural-only `quantized_ttt_phased=1.062106` — which sits at parity with PR #1797 (1.06157, byte-identical to dexhunter on shared seed 42). The 8M-subset headline class metric is comparable to PR #1795/#1835/#1850/#1854 — and Issue #1872 is indeed where the class-level decision lies.

For what it's worth, on the C2 question your phrasing of "`p_NN` (bit-conserving spread) and `p_PPM` are both normalized over 256 bytes, convex combination is normalized" matches the byte-alphabet reading in the issue. We're agnostic to which way the ruling goes — this submission is hedged with the 1.0621 full-val neural-only number as the structural fallback if the class is disallowed.

Thanks for the careful flag — the leaderboard's traceability is better with PR #1795 cited.

leon2k2k2k mentioned this pull request Apr 28, 2026

Record: PR #1850 + Anti-Hijack Gate — val_bpb 0.99445 (full val) #1885

Open

3 tasks

Christopher-Lee-McClendon mentioned this pull request Apr 29, 2026

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures #1916

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621#1881