Skip to content

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621#1881

Open
ndokutovich wants to merge 2 commits intoopenai:mainfrom
ndokutovich:submission-v2-fullval
Open

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621#1881
ndokutovich wants to merge 2 commits intoopenai:mainfrom
ndokutovich:submission-v2-fullval

Conversation

@ndokutovich
Copy link
Copy Markdown

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage)

val_bpb (mix): 0.901886 (3-seed mean, std 0.000803, PPM_SUBSET_TOKENS=8,000,000)
val_bpb (neural-only quantized_ttt_phased): 1.062106 (3-seed mean, std 0.001166, full 47.85M val)
~15.95 MB | 8×H100 SXM | 599.6s train / 576.7s eval

What this is

Direct successor to our PR #1854 with the data-coverage correction motivated by @dexhunter's comment on PR #1858. Inherits @dexhunter PR #1797 verbatim base stack (CaseOps + SparseAttnGate + PolarNS + MIN_LR + FusedCE + LQER asym + Phased TTT) and ports the PPM-D byte mixture from @anmarhindi PR #1835 (order-5, binary-lambda gate, score-before-update).

What changed vs PR #1854

PR #1854 inherited @dexhunter's prep with the argparse default `--val-docs 10000`, producing `val_tokens=9,662,464` (~17% of leaderboard val coverage). @dexhunter's own seed log silently uses `--val-docs 50000` (47.85M val tokens). This v2 reproduces the explicit reference invocation.

Metric PR #1854 (v1) This (v2)
`val_tokens` 9,662,464 47,853,344
`total_docs` 10,000 50,000
Reference parity vs PR #1797 (47,851,520) 79.8% 100.0% (delta 0.004%)
Headline `mix_bpb` 3-seed mean 0.90236 0.901886
Neural-only `quantized_ttt_phased` 3-seed mean 1.06791 (on 9.66M, not comparable) 1.062106 (on full 47.85M)

Reproduction parity with PR #1797

On the two seeds shared with @dexhunter PR #1797 (seeds 42 and 314), our quantized_ttt_phased on his exact val coverage:

Seed dexhunter PR #1797 this v2 Delta
42 1.06181 1.06181 +0.00000 (byte-identical)
314 1.06083 1.06112 +0.00029 (within seed noise)

The PPM-D byte-mixture layer is the only delta from his stack, demonstrating clean additivity.

PPM coverage disclosure

Headline `mix_bpb=0.901886` is measured on `PPM_SUBSET_TOKENS=8,000,000` (16.7% of full 47,853,344 val). This is structural in the stack: PPM mix at 35M coverage measured `total_eval_time:1041s` for seed 42 (over the 600s eval cap, see internal logs); the subset is the largest that fits under cap with the full Phased TTT pipeline.

The headline is therefore directly comparable to other PPM-D byte-mixture submissions using the same subset (PR #1835, PR #1850, PR #1854, PR #1858 if re-run on subset). All non-PPM diagnostics (`quantized_ttt_phased`, `diagnostic_quantized_no_ttt`, `diagnostic_pre_quantization_post_ema`) are computed on full 47,853,344 val and ARE directly comparable to PR #1797 (1.06157) and merged SOTA PR #1493 (1.0810) per the leaderboard's standard byte-level BPB metric.

Hedged ruling outcomes

This submission contains both numbers in the same artifact:

Compliance (Issue #1017)

  • C1 (causal): PPM context at byte t uses bytes <t only. Phased TTT updates the per-document LoRA adapter AFTER scoring every chunk. SparseAttnGate / Smear gate causal per PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 audit.
  • C2 (normalized): token-vs-byte alphabet question is the subject of Issue Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2 #1872 (cocohearts ruling pending). Submission is in the PPM-D cluster called out by name.
  • C3 (score-before-update): Phased TTT scores chunk before SGD step (per-document LoRA reset). PPM-D counts incremented at byte t only AFTER `−log p_mix(t)` is recorded.
  • C4 (single pass): one left-to-right traversal, sliding stride 64; no rescore/selection.
  • Section V byte-level BPB: scored on original pre-transform UTF-8 bytes via per-token byte sidecar (`fineweb_val_bytes_*.bin`).
  • Caps: all 3 seeds 599.575–599.628s train; 575.9–578.3s eval; 15,950,213–15,953,505 byte artifact.

Files

  • `README.md` — full table of results, stack, reproducibility recipe
  • `submission.json` — machine-readable per-seed metrics + compliance notes
  • `train_gpt.py` — PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 base + PPM-D byte-mixture port (208-line addition: `build_token_bytes_lut` + `_ppm_mixture_bpb`)
  • `prepare_caseops_data.py` — CaseOps prep (use `--val-docs 50000` explicitly)
  • `lossless_caps.py` — bijective CaseOps tokenizer transform
  • `tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` — SP8192 + CaseOps SP model
  • `train_seed{42,1337,314}.log` — full per-seed training/eval logs

Acknowledgements

@dexhunter for the PR #1797 base stack and the PR #1858 methodology comment that motivated this v2. @anmarhindi for the PR #1835 PPM-D byte-mixture port. @romeerp / PR #1729 lineage for the CaseOps bijective tokenizer.

…ge) — mix_bpb 0.9019 / quantized_ttt 1.0621

Direct successor to our PR openai#1854 with the data-coverage correction motivated by
@dexhunter's PR openai#1858 comment. Inherits PR openai#1797 verbatim base stack
(CaseOps + SparseAttnGate + PolarNS + MIN_LR + FusedCE + LQER asym + Phased TTT)
and ports PPM-D byte mixture from @anmarhindi PR openai#1835 (order-5, binary-lambda
gate, score-before-update).

3-seed mean (8xH100 SXM, brotli, ~15.95MB):
  mix_bpb (8M PPM subset):              0.901886 (std 0.000803)
  quantized_ttt_phased (full 47.85M val): 1.062106 (std 0.001166)
  total_eval_time: 576.7s, train_time: 599.6s, all under 600s caps

Data parity correction vs PR openai#1854:
  PR openai#1854 used --val-docs default=10000 (9.66M val), this v2 uses explicit
  --val-docs 50000 matching dexhunter PR openai#1797 reference seed log
  (47,853,344 val tokens vs reference 47,851,520, parity 0.004%).

Neural-only quantized_ttt_phased on shared seeds vs dexhunter PR openai#1797:
  seed 42:  ours 1.06181 vs dex 1.06181 (byte-identical)
  seed 314: ours 1.06112 vs dex 1.06083 (delta +0.00029, within seed noise)

Headline disclosure: mix_bpb is on PPM_SUBSET_TOKENS=8000000 (16.7% of full
val); structural — PPM at 35M coverage measured 1041s eval, exceeding 600s
cap. All non-PPM diagnostics computed on full 47.85M val.

Compliance: PPM-D legality pending Issue openai#1872 (token-vs-byte alphabet
question, called out by name). CaseOps legality pending Issue openai#1604.
If openai#1872 rules byte-alphabet legal: headline 0.9019 valid.
If openai#1872 rules against: 1.062106 neural-only remains, beats merged SOTA
PR openai#1493 (1.0810) by -0.019 BPB on full val.
@OE-GOD
Copy link
Copy Markdown

OE-GOD commented Apr 28, 2026

Hi @ndokutovich — thanks for the careful disclosure in the README. Two notes for the reviewers' benefit:

1. PPM-mixture lineage. The byte-level PPM-D mixture lever was first introduced in #1795 (filed 2026-04-23). PR #1835 followed on 2026-04-25. If the lineage table is meant to track origin of the technique class, #1795 is the earlier reference.

2. Headline metric is on a subset. The headline mix_bpb=0.901886 is measured on PPM_SUBSET_TOKENS=8,000,000 (16.7% of the 47.85M val). The full-val honest number in this submission is the neural-only quantized_ttt_phased=1.062106, which is at parity with the #1797 base (1.06157) — i.e. the PPM mix's contribution isn't measured at full leaderboard coverage. Other PPM-mixture submissions using the same 8M subset (#1835, #1850, #1854) are in the same situation, and #1872 seems to be where the legality of this class is being decided.

Not a blocker on the implementation quality — just want the lineage and the coverage caveat visible to anyone reading the leaderboard.

…ixture class

Per @OE-GOD review note on this PR — the byte-level PPM-D mixture technique
class was first introduced in PR openai#1795 (2026-04-23), with anmarhindi's
PR openai#1835 (2026-04-25, our port source) following two days later.

Updates:
- Opening summary cites PR openai#1795 as class introduction + PR openai#1835 as our port source
- Stack table 'Byte-level PPM-D mixture (this addition)' row updated with both refs
- Acknowledgements section reordered to lead with PR openai#1795 chronologically
- PPM-D cluster list in compliance section now includes openai#1795

No code or score changes.
@ndokutovich
Copy link
Copy Markdown
Author

Thanks @OE-GOD — both points well taken.

1. Lineage correction. Updated in 4cbab86: PR #1795 is now credited as the earliest reference of the byte-level PPM-D mixture technique class (2026-04-23), with anmarhindi's PR #1835 (2026-04-25) noted as the specific implementation we ported from. Acknowledgements reordered to lead chronologically.

2. Coverage caveat. Agreed — and the README's PPM coverage disclosure section was the explicit attempt to make this visible: `mix_bpb=0.901886` is on `PPM_SUBSET_TOKENS=8,000,000` (16.7% of the 47,853,344 val), and the comparable full-val number in this submission is the neural-only `quantized_ttt_phased=1.062106` — which sits at parity with PR #1797 (1.06157, byte-identical to dexhunter on shared seed 42). The 8M-subset headline class metric is comparable to PR #1795/#1835/#1850/#1854 — and Issue #1872 is indeed where the class-level decision lies.

For what it's worth, on the C2 question your phrasing of "`p_NN` (bit-conserving spread) and `p_PPM` are both normalized over 256 bytes, convex combination is normalized" matches the byte-alphabet reading in the issue. We're agnostic to which way the ruling goes — this submission is hedged with the 1.0621 full-val neural-only number as the structural fallback if the class is disallowed.

Thanks for the careful flag — the leaderboard's traceability is better with PR #1795 cited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants