Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer Full-Val - val_bpb 1.015784#2039
Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer Full-Val - val_bpb 1.015784#2039anmarhindi wants to merge 2 commits intoopenai:mainfrom
Conversation
|
This block of code breaks C1 (because include_space uses the current token to choose the probability distribution) and C2 (the flat fallback). b0 = token_bytes[0]
ppm_log_b0, conf_b0 = _ppm_byte_logprob_and_conf(b0, window, ctx_counts)
nn_p_b0 = float(byte0_nn_prob[i]) # ← the proper marginalization
if include_space:
try:
nn_p_b0 = math.exp(-float(nll_nats[i]) / n_bytes) # ← FLAT FALLBACK
except OverflowError:
nn_p_b0 = 0.0 |
…ixer Self-contained reference for byte-level NN scoring without the C1/C2 leak in PR openai#2039 / openai#1967 / openai#2018 / openai#2041. Shows ~-0.097 BPB legitimate gain on spec 250 seed_0 (1M val tokens), independent of include_space leak. Files: README, proper_ppm_mixer_rigorous.py (canonical), byte_bpb_proper.py (NN-only baseline), show_big_gains.py (inspection), test_byte0_3way.py (5-config leak validation).
…en-uniform fallback) — val_bpb 1.015784
Good catch on the Removed it. Re-ran 3 fresh seeds end-to-end with the corrected code as proof. val_bpb = 1.015784 (std 0.000524), full val (9,662,464 tokens / 32,756,252 canonical bytes per seed). Per-seed: 1.01519 / 1.01596 / 1.01620. Each |
|
Leaderboard audit note (pre-cutoff state): I don't think this is valid as a record row. The include_space C1/C2 fallback was fixed before cutoff, but the reported cond-PPM BPB still appears to use the wrong scoring surface/denominator: the logs score 9,662,464 tokens / 32,756,252 reconstructed token bytes, while the standard eval for that same split implies about 29.95M raw validation bytes. The official CaseOps BPB denominator should be the raw validation byte sidecar, not reconstructed transformed token-piece bytes; correcting seed 42's denominator puts the cond-PPM score around 1.11 BPB, not 1.015. It also is not the full current 50k-doc CaseOps validation set used by the accepted rows. |
Summary
3-seed mean val_bpb = 1.015784 (std 0.000524), an improvement of −0.065 BPB over the current leaderboard SOTA (1.0810). 8×H100 SXM, training ≤600s, eval ≤600s, total submission 15,576,768 bytes (≤16,000,000).
The contribution is a single eval-time post-processor on top of the PR #1493 stack: a conditional byte-level PPM mixer that derives
P_NN(byte_0 | history)from the model's SP8192 softmax via a canonical first-byte LUT, then mixes it with a PPM-D byte conditional through a per-byte sigmoid gate. No new trainable parameters, no additional artifact bytes, no training-time changes.Each seed's eval runs over the full validation set (9,662,464 tokens; 32,756,252 canonical bytes), gathered across all 8 ranks before the byte-level PPM-D state advances.
submission.jsonrecordseval_full_val_verified: true,eval_token_count_per_seed: 9662464, andeval_canonical_byte_count_per_seed: 32756252as forensic attestations.A std of 0.000524 puts the −0.005-nat threshold at p ≪ 0.01.
Relationship to PR #1835
PR #1835 introduced byte-level mixing using the approximation:
This value is identical for every byte in the realized token and does not sum to 1 over the byte alphabet — it is a per-byte-position constant rather than a probability distribution. The C2 concern raised on the #1835 thread follows directly: scoring the same realized byte yields different probabilities depending on which token later turns out to be correct, which is not autoregressive at the byte alphabet.
This submission instead derives
P_NN(byte_0)as a proper marginal over the SP8192 alphabet, weighted by a canonical first-byte LUT (rebuilt from the SP tokenizer at deserialize, no extra storage):The "canonical first byte" of token T at position t depends on whether the previous token x[t] is a boundary: if T carries the SP-BPE leading-space marker AND prev is non-boundary, the realized canonical first byte is
0x20; otherwise it is the SP-piece's natural first byte.ValidationData.__init__builds two first-byte masks (mask_no_spacefor prev-IS-boundary,mask_with_spacefor prev-non-boundary), andeval_val_slidingselects per-position viais_boundary_token_lut[prev]. This eliminates the per-token uniform-split fallback that earlier drafts of this mixer used for the include_space case, which would have re-introduced the C2 issue from #1835 for the ~40% of positions where include_space fires.For the remaining bytes within the realized token (b₁..b_{k−1}), the chain-rule residual is:
Both terms are proper distributions over their respective alphabets:
Their product is a proper distribution over the realized token's byte stream, so C2 holds by construction at every position regardless of leading-space status.
The mix gate is a sigmoid on PPM context confidence:
Why this works
The 16 MB cap forces a 35M-parameter model that cannot allocate enough mass to the long tail of byte sequences (URLs, code identifiers, repeated proper nouns, numerical literals). PPM's strength is exactly that tail: order-5 byte context routinely assigns near-1 probability to the next byte inside a code block or a recurring named entity, while a parameter-constrained NN has to spread mass thin. The sigmoid gate captures this conditionally — trust PPM when its top-symbol probability is high, fall back to the NN otherwise. Deriving
P_NN(byte_0)as a proper marginal of the SP8192 softmax (rather than a per-byte uniform-split approximation) means the gate weights two distributions that actually live on the same alphabet, so C2 holds at every position regardless of whether a leading-space prepend applies.Ablation
3-seed means, applied sequentially:
The mixer contributes −0.091 BPB at eval time over the full validation set (post-sliding-window → post-cond-PPM). Local Blackwell ablation predicted larger absolute mixer gains on weaker base models — where the long-tail miss is more severe — so the better-trained base here narrows the absolute headroom while still landing well below SOTA.
Compliance (Issue #1017)
is_boundary_token_lut[prev]. Mix gate weights depend on PPM context confidence only, never on the realized byte.No SLOT, no n-gram cache outside the legal byte-level PPM-D state, no logit bias, no ETLB, no pre-quant TTT (which would violate C3). Standard softmax over SP8192 at every scored position.
Eval-set coverage. The cond-PPM mixer runs over the full validation set (9,662,464 tokens / 32,756,252 canonical bytes per seed), with chunks gathered across all 8 ranks before byte-level PPM-D state advancement on rank 0. This is recorded in
submission.jsonaseval_full_val_verified: true,eval_token_count_per_seed: 9662464, andeval_canonical_byte_count_per_seed: 32756252. Each seed'strain_seed*.logcontains acond_ppm tokens=9662464 bytes=32756252 cond_mix_bpb=...line as forensic evidence.Reproduction
Full reproduction commands, per-seed train + eval logs, and helper modules (CaseOps tokenizer + transform) are in
records/track_10min_16mb/2026-04-30_CondPPM_1.015784/. The bundledtrain_gpt.pyis lzma+base85 wrapped (49,485 bytes); the underlying source lives atobliterate_0p9/train_gpt.pyon thecond-ppmbranch of the author's repo.To inspect the readable source of
train_gpt.pywithout executing it:Artifact size
final_model.int6.ptz(int6 GPTQ + brotli)train_gpt.py(lzma+base85 wrapped)Helper modules (
lossless_caps.py,prepare_caseops_data.py) and the SP8192 tokenizer are included for reproduction but do not count toward the cap, per Issue #1017 §III.Acknowledgments
This submission stands on a chain of prior work: @clarkkev (PR #1394) for the SP8192 base; @samacqua (PR #1530) and @romeerp (PR #1729) for CaseOps; @nprime06 (PR #1787) for parallel residuals + looping; @dexhunter (PR #1797) for Smear + LQER; @cocohearts for the SmearGate BOS-mask fix; and the sliding-window stride-64 evaluator from PR #1493. The PPM-D byte conditional itself is classical (Cleary & Witten 1984; Moffat 1990; Howard 1993). What's contributed here is the per-position canonical first-byte marginalization that closes the C2 gap raised on PR #1835's earlier byte-mix attempt, and the empirical observation that an order-5 byte conditional pairs well with a parameter-constrained LM at this scale.