Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282#2032
Closed
anmarhindi wants to merge 1 commit intoopenai:mainfrom
Closed
Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282#2032anmarhindi wants to merge 1 commit intoopenai:mainfrom
anmarhindi wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
3-seed mean val_bpb = 1.029282 (std 0.000782), an improvement of
−0.0517 nats over current SOTA (1.0810). 8×H100 SXM, eval ≤600s,
artifact 15.59 MB.
The contribution is a single eval-time post-processor on top of the
PR #1493 stack: a conditional byte-level PPM mixer that derives
P_NN(byte_0 | history)from the model's SP8192 softmax via thecanonical first-byte LUT, then mixes it with a PPM-D byte conditional
through a per-byte sigmoid gate. No new trainable parameters, no
additional artifact bytes, no training-time changes.
Std of 0.0008 puts the −0.005-nat threshold at p ≪ 0.01.
Relationship to PR #1835
PR #1835 introduced byte-level mixing using the approximation
This value is identical for every byte in the realized token and does
not sum to 1 over the byte alphabet: it is a per-byte-position
constant rather than a distribution. The C2 concern raised in review
of #1835 follows from this directly.
This submission instead derives
P_NN(byte_0)as a proper marginalover the SP8192 alphabet, weighted by the canonical first-byte LUT
(rebuilt from the SP tokenizer at deserialize, no extra storage):
For the remaining bytes within the realized token (b₁..b_{k−1}),
the chain-rule residual is:
Both terms are now normalized distributions over their respective
alphabets:
of length k−1.
Their product is a proper distribution over the realized token's
byte stream, so C2 holds by construction.
The mix gate is a sigmoid on PPM context confidence:
Why this works
The 16 MB cap forces a 35M-parameter model that can't allocate enough
mass to the long tail of byte sequences (URLs, code identifiers,
repeated proper nouns). PPM's strength is exactly that tail: order-5
byte context routinely assigns near-1 probability to the next byte
inside a code block or a recurring named entity, while a
parameter-constrained NN has to spread mass thin. The sigmoid gate
captures this conditionally: trust PPM when its top-symbol
probability is high, fall back to the NN otherwise. The conditioning
step (deriving
P_NN(byte_0)from the SP8192 softmax instead ofapproximating per-byte) closes the C2 gap that prevented PR #1835's
version from being clean: the gate now weights two distributions that
actually live on the same alphabet.
Ablation
3-seed means, applied sequentially:
The mixer contributes −0.150 BPB at eval time. Local Blackwell
ablation predicted −0.187 (~80% transfer to H100 SXM).
Compliance (Issue #1017)
No SLOT, no n-gram cache outside the legal byte-level PPM-D state,
no logit bias, no ETLB, no pre-quant TTT (which would violate C3).
Standard softmax over SP8192 at every scored position.
Lineage
PR #1394 (clarkkev) → PR #1530 (samacqua) → PR #1729 (romeerp,
CaseOps) → PR #1787 (nprime06) → PR #1797 (dexhunter, Smear+LQER,
with SmearGate BOS-mask fix from cocohearts' review) → PR #1493
(sliding-window stride-64) → this submission.
Reproduction
Full reproduction commands, per-seed train and eval logs, and helper
modules (CaseOps tokenizer + transform) are in
records/track_10min_16mb/2026-04-30_OptE_SlidingWindow_CondPPM/.The bundled
train_gpt.pyis lzma+base85 wrapped (49,750 bytes);underlying source lives at
obliterate_0p9/train_gpt.pyon thecond-ppmbranch.To inspect the readable source of
train_gpt.pywithout executing it:Artifact size
final_model.int6.ptz(int6 GPTQ + brotli)train_gpt.py(lzma+base85 wrapped)Helper modules (
lossless_caps.py,prepare_caseops_data.py) andthe SP8192 tokenizer are included for reproduction but do not count
toward the cap, per Issue #1017 §III.
Acknowledgments
This submission stands on a chain of prior work: @clarkkev (PR #1394)
for the SP8192 base, @samacqua (PR #1530), @romeerp (PR #1729) for
CaseOps, @nprime06 (PR #1787) for parallel residuals + looping,
@dexhunter (PR #1797) for Smear + LQER, @cocohearts for the
SmearGate BOS-mask fix, and the sliding-window stride-64 evaluator
from PR #1493. The PPM-D byte conditional itself is classical
(Cleary & Witten 1984; Moffat 1990; Howard 1993). What's contributed
here is the canonical first-byte marginalization that closes the C2
gap from PR #1835's earlier byte-mix attempt, and the empirical
observation that an order-5 byte conditional pairs well with a
35M-parameter LM at this scale.