Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain#1905
Open
leon2k2k2k wants to merge 3 commits intoopenai:mainfrom
Open
Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain#1905leon2k2k2k wants to merge 3 commits intoopenai:mainfrom
leon2k2k2k wants to merge 3 commits intoopenai:mainfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain
This is an investigation of the recent byte-level PPM-D mixture submissions
(#1835,
#1850, #1854,
#1858,
#1862, #1833,
#1871, #1865,
#1885 (mine)
-- the cluster flagged in #1872). The cluster reports
val_bpb ranging from 0.90 to 1.014, all using the same scoring construction: standard token-level NN
log-probability "bit-conservingly spread" across each token's bytes, mixed in probability space with a
classical byte-level PPM-D order-5 model. We focus on
#1850 as a clean reference implementation -- it isolates
the mechanism without additional architectural changes, and the same scoring formula appears verbatim
in the others. This report is not about the
Σ_tokvsΣ_bytedispute in#1872. The problem is more fundamental: the
uniform-spread NN side is not a valid probability distribution over 256 bytes, and therefore does not
satisfy C2.
All errors in the analysis below are mine and mine alone.
Contents
This report has two parts.
Part 1 -- The PPM-D byte-level mixture is not a valid probability distribution.
Part 2 -- We investigate how this scoring system erroneously produces the reported gain.
Part 1 -- The PPM-D byte-level mixture is not a valid probability distribution
This part is due to @sharpobject in #1872. I am
restating that argument here because it is the premise for Part 2.
Under the byte-level reading of C2, the scorer must define a probability distribution over the 256
possible next bytes at each scored position.
The submission class scores bytes via
where$p_{\mathrm{PPM}}$ is the PPM-D byte distribution and $p_{\mathrm{uniform}}$ is obtained by
uniformly spreading token log-probability across bytes.
Since$p_{\mathrm{PPM}}$ already sums to 1, $p_{\mathrm{mix}}$ can sum to 1 only if
$p_{\mathrm{uniform}}$ does. So the question reduces to whether the NN-side byte object is itself a
probability distribution.
For the first byte of a token, the natural extension is
For later bytes one conditions on the already-realized within-token prefix first; the same normalization
problem remains, so I ignore that extra notation here.
But this object does not sum to 1. The reason is simple: for any multi-byte token with$0<p<1$ , one$p^{1/n}>p$ . So the uniform-spread construction systematically inflates small token probabilities
has
before summing them by byte.
A toy example already breaks normalization. Suppose
where$t_1,t_2$ are two-byte tokens starting with $t_3$ is a one-byte token starting with
a, andb. Thenso
Therefore$p_{\mathrm{uniform}}$ is not a probability distribution over bytes, and neither is
$p_{\mathrm{mix}}$ . This is exactly the C2 failure pointed out in Issue #1872.
The natural byte-level object induced by the same token softmax is instead the conditional distribution
where$\pi$ is the within-token byte prefix already realized. Unlike the uniform-spread construction,
this is a genuine distribution over next bytes. Part 2 uses it as the correct reference point.
Part 2 -- Why the apparent gain comes from the scoring system, not from PPM itself
The uniform-spread construction was chosen for a reason: if PPM is turned off, summing byte losses
reproduces the original token-level score exactly. But that same bookkeeping choice is what makes PPM
appear much stronger than it really is.
What the uniform-spread distribution tends to do is move uncertainty from the later parts of a token
toward the front and flatten it out across all of the token's bytes. In other words, it takes token
loss that in a natural conditional view would be concentrated on a few genuinely uncertain later bytes,
and redistributes that loss onto earlier bytes that may already be almost certain. The clearest
examples are tokens whose first byte is a space: the model may be very sure that the next byte is
,while still being unsure which full token follows after that. Uniform spread erases that distinction and
charges the early space byte as if it carried an equal share of the token's uncertainty.
All numbers below use a post-quantized model, together with the same PPM configuration used in #1850.
The key comparison is this:
So under the submitted uniform-spread scoring rule, PPM appears to gain about 0.051 val_bpb. But under
the conditional distribution, the very same PPM scorer is not better than the baseline at all: it is
worse by about 0.038 val_bpb.
That is the central empirical fact of this report. The apparent gain is not coming from PPM
outperforming the model on a valid next-byte scoring problem. It is coming from replacing the
conditional distribution with the uniform-spread one before mixing.
A concrete token-level example makes the mechanism clear.
Consider the real token
" today"in the context...half the total starters) and today it is 17 of the 24....For this token, the two scoring systems assign the same total baseline loss,
but distribute it very differently across bytes:
todayToken totals:
" today"This is the key point. For the very same realized token, the submitted scoring rule makes PPM look
helpful by
+3.99nats, while the conditional distribution shows that the same PPM configuration isactually slightly harmful (
-0.17nats).The reason is visible byte-by-byte. Uniform spread assigns
1.47564nats to every byte, including theeasy bytes
,a, andy, where the conditional distribution is already essentially zero and wheregate_hi=1gives PPM high weight. PPM then gets credit for “improving” those bytes only because thescoring rule first assigned them artificial cost.
So the sign of the apparent token-level gain flips depending on how the same token loss is allocated
across bytes. That is exactly the claim of Part 2: the reported gain is not a stable property of PPM
itself, but of the scoring rule used to mix it with the model.
This shows that a large part of the reported gain is created by the scoring construction itself. Under
the conditional distribution, the same PPM configuration does not improve the score; it makes it worse.
So the headline improvement is not evidence that PPM is winning on a valid next-byte prediction problem.
Reproducibility
testing/inspect_with_ppm.pyruns the uniform-spread versus conditional-distribution comparison withthe same PPM configuration.
testing/ppm_scorer.cis the byte-level PPM scorer used in both comparisons.testing/show_5_artifact_examples.pyregenerates the worked artifact examples from a saved per-bytedump.