Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions eval/2026-04-29_ppm_invalidity_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
---
title: "PPM-D Byte-Level Scoring Report Draft"
---

# Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain

This is an investigation of the recent byte-level PPM-D mixture submissions
(https://github.com/openai/parameter-golf/pull/1835,
https://github.com/openai/parameter-golf/pull/1850, https://github.com/openai/parameter-golf/pull/1854,
https://github.com/openai/parameter-golf/pull/1858,
https://github.com/openai/parameter-golf/pull/1862, https://github.com/openai/parameter-golf/pull/1833,
https://github.com/openai/parameter-golf/pull/1871, https://github.com/openai/parameter-golf/pull/1865,
https://github.com/openai/parameter-golf/pull/1885 (mine)
-- the cluster flagged in https://github.com/openai/parameter-golf/issues/1872). The cluster reports
val_bpb ranging from 0.90 to 1.014, all using the same scoring construction: standard token-level NN
log-probability "bit-conservingly spread" across each token's bytes, mixed in probability space with a
classical byte-level PPM-D order-5 model. We focus on
https://github.com/openai/parameter-golf/pull/1850 as a clean reference implementation -- it isolates
the mechanism without additional architectural changes, and the same scoring formula appears verbatim
in the others. This report is not about the `Σ_tok` vs `Σ_byte` dispute in
https://github.com/openai/parameter-golf/issues/1872. The problem is more fundamental: the
uniform-spread NN side is not a valid probability distribution over 256 bytes, and therefore does not
satisfy C2.

All errors in the analysis below are mine and mine alone.

## Contents

This report has two parts.

Part 1 -- The PPM-D byte-level mixture is not a valid probability distribution.

Part 2 -- We investigate how this scoring system erroneously produces the reported gain.

## Part 1 -- The PPM-D byte-level mixture is not a valid probability distribution

This part is due to @sharpobject in https://github.com/openai/parameter-golf/issues/1872. I am
restating that argument here because it is the premise for Part 2.

Under the byte-level reading of C2, the scorer must define a probability distribution over the 256
possible next bytes at each scored position.

The submission class scores bytes via

$$
p_{\mathrm{mix}}(b)=\lambda p_{\mathrm{uniform}}(b)+(1-\lambda)p_{\mathrm{PPM}}(b),
$$

where $p_{\mathrm{PPM}}$ is the PPM-D byte distribution and $p_{\mathrm{uniform}}$ is obtained by
uniformly spreading token log-probability across bytes.

Since $p_{\mathrm{PPM}}$ already sums to 1, $p_{\mathrm{mix}}$ can sum to 1 only if
$p_{\mathrm{uniform}}$ does. So the question reduces to whether the NN-side byte object is itself a
probability distribution.

For the first byte of a token, the natural extension is

$$
p_{\mathrm{uniform}}(b)=\sum_{t=b\ldots} p_{\mathrm{NN}}(t)^{1/n(t)}.
$$

For later bytes one conditions on the already-realized within-token prefix first; the same normalization
problem remains, so I ignore that extra notation here.

But this object does not sum to 1. The reason is simple: for any multi-byte token with $0<p<1$, one
has $p^{1/n}>p$. So the uniform-spread construction systematically inflates small token probabilities
before summing them by byte.

A toy example already breaks normalization. Suppose

$$
p_{\mathrm{NN}}(t_1)=0.25,\qquad p_{\mathrm{NN}}(t_2)=0.25,\qquad p_{\mathrm{NN}}(t_3)=0.5,
$$

where $t_1,t_2$ are two-byte tokens starting with `a`, and $t_3$ is a one-byte token starting with
`b`. Then

$$
p_{\mathrm{uniform}}(\texttt{a})=0.25^{1/2}+0.25^{1/2}=1,\qquad
p_{\mathrm{uniform}}(\texttt{b})=0.5,
$$

so

$$
p_{\mathrm{uniform}}(\texttt{a})+p_{\mathrm{uniform}}(\texttt{b})=1.5>1.
$$

Therefore $p_{\mathrm{uniform}}$ is not a probability distribution over bytes, and neither is
$p_{\mathrm{mix}}$. This is exactly the C2 failure pointed out in Issue #1872.

The natural byte-level object induced by the same token softmax is instead the conditional distribution

$$
p_{\mathrm{cond}}(b\mid \pi)=
\frac{\sum_{t:\pi b \preceq \mathrm{bytes}(t)} p_{\mathrm{NN}}(t)}
{\sum_{t:\pi \preceq \mathrm{bytes}(t)} p_{\mathrm{NN}}(t)},
$$

where $\pi$ is the within-token byte prefix already realized. Unlike the uniform-spread construction,
this is a genuine distribution over next bytes. Part 2 uses it as the correct reference point.

## Part 2 -- Why the apparent gain comes from the scoring system, not from PPM itself

The uniform-spread construction was chosen for a reason: if PPM is turned off, summing byte losses
reproduces the original token-level score exactly. But that same bookkeeping choice is what makes PPM
appear much stronger than it really is.

What the uniform-spread distribution tends to do is move uncertainty from the later parts of a token
toward the front and flatten it out across all of the token's bytes. In other words, it takes token
loss that in a natural conditional view would be concentrated on a few genuinely uncertain later bytes,
and redistributes that loss onto earlier bytes that may already be almost certain. The clearest
examples are tokens whose first byte is a space: the model may be very sure that the next byte is ` `,
while still being unsure which full token follows after that. Uniform spread erases that distinction and
charges the early space byte as if it carried an equal share of the token's uncertainty.

All numbers below use a post-quantized model, together with the same PPM configuration used in #1850.

The key comparison is this:

| Method | val_bpb |
|---|---:|
| Token baseline | 1.08335 |
| Uniform-spread + PPM | 1.03242 |
| Conditional distribution + PPM | 1.12144 |

So under the submitted uniform-spread scoring rule, PPM appears to gain about 0.051 val_bpb. But under
the conditional distribution, the very same PPM scorer is not better than the baseline at all: it is
worse by about 0.038 val_bpb.

That is the central empirical fact of this report. The apparent gain is not coming from PPM
outperforming the model on a valid next-byte scoring problem. It is coming from replacing the
conditional distribution with the uniform-spread one before mixing.

A concrete token-level example makes the mechanism clear.

Consider the real token `" today"` in the context
`...half the total starters) and today it is 17 of the 24...`.

For this token, the two scoring systems assign the same total baseline loss,
but distribute it very differently across bytes:

| byte | gate_hi | uniform | conditional | PPM | mix(uniform) | mix(conditional) | gain vs uniform | gain vs conditional |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| ` ` | 1 | 1.47564 | 0.00841 | 0.01511 | 0.05426 | 0.01478 | +1.42138 | -0.00637 |
| `t` | 0 | 1.47564 | 1.51869 | 1.94185 | 1.51362 | 1.55381 | -0.03797 | -0.03511 |
| `o` | 0 | 1.47564 | 2.38802 | 1.93680 | 1.51329 | 2.33256 | -0.03764 | +0.05546 |
| `d` | 0 | 1.47564 | 4.93874 | 5.98645 | 1.57978 | 5.00587 | -0.10414 | -0.06713 |
| `a` | 1 | 1.47564 | 0.00000 | 0.06454 | 0.10308 | 0.06121 | +1.37257 | -0.06121 |
| `y` | 1 | 1.47564 | 0.00000 | 0.05716 | 0.09579 | 0.05422 | +1.37985 | -0.05422 |

Token totals:

| token | uniform baseline | conditional baseline | uniform+PPM mix | conditional+PPM mix | PPM gain vs uniform | PPM gain vs conditional |
|---|---:|---:|---:|---:|---:|---:|
| `" today"` | 8.85387 | 8.85387 | 4.85983 | 9.02245 | +3.99404 | -0.16858 |

This is the key point. For the very same realized token, the submitted scoring rule makes PPM look
helpful by `+3.99` nats, while the conditional distribution shows that the same PPM configuration is
actually slightly harmful (`-0.17` nats).

The reason is visible byte-by-byte. Uniform spread assigns `1.47564` nats to every byte, including the
easy bytes ` `, `a`, and `y`, where the conditional distribution is already essentially zero and where
`gate_hi=1` gives PPM high weight. PPM then gets credit for “improving” those bytes only because the
scoring rule first assigned them artificial cost.

So the sign of the apparent token-level gain flips depending on how the same token loss is allocated
across bytes. That is exactly the claim of Part 2: the reported gain is not a stable property of PPM
itself, but of the scoring rule used to mix it with the model.

This shows that a large part of the reported gain is created by the scoring construction itself. Under
the conditional distribution, the same PPM configuration does not improve the score; it makes it worse.
So the headline improvement is not evidence that PPM is winning on a valid next-byte prediction problem.

## Reproducibility

- `testing/inspect_with_ppm.py` runs the uniform-spread versus conditional-distribution comparison with
the same PPM configuration.
- `testing/ppm_scorer.c` is the byte-level PPM scorer used in both comparisons.
- `testing/show_5_artifact_examples.py` regenerates the worked artifact examples from a saved per-byte
dump.
68 changes: 68 additions & 0 deletions experiments/run_all.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash
# Master runner: runs all experiments back-to-back.
# Deploy pod → SSH in → bash experiments/run_all.sh → come back to results.
# Auto-terminates pod when done.

set -uo pipefail

echo "========================================"
echo "=== Parameter Golf Experiment Sweep ==="
echo "=== $(date) ==="
echo "========================================"

# Setup
echo ""
echo "=== Phase 1: Setup ==="
bash experiments/setup_pod.sh

# Run planned experiments
EXPERIMENTS=(5 6 7 8 9 10 11 12 13 14)
RESULTS_FILE=/workspace/logs/sweep_results.txt
echo "Experiment | val_bpb (final) | Steps | Status" > "$RESULTS_FILE"

for n in "${EXPERIMENTS[@]}"; do
script=$(ls experiments/run_exp${n}_*.sh 2>/dev/null | head -1)
if [ -z "$script" ]; then
echo "Skipping exp $n: no script found"
continue
fi

echo ""
echo "========================================"
echo "=== Running Exp $n: $script ==="
echo "=== $(date) ==="
echo "========================================"

if bash "$script"; then
# Extract final val_bpb from log
logfile=$(ls /workspace/logs/exp${n}_*.log 2>/dev/null | head -1)
if [ -n "$logfile" ]; then
final_bpb=$(grep "final_int8_zlib_roundtrip_exact" "$logfile" | grep -oP 'val_bpb:\K[0-9.]+' | tail -1)
steps=$(grep "^step:" "$logfile" | tail -1 | grep -oP 'step:\K[0-9]+')
echo "Exp $n | ${final_bpb:-MISSING} | ${steps:-?} | OK" >> "$RESULTS_FILE"
echo ">>> Exp $n result: val_bpb=${final_bpb:-MISSING} steps=${steps:-?}"
fi
else
echo "Exp $n | FAILED | - | ERROR" >> "$RESULTS_FILE"
echo ">>> Exp $n FAILED, continuing..."
fi
done

echo ""
echo "========================================"
echo "=== All planned experiments done ==="
echo "=== $(date) ==="
echo "========================================"
echo ""
echo "=== Summary ==="
cat "$RESULTS_FILE"

# Auto-terminate pod
echo ""
echo "=== Terminating pod ==="
if command -v runpodctl &>/dev/null && [ -n "${RUNPOD_POD_ID:-}" ]; then
echo "Stopping pod $RUNPOD_POD_ID..."
runpodctl stop pod "$RUNPOD_POD_ID"
else
echo "WARNING: Cannot auto-terminate. Please terminate pod manually!"
fi
19 changes: 19 additions & 0 deletions experiments/run_exp10_3layer_recur.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash
# Exp 10 — SP1024 + 3-Layer Recurrence [2,3,4]
# Branch: exp/depth-recurrence
# Block schedule: [0,1,2,3,4,2,3,4,5,6,7,8] — 12 virtual passes
# Question: does recurring 3 layers beat 2?

set -euo pipefail
cd /workspace/parameter-golf
git checkout exp/depth-recurrence

VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
RECUR_LAYERS=2,3,4 \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=2 train_gpt.py \
2>&1 | tee /workspace/logs/exp10_3layer_recur.log

echo "Exp 10 done."
22 changes: 22 additions & 0 deletions experiments/run_exp11_11L_narrow.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
# Exp 11 — 11 Layers + narrow (model_dim=448) + SP1024 + Recurrence
# Branch: exp/depth-recurrence
# Params: ~15.5M (fits in 16 MB with INT8)
# head_dim = 448/8 = 56 (even, OK for RoPE)
# Question: is more layers + narrower better than fewer + wider?

set -euo pipefail
cd /workspace/parameter-golf
git checkout exp/depth-recurrence

NUM_LAYERS=11 \
MODEL_DIM=448 \
VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
RECUR_LAYERS=3,4 \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=2 train_gpt.py \
2>&1 | tee /workspace/logs/exp11_11L_narrow.log

echo "Exp 11 done."
22 changes: 22 additions & 0 deletions experiments/run_exp12_wide_mlp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
# Exp 12 — Wide MLP (mlp_mult=3) + narrow (model_dim=384) + SP1024 + Recurrence
# Branch: exp/depth-recurrence
# Params: ~13.5M (fits in 16 MB with INT8)
# head_dim = 384/8 = 48 (even, OK for RoPE)
# Question: is wider MLP worth the reduced model_dim?

set -euo pipefail
cd /workspace/parameter-golf
git checkout exp/depth-recurrence

MLP_MULT=3 \
MODEL_DIM=384 \
VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
RECUR_LAYERS=3,4 \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=2 train_gpt.py \
2>&1 | tee /workspace/logs/exp12_wide_mlp.log

echo "Exp 12 done."
18 changes: 18 additions & 0 deletions experiments/run_exp13_seq2048.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
# Exp 13 — SP1024 + Depth Recurrence + train_seq_len=2048
# Branch: exp/depth-recurrence
# Question: longer context helps? Top submissions use 2048.

set -euo pipefail
cd /workspace/parameter-golf
git checkout exp/depth-recurrence

VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
TRAIN_SEQ_LEN=2048 \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=2 train_gpt.py \
2>&1 | tee /workspace/logs/exp13_seq2048.log

echo "Exp 13 done."
18 changes: 18 additions & 0 deletions experiments/run_exp14_warmdown2400.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
# Exp 14 — SP1024 + Depth Recurrence + warmdown_iters=2400
# Branch: exp/depth-recurrence
# Question: longer warmdown helps? (default 1200, top submissions use 3500)

set -euo pipefail
cd /workspace/parameter-golf
git checkout exp/depth-recurrence

VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
WARMDOWN_ITERS=2400 \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=2 train_gpt.py \
2>&1 | tee /workspace/logs/exp14_warmdown2400.log

echo "Exp 14 done."
18 changes: 18 additions & 0 deletions experiments/run_exp5_sp4096_baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
# Exp 5 — SP1024 Baseline
# Branch: main (no code changes, env vars only)
# Hypothesis: clean baseline with frequent val readings for comparison, same model architecture as baseline
# Compare to: Exp 3 (baseline SP1024, 2×H100, val_bpb=1.2732)

set -euo pipefail
cd /workspace/parameter-golf
git checkout main

VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=2 train_gpt.py \
2>&1 | tee /workspace/logs/exp5_sp4096_baseline.log

echo "Exp 5 done."
18 changes: 18 additions & 0 deletions experiments/run_exp6_sp4096_recurrence.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
# Exp 6 — SP1024 + Depth Recurrence
# Branch: exp/depth-recurrence (commit ea1898a)
# Hypothesis: combining SP1024 + recurrence stacks both improvements
# Compare to: Exp 5 (SP1024 baseline) and Exp 4 (recurrence SP1024)

set -euo pipefail
cd /workspace/parameter-golf
git checkout exp/depth-recurrence

VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=2 train_gpt.py \
2>&1 | tee /workspace/logs/exp6_sp4096_recurrence.log

echo "Exp 6 done."
Loading