diff --git a/eval/2026-04-29_ppm_invalidity_report.md b/eval/2026-04-29_ppm_invalidity_report.md
new file mode 100644
index 0000000000..086b33def5
--- /dev/null
+++ b/eval/2026-04-29_ppm_invalidity_report.md
@@ -0,0 +1,181 @@
+---
+title: "PPM-D Byte-Level Scoring Report Draft"
+---
+
+# Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain
+
+This is an investigation of the recent byte-level PPM-D mixture submissions
+(https://github.com/openai/parameter-golf/pull/1835,
+https://github.com/openai/parameter-golf/pull/1850, https://github.com/openai/parameter-golf/pull/1854,
+https://github.com/openai/parameter-golf/pull/1858,
+https://github.com/openai/parameter-golf/pull/1862, https://github.com/openai/parameter-golf/pull/1833,
+https://github.com/openai/parameter-golf/pull/1871, https://github.com/openai/parameter-golf/pull/1865,
+https://github.com/openai/parameter-golf/pull/1885 (mine)
+-- the cluster flagged in https://github.com/openai/parameter-golf/issues/1872). The cluster reports
+val_bpb ranging from 0.90 to 1.014, all using the same scoring construction: standard token-level NN
+log-probability "bit-conservingly spread" across each token's bytes, mixed in probability space with a
+classical byte-level PPM-D order-5 model. We focus on
+https://github.com/openai/parameter-golf/pull/1850 as a clean reference implementation -- it isolates
+the mechanism without additional architectural changes, and the same scoring formula appears verbatim
+in the others. This report is not about the `Σ_tok` vs `Σ_byte` dispute in
+https://github.com/openai/parameter-golf/issues/1872. The problem is more fundamental: the
+uniform-spread NN side is not a valid probability distribution over 256 bytes, and therefore does not
+satisfy C2.
+
+All errors in the analysis below are mine and mine alone.
+
+## Contents
+
+This report has two parts.
+
+Part 1 -- The PPM-D byte-level mixture is not a valid probability distribution.
+
+Part 2 -- We investigate how this scoring system erroneously produces the reported gain.
+
+## Part 1 -- The PPM-D byte-level mixture is not a valid probability distribution
+
+This part is due to @sharpobject in https://github.com/openai/parameter-golf/issues/1872. I am
+restating that argument here because it is the premise for Part 2.
+
+Under the byte-level reading of C2, the scorer must define a probability distribution over the 256
+possible next bytes at each scored position.
+
+The submission class scores bytes via
+
+$$
+p_{\mathrm{mix}}(b)=\lambda p_{\mathrm{uniform}}(b)+(1-\lambda)p_{\mathrm{PPM}}(b),
+$$
+
+where $p_{\mathrm{PPM}}$ is the PPM-D byte distribution and $p_{\mathrm{uniform}}$ is obtained by
+uniformly spreading token log-probability across bytes.
+
+Since $p_{\mathrm{PPM}}$ already sums to 1, $p_{\mathrm{mix}}$ can sum to 1 only if
+$p_{\mathrm{uniform}}$ does. So the question reduces to whether the NN-side byte object is itself a
+probability distribution.
+
+For the first byte of a token, the natural extension is
+
+$$
+p_{\mathrm{uniform}}(b)=\sum_{t=b\ldots} p_{\mathrm{NN}}(t)^{1/n(t)}.
+$$
+
+For later bytes one conditions on the already-realized within-token prefix first; the same normalization
+problem remains, so I ignore that extra notation here.
+
+But this object does not sum to 1. The reason is simple: for any multi-byte token with $0<p<1$, one
+has $p^{1/n}>p$. So the uniform-spread construction systematically inflates small token probabilities
+before summing them by byte.
+
+A toy example already breaks normalization. Suppose
+
+$$
+p_{\mathrm{NN}}(t_1)=0.25,\qquad p_{\mathrm{NN}}(t_2)=0.25,\qquad p_{\mathrm{NN}}(t_3)=0.5,
+$$
+
+where $t_1,t_2$ are two-byte tokens starting with `a`, and $t_3$ is a one-byte token starting with
+`b`. Then
+
+$$
+p_{\mathrm{uniform}}(\texttt{a})=0.25^{1/2}+0.25^{1/2}=1,\qquad
+p_{\mathrm{uniform}}(\texttt{b})=0.5,
+$$
+
+so
+
+$$
+p_{\mathrm{uniform}}(\texttt{a})+p_{\mathrm{uniform}}(\texttt{b})=1.5>1.
+$$
+
+Therefore $p_{\mathrm{uniform}}$ is not a probability distribution over bytes, and neither is
+$p_{\mathrm{mix}}$. This is exactly the C2 failure pointed out in Issue #1872.
+
+The natural byte-level object induced by the same token softmax is instead the conditional distribution
+
+$$
+p_{\mathrm{cond}}(b\mid \pi)=
+\frac{\sum_{t:\pi b \preceq \mathrm{bytes}(t)} p_{\mathrm{NN}}(t)}
+     {\sum_{t:\pi \preceq \mathrm{bytes}(t)} p_{\mathrm{NN}}(t)},
+$$
+
+where $\pi$ is the within-token byte prefix already realized. Unlike the uniform-spread construction,
+this is a genuine distribution over next bytes. Part 2 uses it as the correct reference point.
+
+## Part 2 -- Why the apparent gain comes from the scoring system, not from PPM itself
+
+The uniform-spread construction was chosen for a reason: if PPM is turned off, summing byte losses
+reproduces the original token-level score exactly. But that same bookkeeping choice is what makes PPM
+appear much stronger than it really is.
+
+What the uniform-spread distribution tends to do is move uncertainty from the later parts of a token
+toward the front and flatten it out across all of the token's bytes. In other words, it takes token
+loss that in a natural conditional view would be concentrated on a few genuinely uncertain later bytes,
+and redistributes that loss onto earlier bytes that may already be almost certain. The clearest
+examples are tokens whose first byte is a space: the model may be very sure that the next byte is ` `,
+while still being unsure which full token follows after that. Uniform spread erases that distinction and
+charges the early space byte as if it carried an equal share of the token's uncertainty.
+
+All numbers below use a post-quantized model, together with the same PPM configuration used in #1850.
+
+The key comparison is this:
+
+| Method | val_bpb |
+|---|---:|
+| Token baseline | 1.08335 |
+| Uniform-spread + PPM | 1.03242 |
+| Conditional distribution + PPM | 1.12144 |
+
+So under the submitted uniform-spread scoring rule, PPM appears to gain about 0.051 val_bpb. But under
+the conditional distribution, the very same PPM scorer is not better than the baseline at all: it is
+worse by about 0.038 val_bpb.
+
+That is the central empirical fact of this report. The apparent gain is not coming from PPM
+outperforming the model on a valid next-byte scoring problem. It is coming from replacing the
+conditional distribution with the uniform-spread one before mixing.
+
+A concrete token-level example makes the mechanism clear.
+
+Consider the real token `" today"` in the context
+`...half the total starters) and today it is 17 of the 24...`.
+
+For this token, the two scoring systems assign the same total baseline loss,
+but distribute it very differently across bytes:
+
+| byte | gate_hi | uniform | conditional | PPM | mix(uniform) | mix(conditional) | gain vs uniform | gain vs conditional |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|
+| ` ` | 1 | 1.47564 | 0.00841 | 0.01511 | 0.05426 | 0.01478 | +1.42138 | -0.00637 |
+| `t` | 0 | 1.47564 | 1.51869 | 1.94185 | 1.51362 | 1.55381 | -0.03797 | -0.03511 |
+| `o` | 0 | 1.47564 | 2.38802 | 1.93680 | 1.51329 | 2.33256 | -0.03764 | +0.05546 |
+| `d` | 0 | 1.47564 | 4.93874 | 5.98645 | 1.57978 | 5.00587 | -0.10414 | -0.06713 |
+| `a` | 1 | 1.47564 | 0.00000 | 0.06454 | 0.10308 | 0.06121 | +1.37257 | -0.06121 |
+| `y` | 1 | 1.47564 | 0.00000 | 0.05716 | 0.09579 | 0.05422 | +1.37985 | -0.05422 |
+
+Token totals:
+
+| token | uniform baseline | conditional baseline | uniform+PPM mix | conditional+PPM mix | PPM gain vs uniform | PPM gain vs conditional |
+|---|---:|---:|---:|---:|---:|---:|
+| `" today"` | 8.85387 | 8.85387 | 4.85983 | 9.02245 | +3.99404 | -0.16858 |
+
+This is the key point. For the very same realized token, the submitted scoring rule makes PPM look
+helpful by `+3.99` nats, while the conditional distribution shows that the same PPM configuration is
+actually slightly harmful (`-0.17` nats).
+
+The reason is visible byte-by-byte. Uniform spread assigns `1.47564` nats to every byte, including the
+easy bytes ` `, `a`, and `y`, where the conditional distribution is already essentially zero and where
+`gate_hi=1` gives PPM high weight. PPM then gets credit for “improving” those bytes only because the
+scoring rule first assigned them artificial cost.
+
+So the sign of the apparent token-level gain flips depending on how the same token loss is allocated
+across bytes. That is exactly the claim of Part 2: the reported gain is not a stable property of PPM
+itself, but of the scoring rule used to mix it with the model.
+
+This shows that a large part of the reported gain is created by the scoring construction itself. Under
+the conditional distribution, the same PPM configuration does not improve the score; it makes it worse.
+So the headline improvement is not evidence that PPM is winning on a valid next-byte prediction problem.
+
+## Reproducibility
+
+- `testing/inspect_with_ppm.py` runs the uniform-spread versus conditional-distribution comparison with
+  the same PPM configuration.
+- `testing/ppm_scorer.c` is the byte-level PPM scorer used in both comparisons.
+- `testing/show_5_artifact_examples.py` regenerates the worked artifact examples from a saved per-byte
+  dump.
diff --git a/experiments/run_all.sh b/experiments/run_all.sh
new file mode 100755
index 0000000000..b06354ea13
--- /dev/null
+++ b/experiments/run_all.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+# Master runner: runs all experiments back-to-back.
+# Deploy pod → SSH in → bash experiments/run_all.sh → come back to results.
+# Auto-terminates pod when done.
+
+set -uo pipefail
+
+echo "========================================"
+echo "=== Parameter Golf Experiment Sweep ==="
+echo "=== $(date) ==="
+echo "========================================"
+
+# Setup
+echo ""
+echo "=== Phase 1: Setup ==="
+bash experiments/setup_pod.sh
+
+# Run planned experiments
+EXPERIMENTS=(5 6 7 8 9 10 11 12 13 14)
+RESULTS_FILE=/workspace/logs/sweep_results.txt
+echo "Experiment | val_bpb (final) | Steps | Status" > "$RESULTS_FILE"
+
+for n in "${EXPERIMENTS[@]}"; do
+    script=$(ls experiments/run_exp${n}_*.sh 2>/dev/null | head -1)
+    if [ -z "$script" ]; then
+        echo "Skipping exp $n: no script found"
+        continue
+    fi
+
+    echo ""
+    echo "========================================"
+    echo "=== Running Exp $n: $script ==="
+    echo "=== $(date) ==="
+    echo "========================================"
+
+    if bash "$script"; then
+        # Extract final val_bpb from log
+        logfile=$(ls /workspace/logs/exp${n}_*.log 2>/dev/null | head -1)
+        if [ -n "$logfile" ]; then
+            final_bpb=$(grep "final_int8_zlib_roundtrip_exact" "$logfile" | grep -oP 'val_bpb:\K[0-9.]+' | tail -1)
+            steps=$(grep "^step:" "$logfile" | tail -1 | grep -oP 'step:\K[0-9]+')
+            echo "Exp $n | ${final_bpb:-MISSING} | ${steps:-?} | OK" >> "$RESULTS_FILE"
+            echo ">>> Exp $n result: val_bpb=${final_bpb:-MISSING} steps=${steps:-?}"
+        fi
+    else
+        echo "Exp $n | FAILED | - | ERROR" >> "$RESULTS_FILE"
+        echo ">>> Exp $n FAILED, continuing..."
+    fi
+done
+
+echo ""
+echo "========================================"
+echo "=== All planned experiments done ==="
+echo "=== $(date) ==="
+echo "========================================"
+echo ""
+echo "=== Summary ==="
+cat "$RESULTS_FILE"
+
+# Auto-terminate pod
+echo ""
+echo "=== Terminating pod ==="
+if command -v runpodctl &>/dev/null && [ -n "${RUNPOD_POD_ID:-}" ]; then
+    echo "Stopping pod $RUNPOD_POD_ID..."
+    runpodctl stop pod "$RUNPOD_POD_ID"
+else
+    echo "WARNING: Cannot auto-terminate. Please terminate pod manually!"
+fi
diff --git a/experiments/run_exp10_3layer_recur.sh b/experiments/run_exp10_3layer_recur.sh
new file mode 100755
index 0000000000..704069b416
--- /dev/null
+++ b/experiments/run_exp10_3layer_recur.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+# Exp 10 — SP1024 + 3-Layer Recurrence [2,3,4]
+# Branch: exp/depth-recurrence
+# Block schedule: [0,1,2,3,4,2,3,4,5,6,7,8] — 12 virtual passes
+# Question: does recurring 3 layers beat 2?
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+RECUR_LAYERS=2,3,4 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp10_3layer_recur.log
+
+echo "Exp 10 done."
diff --git a/experiments/run_exp11_11L_narrow.sh b/experiments/run_exp11_11L_narrow.sh
new file mode 100755
index 0000000000..685cb8fa27
--- /dev/null
+++ b/experiments/run_exp11_11L_narrow.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+# Exp 11 — 11 Layers + narrow (model_dim=448) + SP1024 + Recurrence
+# Branch: exp/depth-recurrence
+# Params: ~15.5M (fits in 16 MB with INT8)
+# head_dim = 448/8 = 56 (even, OK for RoPE)
+# Question: is more layers + narrower better than fewer + wider?
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+NUM_LAYERS=11 \
+MODEL_DIM=448 \
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+RECUR_LAYERS=3,4 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp11_11L_narrow.log
+
+echo "Exp 11 done."
diff --git a/experiments/run_exp12_wide_mlp.sh b/experiments/run_exp12_wide_mlp.sh
new file mode 100755
index 0000000000..33c185b1b0
--- /dev/null
+++ b/experiments/run_exp12_wide_mlp.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+# Exp 12 — Wide MLP (mlp_mult=3) + narrow (model_dim=384) + SP1024 + Recurrence
+# Branch: exp/depth-recurrence
+# Params: ~13.5M (fits in 16 MB with INT8)
+# head_dim = 384/8 = 48 (even, OK for RoPE)
+# Question: is wider MLP worth the reduced model_dim?
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+MLP_MULT=3 \
+MODEL_DIM=384 \
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+RECUR_LAYERS=3,4 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp12_wide_mlp.log
+
+echo "Exp 12 done."
diff --git a/experiments/run_exp13_seq2048.sh b/experiments/run_exp13_seq2048.sh
new file mode 100755
index 0000000000..655fa3f30e
--- /dev/null
+++ b/experiments/run_exp13_seq2048.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+# Exp 13 — SP1024 + Depth Recurrence + train_seq_len=2048
+# Branch: exp/depth-recurrence
+# Question: longer context helps? Top submissions use 2048.
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+TRAIN_SEQ_LEN=2048 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp13_seq2048.log
+
+echo "Exp 13 done."
diff --git a/experiments/run_exp14_warmdown2400.sh b/experiments/run_exp14_warmdown2400.sh
new file mode 100755
index 0000000000..2321131638
--- /dev/null
+++ b/experiments/run_exp14_warmdown2400.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+# Exp 14 — SP1024 + Depth Recurrence + warmdown_iters=2400
+# Branch: exp/depth-recurrence
+# Question: longer warmdown helps? (default 1200, top submissions use 3500)
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+WARMDOWN_ITERS=2400 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp14_warmdown2400.log
+
+echo "Exp 14 done."
diff --git a/experiments/run_exp5_sp4096_baseline.sh b/experiments/run_exp5_sp4096_baseline.sh
new file mode 100755
index 0000000000..d9ec2e5b2e
--- /dev/null
+++ b/experiments/run_exp5_sp4096_baseline.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+# Exp 5 — SP1024 Baseline
+# Branch: main (no code changes, env vars only)
+# Hypothesis: clean baseline with frequent val readings for comparison, same model architecture as baseline
+# Compare to: Exp 3 (baseline SP1024, 2×H100, val_bpb=1.2732)
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout main
+
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp5_sp4096_baseline.log
+
+echo "Exp 5 done."
diff --git a/experiments/run_exp6_sp4096_recurrence.sh b/experiments/run_exp6_sp4096_recurrence.sh
new file mode 100755
index 0000000000..3b4a29bfa5
--- /dev/null
+++ b/experiments/run_exp6_sp4096_recurrence.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+# Exp 6 — SP1024 + Depth Recurrence
+# Branch: exp/depth-recurrence (commit ea1898a)
+# Hypothesis: combining SP1024 + recurrence stacks both improvements
+# Compare to: Exp 5 (SP1024 baseline) and Exp 4 (recurrence SP1024)
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp6_sp4096_recurrence.log
+
+echo "Exp 6 done."
diff --git a/experiments/run_exp7_sp4096_recur_qkgain.sh b/experiments/run_exp7_sp4096_recur_qkgain.sh
new file mode 100755
index 0000000000..6651514d2d
--- /dev/null
+++ b/experiments/run_exp7_sp4096_recur_qkgain.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+# Exp 7 — SP1024 + Depth Recurrence + QK-Gain 5.25
+# Branch: exp/depth-recurrence (commit ea1898a)
+# Hypothesis: higher q_gain sharpens attention → better quality (top submissions used 5.25)
+# Compare to: Exp 6 (same but q_gain=1.5 default)
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+QK_GAIN_INIT=5.25 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp7_sp4096_recur_qkgain.log
+
+echo "Exp 7 done."
diff --git a/experiments/run_exp8_qkgain3.sh b/experiments/run_exp8_qkgain3.sh
new file mode 100755
index 0000000000..bf9440e3f9
--- /dev/null
+++ b/experiments/run_exp8_qkgain3.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+# Exp 8 — SP1024 + Depth Recurrence + q_gain=3.0
+# Branch: exp/depth-recurrence
+# Question: is q_gain=3.0 better than default 1.5?
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+QK_GAIN_INIT=3.0 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp8_qkgain3.log
+
+echo "Exp 8 done."
diff --git a/experiments/run_exp9_qkgain4.sh b/experiments/run_exp9_qkgain4.sh
new file mode 100755
index 0000000000..f6fbe37ae7
--- /dev/null
+++ b/experiments/run_exp9_qkgain4.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+# Exp 9 — SP1024 + Depth Recurrence + q_gain=4.0
+# Branch: exp/depth-recurrence
+# Question: is q_gain=4.0 better than 3.0 or 5.25?
+
+set -euo pipefail
+cd /workspace/parameter-golf
+git checkout exp/depth-recurrence
+
+VOCAB_SIZE=1024 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+QK_GAIN_INIT=4.0 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=2 train_gpt.py \
+  2>&1 | tee /workspace/logs/exp9_qkgain4.log
+
+echo "Exp 9 done."
diff --git a/experiments/setup_pod.sh b/experiments/setup_pod.sh
new file mode 100755
index 0000000000..6b5a83df7e
--- /dev/null
+++ b/experiments/setup_pod.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# Run this once when deploying a new pod.
+# Downloads SP4096 data if not already present, installs deps.
+
+set -euo pipefail
+
+echo "=== Installing dependencies ==="
+pip install -q sentencepiece huggingface_hub numpy torch==2.5.1 brotli
+
+echo "=== Checking SP1024 data ==="
+if [ -d /workspace/parameter-golf/data/datasets/fineweb10B_sp1024 ]; then
+    echo "SP1024 data: OK"
+else
+    echo "SP1024 data: MISSING — downloading..."
+    cd /workspace/parameter-golf
+    python3 data/cached_challenge_fineweb.py --variant sp1024
+fi
+
+echo "=== Checking SP4096 data ==="
+if [ -d /workspace/parameter-golf/data/datasets/fineweb10B_sp4096 ]; then
+    echo "SP4096 data: OK"
+else
+    echo "SP4096 data: MISSING — downloading..."
+    cd /workspace/parameter-golf
+    python3 data/cached_challenge_fineweb.py --variant sp4096
+    if [ $? -ne 0 ]; then
+        echo "Default repo failed, trying kevclark/parameter-golf..."
+        MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096
+    fi
+fi
+
+echo "=== Setup complete ==="
+echo "Datasets available:"
+ls /workspace/parameter-golf/data/datasets/
+echo "Tokenizers available:"
+ls /workspace/parameter-golf/data/tokenizers/
diff --git a/testing/inspect_with_ppm.py b/testing/inspect_with_ppm.py
new file mode 100644
index 0000000000..cbcec71356
--- /dev/null
+++ b/testing/inspect_with_ppm.py
@@ -0,0 +1,808 @@
+"""Forward-pass + PPM-D byte-mixture inspection.
+
+Loads a post-EMA pre-quant final_model.pt, runs forward on N val tokens,
+applies the PPM-D byte mixture (extracted from PR #1857), and outputs a
+markdown report comparing NN-only vs NN+PPM bpb.
+
+Run on a 1xH100 pod. Usage:
+    python3 testing/inspect_with_ppm.py --ckpt <path> --train_gpt <path>
+"""
+import argparse, ctypes, importlib.util, math, os, struct, subprocess, sys, tempfile, time
+from pathlib import Path
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+import sentencepiece as spm
+
+PPM_C_SRC_FILE = Path(__file__).parent / "ppm_scorer.c"
+
+# ------------------------- helpers -------------------------
+
+def load_train_gpt_module(path):
+    spec = importlib.util.spec_from_file_location("tg", path)
+    m = importlib.util.module_from_spec(spec)
+    sys.modules["tg"] = m
+    spec.loader.exec_module(m)
+    return m
+
+def seed_env_from_train_log(log_path):
+    """Read 'Hyperparameters:' block from a train.log and set values as env vars
+    BEFORE importing train_gpt.py. This makes the constructed model match the
+    saved checkpoint's architecture exactly."""
+    import re
+    txt = Path(log_path).read_text()
+    # Find the hparam block
+    in_block = False
+    n_set = 0
+    for line in txt.split("\n"):
+        if line.strip().startswith("Hyperparameters:"):
+            in_block = True
+            continue
+        if in_block:
+            if not line.startswith("  "):  # block ended
+                break
+            m = re.match(r"  (\w+): (.+)$", line)
+            if not m: continue
+            key, val = m.group(1).upper(), m.group(2).strip()
+            # Convert booleans to 0/1 (env vars are strings)
+            if val == "True": val = "1"
+            elif val == "False": val = "0"
+            elif val == "None": val = ""
+            # Convert "4.0" -> "4" to match int() parsers (logger formats ints as floats)
+            elif re.match(r"^-?\d+\.0$", val): val = val[:-2]
+            os.environ[key] = val
+            n_set += 1
+    print(f"[hparams] seeded {n_set} env vars from {log_path}")
+
+def build_token_bytes_lut(sp, vocab_size):
+    sz = max(int(sp.vocab_size()), vocab_size)
+    bytestrs = [b""] * sz
+    has_space = np.zeros(sz, dtype=np.uint8)
+    is_boundary = np.zeros(sz, dtype=np.uint8)
+    for tid in range(int(sp.vocab_size())):
+        if sp.is_control(tid) or sp.is_unknown(tid) or sp.is_unused(tid):
+            is_boundary[tid] = 1
+            continue
+        piece = sp.id_to_piece(tid)
+        if sp.is_byte(tid):
+            bytestrs[tid] = bytes([int(piece[3:-1], 16)])
+            continue
+        if piece.startswith("▁"):
+            has_space[tid] = 1
+            piece = piece[1:]
+        bytestrs[tid] = piece.encode("utf-8")
+    flat = b"".join(bytestrs)
+    lens = np.array([len(b) for b in bytestrs], dtype=np.int32)
+    offs = np.zeros(sz, dtype=np.int32)
+    offs[1:] = np.cumsum(lens[:-1])
+    return np.frombuffer(flat, dtype=np.uint8).copy(), offs, lens, has_space, is_boundary
+
+def compile_ppm_lib(c_src_path, antihijack=False):
+    so_name = "ppm_scorer_antihijack.so" if antihijack else "ppm_scorer.so"
+    so_path = Path(tempfile.gettempdir()) / so_name
+    cmd = ["gcc", "-O3", "-march=native", "-fopenmp", "-shared", "-fPIC",
+           "-o", str(so_path), str(c_src_path), "-lm"]
+    print(f"[compile] {' '.join(cmd)}")
+    subprocess.run(cmd, check=True)
+    lib = ctypes.CDLL(str(so_path))
+    if antihijack:
+        # Anti-hijack version: extra c_double arg for nn_skip_thr
+        lib.ppm_score_omp.argtypes = [
+            ctypes.POINTER(ctypes.c_int64), ctypes.POINTER(ctypes.c_int64),
+            ctypes.POINTER(ctypes.c_double), ctypes.c_int64,
+            ctypes.POINTER(ctypes.c_uint8), ctypes.POINTER(ctypes.c_int32),
+            ctypes.POINTER(ctypes.c_int32), ctypes.POINTER(ctypes.c_uint8),
+            ctypes.POINTER(ctypes.c_uint8), ctypes.c_int, ctypes.c_int,
+            ctypes.c_double, ctypes.c_double, ctypes.c_double,
+            ctypes.c_double,  # nn_skip_thr (anti-hijack)
+            ctypes.c_uint32, ctypes.c_int64, ctypes.c_int,
+            ctypes.POINTER(ctypes.c_double),
+        ]
+    else:
+        lib.ppm_score_omp.argtypes = [
+            ctypes.POINTER(ctypes.c_int64), ctypes.POINTER(ctypes.c_int64),
+            ctypes.POINTER(ctypes.c_double), ctypes.c_int64,
+            ctypes.POINTER(ctypes.c_uint8), ctypes.POINTER(ctypes.c_int32),
+            ctypes.POINTER(ctypes.c_int32), ctypes.POINTER(ctypes.c_uint8),
+            ctypes.POINTER(ctypes.c_uint8), ctypes.c_int, ctypes.c_int,
+            ctypes.c_double, ctypes.c_double, ctypes.c_double, ctypes.c_uint32,
+            ctypes.c_int64, ctypes.c_int,
+            ctypes.POINTER(ctypes.c_double),
+        ]
+    lib.ppm_score_omp.restype = ctypes.c_int
+    return lib
+
+def categorize(piece):
+    p = piece[1:] if piece.startswith("▁") else piece
+    if not p:
+        return "empty"
+    if any(s in p for s in ["http", "://", "www.", ".com", ".org", ".net", ".io", "@"]):
+        return "URL"
+    s = p.replace(".", "").replace(",", "").replace("-", "")
+    if s and s.isdigit():
+        return "NUMERIC"
+    if any(c in p for c in ["{", "}", "[", "]", "()", ";", "==", "!=", "->", "::"]):
+        return "CODE"
+    if all(c in "0123456789abcdef" for c in p) and 4 <= len(p) <= 64:
+        return "HEX"
+    if "/" in p and all(c in "/_-" or c.isalnum() for c in p):
+        return "PATH"
+    return "PROSE"
+
+# ------------------------- main -------------------------
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--ckpt", required=True)
+    ap.add_argument("--train_gpt", required=True)
+    ap.add_argument("--val_tok", default="/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_000000.bin")
+    ap.add_argument("--val_bytes", default="/workspace/parameter-golf/data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_000000.bin")
+    ap.add_argument("--tokenizer", default="")
+    ap.add_argument("--tokens", type=int, default=8192)
+    ap.add_argument("--out", default="/tmp/ppm_inspect.md")
+    ap.add_argument("--ppm_order", type=int, default=4)
+    ap.add_argument("--lambda_hi", type=float, default=0.9)
+    ap.add_argument("--lambda_lo", type=float, default=0.05)
+    ap.add_argument("--ppm_threshold", type=float, default=0.9)
+    ap.add_argument("--ppm_nn_skip_thr_nats", type=float, default=0.0, help="anti-hijack: suppress gate when NN per-byte logp > -this. 0 disables.")
+    ap.add_argument("--ppm_c_src", default="", help="override path to PPM C source (default = ppm_scorer.c next to this file)")
+    ap.add_argument("--ppm_omp_threads", type=int, default=8)
+    ap.add_argument("--ppm_chunk_tokens", type=int, default=4194304)
+    ap.add_argument("--ppm_log_cache", type=int, default=1048576)
+    ap.add_argument("--train_log", default="", help="train.log to read hparams from (for env-var seeding)")
+    args = ap.parse_args()
+
+    # Seed env vars from train.log BEFORE importing train_gpt
+    if args.train_log:
+        seed_env_from_train_log(args.train_log)
+    else:
+        # Default: try to find a train.log next to the checkpoint
+        cand = Path(args.ckpt).parent / "train.log"
+        if cand.exists():
+            seed_env_from_train_log(cand)
+
+    device = torch.device("cuda")
+
+    # ---- tokenizer ----
+    tk_path = args.tokenizer
+    if not tk_path:
+        for p in [
+            "/workspace/parameter-golf/data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model",
+            str(Path(args.train_gpt).parent / "tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model"),
+        ]:
+            if Path(p).exists():
+                tk_path = p
+                break
+    sp = spm.SentencePieceProcessor()
+    sp.load(tk_path)
+    vocab = sp.vocab_size()
+    print(f"[tk] vocab={vocab}")
+
+    # ---- val tokens ----
+    # Format: 1024-byte header (256 int32s) + uint16 tokens
+    raw = Path(args.val_tok).read_bytes()
+    HEADER = 1024
+    payload = raw[HEADER:]
+    n_total = len(payload) // 2
+    all_tokens = list(struct.unpack(f"<{n_total}H", payload))
+    n = min(args.tokens, n_total)
+    tokens = all_tokens[:n]
+    print(f"[val] {n}/{n_total} tokens loaded (max id={max(tokens)})")
+    assert max(tokens) < 8192, f"token id out of range: {max(tokens)} (expected < 8192)"
+
+    # ---- val_bytes sidecar (authoritative bytes-per-token) ----
+    val_bytes_per_tok_full = None
+    if Path(args.val_bytes).exists():
+        bts_raw = Path(args.val_bytes).read_bytes()
+        # Sidecar is uint16 per token (NOT int32). Total ~151M bytes for full val.
+        bts_payload = bts_raw[HEADER:]
+        n_bts = len(bts_payload) // 2
+        val_bytes_per_tok_full = np.frombuffer(bts_payload, dtype=np.uint16)[:n].astype(np.int64)
+        print(f"[val] sidecar bytes loaded: {n_bts} entries, sum first {n} = {int(val_bytes_per_tok_full.sum()):,} bytes")
+    else:
+        print(f"[val] WARNING: sidecar {args.val_bytes} not found; will fall back to piece-encoding")
+
+    # ---- model ----
+    print(f"[model] importing {args.train_gpt}")
+    mod = load_train_gpt_module(Path(args.train_gpt))
+    H = mod.Hyperparameters()
+
+    use_quantized = args.ckpt.endswith(".ptz") or args.ckpt.endswith(".int6.ptz")
+    if use_quantized:
+        # POST-QUANT path: use train_gpt.deserialize() which reads .int6.ptz
+        # (and the SpinQuant template from .pt) to reconstruct the quantized
+        # eval model. This matches what eval_val(diagnostic quantized) sees.
+        print(f"[model] POST-QUANT path: deserialize() from {args.ckpt}")
+        H.quantized_model_path = args.ckpt
+        # deserialize also reads model_path (final_model.pt) for SpinQuant template
+        pt_path = args.ckpt.replace(".int6.ptz", ".pt").replace(".ptz", ".pt")
+        H.model_path = pt_path
+        print(f"[model] template .pt path: {pt_path}")
+        # cwd matters for deserialize since it uses relative paths
+        import os as _os
+        _orig_cwd = _os.getcwd()
+        _os.chdir(str(Path(args.ckpt).parent))
+        try:
+            model = mod.deserialize(H, device)
+        finally:
+            _os.chdir(_orig_cwd)
+        print(f"[model] deserialize OK")
+    else:
+        print(f"[model] PRE-QUANT path: constructing {mod.GPT.__name__}")
+        try:
+            model = mod.GPT(H).to(device).bfloat16()
+            print("[model] constructed via GPT(h) signature")
+        except TypeError as e:
+            print(f"[model] GPT(h) failed ({e}); trying kwargs")
+            gpt_kwargs = dict(
+                vocab_size=H.vocab_size, num_layers=H.num_layers, model_dim=H.model_dim,
+                num_heads=H.num_heads, num_kv_heads=H.num_kv_heads, mlp_mult=H.mlp_mult,
+                tie_embeddings=H.tie_embeddings, tied_embed_init_std=H.tied_embed_init_std,
+                logit_softcap=H.logit_softcap, rope_base=H.rope_base, qk_gain_init=H.qk_gain_init,
+                recur_layers=H.recur_layers, recur_start_step=H.recur_start_step,
+                parallel_start_layer=H.parallel_start_layer, rope_dims=H.rope_dims,
+            )
+            model = mod.GPT(**gpt_kwargs).to(device).bfloat16()
+        # Mirror train_gpt.py post-construction precision tweaks
+        if hasattr(mod, "CastedLinear"):
+            for m in model.modules():
+                if isinstance(m, mod.CastedLinear):
+                    m.float()
+        for fn in ["restore_low_dim_params_to_fp32", "restore_fp32_params"]:
+            if hasattr(mod, fn):
+                getattr(mod, fn)(model)
+                break
+        print(f"[model] loading state from {args.ckpt}")
+        state = torch.load(args.ckpt, map_location=device, weights_only=False)
+        if isinstance(state, dict):
+            if "state_dict" in state: state = state["state_dict"]
+            elif "model" in state: state = state["model"]
+        try:
+            model.load_state_dict(state, strict=True)
+            print("[model] strict load OK")
+        except Exception as e:
+            missing, unexpected = model.load_state_dict(state, strict=False)
+            print(f"[model] non-strict load: missing={len(missing)} unexpected={len(unexpected)}")
+            if len(missing) <= 5: print(f"  missing: {missing}")
+            if len(unexpected) <= 5: print(f"  unexpected: {unexpected}")
+    model.eval()
+    # CRITICAL: enable loop layers (depth recurrence). Without this the model
+    # runs in non-looped mode (much higher val_bpb, doesn't match training).
+    if H.num_loops > 0 and hasattr(model, "looping_active"):
+        model.looping_active = True
+        print("[model] looping_active=True (depth recurrence enabled)")
+
+    # ---- forward (chunked, MATCHES official eval_val: seq_len=2048 with varlen attn) ----
+    # Official pipeline = torch.compile(forward_logits) + torch.autocast(bf16)
+    seq_len = 2048
+    n_pred = n - 1
+    nll_nats = np.zeros(n_pred, dtype=np.float64)
+    TOP_K_CACHE = int(os.environ.get("TOP_K_CACHE", "50"))
+    top_ids = np.zeros((n_pred, TOP_K_CACHE), dtype=np.int32)
+    top_logp = np.zeros((n_pred, TOP_K_CACHE), dtype=np.float32)
+
+    # Optional: online TRUE full-Σ proper-margin accumulation
+    PROPER_MARGIN_ONLINE = int(os.environ.get("PROPER_MARGIN_ONLINE", "0"))
+    if PROPER_MARGIN_ONLINE:
+        print("[pm] PROPER_MARGIN_ONLINE=1 — building per-prefix tid lists")
+        bytestrs_pm = [b""] * vocab
+        has_space_pm = [False] * vocab
+        is_boundary_pm = [False] * vocab
+        for tid in range(vocab):
+            if sp.is_control(tid) or sp.is_unknown(tid) or sp.is_unused(tid):
+                is_boundary_pm[tid] = True; continue
+            piece = sp.id_to_piece(tid)
+            if sp.is_byte(tid):
+                bytestrs_pm[tid] = bytes([int(piece[3:-1], 16)]); continue
+            if piece.startswith("▁"):
+                has_space_pm[tid] = True; piece = piece[1:]
+            bytestrs_pm[tid] = piece.encode("utf-8")
+        pc_w_list, pc_n_list = {}, {}        # prefix -> tids that START WITH prefix
+        pc_w_exact, pc_n_exact = {}, {}      # prefix -> tids whose bytes EQUAL prefix exactly
+        for use_space, tgt_starts, tgt_exact in [(True, pc_w_list, pc_w_exact), (False, pc_n_list, pc_n_exact)]:
+            tmp_starts = {}; tmp_exact = {}
+            for tid in range(vocab):
+                if is_boundary_pm[tid]: continue
+                inc = has_space_pm[tid] and use_space
+                bs = (b" " if inc else b"") + bytestrs_pm[tid]
+                if not bs: continue
+                for jj in range(1, len(bs) + 1):
+                    tmp_starts.setdefault(bs[:jj], []).append(tid)
+                tmp_exact.setdefault(bs, []).append(tid)  # full bytes only
+            for k_, v_ in tmp_starts.items():
+                tgt_starts[k_] = torch.tensor(v_, dtype=torch.long, device=device)
+            for k_, v_ in tmp_exact.items():
+                tgt_exact[k_] = torch.tensor(v_, dtype=torch.long, device=device)
+        pm_byte_total_nats = 0.0
+        pm_bytes_total = 0
+        pm_token_total_nats = 0.0
+        # Buffers for proper-margin per-byte mix experiment
+        pm_byte_stream_list = []  # list of uint8 (the actual byte at each position)
+        pm_nn_logp_list = []      # list of float (proper-margin log-prob at each position)
+        print(f"[pm] tables ready (with-space={len(pc_w_list)}, no-space={len(pc_n_list)})")
+    print(f"[fwd] running on {n} tokens in chunks of {seq_len} with varlen attention (BOS-aware)")
+
+    # Find _build_cu_seqlens helper from the train_gpt module
+    build_cu = getattr(mod, "_build_cu_seqlens", None)
+    if build_cu is None:
+        print("[fwd] WARNING: _build_cu_seqlens not found; falling back to no-mask attention")
+    else:
+        print("[fwd] using mod._build_cu_seqlens (BOS-aware varlen attention)")
+
+    # torch.compile the forward (dynamic=True so the last short chunk doesn't recompile-bomb).
+    try:
+        forward_compiled = torch.compile(model.forward_logits, dynamic=True, fullgraph=False)
+        print("[fwd] torch.compile(forward_logits) enabled (dynamic=True)")
+    except Exception as e:
+        forward_compiled = model.forward_logits
+        print(f"[fwd] torch.compile failed ({e}); falling back to eager")
+
+    BOS_ID_VAL = getattr(mod, "BOS_ID", None) or 1
+    t0 = time.time()
+    pos = 0
+    # We chunk so that each chunk has exactly seq_len input tokens (and seq_len target tokens shifted by 1).
+    # Mirrors official eval: x = local[:-1], y = local[1:], chunks span seq_len each.
+    while pos < n_pred:
+        end = min(pos + seq_len + 1, n)
+        if end - pos < 2:
+            break
+        chunk_tokens = tokens[pos:end]
+        x_t = torch.tensor(chunk_tokens[:-1], dtype=torch.long, device=device)
+        cu_seqlens, max_seqlen = None, 0
+        if build_cu is not None:
+            bos_pos = (x_t == BOS_ID_VAL).nonzero(as_tuple=True)[0].tolist()
+            cu_seqlens, max_seqlen = build_cu(bos_pos, x_t.numel(), x_t.device, seq_len, 64)
+        with torch.no_grad(), torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
+            if cu_seqlens is not None:
+                logits = forward_compiled(x_t[None], cu_seqlens=cu_seqlens, max_seqlen=max_seqlen)
+            else:
+                logits = forward_compiled(x_t[None])
+            log_probs = F.log_softmax(logits.float(), dim=-1)
+            tgt = torch.tensor(chunk_tokens[1:], dtype=torch.long, device=device)
+            nll_chunk = -log_probs[0].gather(1, tgt.unsqueeze(-1)).squeeze(-1)
+            n_chunk = nll_chunk.shape[0]
+            nll_nats[pos:pos + n_chunk] = nll_chunk.cpu().numpy()
+            tk = log_probs[0].topk(TOP_K_CACHE, dim=-1)
+            top_ids[pos:pos + n_chunk] = tk.indices.cpu().numpy()
+            top_logp[pos:pos + n_chunk] = tk.values.cpu().numpy()
+
+            if PROPER_MARGIN_ONLINE:
+                probs_chunk = log_probs[0].exp()  # (n_chunk, vocab)
+                for j_in_chunk in range(n_chunk):
+                    i_global = pos + j_in_chunk
+                    tid_g = int(chunk_tokens[j_in_chunk + 1])
+                    pid_g = int(chunk_tokens[j_in_chunk])
+                    if tid_g < 0 or tid_g >= vocab or is_boundary_pm[tid_g]:
+                        continue
+                    inc_sp = has_space_pm[tid_g] and (pid_g < 0 or not is_boundary_pm[pid_g])
+                    full_b = (b" " if inc_sp else b"") + bytestrs_pm[tid_g]
+                    if not full_b:
+                        continue
+                    n_b_g = int(val_bytes_per_tok_full[i_global + 1]) if val_bytes_per_tok_full is not None else len(full_b)
+                    use_sp_ctx = (pid_g < 0 or is_boundary_pm[pid_g])
+                    # pc_w_list: has-space tokens get leading space (matches "prev NOT boundary")
+                    # pc_n_list: no leading space ever (matches "prev IS boundary")
+                    pc_g = pc_n_list if use_sp_ctx else pc_w_list
+                    pc_g_exact = pc_n_exact if use_sp_ctx else pc_w_exact
+                    pj = probs_chunk[j_in_chunk]
+                    prefix_b = b""
+                    prefix_mass = 1.0
+                    nn_byte_t = 0.0
+                    n_full = len(full_b)
+                    for byte_idx_, bt in enumerate(full_b):
+                        np_b = prefix_b + bytes([bt])
+                        is_last_byte = (byte_idx_ == n_full - 1)
+                        if is_last_byte:
+                            # FAITHFUL TERMINATION: at the last byte of the canonical span,
+                            # use only tokens whose bytes EXACTLY equal the full span bytes.
+                            # This is what makes byte-level NLL = token-level NLL when no
+                            # same-byte alternates exist (proper bit-conservation).
+                            tids_t = pc_g_exact.get(np_b)
+                        else:
+                            tids_t = pc_g.get(np_b)
+                        if tids_t is None:
+                            ext_m = 1e-30
+                        else:
+                            ext_m = float(pj[tids_t].sum().item())
+                        if ext_m < 1e-30: ext_m = 1e-30
+                        per_byte_logp = math.log(ext_m / max(prefix_mass, 1e-30))
+                        nn_byte_t += -per_byte_logp
+                        pm_byte_stream_list.append(bt)
+                        pm_nn_logp_list.append(per_byte_logp)
+                        prefix_mass = ext_m
+                        prefix_b = np_b
+                    pm_byte_total_nats += nn_byte_t
+                    pm_bytes_total += n_b_g
+                    pm_token_total_nats += float(nll_chunk[j_in_chunk].item())
+        pos += seq_len
+        if pos == seq_len or pos % (seq_len * 256) == 0:
+            elapsed = time.time() - t0
+            print(f"[fwd] pos={pos}/{n_pred} ({100*pos/n_pred:.1f}%) elapsed={elapsed:.1f}s")
+    print(f"[fwd] done in {time.time()-t0:.1f}s; mean NLL/tok = {nll_nats.mean():.4f} nats")
+
+    if PROPER_MARGIN_ONLINE:
+        LOG2_ = math.log(2.0)
+        tok_bpb_pm = pm_token_total_nats / max(pm_bytes_total, 1) / LOG2_
+        byte_bpb_pm = pm_byte_total_nats / max(pm_bytes_total, 1) / LOG2_
+        print(f"[pm] === FULL-Σ PROPER MARGIN ===")
+        print(f"[pm] bytes={pm_bytes_total:,}")
+        print(f"[pm] token-level NN BPB:                {tok_bpb_pm:.5f}")
+        print(f"[pm] byte-level NN BPB (proper, full):  {byte_bpb_pm:.5f}")
+        print(f"[pm] diff (should be 0 by chain rule):  {tok_bpb_pm - byte_bpb_pm:+.5f}")
+
+
+    # ---- build PPM args ----
+    print(f"[ppm] building byte LUT (vocab={vocab})")
+    flat, offs, lens, has_space, is_boundary = build_token_bytes_lut(sp, vocab)
+    target_ids = np.array(tokens[1:n], dtype=np.int64)         # next-token at each pos
+    prev_ids = np.array(tokens[0:n-1], dtype=np.int64)         # previous token
+    print(f"[ppm] target shape={target_ids.shape} flat={len(flat)}B")
+
+    # Per-token NN log-prob = -nll_nats; passed to scorer as nll (positive)
+    # The scorer expects nll in nats, will divide by n_bytes internally.
+
+    out = np.zeros(6, dtype=np.float64)
+    use_antihijack = args.ppm_nn_skip_thr_nats > 0
+    c_src_path = args.ppm_c_src if args.ppm_c_src else str(PPM_C_SRC_FILE)
+    if use_antihijack and not args.ppm_c_src:
+        # default to anti-hijack source on the pod
+        for cand in ["/workspace/their_ppm_antihijack.c", str(PPM_C_SRC_FILE.parent / "ppm_scorer_antihijack.c")]:
+            if Path(cand).exists():
+                c_src_path = cand
+                break
+    print(f"[ppm] using C source: {c_src_path}, antihijack={use_antihijack}")
+    lib = compile_ppm_lib(c_src_path, antihijack=use_antihijack)
+
+    print(f"[ppm] order={args.ppm_order} λ_hi={args.lambda_hi} λ_lo={args.lambda_lo} thr={args.ppm_threshold} nn_skip_thr_nats={args.ppm_nn_skip_thr_nats}")
+    t1 = time.time()
+    if use_antihijack:
+        rc = lib.ppm_score_omp(
+            target_ids.ctypes.data_as(ctypes.POINTER(ctypes.c_int64)),
+            prev_ids.ctypes.data_as(ctypes.POINTER(ctypes.c_int64)),
+            nll_nats.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
+            ctypes.c_int64(len(target_ids)),
+            flat.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+            offs.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
+            lens.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
+            has_space.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+            is_boundary.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+            ctypes.c_int(vocab),
+            ctypes.c_int(args.ppm_order),
+            ctypes.c_double(args.lambda_hi),
+            ctypes.c_double(args.lambda_lo),
+            ctypes.c_double(args.ppm_threshold),
+            ctypes.c_double(args.ppm_nn_skip_thr_nats),
+            ctypes.c_uint32(args.ppm_log_cache),
+            ctypes.c_int64(args.ppm_chunk_tokens),
+            ctypes.c_int(args.ppm_omp_threads),
+            out.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
+        )
+    else:
+        rc = lib.ppm_score_omp(
+            target_ids.ctypes.data_as(ctypes.POINTER(ctypes.c_int64)),
+            prev_ids.ctypes.data_as(ctypes.POINTER(ctypes.c_int64)),
+            nll_nats.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
+            ctypes.c_int64(len(target_ids)),
+            flat.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+            offs.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
+            lens.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
+            has_space.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+            is_boundary.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+            ctypes.c_int(vocab),
+            ctypes.c_int(args.ppm_order),
+            ctypes.c_double(args.lambda_hi),
+            ctypes.c_double(args.lambda_lo),
+            ctypes.c_double(args.ppm_threshold),
+            ctypes.c_uint32(args.ppm_log_cache),
+            ctypes.c_int64(args.ppm_chunk_tokens),
+            ctypes.c_int(args.ppm_omp_threads),
+            out.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
+        )
+    print(f"[ppm] score_omp returned {rc} in {time.time()-t1:.1f}s")
+    if rc != 0:
+        print(f"[ppm] ERROR: rc={rc}")
+        sys.exit(1)
+
+    mix_bpb, ppm_only_bpb, nn_byte_bpb, token_bpb, n_bytes, gate_high_frac = out
+    print(f"[ppm] mix_bpb={mix_bpb:.5f} ppm_only={ppm_only_bpb:.5f} nn_byte_bpb={nn_byte_bpb:.5f} bytes={int(n_bytes)} gate_high={gate_high_frac:.4f}")
+
+    # === proper-margin + PPM mix experiment ===
+    if PROPER_MARGIN_ONLINE and pm_byte_stream_list:
+        try:
+            nbs = len(pm_byte_stream_list)
+            byte_stream_np = np.array(pm_byte_stream_list, dtype=np.uint8)
+            nn_logp_np = np.array(pm_nn_logp_list, dtype=np.float64)
+            print(f"[pm-mix] computing PPM mix on {nbs:,} bytes (proper-margin nn_logp)")
+            lib.ppm_score_bytewise.restype = ctypes.c_int
+            lib.ppm_score_bytewise.argtypes = [
+                ctypes.POINTER(ctypes.c_uint8),
+                ctypes.POINTER(ctypes.c_double),
+                ctypes.c_int64,
+                ctypes.c_int,
+                ctypes.c_double, ctypes.c_double, ctypes.c_double,
+                ctypes.c_uint32,
+                ctypes.POINTER(ctypes.c_double),
+            ]
+            out_pm = np.zeros(6, dtype=np.float64)
+            rc = lib.ppm_score_bytewise(
+                byte_stream_np.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+                nn_logp_np.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
+                ctypes.c_int64(nbs),
+                ctypes.c_int(args.ppm_order),
+                ctypes.c_double(args.lambda_hi),
+                ctypes.c_double(args.lambda_lo),
+                ctypes.c_double(args.ppm_threshold),
+                ctypes.c_uint32(args.ppm_log_cache),
+                out_pm.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
+            )
+            print(f"[pm-mix] PROPER-MARGIN + PPM:")
+            print(f"[pm-mix]   rc={rc}  mix_bpb={out_pm[0]:.5f}  ppm_only={out_pm[1]:.5f}  "
+                  f"nn_proper_bpb={out_pm[2]:.5f}  bytes={int(out_pm[4])}  gate_hi={out_pm[5]:.4f}")
+            print(f"[pm-mix]   COMPARISON:")
+            print(f"[pm-mix]     uniform-spread mix_bpb (spec 055):  {mix_bpb:.5f}")
+            print(f"[pm-mix]     proper-margin   mix_bpb (this run): {out_pm[0]:.5f}")
+            print(f"[pm-mix]     diff: {out_pm[0]-mix_bpb:+.5f}")
+
+            # === Per-byte dump comparison: find explicit examples of bookkeeping artifact ===
+            print(f"\n[pm-dump] running dump scorer to capture per-byte uniform/PPM/gate data")
+            max_bytes = nbs + 1024
+            dump_mix = np.zeros(max_bytes, dtype=np.float32)
+            dump_ppm = np.zeros(max_bytes, dtype=np.float32)
+            dump_nn_uniform = np.zeros(max_bytes, dtype=np.float32)  # uniform-spread nn_nll per byte
+            dump_conf = np.zeros(max_bytes, dtype=np.float32)
+            dump_gate_hi = np.zeros(max_bytes, dtype=np.uint8)
+            dump_byte = np.zeros(max_bytes, dtype=np.uint8)
+            dump_n_bytes = np.zeros(1, dtype=np.uint64)
+            lib.ppm_score_dump.restype = ctypes.c_int
+            lib.ppm_score_dump.argtypes = [
+                ctypes.POINTER(ctypes.c_int64), ctypes.POINTER(ctypes.c_int64),
+                ctypes.POINTER(ctypes.c_double), ctypes.c_int64,
+                ctypes.POINTER(ctypes.c_uint8), ctypes.POINTER(ctypes.c_int32), ctypes.POINTER(ctypes.c_int32),
+                ctypes.POINTER(ctypes.c_uint8), ctypes.POINTER(ctypes.c_uint8),
+                ctypes.c_int, ctypes.c_int, ctypes.c_double, ctypes.c_double, ctypes.c_double,
+                ctypes.c_uint32, ctypes.POINTER(ctypes.c_double),
+                ctypes.POINTER(ctypes.c_float), ctypes.POINTER(ctypes.c_float), ctypes.POINTER(ctypes.c_float),
+                ctypes.POINTER(ctypes.c_float), ctypes.POINTER(ctypes.c_uint8), ctypes.POINTER(ctypes.c_uint8),
+                ctypes.POINTER(ctypes.c_uint64),
+            ]
+            out_dump = np.zeros(6, dtype=np.float64)
+            rc_dump = lib.ppm_score_dump(
+                target_ids.ctypes.data_as(ctypes.POINTER(ctypes.c_int64)),
+                prev_ids.ctypes.data_as(ctypes.POINTER(ctypes.c_int64)),
+                nll_nats.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
+                ctypes.c_int64(len(target_ids)),
+                flat.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+                offs.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
+                lens.ctypes.data_as(ctypes.POINTER(ctypes.c_int32)),
+                has_space.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+                is_boundary.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+                ctypes.c_int(vocab), ctypes.c_int(args.ppm_order),
+                ctypes.c_double(args.lambda_hi), ctypes.c_double(args.lambda_lo), ctypes.c_double(args.ppm_threshold),
+                ctypes.c_uint32(args.ppm_log_cache),
+                out_dump.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
+                dump_mix.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
+                dump_ppm.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
+                dump_nn_uniform.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
+                dump_conf.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
+                dump_gate_hi.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+                dump_byte.ctypes.data_as(ctypes.POINTER(ctypes.c_uint8)),
+                dump_n_bytes.ctypes.data_as(ctypes.POINTER(ctypes.c_uint64)),
+            )
+            n_dump = int(dump_n_bytes[0])
+            print(f"[pm-dump] dumped {n_dump} bytes (proper-margin had {nbs}); aligned: {n_dump==nbs}")
+
+            # Save and find examples
+            if n_dump == nbs:
+                # Sanity: dump_byte should match pm_byte_stream
+                mismatches = int((byte_stream_np != dump_byte[:n_dump]).sum())
+                print(f"[pm-dump] byte-stream mismatches: {mismatches} (should be 0)")
+                # Per-byte: dump_nn_uniform = -log(p_NN_byte_uniform) in nats (positive NLL)
+                # dump_ppm = -log(p_PPM_byte) in nats
+                # nn_logp_np = log(p_NN_byte_proper) in nats (negative log-prob)
+                # so per-byte NLL: nn_proper_nats = -nn_logp_np
+                nn_proper_nats = -nn_logp_np[:n_dump]
+                nn_uniform_nats = dump_nn_uniform[:n_dump].astype(np.float64)
+                ppm_nats = dump_ppm[:n_dump].astype(np.float64)
+                gate_hi = dump_gate_hi[:n_dump]
+                bytes_arr = dump_byte[:n_dump]
+
+                # Find "artifact bytes": gate_hi=1 (PPM rescues with λ_lo=0.05, 95% PPM)
+                #   AND nn_proper_nats much smaller than nn_uniform_nats (proper says NN was confident)
+                #   AND ppm_nats < nn_uniform_nats (PPM looks cheap vs uniform NN)
+                artifact_score = (nn_uniform_nats - nn_proper_nats) * gate_hi.astype(np.float64)
+                top_idx = np.argsort(-artifact_score)[:30]
+
+                # Save full per-byte arrays
+                np.savez_compressed("/tmp/per_byte_compare.npz",
+                    bytes=bytes_arr,
+                    nn_proper_nats=nn_proper_nats.astype(np.float32),
+                    nn_uniform_nats=nn_uniform_nats.astype(np.float32),
+                    ppm_nats=ppm_nats.astype(np.float32),
+                    gate_hi=gate_hi,
+                )
+                print(f"[pm-dump] saved per-byte data → /tmp/per_byte_compare.npz")
+
+                LOG2_n = math.log(2.0)
+                print(f"\n[examples] TOP 30 'artifact bytes' (gate_hi=1, big proper-vs-uniform divergence):")
+                print(f"  fmt: idx | byte | uniform_nat | proper_nat | ppm_nat | savings_per_byte_nat")
+                for k_, idx in enumerate(top_idx):
+                    b = int(bytes_arr[idx])
+                    ch = chr(b) if 32 <= b < 127 else f"\\x{b:02x}"
+                    # Show context: 8 bytes before
+                    pre_start = max(0, int(idx) - 8)
+                    pre_bytes = bytes(bytes_arr[pre_start:int(idx)+1])
+                    pre_str = pre_bytes.decode("utf-8", errors="replace")
+                    print(f"  [{int(idx):8d}] byte={ch!r:>5}  "
+                          f"uniform={nn_uniform_nats[idx]:6.3f}n  proper={nn_proper_nats[idx]:6.3f}n  "
+                          f"ppm={ppm_nats[idx]:6.3f}n  savings={(nn_uniform_nats[idx]-nn_proper_nats[idx]):.3f}n  "
+                          f"ctx={pre_str!r}")
+        except Exception as e:
+            print(f"[pm-mix/dump] FAILED: {e}")
+            import traceback; traceback.print_exc()
+
+    # ---- SAVE RAW DATA IMMEDIATELY (before any report-writing that could crash) ----
+    npz_path = Path(args.out).with_suffix(".npz")
+    save_dict = dict(
+        tokens=np.array(tokens, dtype=np.int32),
+        nll_nats=nll_nats.astype(np.float32),
+        top_ids=top_ids,
+        top_logp=top_logp,
+        ppm_out=out,
+        ckpt_path=np.array([args.ckpt]),
+        ppm_config=np.array([args.ppm_order, args.lambda_hi, args.lambda_lo, args.ppm_threshold], dtype=np.float64),
+    )
+    if val_bytes_per_tok_full is not None:
+        save_dict["val_bytes_per_tok"] = val_bytes_per_tok_full.astype(np.int32)
+    np.savez_compressed(npz_path, **save_dict)
+    print(f"[save] raw data → {npz_path} ({npz_path.stat().st_size//1024//1024} MB)")
+
+    # ---- official-style val_bpb computation using sidecar bytes ----
+    if val_bytes_per_tok_full is not None:
+        # Match official: byte budget for target tokens (positions 1..n-1)
+        target_bytes = val_bytes_per_tok_full[1:n].astype(np.float64)
+        total_nll_bits = nll_nats.sum() / math.log(2.0)
+        total_bytes_official = float(target_bytes.sum())
+        nn_bpb_official = total_nll_bits / total_bytes_official
+        print(f"[official] NN val_bpb (sidecar bytes, varlen attn) = {nn_bpb_official:.5f}")
+        print(f"           total bits = {total_nll_bits:,.0f}, total bytes = {int(total_bytes_official):,}")
+
+    # ---- failure analysis breakdown by token category ----
+    pieces = [sp.id_to_piece(t) if t < vocab else "<oor>" for t in tokens]
+    actual_pieces = pieces[1:n]
+    cats = [categorize(p) for p in actual_pieces]
+    NLL_bits = nll_nats / math.log(2)
+    bytes_per = lens[target_ids]
+    bpb_per = NLL_bits / np.maximum(bytes_per, 1)
+
+    from collections import defaultdict
+    cat_count = defaultdict(int)
+    cat_total_bits = defaultdict(float)
+    cat_total_bytes = defaultdict(int)
+    for i, c in enumerate(cats):
+        cat_count[c] += 1
+        cat_total_bits[c] += float(NLL_bits[i])
+        cat_total_bytes[c] += int(bytes_per[i])
+    total_bits = float(NLL_bits.sum())
+    total_bytes_meas = int(bytes_per.sum())
+    measured_nn_bpb = total_bits / max(total_bytes_meas, 1)
+
+    # NLL distribution buckets
+    buckets = [(0, 0.5), (0.5, 1.5), (1.5, 3.0), (3.0, 5.0), (5.0, 999)]
+    bucket_count = [0]*len(buckets); bucket_sum = [0.0]*len(buckets)
+    for x in bpb_per:
+        for j, (lo, hi) in enumerate(buckets):
+            if lo <= x < hi:
+                bucket_count[j] += 1
+                bucket_sum[j] += float(x)
+                break
+
+    # Top/bottom NLL positions for context
+    sorted_idx = np.argsort(-NLL_bits)
+    worst = sorted_idx[:25]
+    best = sorted_idx[-25:][::-1]
+
+    def ctx_str(pos, k=25):
+        end = pos + 1
+        start = max(0, end - k)
+        s = "".join(pieces[i].replace("▁", " ") for i in range(start, end))
+        return s.replace("\n", "↵")[-80:]
+
+    # ---- write report ----
+    out_lines = []
+    out_lines.append(f"# PPM-D byte-mixture inspection")
+    out_lines.append(f"")
+    out_lines.append(f"**Checkpoint:** `{args.ckpt}`")
+    out_lines.append(f"**Tokens scored:** {len(target_ids)} (bytes: {int(n_bytes)})")
+    out_lines.append(f"**PPM config:** order={args.ppm_order} λ_hi={args.lambda_hi} λ_lo={args.lambda_lo} threshold={args.ppm_threshold}")
+    out_lines.append(f"")
+    out_lines.append(f"## Headline")
+    out_lines.append(f"")
+    out_lines.append(f"| Metric | bits/byte |")
+    out_lines.append(f"|---|---:|")
+    out_lines.append(f"| **NN only** (`nn_byte_bpb`) | **{nn_byte_bpb:.5f}** |")
+    out_lines.append(f"| PPM only (`ppm_only`) | {ppm_only_bpb:.5f} |")
+    out_lines.append(f"| **NN + PPM mix** (`mix_bpb`) | **{mix_bpb:.5f}** |")
+    out_lines.append(f"| **Δ from PPM** | **{mix_bpb - nn_byte_bpb:+.5f}** |")
+    out_lines.append(f"| Gate high-confidence fraction | {gate_high_frac:.4f} ({100*gate_high_frac:.1f}%) |")
+    out_lines.append(f"| Token-level reference (`token_bpb`) | {token_bpb:.5f} |")
+    out_lines.append(f"")
+    out_lines.append(f"## Comparison to dexhunter's #1857")
+    out_lines.append(f"")
+    out_lines.append(f"| | #1857 | this run |")
+    out_lines.append(f"|---|---:|---:|")
+    out_lines.append(f"| nn_byte_bpb | 1.10020 | {nn_byte_bpb:.5f} |")
+    out_lines.append(f"| ppm_only | 2.34028 | {ppm_only_bpb:.5f} |")
+    out_lines.append(f"| mix_bpb | 1.03176 | {mix_bpb:.5f} |")
+    out_lines.append(f"| gate_high_frac | 0.14241 | {gate_high_frac:.5f} |")
+    out_lines.append(f"| Δ from PPM | -0.06844 | {mix_bpb - nn_byte_bpb:+.5f} |")
+    out_lines.append(f"")
+    out_lines.append(f"## Per-category contribution to NN-only val_bpb")
+    out_lines.append(f"")
+    out_lines.append(f"| category | count | % positions | mean bits/byte | bits | % NN val_bpb |")
+    out_lines.append(f"|---|---:|---:|---:|---:|---:|")
+    for cat in sorted(cat_count, key=lambda c: -cat_total_bits[c]):
+        c = cat_count[cat]
+        pct = 100*c/max(len(cats),1)
+        bb = cat_total_bytes[cat]
+        mean = cat_total_bits[cat] / max(bb, 1) if bb else 0
+        contrib = 100 * cat_total_bits[cat] / max(total_bits, 1e-9)
+        out_lines.append(f"| {cat} | {c} | {pct:.1f}% | {mean:.2f} | {cat_total_bits[cat]:.1f} | {contrib:.1f}% |")
+    out_lines.append(f"")
+    ppm_addr_pct = 100 * sum(cat_total_bits[c] for c in ['URL','NUMERIC','CODE','HEX','PATH']) / max(total_bits, 1e-9)
+    out_lines.append(f"**PPM-addressable categories (URL+NUMERIC+CODE+HEX+PATH): {ppm_addr_pct:.1f}% of NN val_bpb**")
+    out_lines.append(f"")
+    out_lines.append(f"## NLL distribution (NN only, bits/byte)")
+    out_lines.append(f"")
+    out_lines.append(f"| bucket | count | % | mean | contribution |")
+    out_lines.append(f"|---|---:|---:|---:|---:|")
+    for j, (lo, hi) in enumerate(buckets):
+        c = bucket_count[j]; pct = 100*c/max(len(cats),1)
+        mean = bucket_sum[j]/max(c,1)
+        contrib = bucket_sum[j]/max(len(cats),1)
+        contrib_pct = 100*contrib/max(measured_nn_bpb, 1e-9)
+        hi_str = "∞" if hi > 100 else f"{hi:.1f}"
+        out_lines.append(f"| {lo:.1f}–{hi_str} | {c} | {pct:.1f}% | {mean:.3f} | {contrib:.4f} ({contrib_pct:.1f}%) |")
+    out_lines.append(f"")
+    out_lines.append(f"## Top-50 most catastrophic NN predictions")
+    out_lines.append(f"")
+    out_lines.append(f"These are the bytes where the model was most surprised — high NLL means the model assigned ~0% probability to what actually came next.")
+    out_lines.append(f"")
+    out_lines.append(f"| pos | NLL (bits) | actual | top-1 prediction (prob) | category | left context (last 50 chars) |")
+    out_lines.append(f"|---|---:|---|---|---|---|")
+    for pos in sorted_idx[:50]:
+        top1_id = int(top_ids[pos][0])
+        top1_piece = sp.id_to_piece(top1_id) if top1_id < vocab else "<oor>"
+        top1_p = math.exp(float(top_logp[pos][0]))
+        out_lines.append(f"| {int(pos)} | {NLL_bits[pos]:.2f} | `{actual_pieces[pos]!r}` | `{top1_piece!r}` ({top1_p:.3f}) | {cats[pos]} | `{ctx_str(int(pos), k=50)}` |")
+    out_lines.append(f"")
+    out_lines.append(f"## Worst-10 per category")
+    out_lines.append(f"")
+    out_lines.append(f"Where each kind of byte fails most. PPM-addressable categories (URL/NUMERIC/CODE) are exactly where PPM helps.")
+    out_lines.append(f"")
+    for target_cat in ["URL", "NUMERIC", "CODE", "HEX", "PATH", "PROSE"]:
+        cat_positions = [i for i, c in enumerate(cats) if c == target_cat]
+        if not cat_positions: continue
+        cat_sorted = sorted(cat_positions, key=lambda i: -NLL_bits[i])[:10]
+        out_lines.append(f"### {target_cat} ({len(cat_positions)} positions, mean NLL {sum(NLL_bits[i] for i in cat_positions)/len(cat_positions):.2f} bits)")
+        out_lines.append(f"")
+        out_lines.append(f"| pos | NLL | actual | top-1 (prob) | left context |")
+        out_lines.append(f"|---|---:|---|---|---|")
+        for pos in cat_sorted:
+            top1_id = int(top_ids[pos][0])
+            top1_piece = sp.id_to_piece(top1_id) if top1_id < vocab else "<oor>"
+            top1_p = math.exp(float(top_logp[pos][0]))
+            out_lines.append(f"| {int(pos)} | {NLL_bits[pos]:.2f} | `{actual_pieces[pos]!r}` | `{top1_piece!r}` ({top1_p:.3f}) | `{ctx_str(int(pos), k=50)}` |")
+        out_lines.append(f"")
+    out_lines.append(f"## Top-25 best NN predictions (contrast)")
+    out_lines.append(f"")
+    out_lines.append(f"| pos | NLL (bits) | actual | category | left context |")
+    out_lines.append(f"|---|---:|---|---|---|")
+    for pos in best:
+        out_lines.append(f"| {int(pos)} | {NLL_bits[pos]:.2f} | `{actual_pieces[pos]!r}` | {cats[pos]} | `{ctx_str(int(pos))}` |")
+
+    Path(args.out).write_text("\n".join(out_lines))
+    print(f"\n[done] wrote {len(out_lines)} lines to {args.out}")
+
+if __name__ == "__main__":
+    main()
diff --git a/testing/ppm_scorer.c b/testing/ppm_scorer.c
new file mode 100644
index 0000000000..4617e5c7ff
--- /dev/null
+++ b/testing/ppm_scorer.c
@@ -0,0 +1,210 @@
+
+#include <math.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <omp.h>
+typedef struct{uint64_t key;uint32_t total,max_count,unique,head;uint8_t used,ib[4];uint32_t ic[4];} Ctx;
+typedef struct{uint32_t next,ctx,count;uint8_t byte;} Edge;
+typedef struct{Ctx*ctx;uint64_t cap,used;Edge*edges;uint64_t ecap,eused;} Table;
+static uint64_t mix64(uint64_t x){x^=x>>33;x*=0xff51afd7ed558ccdULL;x^=x>>33;x*=0xc4ceb9fe1a85ec53ULL;x^=x>>33;return x;}
+static int table_init(Table*t,uint64_t cap){uint64_t c=1;while(c<cap)c<<=1;t->cap=c;t->used=0;t->ctx=(Ctx*)calloc(c,sizeof(Ctx));t->ecap=cap*2+1024;t->eused=1;t->edges=(Edge*)calloc(t->ecap,sizeof(Edge));return t->ctx&&t->edges?0:-1;}
+static void table_free(Table*t){free(t->ctx);free(t->edges);memset(t,0,sizeof(*t));}
+static int grow_edges(Table*t){uint64_t nc=t->ecap*2;Edge*ne=(Edge*)realloc(t->edges,nc*sizeof(Edge));if(!ne)return-1;memset(ne+t->ecap,0,(nc-t->ecap)*sizeof(Edge));t->edges=ne;t->ecap=nc;return 0;}
+static Ctx* table_find(Table*t,uint64_t key){uint64_t m=t->cap-1,i=mix64(key)&m;for(;;){Ctx*c=&t->ctx[i];if(!c->used)return 0;if(c->key==key)return c;i=(i+1)&m;}}
+static int table_rehash(Table*t){
+    Table nt;if(table_init(&nt,t->cap*2))return-1;
+    free(nt.edges);nt.edges=t->edges;nt.ecap=t->ecap;nt.eused=t->eused;
+    for(uint64_t j=0;j<t->cap;j++)if(t->ctx[j].used){uint64_t m=nt.cap-1,i=mix64(t->ctx[j].key)&m;while(nt.ctx[i].used)i=(i+1)&m;nt.ctx[i]=t->ctx[j];nt.used++;}
+    free(t->ctx);*t=nt;return 0;
+}
+static Ctx* table_get_or_add(Table*t,uint64_t key){
+    if((t->used+1)*10>t->cap*7)if(table_rehash(t))return 0;
+    uint64_t m=t->cap-1,i=mix64(key)&m;
+    for(;;){Ctx*c=&t->ctx[i];if(!c->used){c->used=1;c->key=key;c->head=0;t->used++;return c;}if(c->key==key)return c;i=(i+1)&m;}
+}
+static uint32_t edge_count(Table*t,Ctx*c,uint8_t b){uint32_t m=c->unique<4?c->unique:4;for(uint32_t i=0;i<m;i++)if(c->ib[i]==b)return c->ic[i];for(uint32_t e=c->head;e;e=t->edges[e].next)if(t->edges[e].byte==b)return t->edges[e].count;return 0;}
+static int edge_inc(Table*t,Ctx*c,uint8_t b){
+    uint32_t m=c->unique<4?c->unique:4;for(uint32_t i=0;i<m;i++)if(c->ib[i]==b){uint32_t nc=++c->ic[i];c->total++;if(nc>c->max_count)c->max_count=nc;return 0;}
+    for(uint32_t e=c->head;e;e=t->edges[e].next)if(t->edges[e].byte==b){uint32_t nc=++t->edges[e].count;c->total++;if(nc>c->max_count)c->max_count=nc;return 0;}
+    if(c->unique<4){uint32_t i=c->unique;c->ib[i]=b;c->ic[i]=1;c->total++;c->unique++;if(c->max_count<1)c->max_count=1;return 0;}
+    if(t->eused>=t->ecap)if(grow_edges(t))return-1;
+    uint32_t e=(uint32_t)t->eused++;t->edges[e].byte=b;t->edges[e].count=1;t->edges[e].ctx=(uint32_t)(c-t->ctx);t->edges[e].next=c->head;c->head=e;c->total++;c->unique++;if(c->max_count<1)c->max_count=1;return 0;
+}
+static uint64_t mask_for(int K){return K>=8?~0ULL:((1ULL<<(8*K))-1ULL);}
+static inline double lgi(uint32_t x,double*lc,uint32_t lcap){if(lc&&x<lcap){double v=lc[x];if(v>=0.0)return v;v=log((double)x);lc[x]=v;return v;}return log((double)x);}
+
+/* score_byte_with_dump: extension of score_byte that ALSO writes per-byte
+ * info (mix_nll, ppm_nll, nn_nll, conf, gate_high, actual_byte) to dump
+ * arrays at index *dump_idx, then increments dump_idx. If dump arrays are
+ * NULL, behaves identically to score_byte. */
+static int score_byte_with_dump(Table*tables,uint32_t*c0,uint32_t*tot0,uint32_t*uniq0,uint32_t*max0,uint64_t*hist,int*wlen,int order,uint8_t b,double nn_logp,double lambda_hi,double lambda_lo,double lhi,double llo,double l1hi,double l1lo,double thr,double*lc,uint32_t lcap,double*mix_nll,double*ppm_nll,double*nn_nll,uint64_t*bytes,uint64_t*gate_high,uint64_t*gate_total,
+        float*dump_mix,float*dump_ppm,float*dump_nn,float*dump_conf,uint8_t*dump_gate_hi,uint8_t*dump_byte,uint64_t*dump_idx){
+    const double uni=log(1.0/256.0);double ppm_log=0.0,conf=0.0,esc=0.0;int found=0,seen=0,maxk=*wlen<order?*wlen:order;uint64_t keys[9];keys[0]=0;for(int K=1;K<=maxk;K++)keys[K]=(*hist)&mask_for(K);
+    for(int K=maxk;K>=1;K--){Ctx*c=table_find(&tables[K],keys[K]);if(!c)continue;uint32_t den=c->total+c->unique;if(!den)continue;double denom=(double)den;if(!seen){conf=(double)c->max_count/denom;seen=1;}uint32_t cnt=edge_count(&tables[K],c,b);if(cnt){ppm_log=esc+(lgi(cnt,lc,lcap)-lgi(den,lc,lcap));found=1;break;}if(c->unique>0)esc+=lgi(c->unique,lc,lcap)-lgi(den,lc,lcap);}
+    if(!found){uint32_t den0=*tot0+*uniq0;if(den0>0){double denom0=(double)den0;if(!seen){conf=(double)(*max0)/denom0;seen=1;}uint32_t cnt=c0[b];if(cnt){ppm_log=esc+(lgi(cnt,lc,lcap)-lgi(den0,lc,lcap));found=1;}else if(*uniq0>0)esc+=lgi(*uniq0,lc,lcap)-lgi(den0,lc,lcap);}}
+    if(!found)ppm_log=esc+uni;
+    int hi=conf>=thr;double lam=hi?lambda_lo:lambda_hi;(*gate_total)++;if(hi)(*gate_high)++;
+    double log_mix;if(lam<=0.0)log_mix=ppm_log;else if(lam>=1.0)log_mix=nn_logp;else{double a=(hi?llo:lhi)+nn_logp,c=(hi?l1lo:l1hi)+ppm_log,m=a>c?a:c;log_mix=m+log(exp(a-m)+exp(c-m));}
+    *mix_nll-=log_mix;*ppm_nll-=ppm_log;*nn_nll-=nn_logp;(*bytes)++;
+    /* PER-BYTE DUMP */
+    if(dump_mix){
+        uint64_t idx=*dump_idx;
+        dump_mix[idx]=(float)(-log_mix);
+        dump_ppm[idx]=(float)(-ppm_log);
+        dump_nn[idx]=(float)(-nn_logp);
+        dump_conf[idx]=(float)conf;
+        dump_gate_hi[idx]=(uint8_t)hi;
+        dump_byte[idx]=b;
+        (*dump_idx)++;
+    }
+    uint32_t nc=++c0[b];(*tot0)++;if(nc==1)(*uniq0)++;if(nc>*max0)*max0=nc;
+    for(int K=1;K<=maxk;K++){Ctx*c=table_get_or_add(&tables[K],keys[K]);if(!c||edge_inc(&tables[K],c,b))return-1;}
+    if(order>0){*hist=((*hist)<<8|b)&mask_for(order);if(*wlen<order)(*wlen)++;}
+    return 0;
+}
+
+/* Backward-compat wrapper: calls score_byte_with_dump with NULL dump pointers. */
+static int score_byte(Table*tables,uint32_t*c0,uint32_t*tot0,uint32_t*uniq0,uint32_t*max0,uint64_t*hist,int*wlen,int order,uint8_t b,double nn_logp,double lambda_hi,double lambda_lo,double lhi,double llo,double l1hi,double l1lo,double thr,double*lc,uint32_t lcap,double*mix_nll,double*ppm_nll,double*nn_nll,uint64_t*bytes,uint64_t*gate_high,uint64_t*gate_total){
+    return score_byte_with_dump(tables,c0,tot0,uniq0,max0,hist,wlen,order,b,nn_logp,lambda_hi,lambda_lo,lhi,llo,l1hi,l1lo,thr,lc,lcap,mix_nll,ppm_nll,nn_nll,bytes,gate_high,gate_total,
+        NULL,NULL,NULL,NULL,NULL,NULL,NULL);
+}
+
+int ppm_score(const int64_t*target,const int64_t*prev,const double*nll,int64_t n,const uint8_t*flat,const int32_t*offs,const int32_t*lens,const uint8_t*has_space,const uint8_t*is_boundary,int vocab,int order,double lambda_hi,double lambda_lo,double thr,uint32_t log_cache_size,double*out){
+    if(order<0||order>8)return-2;Table tables[9];uint64_t cap=(uint64_t)n*2+1024;for(int k=1;k<=order;k++)if(table_init(&tables[k],cap/(k+1)+1024))return-3;
+    double*lc=0;if(log_cache_size>1){lc=(double*)malloc((size_t)log_cache_size*sizeof(double));if(!lc)return-6;for(uint32_t i=0;i<log_cache_size;i++)lc[i]=-1.0;}double lhi=log(lambda_hi),llo=log(lambda_lo),l1hi=log(1.0-lambda_hi),l1lo=log(1.0-lambda_lo);
+    uint32_t c0[256];memset(c0,0,sizeof(c0));uint32_t tot0=0,uniq0=0,max0=0;uint64_t hist=0;int wlen=0;double mix_nll=0,ppm_nll=0,nn_nll=0,token_nll=0;uint64_t bytes=0,gate_high=0,gate_total=0;
+    for(int64_t i=0;i<n;i++){int tid=(int)target[i],pid=(int)prev[i];if(tid<0||tid>=vocab)continue;int len=lens[tid];int inc_space=has_space[tid]&&(pid<0||!is_boundary[pid]);int nb=len+(inc_space?1:0);if(nb<=0)continue;double nn_logp=-nll[i]/(double)nb;token_nll+=nll[i];if(inc_space)if(score_byte(tables,c0,&tot0,&uniq0,&max0,&hist,&wlen,order,32,nn_logp,lambda_hi,lambda_lo,lhi,llo,l1hi,l1lo,thr,lc,log_cache_size,&mix_nll,&ppm_nll,&nn_nll,&bytes,&gate_high,&gate_total))return-4;const uint8_t*p=flat+offs[tid];for(int j=0;j<len;j++)if(score_byte(tables,c0,&tot0,&uniq0,&max0,&hist,&wlen,order,p[j],nn_logp,lambda_hi,lambda_lo,lhi,llo,l1hi,l1lo,thr,lc,log_cache_size,&mix_nll,&ppm_nll,&nn_nll,&bytes,&gate_high,&gate_total))return-5;}
+    const double log2v=log(2.0);out[0]=bytes?mix_nll/(double)bytes/log2v:0;out[1]=bytes?ppm_nll/(double)bytes/log2v:0;out[2]=bytes?nn_nll/(double)bytes/log2v:0;out[3]=bytes?token_nll/(double)bytes/log2v:0;out[4]=(double)bytes;out[5]=gate_total?(double)gate_high/(double)gate_total:0;
+    if(lc)free(lc);for(int k=1;k<=order;k++)table_free(&tables[k]);return 0;
+}
+
+/* ppm_score_bytewise: takes a flat byte stream + per-byte nn_logp array
+ * (proper marginalization log-prob from external Python walker). Loops
+ * byte-by-byte and applies PPM-D mix. No token-based reconstruction.
+ * Returns out[0..5] with same layout as ppm_score, but out[3] (token_nll)
+ * is set to 0 (no token info here). */
+int ppm_score_bytewise(const uint8_t*byte_stream,const double*nn_logp_per_byte,int64_t n_bytes,
+        int order,double lambda_hi,double lambda_lo,double thr,uint32_t log_cache_size,double*out){
+    if(order<0||order>8)return-2;
+    Table tables[9];uint64_t cap=(uint64_t)n_bytes*2+1024;
+    for(int k=1;k<=order;k++)if(table_init(&tables[k],cap/(k+1)+1024))return-3;
+    double*lc=0;if(log_cache_size>1){lc=(double*)malloc((size_t)log_cache_size*sizeof(double));if(!lc)return-6;for(uint32_t i=0;i<log_cache_size;i++)lc[i]=-1.0;}
+    double lhi=log(lambda_hi),llo=log(lambda_lo),l1hi=log(1.0-lambda_hi),l1lo=log(1.0-lambda_lo);
+    uint32_t c0[256];memset(c0,0,sizeof(c0));uint32_t tot0=0,uniq0=0,max0=0;uint64_t hist=0;int wlen=0;
+    double mix_nll=0,ppm_nll=0,nn_nll=0;uint64_t bytes=0,gate_high=0,gate_total=0;
+    for(int64_t i=0;i<n_bytes;i++){
+        uint8_t b=byte_stream[i];
+        double nn_logp=nn_logp_per_byte[i];
+        if(score_byte(tables,c0,&tot0,&uniq0,&max0,&hist,&wlen,order,b,nn_logp,
+                lambda_hi,lambda_lo,lhi,llo,l1hi,l1lo,thr,lc,log_cache_size,
+                &mix_nll,&ppm_nll,&nn_nll,&bytes,&gate_high,&gate_total))return-5;
+    }
+    const double log2v=log(2.0);
+    out[0]=bytes?mix_nll/(double)bytes/log2v:0;
+    out[1]=bytes?ppm_nll/(double)bytes/log2v:0;
+    out[2]=bytes?nn_nll/(double)bytes/log2v:0;
+    out[3]=0;out[4]=(double)bytes;out[5]=gate_total?(double)gate_high/(double)gate_total:0;
+    if(lc)free(lc);for(int k=1;k<=order;k++)table_free(&tables[k]);return 0;
+}
+
+/* ppm_score_dump: single-threaded ppm_score with per-byte data dump.
+ * Caller pre-allocates dump arrays of size >= max possible byte count
+ * (suggest: n_tokens * 17 to be safe). On return, `*dump_n_bytes`
+ * contains actual bytes written. Other behavior identical to ppm_score. */
+int ppm_score_dump(const int64_t*target,const int64_t*prev,const double*nll,int64_t n,const uint8_t*flat,const int32_t*offs,const int32_t*lens,const uint8_t*has_space,const uint8_t*is_boundary,int vocab,int order,double lambda_hi,double lambda_lo,double thr,uint32_t log_cache_size,double*out,
+        float*dump_mix,float*dump_ppm,float*dump_nn,float*dump_conf,uint8_t*dump_gate_hi,uint8_t*dump_byte,uint64_t*dump_n_bytes){
+    if(order<0||order>8)return-2;Table tables[9];uint64_t cap=(uint64_t)n*2+1024;for(int k=1;k<=order;k++)if(table_init(&tables[k],cap/(k+1)+1024))return-3;
+    double*lc=0;if(log_cache_size>1){lc=(double*)malloc((size_t)log_cache_size*sizeof(double));if(!lc)return-6;for(uint32_t i=0;i<log_cache_size;i++)lc[i]=-1.0;}double lhi=log(lambda_hi),llo=log(lambda_lo),l1hi=log(1.0-lambda_hi),l1lo=log(1.0-lambda_lo);
+    uint32_t c0[256];memset(c0,0,sizeof(c0));uint32_t tot0=0,uniq0=0,max0=0;uint64_t hist=0;int wlen=0;double mix_nll=0,ppm_nll=0,nn_nll=0,token_nll=0;uint64_t bytes=0,gate_high=0,gate_total=0;
+    uint64_t dump_idx=0;
+    for(int64_t i=0;i<n;i++){
+        int tid=(int)target[i],pid=(int)prev[i];if(tid<0||tid>=vocab)continue;
+        int len=lens[tid];int inc_space=has_space[tid]&&(pid<0||!is_boundary[pid]);int nb=len+(inc_space?1:0);if(nb<=0)continue;
+        double nn_logp=-nll[i]/(double)nb;token_nll+=nll[i];
+        if(inc_space)if(score_byte_with_dump(tables,c0,&tot0,&uniq0,&max0,&hist,&wlen,order,32,nn_logp,lambda_hi,lambda_lo,lhi,llo,l1hi,l1lo,thr,lc,log_cache_size,&mix_nll,&ppm_nll,&nn_nll,&bytes,&gate_high,&gate_total,
+            dump_mix,dump_ppm,dump_nn,dump_conf,dump_gate_hi,dump_byte,&dump_idx))return-4;
+        const uint8_t*p=flat+offs[tid];
+        for(int j=0;j<len;j++)if(score_byte_with_dump(tables,c0,&tot0,&uniq0,&max0,&hist,&wlen,order,p[j],nn_logp,lambda_hi,lambda_lo,lhi,llo,l1hi,l1lo,thr,lc,log_cache_size,&mix_nll,&ppm_nll,&nn_nll,&bytes,&gate_high,&gate_total,
+            dump_mix,dump_ppm,dump_nn,dump_conf,dump_gate_hi,dump_byte,&dump_idx))return-5;
+    }
+    const double log2v=log(2.0);
+    out[0]=bytes?mix_nll/(double)bytes/log2v:0;out[1]=bytes?ppm_nll/(double)bytes/log2v:0;out[2]=bytes?nn_nll/(double)bytes/log2v:0;out[3]=bytes?token_nll/(double)bytes/log2v:0;out[4]=(double)bytes;out[5]=gate_total?(double)gate_high/(double)gate_total:0;
+    *dump_n_bytes=dump_idx;
+    if(lc)free(lc);for(int k=1;k<=order;k++)table_free(&tables[k]);return 0;
+}
+
+/* OpenMP parallel chunked scorer. Splits the token stream into chunks of
+ * size `chunk_tokens`; each chunk gets its own PPM-D state (tables, c0,
+ * hist) and is processed sequentially within the chunk. Chunks are
+ * distributed across OMP threads via dynamic scheduling. PPM state
+ * RESETS at chunk boundaries -- this CHANGES the scored BPB vs the
+ * single-context legacy ppm_score path (smaller chunks => more cold-start).
+ * For chunk_tokens >= n the result is bit-identical to ppm_score.
+ * `lc` (log-cache) is a read-only memo populated lazily; with
+ * `lc[i] = log(i)` it is monotonically writable -- benign races are
+ * idempotent (every thread writes the same value). To be conservative
+ * each thread allocates its own log cache. */
+int ppm_score_omp(const int64_t*target,const int64_t*prev,const double*nll,int64_t n,const uint8_t*flat,const int32_t*offs,const int32_t*lens,const uint8_t*has_space,const uint8_t*is_boundary,int vocab,int order,double lambda_hi,double lambda_lo,double thr,uint32_t log_cache_size,int64_t chunk_tokens,int num_threads,double*out){
+    if(order<0||order>8)return-2;
+    if(chunk_tokens<=0)return-7;
+    if(num_threads>0)omp_set_num_threads(num_threads);
+    double lhi=log(lambda_hi),llo=log(lambda_lo),l1hi=log(1.0-lambda_hi),l1lo=log(1.0-lambda_lo);
+    int64_t num_chunks=(n+chunk_tokens-1)/chunk_tokens;
+    double mix_nll_total=0,ppm_nll_total=0,nn_nll_total=0,token_nll_total=0;
+    uint64_t bytes_total=0,gate_high_total=0,gate_total_total=0;
+    int err_code=0;
+    #pragma omp parallel for schedule(dynamic,1) reduction(+:mix_nll_total,ppm_nll_total,nn_nll_total,token_nll_total,bytes_total,gate_high_total,gate_total_total)
+    for(int64_t ci=0;ci<num_chunks;ci++){
+        if(err_code)continue;
+        int64_t s=ci*chunk_tokens;
+        int64_t e=s+chunk_tokens;
+        if(e>n)e=n;
+        int64_t cn=e-s;
+        Table tables[9];memset(tables,0,sizeof(tables));
+        uint64_t cap=(uint64_t)cn*2+1024;
+        int local_err=0;
+        for(int k=1;k<=order;k++)if(table_init(&tables[k],cap/(k+1)+1024)){local_err=-3;break;}
+        double*lc=0;
+        if(!local_err&&log_cache_size>1){lc=(double*)malloc((size_t)log_cache_size*sizeof(double));if(!lc)local_err=-6;else for(uint32_t i=0;i<log_cache_size;i++)lc[i]=-1.0;}
+        if(!local_err){
+            uint32_t c0[256];memset(c0,0,sizeof(c0));
+            uint32_t tot0=0,uniq0=0,max0=0;
+            uint64_t hist=0;int wlen=0;
+            double mix_nll=0,ppm_nll=0,nn_nll=0,token_nll=0;
+            uint64_t bytes=0,gate_high=0,gate_total=0;
+            for(int64_t i=s;i<e&&!local_err;i++){
+                int tid=(int)target[i],pid=(int)prev[i];
+                if(tid<0||tid>=vocab)continue;
+                int len=lens[tid];
+                int inc_space=has_space[tid]&&(pid<0||!is_boundary[pid]);
+                int nb=len+(inc_space?1:0);
+                if(nb<=0)continue;
+                double nn_logp=-nll[i]/(double)nb;
+                token_nll+=nll[i];
+                if(inc_space)if(score_byte(tables,c0,&tot0,&uniq0,&max0,&hist,&wlen,order,32,nn_logp,lambda_hi,lambda_lo,lhi,llo,l1hi,l1lo,thr,lc,log_cache_size,&mix_nll,&ppm_nll,&nn_nll,&bytes,&gate_high,&gate_total)){local_err=-4;break;}
+                const uint8_t*p=flat+offs[tid];
+                for(int j=0;j<len;j++)if(score_byte(tables,c0,&tot0,&uniq0,&max0,&hist,&wlen,order,p[j],nn_logp,lambda_hi,lambda_lo,lhi,llo,l1hi,l1lo,thr,lc,log_cache_size,&mix_nll,&ppm_nll,&nn_nll,&bytes,&gate_high,&gate_total)){local_err=-5;break;}
+            }
+            if(!local_err){
+                mix_nll_total+=mix_nll;ppm_nll_total+=ppm_nll;nn_nll_total+=nn_nll;token_nll_total+=token_nll;
+                bytes_total+=bytes;gate_high_total+=gate_high;gate_total_total+=gate_total;
+            }
+        }
+        if(lc)free(lc);
+        for(int k=1;k<=order;k++)if(tables[k].ctx)table_free(&tables[k]);
+        if(local_err){
+            #pragma omp atomic write
+            err_code=local_err;
+        }
+    }
+    if(err_code)return err_code;
+    const double log2v=log(2.0);
+    out[0]=bytes_total?mix_nll_total/(double)bytes_total/log2v:0;
+    out[1]=bytes_total?ppm_nll_total/(double)bytes_total/log2v:0;
+    out[2]=bytes_total?nn_nll_total/(double)bytes_total/log2v:0;
+    out[3]=bytes_total?token_nll_total/(double)bytes_total/log2v:0;
+    out[4]=(double)bytes_total;
+    out[5]=gate_total_total?(double)gate_high_total/(double)gate_total_total:0;
+    return 0;
+}
diff --git a/testing/show_5_artifact_examples.py b/testing/show_5_artifact_examples.py
new file mode 100644
index 0000000000..9f9de8d40c
--- /dev/null
+++ b/testing/show_5_artifact_examples.py
@@ -0,0 +1,89 @@
+"""Walk through 5 specific artifact examples from a saved per-byte compare dump.
+
+For each: shows nn_uniform, nn_proper, ppm log-probs and the resulting mix
+charges under both spec 055 (uniform+PPM) and proper+PPM, plus context bytes.
+"""
+import argparse
+import numpy as np
+import math
+from pathlib import Path
+
+DEFAULT_CANDIDATES = [
+    Path("eval/data/2026-04-29_047B_200k_per_byte_compare.npz"),
+    Path("/tmp/per_byte_compare.npz"),
+]
+
+
+def resolve_input(path_arg: str | None) -> Path:
+    if path_arg:
+        return Path(path_arg)
+    for path in DEFAULT_CANDIDATES:
+        if path.exists():
+            return path
+    raise FileNotFoundError(
+        "No per-byte compare dump found. Pass --input or place one of these files:\n"
+        + "\n".join(f"  - {p}" for p in DEFAULT_CANDIDATES)
+    )
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--input", help="Path to per-byte compare .npz dump")
+args = parser.parse_args()
+
+input_path = resolve_input(args.input)
+d = np.load(input_path)
+bytes_arr = d["bytes"]
+nn_proper = d["nn_proper_nats"].astype(np.float64)
+nn_uniform = d["nn_uniform_nats"].astype(np.float64)
+ppm = d["ppm_nats"].astype(np.float64)
+gate_hi = d["gate_hi"]
+n = len(bytes_arr)
+
+mask = gate_hi == 1
+savings = nn_uniform - nn_proper
+
+def show(idx, label):
+    idx = int(idx)
+    b = int(bytes_arr[idx])
+    ch = chr(b) if 32 <= b < 127 else f"\\x{b:02x}"
+    nu = nn_uniform[idx]; np_ = nn_proper[idx]; pp = ppm[idx]
+    p_uni = math.exp(-nu); p_ppm = math.exp(-pp); p_pro = math.exp(-np_)
+    mix_uni = -math.log(0.05*p_uni + 0.95*p_ppm)
+    mix_pro = -math.log(0.05*p_pro + 0.95*p_ppm)
+    pre = bytes(bytes_arr[max(0, idx-12):idx+1]).decode("utf-8", errors="replace")
+    nxt = bytes(bytes_arr[idx+1:min(n, idx+5)]).decode("utf-8", errors="replace")
+    print(f"\n[{label}] idx={idx} realized byte={ch!r}")
+    print(f"  context (last 13 bytes incl. realized): {pre!r}")
+    print(f"  next 4 bytes:                            {nxt!r}")
+    print(f"  ----")
+    print(f"  nn_uniform-spread  = {nu:6.3f} nats   (p ≈ {p_uni:.4g})")
+    print(f"  nn_proper-margin   = {np_:6.3f} nats   (p ≈ {p_pro:.4g})")
+    print(f"  ppm                = {pp:6.3f} nats   (p ≈ {p_ppm:.4g})")
+    print(f"  spec055 mix charge = {mix_uni:.3f} nats   →  spec055 \"saves\" {nu - mix_uni:+.3f} vs uniform")
+    print(f"  proper+ppm charge  = {mix_pro:.3f} nats   →  proper+PPM gains {np_ - mix_pro:+.3f} vs proper-alone")
+    print(f"  TRUTH (proper) - SPEC055_MIX: {np_ - mix_uni:+.3f} nats  (positive = spec 055 even charges more than truth!)")
+
+ranked = np.argsort(-(savings * mask.astype(np.float64)))
+
+print("=" * 100)
+print("TOP 5 ARTIFACT BYTES — gate_hi=1, ranked by (nn_uniform − nn_proper)")
+print("These are bytes where uniform-spread *fakes* a high NN cost that PPM then 'rescues'.")
+print(f"  input: {input_path}")
+print(f"  Total bytes: {n:,}   gate_hi rate: {gate_hi.mean():.4f}")
+print("=" * 100)
+for k in range(5):
+    show(ranked[k], f"#{k+1}")
+
+# bonus: aggregate impact
+gate_high_mask = gate_hi == 1
+spec055_savings = (nn_uniform - np.minimum(nn_uniform, ppm))  # rough proxy under λ_lo=0.05
+total_uniform = nn_uniform.sum()
+total_proper = nn_proper.sum()
+print(f"\n{'='*100}")
+print(f"AGGREGATE (across {n:,} bytes):")
+print(f"  Sum nn_uniform NLL:   {total_uniform:11.1f} nats  → bpb = {total_uniform/n/math.log(2):.5f}")
+print(f"  Sum nn_proper  NLL:   {total_proper:11.1f} nats  → bpb = {total_proper/n/math.log(2):.5f}")
+print(f"  (these MUST be equal by bit-conservation, they total to the same — just redistributed)")
+print(f"  diff = {total_uniform - total_proper:+.6f} nats")
+print(f"  fraction of bytes with gate_hi=1: {gate_hi.mean():.4f}")
+print(f"  on those bytes: avg(uniform-proper) = {savings[mask].mean():.3f} nats per byte")