Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282 by anmarhindi · Pull Request #2032 · openai/parameter-golf

anmarhindi · 2026-04-30T22:42:04Z

Summary

3-seed mean val_bpb = 1.029282 (std 0.000782), an improvement of
−0.0517 nats over current SOTA (1.0810). 8×H100 SXM, eval ≤600s,
artifact 15.59 MB.

The contribution is a single eval-time post-processor on top of the
PR #1493 stack: a conditional byte-level PPM mixer that derives
P_NN(byte_0 | history) from the model's SP8192 softmax via the
canonical first-byte LUT, then mixes it with a PPM-D byte conditional
through a per-byte sigmoid gate. No new trainable parameters, no
additional artifact bytes, no training-time changes.

Seed	val_bpb
42	1.02849
1337	1.03005
314	1.02931
Mean	1.029282
Std	0.000782

Std of 0.0008 puts the −0.005-nat threshold at p ≪ 0.01.

Relationship to PR #1835

PR #1835 introduced byte-level mixing using the approximation

P_NN(byte = b) ≈ exp(token_logp / n_bytes_in_token)

This value is identical for every byte in the realized token and does
not sum to 1 over the byte alphabet: it is a per-byte-position
constant rather than a distribution. The C2 concern raised in review
of #1835 follows from this directly.

This submission instead derives P_NN(byte_0) as a proper marginal
over the SP8192 alphabet, weighted by the canonical first-byte LUT
(rebuilt from the SP tokenizer at deserialize, no extra storage):

P_NN(byte_0 = b | history)
    = Σ_{T : canonical_first_byte(T) = b}  P_NN(T | history)

For the remaining bytes within the realized token (b₁..b_{k−1}),
the chain-rule residual is:

P_NN_rem(b₁..b_{k−1} | b₀, history)
    = P_NN(token | history) / P_NN(byte_0 | history)

Both terms are now normalized distributions over their respective
alphabets:

byte_0: convex combination over the 256-byte alphabet.
remainder: convex combination over the joint byte-sequence alphabet
of length k−1.

Their product is a proper distribution over the realized token's
byte stream, so C2 holds by construction.

The mix gate is a sigmoid on PPM context confidence:

λ = 1 − sigmoid(α · (conf_PPM − β))    α = 15.0, β = 0.80
P_mix(b) = λ · P_NN(b) + (1 − λ) · P_PPM(b)

Why this works

The 16 MB cap forces a 35M-parameter model that can't allocate enough
mass to the long tail of byte sequences (URLs, code identifiers,
repeated proper nouns). PPM's strength is exactly that tail: order-5
byte context routinely assigns near-1 probability to the next byte
inside a code block or a recurring named entity, while a
parameter-constrained NN has to spread mass thin. The sigmoid gate
captures this conditionally: trust PPM when its top-symbol
probability is high, fall back to the NN otherwise. The conditioning
step (deriving P_NN(byte_0) from the SP8192 softmax instead of
approximating per-byte) closes the C2 gap that prevented PR #1835's
version from being clean: the gate now weights two distributions that
actually live on the same alphabet.

Ablation

3-seed means, applied sequentially:

Stage	val_bpb
pre-EMA, pre-quant	1.147
post-quant (no mixer)	1.179
sliding-window stride-64 (PR #1493)	1.184
+ cond-PPM mixer	1.029

The mixer contributes −0.150 BPB at eval time. Local Blackwell
ablation predicted −0.187 (~80% transfer to H100 SXM).

Compliance (Issue #1017)

Condition	How this submission satisfies it
C1, causality	Sliding-window scoring is strict-past only. Cond-PPM byte state advances only after each byte's mix log-prob is recorded. Marginalization at byte_0 derives from the position's softmax, which sees only strict past. Mix gate weights depend on PPM context confidence only, never on the realized byte.
C2, normalization	byte_0 mix is between two byte-alphabet distributions; remainder mix is between two joint-byte-sequence distributions. Both are proper.
C3, score-first	Both NN softmax and PPM byte conditional commit before observing the realized byte at each step. PPM state advances post-scoring.
C4, single L→R pass	Each validation token contributes exactly one BPB term. Sliding-window scoring overlaps but each token is scored at exactly one position.

No SLOT, no n-gram cache outside the legal byte-level PPM-D state,
no logit bias, no ETLB, no pre-quant TTT (which would violate C3).
Standard softmax over SP8192 at every scored position.

Lineage

PR #1394 (clarkkev) → PR #1530 (samacqua) → PR #1729 (romeerp,
CaseOps) → PR #1787 (nprime06) → PR #1797 (dexhunter, Smear+LQER,
with SmearGate BOS-mask fix from cocohearts' review) → PR #1493
(sliding-window stride-64) → this submission.

Reproduction

Full reproduction commands, per-seed train and eval logs, and helper
modules (CaseOps tokenizer + transform) are in
records/track_10min_16mb/2026-04-30_OptE_SlidingWindow_CondPPM/.
The bundled train_gpt.py is lzma+base85 wrapped (49,750 bytes);
underlying source lives at obliterate_0p9/train_gpt.py on the
cond-ppm branch.

git clone https://github.com/anmarhindi/parameter-golf
cd parameter-golf/obliterate_0p9
git checkout cond-ppm
bash run.sh                  # 3 seeds × (≤600s train + ≤600s eval)
bash build_submissions.sh    # produces submission folder + tarball

To inspect the readable source of train_gpt.py without executing it:

import lzma, base64, re
src = open("train_gpt.py").read()
blob = re.search(r'b85decode\("([^"]+)"\)', src).group(1)
print(lzma.decompress(base64.b85decode(blob)).decode())

Artifact size

Component	Bytes
`final_model.int6.ptz` (int6 GPTQ + brotli)	15,542,968
`train_gpt.py` (lzma+base85 wrapped)	49,750
Total counted toward 16 MB cap	15,592,718
Headroom	407,282 (2.5%)

Helper modules (lossless_caps.py, prepare_caseops_data.py) and
the SP8192 tokenizer are included for reproduction but do not count
toward the cap, per Issue #1017 §III.

Acknowledgments

This submission stands on a chain of prior work: @clarkkev (PR #1394)
for the SP8192 base, @samacqua (PR #1530), @romeerp (PR #1729) for
CaseOps, @nprime06 (PR #1787) for parallel residuals + looping,
@dexhunter (PR #1797) for Smear + LQER, @cocohearts for the
SmearGate BOS-mask fix, and the sliding-window stride-64 evaluator
from PR #1493. The PPM-D byte conditional itself is classical
(Cleary & Witten 1984; Moffat 1990; Howard 1993). What's contributed
here is the canonical first-byte marginalization that closes the C2
gap from PR #1835's earlier byte-mix attempt, and the empirical
observation that an order-5 byte conditional pairs well with a
35M-parameter LM at this scale.

Record: cond-PPM byte-conditional mixture — val_bpb 1.029282

af691d0

anmarhindi closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282#2032

Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282#2032
anmarhindi wants to merge 1 commit intoopenai:mainfrom
anmarhindi:submission-2026-04-30-cond-ppm

anmarhindi commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anmarhindi commented Apr 30, 2026

Summary

Relationship to PR #1835

Why this works

Ablation

Compliance (Issue #1017)

Lineage

Reproduction

Artifact size

Acknowledgments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant