Skip to content

Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282#2032

Closed
anmarhindi wants to merge 1 commit intoopenai:mainfrom
anmarhindi:submission-2026-04-30-cond-ppm
Closed

Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282#2032
anmarhindi wants to merge 1 commit intoopenai:mainfrom
anmarhindi:submission-2026-04-30-cond-ppm

Conversation

@anmarhindi
Copy link
Copy Markdown

Summary

3-seed mean val_bpb = 1.029282 (std 0.000782), an improvement of
−0.0517 nats over current SOTA (1.0810). 8×H100 SXM, eval ≤600s,
artifact 15.59 MB.

The contribution is a single eval-time post-processor on top of the
PR #1493 stack: a conditional byte-level PPM mixer that derives
P_NN(byte_0 | history) from the model's SP8192 softmax via the
canonical first-byte LUT, then mixes it with a PPM-D byte conditional
through a per-byte sigmoid gate. No new trainable parameters, no
additional artifact bytes, no training-time changes.

Seed val_bpb
42 1.02849
1337 1.03005
314 1.02931
Mean 1.029282
Std 0.000782

Std of 0.0008 puts the −0.005-nat threshold at p ≪ 0.01.

Relationship to PR #1835

PR #1835 introduced byte-level mixing using the approximation

P_NN(byte = b) ≈ exp(token_logp / n_bytes_in_token)

This value is identical for every byte in the realized token and does
not sum to 1 over the byte alphabet: it is a per-byte-position
constant rather than a distribution. The C2 concern raised in review
of #1835 follows from this directly.

This submission instead derives P_NN(byte_0) as a proper marginal
over the SP8192 alphabet, weighted by the canonical first-byte LUT
(rebuilt from the SP tokenizer at deserialize, no extra storage):

P_NN(byte_0 = b | history)
    = Σ_{T : canonical_first_byte(T) = b}  P_NN(T | history)

For the remaining bytes within the realized token (b₁..b_{k−1}),
the chain-rule residual is:

P_NN_rem(b₁..b_{k−1} | b₀, history)
    = P_NN(token | history) / P_NN(byte_0 | history)

Both terms are now normalized distributions over their respective
alphabets:

  • byte_0: convex combination over the 256-byte alphabet.
  • remainder: convex combination over the joint byte-sequence alphabet
    of length k−1.

Their product is a proper distribution over the realized token's
byte stream, so C2 holds by construction.

The mix gate is a sigmoid on PPM context confidence:

λ = 1 − sigmoid(α · (conf_PPM − β))    α = 15.0, β = 0.80
P_mix(b) = λ · P_NN(b) + (1 − λ) · P_PPM(b)

Why this works

The 16 MB cap forces a 35M-parameter model that can't allocate enough
mass to the long tail of byte sequences (URLs, code identifiers,
repeated proper nouns). PPM's strength is exactly that tail: order-5
byte context routinely assigns near-1 probability to the next byte
inside a code block or a recurring named entity, while a
parameter-constrained NN has to spread mass thin. The sigmoid gate
captures this conditionally: trust PPM when its top-symbol
probability is high, fall back to the NN otherwise. The conditioning
step (deriving P_NN(byte_0) from the SP8192 softmax instead of
approximating per-byte) closes the C2 gap that prevented PR #1835's
version from being clean: the gate now weights two distributions that
actually live on the same alphabet.

Ablation

3-seed means, applied sequentially:

Stage val_bpb
pre-EMA, pre-quant 1.147
post-quant (no mixer) 1.179
sliding-window stride-64 (PR #1493) 1.184
+ cond-PPM mixer 1.029

The mixer contributes −0.150 BPB at eval time. Local Blackwell
ablation predicted −0.187 (~80% transfer to H100 SXM).

Compliance (Issue #1017)

Condition How this submission satisfies it
C1, causality Sliding-window scoring is strict-past only. Cond-PPM byte state advances only after each byte's mix log-prob is recorded. Marginalization at byte_0 derives from the position's softmax, which sees only strict past. Mix gate weights depend on PPM context confidence only, never on the realized byte.
C2, normalization byte_0 mix is between two byte-alphabet distributions; remainder mix is between two joint-byte-sequence distributions. Both are proper.
C3, score-first Both NN softmax and PPM byte conditional commit before observing the realized byte at each step. PPM state advances post-scoring.
C4, single L→R pass Each validation token contributes exactly one BPB term. Sliding-window scoring overlaps but each token is scored at exactly one position.

No SLOT, no n-gram cache outside the legal byte-level PPM-D state,
no logit bias, no ETLB, no pre-quant TTT (which would violate C3).
Standard softmax over SP8192 at every scored position.

Lineage

PR #1394 (clarkkev) → PR #1530 (samacqua) → PR #1729 (romeerp,
CaseOps) → PR #1787 (nprime06) → PR #1797 (dexhunter, Smear+LQER,
with SmearGate BOS-mask fix from cocohearts' review) → PR #1493
(sliding-window stride-64) → this submission.

Reproduction

Full reproduction commands, per-seed train and eval logs, and helper
modules (CaseOps tokenizer + transform) are in
records/track_10min_16mb/2026-04-30_OptE_SlidingWindow_CondPPM/.
The bundled train_gpt.py is lzma+base85 wrapped (49,750 bytes);
underlying source lives at obliterate_0p9/train_gpt.py on the
cond-ppm branch.

git clone https://github.com/anmarhindi/parameter-golf
cd parameter-golf/obliterate_0p9
git checkout cond-ppm
bash run.sh                  # 3 seeds × (≤600s train + ≤600s eval)
bash build_submissions.sh    # produces submission folder + tarball

To inspect the readable source of train_gpt.py without executing it:

import lzma, base64, re
src = open("train_gpt.py").read()
blob = re.search(r'b85decode\("([^"]+)"\)', src).group(1)
print(lzma.decompress(base64.b85decode(blob)).decode())

Artifact size

Component Bytes
final_model.int6.ptz (int6 GPTQ + brotli) 15,542,968
train_gpt.py (lzma+base85 wrapped) 49,750
Total counted toward 16 MB cap 15,592,718
Headroom 407,282 (2.5%)

Helper modules (lossless_caps.py, prepare_caseops_data.py) and
the SP8192 tokenizer are included for reproduction but do not count
toward the cap, per Issue #1017 §III.

Acknowledgments

This submission stands on a chain of prior work: @clarkkev (PR #1394)
for the SP8192 base, @samacqua (PR #1530), @romeerp (PR #1729) for
CaseOps, @nprime06 (PR #1787) for parallel residuals + looping,
@dexhunter (PR #1797) for Smear + LQER, @cocohearts for the
SmearGate BOS-mask fix, and the sliding-window stride-64 evaluator
from PR #1493. The PPM-D byte conditional itself is classical
(Cleary & Witten 1984; Moffat 1990; Howard 1993). What's contributed
here is the canonical first-byte marginalization that closes the C2
gap from PR #1835's earlier byte-mix attempt, and the empirical
observation that an order-5 byte conditional pairs well with a
35M-parameter LM at this scale.

@anmarhindi anmarhindi closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant