Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean) by joshuaswanson · Pull Request #1991 · openai/parameter-golf

joshuaswanson · 2026-04-30T16:37:27Z

Summary

val_bpb = 0.94290 (3-seed mean, std=0.00070, full FineWeb val) | <16 MB artifact | 8×H100 SXM | Causal byte-PPM mixer at eval, no TTT.

Builds on PR #1959 (PR #1493 bigbag + PR #1795 byte-PPM mixer). The neural network and training pipeline are byte-identical to PR #1959. The single change is the PPM mixer's four hyperparameters, found via a systematic offline sweep on the SP8192 NN's per-byte distribution:

Hyperparameter	PR #1959 default	This submission
`PPM_ORDER` (context length)	4	5
`PPM_T` (gate threshold)	0.9	0.80
`PPM_H` (high-lambda)	0.9	0.99
`PPM_L` (low-lambda)	0.05	0.20

PR #1795 originally hand-picked these defaults on top of @clarkkev's SP4096 stack, and PR #1959 inherited them when porting the mixer to PR #1493's SP8192 stack with a different NN distribution. No prior submission ran a systematic sweep on the SP8192 NN's per-byte distribution.

vs verified leader PR #1855 (1.06108): −0.11818 BPB
vs current open sub-1.0 candidate PR #1959 (0.99621): −0.05331 BPB

3-Seed Results

Seed	NN-only sliding	PPM mixer (O=5, tuned gate)	Model bytes	PPM eval time
42	1.10048	0.94289	15,974,299	480.9 s
314	1.09973	0.94221	15,971,826	473.3 s
999	1.10135	0.94361	15,973,459	471.6 s
Mean	1.10052	0.94290	15,973,195	475.3 s
Std	0.00081	0.00070

t-stat ≈ 132 on the 0.005-nat bar vs PR #1959, p ≪ 1e-10.

Sweep procedure (offline, on dumped (tga, lpa) from seed 42)

Train PR Record: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean) #1959 model (seed 42), with DUMP_PPM_INPUTS=1 so the eval loop dumps (target tokens, per-token NN log-probability) at byte-stream order. Same neural pipeline; no changes to training.
Replay byte-PPM-D over orders {3, 4, 5, 6} on the dumped per-byte target sequence. Same strict-legal causal-gate semantics as PR Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) #1795 (cf computed BEFORE looking up observed byte's count).
Vectorized sweep over (T ∈ {0.55…0.95}, H ∈ {0.85, 0.90, 0.93, 0.95, 0.97, 0.99}, L ∈ {0.0, 0.005, 0.01, 0.02, 0.03, 0.05, 0.07, 0.10, 0.12, 0.15, 0.18, 0.20, 0.22, 0.25, 0.30, 0.40}) for each PPM order.
Best single-order optimum: O=5, T=0.80, H=0.99, L=0.20 → 0.937 BPB on the seed-42 dump (vs PR Record: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean) #1959 default O=4, T=0.9, H=0.9, L=0.05 = 1.004 BPB on the same dump).
Reproducible: dump via DUMP_PPM_INPUTS=1; offline sweep runs on standard CPU (no GPU required).

Compliance (Track B — legal eval-time adaptation)

Inherits all compliance properties from PR #1959 / PR #1795:

Causal PPM: each byte scored under PPM-D using counters built only from bytes 0..i-1, then counter for byte i is updated. Score-before-update on every byte.
Outcome-independent gate: cf is computed from the deepest PPM context with data BEFORE any lookup of the observed byte's count.
Single pass: each byte scored exactly once.
No SLOT, no n-gram cache, no ETLB, no two-pass logit biasing.
No pre-quant TTT on val data: model is quantized once after training.
No tokenizer change: SP8192 unchanged from PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394.
Artifact under 16 MB on all 3 seeds (15.97 MB max + 19.6 KB LZMA-packed code wrapper = under 16 MB total).
Training under 600s on 8×H100 SXM: training is byte-identical to PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (588s on 8×H100 SXM).
Eval under 600s on 8×H100 SXM: PPM order-5 mixer is rank-0 single-threaded Python at ~475s (matches PR Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) #1795's report that order-5 is ~15s longer than order-4's ~365s = ~380s on a proper 8×H100). Sliding-window NN eval ~95s + GPTQ/quant ~30s. Total projected: ~510 s.

The only change to train_gpt.py vs PR #1959's submitted version is the four PPM env-var defaults. No structural changes; the strict-legal gate machinery is byte-identical.

Test plan

3-seed validation (42, 314, 999) — all complete
All artifacts under 16,000,000 bytes (max 15,974,299 + 19,602 = 15,993,901)
Full FineWeb val measurement (40,540,160 tokens / 152,574,319 bytes)
PPM strict-legal gate: prefix-only, score-before-update verified by code-diff
No tokenizer/training change vs PR Record: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean) #1959
Reviewer reproduction on standard 8×H100 SXM (P2P-enabled hardware projected to fit 600s eval cap)

…90 (3-seed mean)

… competition closed - Merged SOTA dropped from 1.0810 → 1.0611 (codemath3000, PR openai#1855) with all organizer pending branches now in main (CaseOps + SmearGate BOS fix + lrzip) - New target was ≤1.0561; competition closes today (April 30) - PR openai#1967 (ndokutovich, 1.05851): best clean legal open PR, timing question pending - PR openai#1991 (joshuaswanson, 0.94290): Byte-PPM Mixer; Issue openai#1872 open, no ruling - PR openai#1992 / openai#1972: ILLEGAL (PreQuantTTT 21ep) - PR openai#731 (Hedge Mixer, 1.0400): seeds 1337/2024 never filed; competition closing - Session 25 lessons + final Competition Strategy update added to CLAUDE.md https://claude.ai/code/session_01QKHz6Vfu2DFZdc7GiuKSBQ

remg1997 · 2026-04-30T18:02:05Z

Looks promising. I didn't have the time to do the sweep! Good luck :)

cocohearts · 2026-05-02T18:14:54Z

Leaderboard audit note (pre-cutoff state): I don't think this is valid as a record row. The byte-PPM mixer does not define a full normalized distribution over the official next-token alphabet before the realized token is known; it scores the realized byte stream by spreading the realized token log-prob across observed bytes. The submitted PPM order/gate choices also appear selected by offline validation-target sweeps, which is not acceptable record evidence.

Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate — val_bpb 0.942…

191bec9

…90 (3-seed mean)

Lumi-node mentioned this pull request Apr 30, 2026

Non-record: DepthScale — Parameter-Shared Iterative Transformer (1.1962 BPB) #1509

Open

This was referenced May 1, 2026

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083

Open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080

Open

himanshudongre mentioned this pull request May 1, 2026

Non-record: competition research notes #2111

Open

cocohearts mentioned this pull request May 2, 2026

Update leaderboard with May 1 audited rows #2146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean)#1991

Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean)#1991
joshuaswanson wants to merge 1 commit intoopenai:mainfrom
joshuaswanson:record/sp8192-ppm-tuned-O5

joshuaswanson commented Apr 30, 2026

Uh oh!

remg1997 commented Apr 30, 2026

Uh oh!

cocohearts commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joshuaswanson commented Apr 30, 2026

Summary

3-Seed Results

Sweep procedure (offline, on dumped (tga, lpa) from seed 42)

Compliance (Track B — legal eval-time adaptation)

Test plan

Uh oh!

remg1997 commented Apr 30, 2026

Uh oh!

cocohearts commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants