Skip to content

Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean)#1991

Open
joshuaswanson wants to merge 1 commit intoopenai:mainfrom
joshuaswanson:record/sp8192-ppm-tuned-O5
Open

Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean)#1991
joshuaswanson wants to merge 1 commit intoopenai:mainfrom
joshuaswanson:record/sp8192-ppm-tuned-O5

Conversation

@joshuaswanson
Copy link
Copy Markdown

Summary

val_bpb = 0.94290 (3-seed mean, std=0.00070, full FineWeb val) | <16 MB artifact | 8×H100 SXM | Causal byte-PPM mixer at eval, no TTT.

Builds on PR #1959 (PR #1493 bigbag + PR #1795 byte-PPM mixer). The neural network and training pipeline are byte-identical to PR #1959. The single change is the PPM mixer's four hyperparameters, found via a systematic offline sweep on the SP8192 NN's per-byte distribution:

Hyperparameter PR #1959 default This submission
PPM_ORDER (context length) 4 5
PPM_T (gate threshold) 0.9 0.80
PPM_H (high-lambda) 0.9 0.99
PPM_L (low-lambda) 0.05 0.20

PR #1795 originally hand-picked these defaults on top of @clarkkev's SP4096 stack, and PR #1959 inherited them when porting the mixer to PR #1493's SP8192 stack with a different NN distribution. No prior submission ran a systematic sweep on the SP8192 NN's per-byte distribution.

vs verified leader PR #1855 (1.06108): −0.11818 BPB
vs current open sub-1.0 candidate PR #1959 (0.99621): −0.05331 BPB

3-Seed Results

Seed NN-only sliding PPM mixer (O=5, tuned gate) Model bytes PPM eval time
42 1.10048 0.94289 15,974,299 480.9 s
314 1.09973 0.94221 15,971,826 473.3 s
999 1.10135 0.94361 15,973,459 471.6 s
Mean 1.10052 0.94290 15,973,195 475.3 s
Std 0.00081 0.00070

t-stat ≈ 132 on the 0.005-nat bar vs PR #1959, p ≪ 1e-10.

Sweep procedure (offline, on dumped (tga, lpa) from seed 42)

  1. Train PR Record: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean) #1959 model (seed 42), with DUMP_PPM_INPUTS=1 so the eval loop dumps (target tokens, per-token NN log-probability) at byte-stream order. Same neural pipeline; no changes to training.
  2. Replay byte-PPM-D over orders {3, 4, 5, 6} on the dumped per-byte target sequence. Same strict-legal causal-gate semantics as PR Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) #1795 (cf computed BEFORE looking up observed byte's count).
  3. Vectorized sweep over (T ∈ {0.55…0.95}, H ∈ {0.85, 0.90, 0.93, 0.95, 0.97, 0.99}, L ∈ {0.0, 0.005, 0.01, 0.02, 0.03, 0.05, 0.07, 0.10, 0.12, 0.15, 0.18, 0.20, 0.22, 0.25, 0.30, 0.40}) for each PPM order.
  4. Best single-order optimum: O=5, T=0.80, H=0.99, L=0.20 → 0.937 BPB on the seed-42 dump (vs PR Record: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean) #1959 default O=4, T=0.9, H=0.9, L=0.05 = 1.004 BPB on the same dump).
  5. Reproducible: dump via DUMP_PPM_INPUTS=1; offline sweep runs on standard CPU (no GPU required).

Compliance (Track B — legal eval-time adaptation)

Inherits all compliance properties from PR #1959 / PR #1795:

The only change to train_gpt.py vs PR #1959's submitted version is the four PPM env-var defaults. No structural changes; the strict-legal gate machinery is byte-identical.

Test plan

  • 3-seed validation (42, 314, 999) — all complete
  • All artifacts under 16,000,000 bytes (max 15,974,299 + 19,602 = 15,993,901)
  • Full FineWeb val measurement (40,540,160 tokens / 152,574,319 bytes)
  • PPM strict-legal gate: prefix-only, score-before-update verified by code-diff
  • No tokenizer/training change vs PR Record: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean) #1959
  • Reviewer reproduction on standard 8×H100 SXM (P2P-enabled hardware projected to fit 600s eval cap)

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 30, 2026
… competition closed

- Merged SOTA dropped from 1.0810 → 1.0611 (codemath3000, PR openai#1855) with all
  organizer pending branches now in main (CaseOps + SmearGate BOS fix + lrzip)
- New target was ≤1.0561; competition closes today (April 30)
- PR openai#1967 (ndokutovich, 1.05851): best clean legal open PR, timing question pending
- PR openai#1991 (joshuaswanson, 0.94290): Byte-PPM Mixer; Issue openai#1872 open, no ruling
- PR openai#1992 / openai#1972: ILLEGAL (PreQuantTTT 21ep)
- PR openai#731 (Hedge Mixer, 1.0400): seeds 1337/2024 never filed; competition closing
- Session 25 lessons + final Competition Strategy update added to CLAUDE.md

https://claude.ai/code/session_01QKHz6Vfu2DFZdc7GiuKSBQ
@remg1997
Copy link
Copy Markdown

Looks promising. I didn't have the time to do the sweep! Good luck :)

@cocohearts
Copy link
Copy Markdown
Collaborator

Leaderboard audit note (pre-cutoff state): I don't think this is valid as a record row. The byte-PPM mixer does not define a full normalized distribution over the official next-token alphabet before the realized token is known; it scores the realized byte stream by spreading the realized token log-prob across observed bytes. The submitted PPM order/gate choices also appear selected by offline validation-target sweeps, which is not acceptable record evidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants