Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 by NewyorkDev · Pull Request #2083 · openai/parameter-golf

NewyorkDev · 2026-05-01T03:53:24Z

Summary

fresh val_bpb = 0.94175270 (3-seed mean, sample std=0.00026331, full FineWeb val) | strict <16 MB artifact | 8xH100 SXM | causal sidecar-aware byte PPM, no TTT.

This is our strongest v13 lane: SP8192 CaseOps + SmearGate BOS masking + per-group lrzip compression + PPM order-5 evaluator. The final delta is a narrow PPM gate retune:

Hyperparameter	Previous tuned line	v13 submission
`PPM_ORDER`	5	5
`PPM_H`	0.99 / 0.995	0.999
`PPM_L`	0.20	0.18
`PPM_T`	0.80	0.80

Relative to PR #1991's open 0.94290 three-seed mean, this is about -0.00115 BPB on the same seed set.

Fresh 3-seed results

Seed	Final `ppm_sliding val_bpb`	Artifact bytes	Train stop	Eval time
42	0.94182660	15,987,305	4773 steps / 599.686s	507.652s
314	0.94146034	15,983,753	4770 steps / 599.628s	516.897s
999	0.94197117	15,988,348	4772 steps / 599.644s	519.029s
Mean	0.94175270
Std	0.00026331

All three fresh evals finish under the 600s eval cap. All three artifacts are under the strict decimal 16,000,000 byte cap; the largest measured total is 15,988,348 bytes.

The earlier eval-only three-seed mean was 0.94174862; the fresh end-to-end rerun set is cleaner and is now used in submission.json.

Compliance notes

Causal PPM: score-before-update, prefix counts only.
TTT_ENABLED=0: no validation-set gradient update for the submitted score.
SmearGate BOS leak fix is present in both normal forward and TTT forward paths.
The byte sidecar is used for BPB accounting and byte-stream alignment, not as a learned answer table.
lrzip is required as a preinstalled system binary for the per-group compressor; the script does not download packages during training/eval.

Test plan

python3 -m py_compile train_gpt.py
python3 -m json.tool submission.json
3 fresh end-to-end PPM evals under 600s
Artifacts under 16,000,000 bytes
SmearGate BOS mask audited in both forward paths

Thanks to Claude for late-stage experiment design help, to Codex for implementation/audit/packaging/run coordination, and to the Parameter Golf community for the public SP8192, PPM, SmearGate, compression, and quantization ideas this builds on.

Attribution

README.md and REFERENCES.md explicitly credit inherited public Parameter Golf components: SP8192/tokenizer and recurrence lineage (PR #1394, #1493, #1855), byte-PPM lineage (PR #1795, #1959, #1991), SmearGate/BOS masking lineage (modded-nanogpt @classiclarryd, PR #1667, #1797, #2014), compression lineage (PR #1586, #1667, #1729), and quantization/optimizer/scoring pieces (PR #1530, #1886, #1923, #1344, #1145, #1967). The v13-specific contribution is the consolidation, sidecar-aware packaging, and final PPM gate retune to H=0.999/L=0.18/T=0.80.

0.9.mp4

NewyorkDev · 2026-05-01T04:16:10Z

Fresh end-to-end rerun evidence is still being collected. Seed 42 has completed cleanly and is included in this PR; seed 314 is currently running; seed 999 is queued to start after seed 314 exits. The current headline score remains based on the existing three-seed evidence until the fresh rerun set is complete.

NewyorkDev · 2026-05-01T04:38:36Z

Fresh rerun status update: seed 42 completed cleanly at ppm_sliding val_bpb 0.94182660; seed 314 completed cleanly at ppm_sliding val_bpb 0.94146034; seed 999 is now running as v13_submit_clean_s999_20260501_043637. I left the headline score unchanged until seed 999 finishes, because the fresh set is still incomplete. Thanks again to the public Parameter Golf contributors credited in REFERENCES.md, Claude for experiment/design help, and Codex for orchestration, implementation, audit, packaging, and PR maintenance.

NewyorkDev · 2026-05-01T05:02:37Z

Fresh clean rerun set is now complete and pushed in commit 4214ca9.

Final fresh end-to-end evidence with submitted defaults:

seed 42:  ppm_sliding val_bpb 0.94182660, bytes 15,987,305, eval 507.652s, rc=0
seed 314: ppm_sliding val_bpb 0.94146034, bytes 15,983,753, eval 516.897s, rc=0
seed 999: ppm_sliding val_bpb 0.94197117, bytes 15,988,348, eval 519.029s, rc=0
mean: 0.94175270
sample std: 0.00026331

All three artifacts are under the strict 16,000,000 byte cap and eval stays under 600s. The earlier eval-only mean was 0.94174862; the fresh full rerun set is cleaner and is now used in submission.json.

NewyorkDev added 3 commits April 30, 2026 23:52

Add SP8192 CaseOps v13 PPM record

08d11cb

Expand v13 attribution notes

a0e1834

Add fresh v13 seed42 rerun evidence

ff75681

NewyorkDev changed the title ~~Record: SP8192 CaseOps v13 PPM tuned gate — val_bpb 0.94175~~ [PENDING fresh reruns] Record: SP8192 CaseOps v13 PPM tuned gate — val_bpb 0.94175 May 1, 2026

Add fresh v13 seed314 rerun evidence

6aeb2a9

Add fresh seed999 v13 rerun evidence

4214ca9

NewyorkDev changed the title ~~[PENDING fresh reruns] Record: SP8192 CaseOps v13 PPM tuned gate — val_bpb 0.94175~~ Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 May 1, 2026

himanshudongre mentioned this pull request May 1, 2026

Non-record: competition research notes #2111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270#2083

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270#2083
NewyorkDev wants to merge 5 commits intoopenai:mainfrom
NewyorkDev:codex/v13-caseops-ppm-094175

NewyorkDev commented May 1, 2026 •

edited

Loading

Uh oh!

NewyorkDev commented May 1, 2026

Uh oh!

NewyorkDev commented May 1, 2026

Uh oh!

NewyorkDev commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NewyorkDev commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fresh 3-seed results

Compliance notes

Test plan

Attribution

Uh oh!

NewyorkDev commented May 1, 2026

Uh oh!

NewyorkDev commented May 1, 2026

Uh oh!

NewyorkDev commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NewyorkDev commented May 1, 2026 •

edited

Loading