Skip to content

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270#2083

Open
NewyorkDev wants to merge 5 commits intoopenai:mainfrom
NewyorkDev:codex/v13-caseops-ppm-094175
Open

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270#2083
NewyorkDev wants to merge 5 commits intoopenai:mainfrom
NewyorkDev:codex/v13-caseops-ppm-094175

Conversation

@NewyorkDev
Copy link
Copy Markdown

@NewyorkDev NewyorkDev commented May 1, 2026

Summary

fresh val_bpb = 0.94175270 (3-seed mean, sample std=0.00026331, full FineWeb val) | strict <16 MB artifact | 8xH100 SXM | causal sidecar-aware byte PPM, no TTT.

This is our strongest v13 lane: SP8192 CaseOps + SmearGate BOS masking + per-group lrzip compression + PPM order-5 evaluator. The final delta is a narrow PPM gate retune:

Hyperparameter Previous tuned line v13 submission
PPM_ORDER 5 5
PPM_H 0.99 / 0.995 0.999
PPM_L 0.20 0.18
PPM_T 0.80 0.80

Relative to PR #1991's open 0.94290 three-seed mean, this is about -0.00115 BPB on the same seed set.

Fresh 3-seed results

Seed Final ppm_sliding val_bpb Artifact bytes Train stop Eval time
42 0.94182660 15,987,305 4773 steps / 599.686s 507.652s
314 0.94146034 15,983,753 4770 steps / 599.628s 516.897s
999 0.94197117 15,988,348 4772 steps / 599.644s 519.029s
Mean 0.94175270
Std 0.00026331

All three fresh evals finish under the 600s eval cap. All three artifacts are under the strict decimal 16,000,000 byte cap; the largest measured total is 15,988,348 bytes.

The earlier eval-only three-seed mean was 0.94174862; the fresh end-to-end rerun set is cleaner and is now used in submission.json.

Compliance notes

  • Causal PPM: score-before-update, prefix counts only.
  • TTT_ENABLED=0: no validation-set gradient update for the submitted score.
  • SmearGate BOS leak fix is present in both normal forward and TTT forward paths.
  • The byte sidecar is used for BPB accounting and byte-stream alignment, not as a learned answer table.
  • lrzip is required as a preinstalled system binary for the per-group compressor; the script does not download packages during training/eval.

Test plan

  • python3 -m py_compile train_gpt.py
  • python3 -m json.tool submission.json
  • 3 fresh end-to-end PPM evals under 600s
  • Artifacts under 16,000,000 bytes
  • SmearGate BOS mask audited in both forward paths

Thanks to Claude for late-stage experiment design help, to Codex for implementation/audit/packaging/run coordination, and to the Parameter Golf community for the public SP8192, PPM, SmearGate, compression, and quantization ideas this builds on.

Attribution

README.md and REFERENCES.md explicitly credit inherited public Parameter Golf components: SP8192/tokenizer and recurrence lineage (PR #1394, #1493, #1855), byte-PPM lineage (PR #1795, #1959, #1991), SmearGate/BOS masking lineage (modded-nanogpt @classiclarryd, PR #1667, #1797, #2014), compression lineage (PR #1586, #1667, #1729), and quantization/optimizer/scoring pieces (PR #1530, #1886, #1923, #1344, #1145, #1967). The v13-specific contribution is the consolidation, sidecar-aware packaging, and final PPM gate retune to H=0.999/L=0.18/T=0.80.

0.9.mp4

@NewyorkDev NewyorkDev changed the title Record: SP8192 CaseOps v13 PPM tuned gate — val_bpb 0.94175 [PENDING fresh reruns] Record: SP8192 CaseOps v13 PPM tuned gate — val_bpb 0.94175 May 1, 2026
@NewyorkDev
Copy link
Copy Markdown
Author

Fresh end-to-end rerun evidence is still being collected. Seed 42 has completed cleanly and is included in this PR; seed 314 is currently running; seed 999 is queued to start after seed 314 exits. The current headline score remains based on the existing three-seed evidence until the fresh rerun set is complete.

@NewyorkDev
Copy link
Copy Markdown
Author

Fresh rerun status update: seed 42 completed cleanly at ppm_sliding val_bpb 0.94182660; seed 314 completed cleanly at ppm_sliding val_bpb 0.94146034; seed 999 is now running as v13_submit_clean_s999_20260501_043637. I left the headline score unchanged until seed 999 finishes, because the fresh set is still incomplete. Thanks again to the public Parameter Golf contributors credited in REFERENCES.md, Claude for experiment/design help, and Codex for orchestration, implementation, audit, packaging, and PR maintenance.

@NewyorkDev
Copy link
Copy Markdown
Author

Fresh clean rerun set is now complete and pushed in commit 4214ca9.

Final fresh end-to-end evidence with submitted defaults:

seed 42:  ppm_sliding val_bpb 0.94182660, bytes 15,987,305, eval 507.652s, rc=0
seed 314: ppm_sliding val_bpb 0.94146034, bytes 15,983,753, eval 516.897s, rc=0
seed 999: ppm_sliding val_bpb 0.94197117, bytes 15,988,348, eval 519.029s, rc=0
mean: 0.94175270
sample std: 0.00026331

All three artifacts are under the strict 16,000,000 byte cap and eval stays under 600s. The earlier eval-only mean was 0.94174862; the fresh full rerun set is cleaner and is now used in submission.json.

@NewyorkDev NewyorkDev changed the title [PENDING fresh reruns] Record: SP8192 CaseOps v13 PPM tuned gate — val_bpb 0.94175 Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant