Update leaderboard with May 1 audited rows by cocohearts · Pull Request #2146 · openai/parameter-golf

cocohearts · 2026-05-02T18:08:54Z

Summary

Adds the audited post-#1902 leaderboard progression rows using the maintainer grace policy: a PR opened before the May 1, 2026 5:00 PM Pacific cutoff can count when the original idea/code was public before cutoff and later commits only supplied validation/results or narrowly compliance-restoring fixes.

Rows added:

PR Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945 at V21 v2 commit 70067534: 1.05943, p=0.034 vs PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855
PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953: 1.05855, p=0.063 vs PR Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945 V21 v2
PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014: 1.05759, p=0.011 vs PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953
PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135: 1.05651, p=0.014 vs PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014

Audit Notes

I scanned 192 PRs (#1944-#2140) created after the #1902 leaderboard merge and before the cutoff, using parallel Codex shard graders plus a final global chronological reconciliation.

Grace-policy handling:

PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 is included: the PR and full code/scaffold were opened pre-cutoff, later commits supplied the 3-seed logs/results, and the submitted logs use clean canonical CaseOps data (train_shards: 80, no doubled local datasets/datasets path, val_tokens: 47851520).
PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's corrected token-only state is treated as technically acceptable under grace, but it is not a leaderboard row because earlier PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 has the lower score (1.05651 vs 1.05702).

Notable exclusions:

PR Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) #2019: valid-looking, but not a chronological frontier because earlier PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 already scored lower.
PR Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950, Record: LongCtx No-QV QK5.25 + AsymLogit — 1.05899 BPB 3-seed mean #2007, Add LR0.85 prefix2750 legal TTT record #2047, Record: LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 — 1.05792 BPB 3-seed mean #2060, Record: AWQ-lite + AsymLogit + GradCentral + ... val_bpb=1.05845 #2101: valid-looking or support rows, but not chronological frontiers after earlier better rows.
PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018 and PR Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130: invalid due train/validation document overlap in the submitted CaseOps data construction.
PR Record: SP8192 V21 + Inside-timer N-gram TTT (no Gated XSA) — val_bpb 1.05692 (3-seed mean) #2041, Record [corrected] : 1.05770 Gated XSA + token-only n-gram tilt + LQER top-1 + AWQ-lite + AsymLogit) with GPTQ_RESERVE_SECONDS=2.0 and corrected CaseOps data preparation #2118, and PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140 original state: active within/word n-gram gates, same C1 target-token issue; Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's later corrected token-only state is acceptable but non-frontier.
PR Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer Full-Val - val_bpb 1.015784 #2039, Record: SP8192 + Sliding-Window Eval + Lock-In Byte Mixer - val_bpb 1.067219 #2138: CaseOps/cond-PPM BPB used transformed/reconstructed byte denominator instead of raw validation byte sidecar.
PR Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean) #1991, SP8192 Byte-PPM O=5 + V6 micro, 3-seed mean 0.92967555 BPB #2076, Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080, Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083, Record: PR #1873 base + tuned PPM gate (T=0.7/H=0.99/L=0.3) — val_bpb 0.80051 (3-seed mean) #2098, Add 16MB SP1024 Value Residual + PPM mixture submission ppm_mix_bpb 0.829467 #2103: byte/PPM mixers failed the full-normalized-distribution audit.

Co-authored-by: Codex <[email protected]>

anmarhindi · 2026-05-02T18:22:20Z

Thank you @cocohearts and all the participants. Until next time!

codemath3000 · 2026-05-02T18:23:36Z

@cocohearts Thanks for the audit work. I'm on the same page regarding most of the exclusions. One pushback on the rationale for PR #2135.

Precedent. PR #1851 was filed without sufficient run logs/results, and PR #1868 supplied that evidence later, after PR #1855 had already beaten #1851/#1868. The combined submission was still accepted as part of the leaderboard record. That is the structural situation PR #2135 cites: code pre-cutoff, logs and results landing afterward.

Consistency. The rationale for excluding PR #2135 appears to be "all logs and results in by cutoff." That criterion is not stated in the README, and applying it consistently would retroactively invalidate PR #1851's leaderboard spot: PR #1851's scored 3-seed state was only complete once PR #1868 supplied the missing logs, by which point PR #1855 had already beaten the #1851/#1868 record. Under a "logs and results in by cutoff" rule, PR #1851/#1868 would have been beaten before its submission was complete, and could never have taken its leaderboard spot in the first place.

README rule. "Since all submissions are public, we're accepting record submissions chronologically depending on their PR creation time." That pins ordering on PR creation time, not on when logs and results reach completion.

Timing and code surface. PR #2135 was opened at 2026-05-01 23:48:57Z (4:48 PM PT, ~11 minutes before the 5 PM PT cutoff). The finalized 3-seed results were pushed afterward, but the PR itself was filed pre-cutoff. The pre-cutoff commit (be7420d) shipped the full code surface: train_gpt.py, lossless_caps.py, online_ngram_state.c, online_ngram_tilt.py, prepare_caseops_data.py, the tokenizer .model, requirements.txt, and scaffold README/submission.json. The two post-cutoff commits (f086a9f, ff90522) only added train_seed*.log run outputs and filled result numerics into README.md and submission.json. No methodology, architecture, training script, or tokenizer changed afterward.

Reproducibility. The seed log files added afterward are run outputs of the pre-cutoff code, not new submission content. Anyone with 8xH100 SXM access could clone be7420d and regenerate the runs within run-to-run noise. What was "missing" at cutoff wasn't the submission; it was the empirical measurement of a submission that was already public and reproducible.

Conclusion. The README's PR-creation-time rule and the #1851/#1868 precedent both place PR #2135 on the same footing as PR #1851/#1868 for leaderboard inclusion: PR opened pre-cutoff, full code surface in-tree pre-cutoff, logs and results landing afterward. PR #2135 is a valid record submission and should be included on the leaderboard.

Thanks so much again for taking the time to review these submissions.

andrewbaggio1 · 2026-05-02T18:31:41Z

congrats everyone!!

leon2k2k2k · 2026-05-02T18:46:21Z

Thanks for the thorough audit @cocohearts. One flag before this merges: PR #2130 has the same train/val document overlap as PR #2018, which you correctly excluded.

Evidence: SUBMISSION_FINAL/train_seed314.log line 1: train_shards: 1499

Per Issue #2127, train_shards: 1499 is the fingerprint of a local prepare_caseops_data.py run with the default --val-docs=10000. Train starts at document 10,000; the competition val set covers documents 0–49,999. Result: documents 10,000–49,999 (40k docs, 80% of the val set) appear in both training data and the scored val split.

By contrast, the other three rows you included (#1945, #1953, #2014) all use snapshot_download from romeerp/parameter-golf-caseops-v1 and report train_shards: 80 with an explicit 50,000-doc val split — those are clean.

PR #2130's README claims "Same dependencies and CaseOps tokenizer/shards as merged PR #1855" but the log contradicts this: PR #1855 uses the canonical HF dataset (80 shards), whereas #2130 generated data locally with the leaky default. The claimed 1.05670 should be excluded on the same grounds as #2018.

codemath3000 · 2026-05-02T18:58:59Z

@cocohearts @leon2k2k2k Acknowledged on PR #2130; the train_shards:1499 + doubled datasets/datasets path is hard to read any other way.

Worth noting for the leaderboard reconciliation: PR #2135 does not share the same validity issue as PR #2130 even though it's based on #2130. PR #2135 inherits PR #2130's train_gpt.py byte-for-byte but prepared its data fresh from the canonical romeerp/parameter-golf-caseops-v1 HF dataset, not from the leaky local prep. All three PR #2135 seed logs show train_shards: 80, datasets_dir: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved (no doubled nesting), and val_tokens: 47851520 matching PR #1855's canonical baseline. The leak fingerprint Issue #2127 identified is absent.

Therefore, #2135 is valid regardless of the validity of #2130.

Semi-unrelated, but also flagging that numerous people on the Discord have raised reproducibility questions about PR #2014. Relevant since the chain backs up to PR #2014 if PR #2130 is excluded.

simon-marcus · 2026-05-02T19:27:32Z

Small update on #2140: I agree with the exclusion rationale for the originally submitted state because it accidentally had the within-word / word-start / agreement n-gram channels active.

I’ve pushed a corrective commit restoring the intended token-only posture I had in #2018. The corrected logs now report:

within_gate=0 word_gate=0 agree2plus=0

Corrected 3-seed mean is 1.05701907 BPB, with eval times under 600s. The corrected PR title/README/submission metadata/logs have been updated accordingly.

I defer to maintainers on how to treat the timing/eligibility question for #2140 in this audit PR, but wanted to make the technical correction visible here.

Thanks again to @cocohearts @0hq @valerio-oai and the other golfers for a great competition!

codemath3000 · 2026-05-02T19:49:16Z

@cocohearts @0hq @valerio-oai Separate from any of the specific-submission discussions, just want to say thank you to all three of you for the work that went into making this contest possible. It's a genuinely interesting challenge, accessible to people without industrial-scale compute, with a records archive that's already a serious body of method ideas. The audit and review work on top of that, at this scale, is real labor. None of that exists without you. Appreciated.

Co-authored-by: Codex <[email protected]>

cocohearts · 2026-05-02T21:01:02Z

Updated after applying the grace policy:

PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 is now included in this README update. The PR/code/scaffold were opened before cutoff, the later commits supplied logs/results, and the logs verify clean canonical CaseOps data (train_shards: 80, no doubled datasets/datasets path, val_tokens: 47851520).
PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's corrected token-only state is treated as technically acceptable under grace, but it still does not add a row because earlier PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 is lower: 1.05651 vs 1.05702.
PR Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) #2019 is valid-looking but not a chronological frontier because earlier PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 already scored lower: 1.05759 vs 1.05847.
PR Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130 remains excluded for the train/validation document-overlap issue.

The README diff is now four rows: #1945 V21 v2, #1953, #2014, and #2135.

simon-marcus · 2026-05-02T21:26:41Z

Thanks @cocohearts.

One small note, broadly consistent with @codemath3000’s point about timing and precedent: like his approved #2135, my PR was opened before the cutoff. My #2140 is admittedly not identical to #2135, since my corrective commit did touch code, not just logs. But that change was narrowly compliance-restoring: it disabled the unintended target-token-gated channels and moved the score worse, from the originally submitted number to 1.05701907.

I'd humbly hope maintainers decide there is room for a little grace here, for my compliance correction clearly made in the spirit of the competition, especially since there was a little uncertainty about precisely when the cutoff would be and what adjustments might be permitted.

If the rule is strictly “final scored state pushed before cutoff,” I understand the exclusion. Thanks for your consideration.

aquariouseworkman · 2026-05-02T21:46:34Z

@cocohearts why did 2019 not make it?

codemath3000 · 2026-05-02T21:53:34Z

Thanks @simon-marcus. Just to clarify the current state for context: under @cocohearts's most recent call, PR #2135 is also still on the excluded side, applying the "scored-state-pushed-before-cutoff" rule. @simon-marcus, I do agree with your broader point about leaving room for narrowly measurement-completing or compliance-restoring work that lands shortly after cutoff, especially when the PR itself was filed pre-cutoff. I'd similarly hope #2135 receives the same consideration.

Co-authored-by: Codex <[email protected]>

cocohearts · 2026-05-02T23:36:05Z

Applying the maintainer grace policy here: if the original idea/code was submitted before cutoff, then later validation/results or narrowly compliance-restoring fixes can be considered.

Changes made in this PR:

Added PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 as the new top row: 3-seed mean 1.05650768, p=0.014 vs PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014. I verified the logs use clean canonical CaseOps data (train_shards: 80, no doubled datasets/datasets path, val_tokens: 47851520) and token-only n-gram gates.
Treated PR Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140's corrected token-only state as technically acceptable under grace, but not a row because PR Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135 was opened earlier and scores lower (1.05651 vs 1.05702).
PR Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) #2019 did not make the table because PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 was opened earlier and scores lower (1.05759 vs 1.05847).
PR Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130 remains excluded due the train/validation document overlap fingerprint.

codemath3000 · 2026-05-02T23:40:41Z

Thank you so much @cocohearts!

cocohearts and others added 2 commits May 2, 2026 18:08

Update leaderboard with May 1 audited rows

dff916e

Co-authored-by: Codex <[email protected]>

Clarify PR 2130 leaderboard attribution

cb46d19

Co-authored-by: Codex <[email protected]>

leon2k2k2k mentioned this pull request May 2, 2026

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130

Open

Remove PR 2130 from leaderboard update

e8db356

Co-authored-by: Codex <[email protected]>

Add PR 2135 under grace policy

bfc3a26

Co-authored-by: Codex <[email protected]>

This was referenced May 2, 2026

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135

Open

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

Conversation

cocohearts commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Audit Notes

Uh oh!

anmarhindi commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewbaggio1 commented May 2, 2026

Uh oh!

leon2k2k2k commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-marcus commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026

Uh oh!

cocohearts commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-marcus commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aquariouseworkman commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cocohearts commented May 2, 2026

Uh oh!

codemath3000 commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cocohearts commented May 2, 2026 •

edited

Loading

codemath3000 commented May 2, 2026 •

edited

Loading

codemath3000 commented May 2, 2026 •

edited

Loading

cocohearts commented May 2, 2026 •

edited

Loading

simon-marcus commented May 2, 2026 •

edited

Loading

codemath3000 commented May 2, 2026 •

edited

Loading