Update leaderboard with May 1 audited rows#2146
Conversation
Co-authored-by: Codex <[email protected]>
Co-authored-by: Codex <[email protected]>
|
Thank you @cocohearts and all the participants. Until next time! |
|
@cocohearts Thanks for the audit work. I'm on the same page regarding most of the exclusions. One pushback on the rationale for PR #2135. Precedent. PR #1851 was filed without sufficient run logs/results, and PR #1868 supplied that evidence later, after PR #1855 had already beaten #1851/#1868. The combined submission was still accepted as part of the leaderboard record. That is the structural situation PR #2135 cites: code pre-cutoff, logs and results landing afterward. Consistency. The rationale for excluding PR #2135 appears to be "all logs and results in by cutoff." That criterion is not stated in the README, and applying it consistently would retroactively invalidate PR #1851's leaderboard spot: PR #1851's scored 3-seed state was only complete once PR #1868 supplied the missing logs, by which point PR #1855 had already beaten the #1851/#1868 record. Under a "logs and results in by cutoff" rule, PR #1851/#1868 would have been beaten before its submission was complete, and could never have taken its leaderboard spot in the first place. README rule. "Since all submissions are public, we're accepting record submissions chronologically depending on their PR creation time." That pins ordering on PR creation time, not on when logs and results reach completion. Timing and code surface. PR #2135 was opened at 2026-05-01 23:48:57Z (4:48 PM PT, ~11 minutes before the 5 PM PT cutoff). The finalized 3-seed results were pushed afterward, but the PR itself was filed pre-cutoff. The pre-cutoff commit (be7420d) shipped the full code surface: train_gpt.py, lossless_caps.py, online_ngram_state.c, online_ngram_tilt.py, prepare_caseops_data.py, the tokenizer .model, requirements.txt, and scaffold README/submission.json. The two post-cutoff commits (f086a9f, ff90522) only added train_seed*.log run outputs and filled result numerics into README.md and submission.json. No methodology, architecture, training script, or tokenizer changed afterward. Reproducibility. The seed log files added afterward are run outputs of the pre-cutoff code, not new submission content. Anyone with 8xH100 SXM access could clone be7420d and regenerate the runs within run-to-run noise. What was "missing" at cutoff wasn't the submission; it was the empirical measurement of a submission that was already public and reproducible. Conclusion. The README's PR-creation-time rule and the #1851/#1868 precedent both place PR #2135 on the same footing as PR #1851/#1868 for leaderboard inclusion: PR opened pre-cutoff, full code surface in-tree pre-cutoff, logs and results landing afterward. PR #2135 is a valid record submission and should be included on the leaderboard. Thanks so much again for taking the time to review these submissions. |
|
congrats everyone!! |
|
Thanks for the thorough audit @cocohearts. One flag before this merges: PR #2130 has the same train/val document overlap as PR #2018, which you correctly excluded. Evidence: SUBMISSION_FINAL/train_seed314.log line 1: train_shards: 1499 Per Issue #2127, train_shards: 1499 is the fingerprint of a local prepare_caseops_data.py run with the default --val-docs=10000. Train starts at document 10,000; the competition val set covers documents 0–49,999. Result: documents 10,000–49,999 (40k docs, 80% of the val set) appear in both training data and the scored val split. By contrast, the other three rows you included (#1945, #1953, #2014) all use snapshot_download from romeerp/parameter-golf-caseops-v1 and report train_shards: 80 with an explicit 50,000-doc val split — those are clean. PR #2130's README claims "Same dependencies and CaseOps tokenizer/shards as merged PR #1855" but the log contradicts this: PR #1855 uses the canonical HF dataset (80 shards), whereas #2130 generated data locally with the leaky default. The claimed 1.05670 should be excluded on the same grounds as #2018. |
|
@cocohearts @leon2k2k2k Acknowledged on PR #2130; the train_shards:1499 + doubled datasets/datasets path is hard to read any other way. Worth noting for the leaderboard reconciliation: PR #2135 does not share the same validity issue as PR #2130 even though it's based on #2130. PR #2135 inherits PR #2130's Therefore, #2135 is valid regardless of the validity of #2130. Semi-unrelated, but also flagging that numerous people on the Discord have raised reproducibility questions about PR #2014. Relevant since the chain backs up to PR #2014 if PR #2130 is excluded. |
|
Small update on #2140: I agree with the exclusion rationale for the originally submitted state because it accidentally had the within-word / word-start / agreement n-gram channels active. I’ve pushed a corrective commit restoring the intended token-only posture I had in #2018. The corrected logs now report:
Corrected 3-seed mean is I defer to maintainers on how to treat the timing/eligibility question for #2140 in this audit PR, but wanted to make the technical correction visible here. Thanks again to @cocohearts @0hq @valerio-oai and the other golfers for a great competition! |
|
@cocohearts @0hq @valerio-oai Separate from any of the specific-submission discussions, just want to say thank you to all three of you for the work that went into making this contest possible. It's a genuinely interesting challenge, accessible to people without industrial-scale compute, with a records archive that's already a serious body of method ideas. The audit and review work on top of that, at this scale, is real labor. None of that exists without you. Appreciated. |
Co-authored-by: Codex <[email protected]>
|
Updated after applying the grace policy:
The README diff is now four rows: #1945 V21 v2, #1953, #2014, and #2135. |
|
Thanks @cocohearts. One small note, broadly consistent with @codemath3000’s point about timing and precedent: like his approved #2135, my PR was opened before the cutoff. My #2140 is admittedly not identical to #2135, since my corrective commit did touch code, not just logs. But that change was narrowly compliance-restoring: it disabled the unintended target-token-gated channels and moved the score worse, from the originally submitted number to I'd humbly hope maintainers decide there is room for a little grace here, for my compliance correction clearly made in the spirit of the competition, especially since there was a little uncertainty about precisely when the cutoff would be and what adjustments might be permitted. If the rule is strictly “final scored state pushed before cutoff,” I understand the exclusion. Thanks for your consideration. |
|
@cocohearts why did 2019 not make it? |
|
Thanks @simon-marcus. Just to clarify the current state for context: under @cocohearts's most recent call, PR #2135 is also still on the excluded side, applying the "scored-state-pushed-before-cutoff" rule. @simon-marcus, I do agree with your broader point about leaving room for narrowly measurement-completing or compliance-restoring work that lands shortly after cutoff, especially when the PR itself was filed pre-cutoff. I'd similarly hope #2135 receives the same consideration. |
Co-authored-by: Codex <[email protected]>
|
Applying the maintainer grace policy here: if the original idea/code was submitted before cutoff, then later validation/results or narrowly compliance-restoring fixes can be considered. Changes made in this PR:
|
|
Thank you so much @cocohearts! |
Summary
Adds the audited post-#1902 leaderboard progression rows using the maintainer grace policy: a PR opened before the May 1, 2026 5:00 PM Pacific cutoff can count when the original idea/code was public before cutoff and later commits only supplied validation/results or narrowly compliance-restoring fixes.
Rows added:
70067534: 1.05943, p=0.034 vs PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855Audit Notes
I scanned 192 PRs (#1944-#2140) created after the #1902 leaderboard merge and before the cutoff, using parallel Codex shard graders plus a final global chronological reconciliation.
Grace-policy handling:
train_shards: 80, no doubled localdatasets/datasetspath,val_tokens: 47851520).Notable exclusions: