You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This audit was conducted in collaboration with @andrewbaggio1. We audited all 34 CaseOps-lineage record-track PRs since #1729 (2026-04-18) for train/val data overlap. 15 have a systematic leak caused by the default behaviour of prepare_caseops_data.py. The current claimed frontier (#2118 at 1.04350) is one of them. The true clean frontier is #2014 at 1.05759.
This is posted to the best of our ability. If you are a PR author and believe your record has been misclassified, please reply and we will revisit.
With the default --val-docs=10000, train shards start at canonical-stream document 10,000. Val is generated from the first 50,000 documents. Documents 10,000–49,999 (40,000 docs, 80% of val) appear in both train and val. The model partially memorizes them during training, then is scored on them as if they were held out.
The canonical HF dataset (romeerp/parameter-golf-caseops-v1) does not have this problem — its manifest has docs_val=50000, docs_train=8,181,945 with a strict disjoint partition by construction.
Every shipped copy of prepare_caseops_data.py across all 34 PRs is byte-identical. No PR ever passes --val-docs=50000.
The gold-standard description of the leak is in PR #2018's DATASET_AUDIT.md, which explicitly documents --val-docs=10000 train shards and independently verifies the first 80 shards byte-for-byte.
How we classified each PR
We applied two questions to each PR's reproduce flow (README data-setup section + all shipped shell scripts):
Q1: Is there an explicit HF download command? (snapshot_download, cached_challenge_fineweb.py, huggingface-cli download — all targeting romeerp/parameter-golf-caseops-v1)
The script's default should be updated to --val-docs=50000, or submissions should be required to use the canonical HF dataset.
Credit
Thanks to @aquariouseworkman who both identified and attempted to fix this issue (PR #1851, the first clean record post-leak), and who raised it with the community. Full audit with per-PR evidence: caseops-memory-leakage/ on the research branch of our fork.
Summary
This audit was conducted in collaboration with @andrewbaggio1. We audited all 34 CaseOps-lineage record-track PRs since #1729 (2026-04-18) for train/val data overlap. 15 have a systematic leak caused by the default behaviour of
prepare_caseops_data.py. The current claimed frontier (#2118 at 1.04350) is one of them. The true clean frontier is #2014 at 1.05759.This is posted to the best of our ability. If you are a PR author and believe your record has been misclassified, please reply and we will revisit.
The bug
prepare_caseops_data.pyships with:With the default
--val-docs=10000, train shards start at canonical-stream document 10,000. Val is generated from the first 50,000 documents. Documents 10,000–49,999 (40,000 docs, 80% of val) appear in both train and val. The model partially memorizes them during training, then is scored on them as if they were held out.The canonical HF dataset (
romeerp/parameter-golf-caseops-v1) does not have this problem — its manifest hasdocs_val=50000, docs_train=8,181,945with a strict disjoint partition by construction.Every shipped copy of
prepare_caseops_data.pyacross all 34 PRs is byte-identical. No PR ever passes--val-docs=50000.The gold-standard description of the leak is in PR #2018's
DATASET_AUDIT.md, which explicitly documents--val-docs=10000train shards and independently verifies the first 80 shards byte-for-byte.How we classified each PR
We applied two questions to each PR's reproduce flow (README data-setup section + all shipped shell scripts):
snapshot_download,cached_challenge_fineweb.py,huggingface-cli download— all targetingromeerp/parameter-golf-caseops-v1)prepare_caseops_data.pyinvoked?train_shards: 39→ CLEAN;train_shards > 1000→ LEAK; triple-nested path → LEAK; otherwise AMBIGUOUS)Results
Updated timeline, reconstructed to the best of our ability
Recommendation
prepare_caseops_data.pywithout--val-docs=50000should be flagged pending re-run on clean data.--val-docs=50000, or submissions should be required to use the canonical HF dataset.Credit
Thanks to @aquariouseworkman who both identified and attempted to fix this issue (PR #1851, the first clean record post-leak), and who raised it with the community. Full audit with per-PR evidence:
caseops-memory-leakage/on theresearchbranch of our fork.@cocohearts @valerio-oai