Skip to content

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

@leon2k2k2k

Description

@leon2k2k2k

Summary

This audit was conducted in collaboration with @andrewbaggio1. We audited all 34 CaseOps-lineage record-track PRs since #1729 (2026-04-18) for train/val data overlap. 15 have a systematic leak caused by the default behaviour of prepare_caseops_data.py. The current claimed frontier (#2118 at 1.04350) is one of them. The true clean frontier is #2014 at 1.05759.

This is posted to the best of our ability. If you are a PR author and believe your record has been misclassified, please reply and we will revisit.


The bug

prepare_caseops_data.py ships with:

SHARD_TOKENS = 10_000_000
parser.add_argument("--val-docs", default=10_000)

With the default --val-docs=10000, train shards start at canonical-stream document 10,000. Val is generated from the first 50,000 documents. Documents 10,000–49,999 (40,000 docs, 80% of val) appear in both train and val. The model partially memorizes them during training, then is scored on them as if they were held out.

The canonical HF dataset (romeerp/parameter-golf-caseops-v1) does not have this problem — its manifest has docs_val=50000, docs_train=8,181,945 with a strict disjoint partition by construction.

Every shipped copy of prepare_caseops_data.py across all 34 PRs is byte-identical. No PR ever passes --val-docs=50000.

The gold-standard description of the leak is in PR #2018's DATASET_AUDIT.md, which explicitly documents --val-docs=10000 train shards and independently verifies the first 80 shards byte-for-byte.


How we classified each PR

We applied two questions to each PR's reproduce flow (README data-setup section + all shipped shell scripts):

  • Q1: Is there an explicit HF download command? (snapshot_download, cached_challenge_fineweb.py, huggingface-cli download — all targeting romeerp/parameter-golf-caseops-v1)
  • Q2: Is prepare_caseops_data.py invoked?
Q1 Q2 Verdict
CLEAN
LEAK
check which is the real reproduce step
check train log (train_shards: 39 → CLEAN; train_shards > 1000 → LEAK; triple-nested path → LEAK; otherwise AMBIGUOUS)

Results

Bucket Count PRs
CLEAN 12 #1729, #1851, #1855, #1868, #1945, #1953, #1967, #2014, #2019, #2027, #2031, #2068, #2123, #2124
LEAK 15 #1736, #1769, #1797, #1923, #2007, #2018, #2060, #2071, #2078, #2100, #2101, #2109, #2118
AMBIGUOUS 6 #1787, #1908, #2041, #2075, #2117, #2121

Updated timeline, reconstructed to the best of our ability

Apr 18   #1729   1.06780   CLEAN
Apr 23   #1787   1.06335   AMBIGUOUS lean LEAK
Apr 27   #1851   1.06128   CLEAN
Apr 25   #1855   1.0611     CLEAN
Apr 28   #1908   1.06081   AMBIGUOUS lean CLEAN
Apr 29   #1868   1.06141   CLEAN
Apr 29   #1945   1.05943   CLEAN
Apr 30   #1953   1.05855   CLEAN
Apr 30  #1967    1.05851    CLEAN
Apr 30   #2014   1.05759   CLEAN

Recommendation

  1. Records using prepare_caseops_data.py without --val-docs=50000 should be flagged pending re-run on clean data.
  2. The current merged SOTA (Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851/Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean) #1868 at 1.06128/1.06141) is clean and stands.
  3. The script's default should be updated to --val-docs=50000, or submissions should be required to use the canonical HF dataset.

Credit

Thanks to @aquariouseworkman who both identified and attempted to fix this issue (PR #1851, the first clean record post-leak), and who raised it with the community. Full audit with per-PR evidence: caseops-memory-leakage/ on the research branch of our fork.

@cocohearts @valerio-oai

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions