Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data

## Summary

This audit was conducted in collaboration with @andrewbaggio1. We audited all 34 CaseOps-lineage record-track PRs since #1729 (2026-04-18) for train/val data overlap. **15 have a systematic leak** caused by the default behaviour of `prepare_caseops_data.py`. The current claimed frontier (#2118 at 1.04350) is one of them. The true clean frontier is **#2014 at 1.05759**.

This is posted to the best of our ability. If you are a PR author and believe your record has been misclassified, please reply and we will revisit. 

---

## The bug

`prepare_caseops_data.py` ships with:

```python
SHARD_TOKENS = 10_000_000
parser.add_argument("--val-docs", default=10_000)
```

With the default `--val-docs=10000`, train shards start at canonical-stream document **10,000**. Val is generated from the first **50,000** documents. Documents 10,000–49,999 (**40,000 docs, 80% of val**) appear in both train and val. The model partially memorizes them during training, then is scored on them as if they were held out.

The canonical HF dataset (`romeerp/parameter-golf-caseops-v1`) does not have this problem — its manifest has `docs_val=50000, docs_train=8,181,945` with a strict disjoint partition by construction.

**Every shipped copy of `prepare_caseops_data.py` across all 34 PRs is byte-identical. No PR ever passes `--val-docs=50000`.**

The gold-standard description of the leak is in PR #2018's `DATASET_AUDIT.md`, which explicitly documents `--val-docs=10000` train shards and independently verifies the first 80 shards byte-for-byte.

---

## How we classified each PR

We applied two questions to each PR's reproduce flow (README data-setup section + all shipped shell scripts):

- **Q1:** Is there an explicit HF download command? (`snapshot_download`, `cached_challenge_fineweb.py`, `huggingface-cli download` — all targeting `romeerp/parameter-golf-caseops-v1`)
- **Q2:** Is `prepare_caseops_data.py` invoked?

| Q1 | Q2 | Verdict |
|---|---|---|
| ✅ | ❌ | CLEAN |
| ❌ | ✅ | LEAK |
| ✅ | ✅ | check which is the real reproduce step |
| ❌ | ❌ | check train log (`train_shards: 39` → CLEAN; `train_shards > 1000` → LEAK; triple-nested path → LEAK; otherwise AMBIGUOUS) |

---

## Results

| Bucket | Count | PRs |
|---|---|---|
| CLEAN | 12 | #1729, #1851, #1855, #1868, #1945, #1953, #1967, #2014, #2019, #2027, #2031, #2068, #2123, #2124 |
| LEAK | 15 | #1736, #1769, #1797,  #1923, #2007, #2018, #2060, #2071, #2078, #2100, #2101, #2109, #2118 |
| AMBIGUOUS | 6 | #1787, #1908, #2041, #2075, #2117, #2121 |

---

## Updated timeline, reconstructed to the best of our ability

```
Apr 18   #1729   1.06780   CLEAN
Apr 23   #1787   1.06335   AMBIGUOUS lean LEAK
Apr 27   #1851   1.06128   CLEAN
Apr 25   #1855   1.0611     CLEAN
Apr 28   #1908   1.06081   AMBIGUOUS lean CLEAN
Apr 29   #1868   1.06141   CLEAN
Apr 29   #1945   1.05943   CLEAN
Apr 30   #1953   1.05855   CLEAN
Apr 30  #1967    1.05851    CLEAN
Apr 30   #2014   1.05759   CLEAN
```

---

## Recommendation

1. Records using `prepare_caseops_data.py` without `--val-docs=50000` should be flagged pending re-run on clean data.
2. The current merged SOTA (#1851/#1868 at 1.06128/1.06141) is clean and stands.
3. The script's default should be updated to `--val-docs=50000`, or submissions should be required to use the canonical HF dataset.

---

## Credit

Thanks to @aquariouseworkman who both identified and attempted to fix this issue (PR #1851, the first clean record post-leak), and who raised it with the community. Full audit with per-PR evidence: `caseops-memory-leakage/` on the `research` branch of our fork.

@cocohearts  @valerio-oai 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Summary

The bug

How we classified each PR

Results

Updated timeline, reconstructed to the best of our ability

Recommendation

Credit

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Q1	Q2	Verdict
✅	❌	CLEAN
❌	✅	LEAK
✅	✅	check which is the real reproduce step
❌	❌	check train log (`train_shards: 39` → CLEAN; `train_shards > 1000` → LEAK; triple-nested path → LEAK; otherwise AMBIGUOUS)

Bucket	Count	PRs
CLEAN	12	#1729, #1851, #1855, #1868, #1945, #1953, #1967, #2014, #2019, #2027, #2031, #2068, #2123, #2124
LEAK	15	#1736, #1769, #1797, #1923, #2007, #2018, #2060, #2071, #2078, #2100, #2101, #2109, #2118
AMBIGUOUS	6	#1787, #1908, #2041, #2075, #2117, #2121

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Description

Summary

The bug

How we classified each PR

Results

Updated timeline, reconstructed to the best of our ability

Recommendation

Credit

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions