Skip to content

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean)#1953

Open
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:record/longctx-noqv-qk525-on-1945-1.0586
Open

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean)#1953
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:record/longctx-noqv-qk525-on-1945-1.0586

Conversation

@andrewbaggio1
Copy link
Copy Markdown

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 (val_bpb 1.05855)

val_bpb = 1.05855370 (3-seed mean, std 0.00029539) | max artifact 15,992,914 B | 8x H100 SXM | 600s train / 600s eval

Stacks four small, individually validated levers on the exact PR #1945 alertcat V21 record source (which is itself PR #1855 + PR #1908 AWQ-lite + PR #1923 Asymmetric Logit Rescale). Each lever was already measured on prior bases. The contribution here is the orthogonal stack and the production verification.

3-seed Results

Seed Stop step Train ms Pre-quant BPB Quant no-TTT BPB Post-TTT BPB Eval s Artifact bytes
42 4895 595955 1.06163175 1.06993750 1.05824720 430.0 15,988,861
0 4896 596123 1.06196584 1.07029420 1.05846113 441.5 15,988,757
1234 4916 596130 1.06199757 1.07068689 1.05895276 513.1 15,992,914
Mean 4902 596069 1.06186505 1.07030620 1.05855370 461.5 15,990,177

Population std on final BPB: 0.00029539.

vs current rank 1 (PR #1855 at 1.06108): -0.00253 BPB.
vs PR #1945 reported mean (1.05943381): -0.00088 BPB.
vs merge bar (1.05893): -0.00038 BPB.

All seeds clear the 600s train cap, 600s eval cap, and 16,000,000-byte artifact cap.

What changed vs PR #1945

Four literal-constant additions on top of the exact alertcat V21 source. No new code paths, no new mechanisms, no architectural changes:

EVAL_SEQ_LEN          = 2560     # was 2048
TTT_EVAL_SEQ_LEN      = 2560     # was 2048
TTT_MASK              = no_qv    # was default (Q/V LoRA active)
TTT_Q_LORA            = 0        # disable Q LoRA in TTT
TTT_V_LORA            = 0        # disable V LoRA in TTT
TTT_LOCAL_LR_MULT     = 0.75     # was 1.0
QK_GAIN_INIT          = 5.25     # was 5.0

Everything else is verbatim PR #1945. AWQ-lite, Asymmetric Logit Rescale, CaseOps tokenizer, Polar Express NS, MIN_LR, fused softcapped CE, LQER asymmetric rank-4, sparse attention gate, BOS-fixed SmearGate, phased TTT (3 phases, 2500 prefix docs), per-group lrzip + brotli compression, GPTQ int6 + int7 embeddings.

Why each lever

Each lever was already publicly measured on a closely related base. None alone clears the merge bar. Combined on the PR #1945 base, they compose into a clearing stack.

EVAL_SEQ_LEN=2560 with TTT_MASK=no_qv: extends eval and TTT score-first context past 2048. The baseline #1855 measurement reported 2560 + no_qv at val_bpb 1.06109776 in 473.4s, an improvement of about -0.00058 BPB vs the 2048 anchor at 473.4s eval time. Legal under the 600s eval cap.

TTT_LOCAL_LR_MULT=0.75: scales local LoRA-TTT optimizer LR. The baseline #1855 sweep at 2560 no_qv showed 0.75 was the best multiplier in {0.50, 0.75, 1.00, 1.25, 1.50, 2.00} at val_bpb 1.06104597. Same direction holds here.

QK_GAIN_INIT=5.25: replaces the 5.0 default per-head learnable Q-gain initialization. The baseline #1855 measurement reported QK_GAIN_INIT=5.25 seed-1234 post-TTT at -0.00019364 vs 5.0. Train-time init only.

Asymmetric Logit Rescale via PR #1945 / PR #1923: replaces the single logit_softcap=30.0 with two learnable scalars softcap_pos and softcap_neg, trained inside Phased TTT global SGD. PR #1945 finds Asym is positive when stacked with AWQ-lite due to better TTT recovery. Initialized at the symmetric value (30.0) so eval is identity at start. Inherited from PR #1945.

AWQ-lite mixed precision via PR #1945 / PR #1908: during GPTQ calibration, collect activation RMS per layer, select the most-salient 64-column group, keep that group at int8 inside the GPTQ solve. Inherited from PR #1945.

Compliance (Issue #1017)

Reproduction

SEED=42 \
DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
CASEOPS_ENABLED=1 \
VOCAB_SIZE=8192 \
ITERATIONS=20000 \
MAX_WALLCLOCK_SECONDS=600 \
TTT_ENABLED=1 \
PHASED_TTT_ENABLED=1 \
PHASED_TTT_NUM_PHASES=3 \
PHASED_TTT_PREFIX_DOCS=2500 \
TTT_LORA_RANK=80 \
TTT_MASK=no_qv \
TTT_Q_LORA=0 \
TTT_V_LORA=0 \
TTT_LOCAL_LR_MULT=0.75 \
EVAL_SEQ_LEN=2560 \
TTT_EVAL_SEQ_LEN=2560 \
QK_GAIN_INIT=5.25 \
MATRIX_LR=0.026 \
MIN_LR=0.1 \
EMBED_BITS=7 \
MATRIX_CLIP_SIGMAS=12.85 \
ATTN_CLIP_SIGMAS=13.0 \
MLP_CLIP_SIGMAS=11.5 \
EMBED_CLIP_SIGMAS=14.0 \
GRAD_CLIP_NORM=0.3 \
FUSED_CE_ENABLED=1 \
SMEAR_GATE_ENABLED=1 \
GATE_WINDOW=12 \
SPARSE_ATTN_GATE_ENABLED=1 \
LQER_ENABLED=1 \
LQER_RANK=4 \
LQER_TOP_K=3 \
LQER_GROUP_SIZE=64 \
LQER_ASYM_ENABLED=1 \
LQER_ASYM_GROUP=64 \
AWQ_LITE_ENABLED=1 \
ASYM_LOGIT_RESCALE=1 \
GPTQ_RESERVE_SECONDS=4.0 \
GPTQ_CALIBRATION_BATCHES=16 \
COMPRESSOR=pergroup \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Repeat for SEED=0 and SEED=1234.

Lineage

This stands on a long chain of prior submissions. The four added levers and the PR #1945 core are all from public PRs:

The diagnostic context (long-context score gating at 2560, no_qv mask, TTT_LOCAL_LR_MULT sweep, QK_GAIN_INIT sweep) was originally measured on exact #1855 in private experiments before this stack. None alone cleared the merge bar on #1855. The contribution here is recognizing that they compose orthogonally on the PR #1945 base.

Files

…T LR 0.75 + QK_GAIN 5.25

3-seed mean val_bpb 1.05855370 (std 0.00029539). Clears merge bar
1.05893 by -0.00038 BPB. Improves on PR openai#1855 (1.06108) by -0.00253 BPB.

All seeds under 16 MB artifact, 600s train cap, 600s eval cap.
@romeerp
Copy link
Copy Markdown
Contributor

romeerp commented Apr 30, 2026

i played around with long-context during eval time earlier, but tried some more complicated things involving dynamic selection that never made it. it's cool to see that just changing the hyperparam for eval seq length was able to improve performance without seeing it in training. really nice result there

TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request Apr 30, 2026
Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 30, 2026
- Pull PR openai#1953's record dir (train_gpt.py + README) from openai/parameter-golf
- Phase U script combines retokenize SP8192 (no HF dataset cached) + 3-seed
  train using PR openai#1953's exact stack with one new lever: PREFIX_DOCS 2500 -> 2800
- Goal: clear record bar (1.05914) by stacking on top of openai#1953's 1.05855
- Robust: heartbeat, GPU keepalive, no trap-on-EXIT, per-seed HF upload,
  apt-installs lrzip for pergroup compressor
- Aborts seed-1 on BPB > 1.060 OR artifact > 16MB

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
aerosta pushed a commit to aerosta/parameter-golf that referenced this pull request Apr 30, 2026
3-seed mean val_bpb = 1.05851479 (std 0.000762, seeds 42/0/1234) on track_10min_16mb.

Stack:
- PR openai#1945 (alertcat) V21 base = PR openai#1908 + AWQ-Lite + AsymLogit Rescale
- PR openai#1953 (andrewbaggio1) TTT/QK env knobs (TTT_LR=0.75, QK_GAIN=5.25, no_qv mask)
- PR openai#1948 (TimS-ml + lijuncheng16) LeakyReLU squared slope 0.3
- PR openai#1145 (AnirudhRahul, valerio-endorsed) closed-form n-gram tilt with Σ P=1 Z renormalization

Compliance: causal hints, single-pass, Σ P=1 by construction, no SLOT,
no n-gram cache, no Pre-Quant TTT.

System deps: gcc + lrzip auto-installed by setup.sh; PyTorch 2.9.1 + Triton + Flash Attn 3.

One-command reproduction:
  bash setup.sh
  SEED={42,0,1234} bash run.sh
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 30, 2026
Hypothesis: PR openai#1953 verbatim + LeakyReLU2 slope 0.5->0.3 lands at
~1.0578 (3-seed). Skip 2xH100 mini (5-site numeric flip on validated
commit). Ladder = 8xH100 official direct, 3 seeds (42, 0, 1234).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 30, 2026
…d-attempt arms

Phase V: TTT_OPTIMIZER=muon for LoRA SGD, swap from default AdamW. New env vars
TTT_MUON_LR_MULT (default 8x adam) and TTT_MUON_BACKEND_STEPS (default 5).
Hypothesis: Newton-Schulz orthogonalized momentum better suits low-rank LoRA.

Phase W: TTT_LORA_RANK 80->96, PHASED_TTT_NUM_PHASES 3->4, PHASED_TTT_PREFIX_DOCS 2500->2000
on PR openai#1953 stack. Hypothesis: more LoRA capacity + extra phase boundary captures more
cross-doc structure. Artifact-cap risk noted; seed-1 abort guards in script.

Both phases run openai#1953 base + their respective lever changes only. Robust pattern
(heartbeat, GPU keepalive, no trap-on-EXIT, per-seed HF upload) preserved from
Phase U / Phase S debugging.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 30, 2026
250D: PHASED_TTT_PREFIX_DOCS 2500 -> 3000 (~+80s, 6s cap margin — risky)
250E: TTT_LOCAL_LR_MULT 0.75 -> 0.65 (compute-neutral, sub-step around openai#1953 optimum)

Both eval-only via TTT_EVAL_ONLY=1 + RESUME_FROM_CKPT on spec 250's
final_model.pt. ~9-15 USD per spec for 3 seeds. No code change, no retrain.

Family complete: 250B (PREFIX 2750), 250C (PHASES 4), 250D (PREFIX 3000),
250E (LR_MULT 0.65). Independent eval-only sweeps; do not stack.
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 30, 2026
Final attempt to overtake PR openai#1953 (1.05855) and PR openai#1967 (1.05851).

Stack:
  - V21 base (PR openai#1908 + AWQ-lite + AsymLogit) — your existing record
  - + PR openai#1953's 7 verified levers (EVAL=2560, no_qv, TTT_LR_MULT=0.75, QK_GAIN=5.25)
  - + EVAL_SEQ_LEN=2816 (intermediate safe value, ~5% eval timing risk)
  - All other hparams identical to V21

Safety: EVAL_SEQ_LEN=2816 vs PR openai#1953's 2560 = ~10% eval time penalty.
Expected eval times: 470s/485s/564s (PR openai#1953 was 430/441/513).
Seed 1234 has thinnest margin (564s of 600s cap = 36s buffer).

Expected V22 BPB: 1.0578-1.0586 (3-seed mean)
P(beat PR openai#1953 1.05855): ~50%
P(beat PR openai#1967 1.05851): ~30-35% (timing-pending PR ahead)
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 30, 2026
…mean 1.05877

Layers PR openai#1953 (@andrewbaggio1)'s 7 hparam levers (TTT_MASK=no_qv,
TTT_Q_LORA=0, TTT_V_LORA=0, TTT_LOCAL_LR_MULT=0.75, QK_GAIN_INIT=5.25,
EVAL_SEQ_LEN, TTT_EVAL_SEQ_LEN) on top of V21 v2 base (PR openai#1908 + AWQ-lite
+ Asymmetric Logit Rescale + WD=2.0). EVAL_SEQ_LEN raised from PR openai#1953's
2560 to 2816 for longer eval context.

3-seed mean 1.05877 (std 0.00102), all strict <600s train wallclock
(596.087-596.152s) and 475-522s eval. Improvement over V21 v2 mean 1.05943
is -0.00066 BPB (matches community 0.0006 floor for meaningful delta).

Run on Hyperbolic eu-north-4 Iceland VM (8xH100 SXM5 80GB, PyTorch
2.9.1+cu128 with CUDA 13 forward-compat driver 580).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
aquariouseworkman added a commit to aquariouseworkman/parameter-golf that referenced this pull request Apr 30, 2026
# Record: PR openai#1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval

**val_bpb = 1.05847** (3-seed mean, std 0.00063) | **max artifact 15,985,934 bytes** | 8x H100 SXM | strict 600s train + eval

## Results

| Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | **Post-TTT BPB** | Eval time | Artifact bytes |
|------|-----------|------------|---------------|---------------|------------------|-----------|----------------|
| 42   | 4892      | 595.97s ✅ | 1.06126       | 1.06962       | **1.05788**      | 493.2s ✅ | 15,979,342     |
| 0    | 4884      | 595.97s ✅ | 1.06181       | 1.07019       | **1.05840**      | 420.5s ✅ | 15,979,187     |
| 1234 | 4894      | 596.14s ✅ | 1.06232       | 1.07093       | **1.05914**      | 428.4s ✅ | 15,985,934     |
| **Mean** | **4890** | **596.03s** | **1.06180** | **1.07025** | **1.05847** | **447.4s** | **15,981,488** |

vs merged PR openai#1855 (1.06108): **-0.00261 BPB / -0.00571 nats**

## Stack

Inherits the full PR openai#1855 base (codemath3000) and layers:

1. **AWQ-lite mixed-precision GPTQ** (PR openai#1908, romeerp) — activation-aware salient-group int8 promotion
2. **Asymmetric Logit Rescale** (PR openai#1923, jorge-asenjo) — learnable pos/neg softcap during TTT eval
3. **no_qv TTT mask** (PR openai#1953, himanshudongre) — disable Q/V LoRA in TTT, keep K/MLP/O
4. **TTT_LOCAL_LR_MULT=0.75** — scaled TTT optimizer LR
5. **QK_GAIN_INIT=5.25** — per-head Q-gain initialization
6. **EVAL_SEQ_LEN=2560** — extended eval context
7. **PHASED_TTT_PREFIX_DOCS=3000** — larger global-TTT prefix
8. **TTT_LORA_RANK=56** — reduced LoRA rank (compute reallocation)

## Compliance

- [x] Artifact under 16,000,000 bytes (max 15,985,934)
- [x] Train wallclock under 600s (max 596.14s)
- [x] Eval wallclock under 600s (max 493.2s)
- [x] No PPM, no SLOT, no pre-quant TTT, no n-gram cache
- [x] Single left-to-right pass, score-before-update
- [x] Full normalized softmax distribution

## Reproduction

```bash
apt-get install -y lrzip
pip install sentencepiece brotli huggingface_hub numpy python-minifier
pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

# Dataset
HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', repo_type='dataset', local_dir='/workspace/caseops_data')
"

# Run
for SEED in 42 0 1234; do
  SEED=$SEED \
  DATA_PATH=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
  TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
  CASEOPS_ENABLED=1 VOCAB_SIZE=8192 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 \
  SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_SCALE=0.5 \
  GATED_ATTN_QUANT_GATE=1 FUSED_CE_ENABLED=1 QK_GAIN_INIT=5.25 \
  EMBED_BITS=7 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \
  GPTQ_RESERVE_SECONDS=4.0 GPTQ_CALIBRATION_BATCHES=16 COMPRESSOR=pergroup \
  LQER_ENABLED=1 LQER_ASYM_ENABLED=1 LQER_RANK=4 LQER_FACTOR_BITS=4 LQER_ASYM_GROUP=64 LQER_TOP_K=3 \
  AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 \
  ASYM_LOGIT_RESCALE=1 \
  TTT_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=3 PHASED_TTT_PREFIX_DOCS=3000 \
  TTT_LORA_RANK=56 TTT_MASK=no_qv TTT_Q_LORA=0 TTT_V_LORA=0 TTT_LOCAL_LR_MULT=0.75 \
  TTT_CHUNK_SIZE=48 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 \
  EVAL_SEQ_LEN=2560 TTT_EVAL_SEQ_LEN=2560 \
  WARMDOWN_FRAC=0.85 BETA2=0.99 GRAD_CLIP_NORM=0.3 MIN_LR=0.1 MATRIX_LR=0.026 \
  NCCL_NET=Socket GLOBAL_TTT_MOMENTUM=0.9 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py > train_seed${SEED}.log 2>&1
done
```
Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 30, 2026
Pull PR openai#2014's record dir from openai/parameter-golf and reproduce its 1.05759
3-seed mean. Key new levers vs openai#1953: EVAL_SEQ_LEN=3072, train_seq_schedule
1024->2048->3072, single-phase TTT (NUM_PHASES=1, PREFIX=2500), short-doc
score-first chunking (TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24).

Even with our infra's ~1.5-2 milli-BPB inflation pattern, reproducing openai#2014
should land ~1.0590 — close enough to record bar to potentially clear it.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 1, 2026
…ean 1.05831 BPB

Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108),
p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under
the 600s wallclock budget.

Per-seed:
- 42:   ttt=1.05793  art=15,986,149  eval=572.6s
- 314:  ttt=1.05852  art=15,987,257  eval=553.7s
- 1234: ttt=1.05849  art=15,989,895  eval=574.1s

Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/
contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a
detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923
-> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition
C1-C4 legality check. submission.json author/github_id are placeholders pending
the user's choice of submitting account.

Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single
8xH100 SXM pod (~2.5h wall, ~$66 cost).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
…ence

After user feedback that LEAK calls relied too heavily on lineage-inheritance
and path heuristics, applied stricter criterion: a LEAK verdict requires at
least one of (a) explicit shell-script invocation of prepare_caseops_data.py
without --val-docs=50000, (b) README "Data setup" matching actual train log
path, (c) audit/submission.json admission text, (d) train log path with
`_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>`
(which only local prep produces; HF always gives double-nesting).

Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS
unless they meet at least one of those tests.

Changes:
  - openai#1945 LEAK → CLEAN  (finalize_v18.sh has snapshot_download from HF;
    actual run path matches HF target; README's prepare_caseops_data.py
    section is stale documentation)
  - openai#1953 LEAK → AMBIGUOUS  (PR ships only train_gpt.py + logs; no prep
    evidence; path matches HF target; parent openai#1945 confirmed CLEAN —
    leans CLEAN but no direct PR evidence)
  - openai#2041 LEAK → AMBIGUOUS  (no prep invocation; double-nested path
    consistent with EITHER HF or local prep)
  - openai#2075 LEAK → AMBIGUOUS  (ships prep file but no explicit invocation;
    path matches HF target)

Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1).

Headline impact: realistic clean SOTA is at most ~0.012 bpb below the
claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order:
  openai#2019 1.05847 (HF, confirmed)
  openai#1953 1.05855 (AMBIGUOUS, leans CLEAN)
  openai#1945 1.05943 (HF, confirmed via re-audit)
  openai#2031 1.05985 (HF, confirmed)
  openai#1908 1.06081 (HF, confirmed)
  openai#1851 1.06128 (HF, MERGED SOTA)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants