Skip to content

Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean)#1626

Merged
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:dexhunter/multiphase-sgd-ttt
Apr 29, 2026
Merged

Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean)#1626
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:dexhunter/multiphase-sgd-ttt

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

Results

Seed Post-TTT BPB val_loss (nats) Artifact
42 1.07280 2.77116 15,932,897
0 1.07134 2.76739 15,939,841
1234 1.07164 2.76815 15,932,419
Mean 1.07193 2.76890

Key Innovation

Multi-phase global SGD: instead of a single SGD round on prefix docs (PR #1610), we split into 3 phases — scoring a chunk, running SGD, then scoring the next chunk with the improved model. This progressively adapts the base model while maintaining strict score-before-update legality. 3-phase gives -0.0008 BPP over single-phase.

Test plan

  • Verify 3-seed mean and std
  • Check artifact sizes < 16 MB
  • Verify score-before-update ordering in TTT logs
  • Check code consistency across seeds

…al_bpb 1.07193 (3-seed mean)

Novel multi-phase global SGD during phased TTT evaluation.
Builds on PR openai#1530 (@samacqua) + PR openai#1610 (@romeerp) phased TTT concept.
3-seed mean: 1.07193 BPB (2.76890 nats), std 0.00063.
Seeds: 42, 0, 1234. All artifacts <16 MB.
@romeerp
Copy link
Copy Markdown
Contributor

romeerp commented Apr 15, 2026

Wanted to implement this multi-phased strategy but didn't have compute to run tests for it. Glad you were able to do it and show improvement!

@cocohearts cocohearts merged commit 5c8e045 into openai:main Apr 29, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants