Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)#2050
Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)#2050AidenGeunGeun wants to merge 5 commits intoopenai:mainfrom
Conversation
Engineering rationale for this submissionI am adding this comment so reviewers do not have to reconstruct the reasoning chain from the full 1. Frozen starting pointPR #1915 remains the conservative anchor and was not modified here. It used the SP8192 CaseOps / PR #1855-style legal frontier stack with per-document score-first LoRA TTT and produced:
This PR is a separate follow-up folder so the clean anchor remains intact. 2. What was tried and closedSeveral plausible mechanisms were tested before selecting this one. I am listing them because the final method was chosen by elimination, not by piling on knobs.
The one small stable positive setting was lower eval-time TTT LR, improving all three seeds by about 3. Same-execution countersOne important lesson from the context-mixer work was that separate eval-time TTT runs are not reliable for tiny BPB deltas. Fresh per-document LoRA state, BF16/fused kernels, and distributed scheduling can move otherwise identical runs at the For scoring-only transforms, I therefore used same-execution counters: same logits, same TTT trajectory, same document order, same hints, same token/byte accounting. That is why the small Adaptive Hedge deltas are meaningful. 4. Token-level causal n-gram tiltThe main late signal came from a normalized token-level causal n-gram tilt over the official SP8192 alphabet. For a strict-prefix hint This keeps the scored distribution normalized over SP8192. The n-gram state is strict-prefix only and updates after the current token is scored. It is not byte PPM and it does not change the tokenizer. The first full seed42 validation showed:
That was the first effect large enough to justify final packaging work. 5. Why Adaptive-Beta HedgeA fixed boost strength can overpay the normalizer on weak hints and underboost on strong hints. Adaptive-Beta Hedge behaves like a small online universal code over boost temperatures. Across multiple same-execution settings, Hedge improved fixed n-gram by almost the same amount:
That consistency is the main reason I trust the mechanism: Hedge appears to correct a systematic n-gram boost calibration error rather than exploit a one-off seed42 trajectory. Limits of the claim: the broader cross-base table is seed42-only per base. The selected family has the three-seed evidence included in this PR. 6. Why 2560 context + no Q/V LoRAI separated scoring transforms from trajectory changes. The n-gram/Hedge overlay changes the scoring distribution. Context length and LoRA target branches change the eval-time TTT trajectory. The strongest local trajectory interaction was:
This showed the trajectory change and the scoring transform were mostly complementary, so this was selected for final packaging. 7. Final package proofThe final package is self-contained and under cap:
The final eval path uses a validation-only data view:
The main runtime optimization was implementation-only. The previous path computed CE and then separately computed a full log-softmax for the n-gram hint probability. The optimized path reuses: This removes a duplicate vocabulary-wide normalization without changing scoring constants or legality. 8. Final per-seed evidence
All three runs use exact official accounting:
The 3-seed mean is |
Supplemental study: Adaptive-Beta Hedge transfer screen on a PR #1934-like trajectoryAfter preparing this PR, I ran one additional isolated transfer screen to test whether the strict-prefix token n-gram + Adaptive-Beta Hedge overlay was specific to this submission's base, or whether it behaves like a reusable scoring calibration layer. This is score evidence only. It is not a new submission claim, and it is not a faithful reproduction of PR #1934. I am documenting it here because the paired result is useful evidence about the mechanism. QuestionThe final method in this PR uses a normalized causal token n-gram overlay with Adaptive-Beta Hedge. Across the experiments in this PR, Hedge repeatedly improved fixed n-gram tilt by about The question was:
Source basisThe transfer screen used PR #1934 source as the starting point:
The overlay added the same style of scoring counters used in this PR:
All three counters were computed in the same execution, using the same logits / TTT trajectory / document order / scored-token and scored-byte denominators. Important deviations from PR #1934This was deliberately a fast transfer screen, not a full PR #1934 audit or reproduction. The run differs from the public PR #1934 setup in important ways:
Because of these deviations, the result below should be read as a mechanism transfer screen, not as “PR #1934 + Hedge” record evidence. Paired same-execution resultsSuccessful eval-only transfer run from the saved artifact, seed42, memory-safe
Official accounting for the screen:
Why this is interestingThe absolute BPB is not the main finding, because the base trajectory was weaker than PR #1934's reported seed42 run. The important signal is the paired overlay gain:
That last number is very close to the Hedge-over-fixed gains observed in this PR's own experiments:
This supports the interpretation that Adaptive-Beta Hedge is correcting a fairly stable n-gram boost calibration error, rather than exploiting a one-off seed or one particular model trajectory. Why I am not submitting this as a separate resultThe transfer screen is not submission-ready:
So the classification is:
TakeawayThis screen does not create a new record claim. It does add useful evidence for the mechanism behind this PR:
If there were more time, the proper hardening path would be:
I am leaving it as a documented supplemental study rather than turning it into a formal submission. |
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
This is a separate follow-up candidate to PR #1915. PR #1915 remains untouched as the conservative 3-seed anchor.
Headline
33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41Per-seed package proofs
Method
Compliance notes
train_gpt.pywrapperReviewer guide
README.mdcontains the concise result/package/runtime summary.ENGINEERING_LOG.mdcontains the professional engineering record: starting point, closed mechanisms, same-execution counter methodology, n-gram/Adaptive Hedge math, trajectory interactions, and final runtime/package proof.submission.json,package_size.json, andeval_data_manifest.jsonprovide machine-readable metadata.train_seed42.log,train_seed0.log, andtrain_seed1234.logcontain the exact validation proof logs.The intended framing is: seed42 record-track proof with three under-600 seed proofs for reproducibility, while explicitly not overclaiming that the 3-seed mean beats the displayed leaderboard mean.