Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB) by AidenGeunGeun · Pull Request #2050 · openai/parameter-golf

AidenGeunGeun · 2026-04-30T23:46:25Z

This is a separate follow-up candidate to PR #1915. PR #1915 remains untouched as the conservative 3-seed anchor.

Headline

Seed42 BPB: 1.06082922
3-seed mean BPB: 1.06157781, reported for transparency and not claimed to beat the displayed leaderboard mean
Exact counts: 47,851,520 scored tokens / 151,074,499 scored bytes
Doc-order hash: 33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41
Max artifact size: 15,932,067 bytes, 67,933 bytes under the 16,000,000-byte cap

Per-seed package proofs

Seed	BPB	Inner TTT eval	Total eval wallclock	Wrapper wallclock	Status
42	1.06082922	544.1s	566.3s	585s	under 600s
0	1.06158291	546.7s	568.5s	587s	under 600s
1234	1.06232130	545.6s	568.1s	586s	under 600s

Method

Starts from the frozen PR Add SP8192 CaseOps + legal per-document TTT record (1.0650 BPB) #1915 quantized artifacts
Lower eval-time per-document TTT LR
2560-token eval/TTT context
Q/V LoRA disabled during eval-time TTT; K/MLP/O/lm_head active
Normalized token-level causal n-gram Adaptive Hedge scoring overlay over official SP8192 CaseOps

Compliance notes

Score-before-update per-document TTT
Strict-prefix n-gram state only
No byte PPM, custom tokenizer, validation pre-quant adaptation, global validation SGD, or cross-document adaptive state
Final eval data view exposes validation shards only and zero train shards
N-gram helper logic is embedded inside the counted train_gpt.py wrapper

Reviewer guide

README.md contains the concise result/package/runtime summary.
ENGINEERING_LOG.md contains the professional engineering record: starting point, closed mechanisms, same-execution counter methodology, n-gram/Adaptive Hedge math, trajectory interactions, and final runtime/package proof.
submission.json, package_size.json, and eval_data_manifest.json provide machine-readable metadata.
train_seed42.log, train_seed0.log, and train_seed1234.log contain the exact validation proof logs.

The intended framing is: seed42 record-track proof with three under-600 seed proofs for reproducibility, while explicitly not overclaiming that the 3-seed mean beats the displayed leaderboard mean.

AidenGeunGeun · 2026-05-01T00:50:17Z

Engineering rationale for this submission

I am adding this comment so reviewers do not have to reconstruct the reasoning chain from the full ENGINEERING_LOG.md. The short version: this submission is a focused eval-time follow-up to PR #1915. It was selected because a normalized causal token n-gram overlay produced a large full-validation gain, and Adaptive-Beta Hedge made that overlay transfer reliably across seeds and trajectory settings.

1. Frozen starting point

PR #1915 remains the conservative anchor and was not modified here. It used the SP8192 CaseOps / PR #1855-style legal frontier stack with per-document score-first LoRA TTT and produced:

3-seed mean: 1.06504520 BPB
max package: 15,922,155 bytes
legal per-document TTT: no global validation SGD, no cross-document adaptive state, score-before-update

This PR is a separate follow-up folder so the clean anchor remains intact.

2. What was tried and closed

Several plausible mechanisms were tested before selecting this one. I am listing them because the final method was chosen by elimination, not by piling on knobs.

Mechanism	Observation	Decision
Weighted LQER	best gain about `0.000013 BPB`	closed
AWQ/no-embedding rescue	package-safe but essentially neutral	closed
D-prime / phase bucketing	slight BPB gain, too slow	closed
First-order TTT-aware training	seed42 worsened to `1.07164685`	closed
Random-map adapters	sampled BPB worsened by `0.008918`	closed
Long-context 4K	full validation worsened by `0.002438 BPB`	closed
LeakyReLU-slope retrain	seed42 worsened to `1.13888025`	closed
Neural Dirichlet context mixer	same-execution identity check repaired, then valid slices regressed and runtime projected about `9180s`	closed

The one small stable positive setting was lower eval-time TTT LR, improving all three seeds by about 0.0005-0.0006 BPB. That became the base eval-time setting for the final follow-up.

3. Same-execution counters

One important lesson from the context-mixer work was that separate eval-time TTT runs are not reliable for tiny BPB deltas. Fresh per-document LoRA state, BF16/fused kernels, and distributed scheduling can move otherwise identical runs at the 1e-5 BPB scale.

For scoring-only transforms, I therefore used same-execution counters: same logits, same TTT trajectory, same document order, same hints, same token/byte accounting. That is why the small Adaptive Hedge deltas are meaningful.

4. Token-level causal n-gram tilt

The main late signal came from a normalized token-level causal n-gram tilt over the official SP8192 alphabet.

For a strict-prefix hint h, fixed tilt scores:

p'(a) = exp(beta * 1[a == h]) * p(a) / Z
Z = 1 + p(h) * (exp(beta) - 1)

This keeps the scored distribution normalized over SP8192. The n-gram state is strict-prefix only and updates after the current token is scored. It is not byte PPM and it does not change the tokenizer.

The first full seed42 validation showed:

paired lower-LR control: 1.06373091 BPB
fixed token n-gram tilt: 1.06182936 BPB
gain: 0.00190155 BPB
exact counts: 47,851,520 scored tokens / 151,074,499 scored bytes

That was the first effect large enough to justify final packaging work.

5. Why Adaptive-Beta Hedge

A fixed boost strength can overpay the normalizer on weak hints and underboost on strong hints. Adaptive-Beta Hedge behaves like a small online universal code over boost temperatures.

Across multiple same-execution settings, Hedge improved fixed n-gram by almost the same amount:

Setting	Fixed n-gram	Adaptive Hedge	Hedge gain
seed42, default trajectory	`1.06182636`	`1.06143959`	`0.00038677`
seed0, default trajectory	`1.06260772`	`1.06221801`	`0.00038971`
seed1234, default trajectory	`1.06331328`	`1.06292962`	`0.00038366`
seed42, Q disabled	`1.06182914`	`1.06144170`	`0.00038744`
seed42, public-frontier diagnostic base	`1.06135233`	`1.06096543`	`0.00038690`
seed42, 2560 context + no Q/V	`1.06121689`	`1.06083091`	`0.00038599`

That consistency is the main reason I trust the mechanism: Hedge appears to correct a systematic n-gram boost calibration error rather than exploit a one-off seed42 trajectory.

Limits of the claim: the broader cross-base table is seed42-only per base. The selected family has the three-seed evidence included in this PR.

6. Why 2560 context + no Q/V LoRA

I separated scoring transforms from trajectory changes. The n-gram/Hedge overlay changes the scoring distribution. Context length and LoRA target branches change the eval-time TTT trajectory.

The strongest local trajectory interaction was:

paired 2560/no-QV trajectory control: 1.06306045 BPB
fixed n-gram: 1.06121689 BPB
Adaptive Hedge: 1.06083091 BPB
Hedge gain over fixed: 0.00038599 BPB

This showed the trajectory change and the scoring transform were mostly complementary, so this was selected for final packaging.

7. Final package proof

The final package is self-contained and under cap:

max compressed model: 15,874,515 bytes
counted train_gpt.py wrapper: 57,552 bytes
max total package: 15,932,067 bytes
margin under cap: 67,933 bytes
custom n-gram Python/C helper logic embedded in counted wrapper
no uncounted helper files required

The final eval path uses a validation-only data view:

train shards visible: 0
validation token shards: 5
validation byte shards: 5

The main runtime optimization was implementation-only. The previous path computed CE and then separately computed a full log-softmax for the n-gram hint probability. The optimized path reuses:

loss = logZ - target_logit
logZ = loss + target_logit
hint_log_prob = hint_logit - logZ

This removes a duplicate vocabulary-wide normalization without changing scoring constants or legality.

8. Final per-seed evidence

Seed	BPB	Inner TTT eval	Total eval wallclock	Wrapper wallclock	Note
42	`1.06082922`	`544.1s`	`566.3s`	`585s`	paired control/fixed/adaptive proof
0	`1.06158291`	`546.7s`	`568.5s`	`587s`	paired control/fixed/adaptive proof
1234	`1.06232130`	`545.6s`	`568.1s`	`586s`	selected Adaptive Hedge runtime proof

All three runs use exact official accounting:

scored tokens: 47,851,520
scored bytes: 151,074,499
doc-order hash: 33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41

The 3-seed mean is 1.06157781 BPB. I am reporting that transparently and not claiming it beats the displayed leaderboard mean. The headline claim is the seed42 record-track proof, with two additional under-600s seed proofs for reproducibility.

AidenGeunGeun · 2026-05-01T01:51:33Z

Supplemental study: Adaptive-Beta Hedge transfer screen on a PR #1934-like trajectory

After preparing this PR, I ran one additional isolated transfer screen to test whether the strict-prefix token n-gram + Adaptive-Beta Hedge overlay was specific to this submission's base, or whether it behaves like a reusable scoring calibration layer.

This is score evidence only. It is not a new submission claim, and it is not a faithful reproduction of PR #1934. I am documenting it here because the paired result is useful evidence about the mechanism.

Question

The final method in this PR uses a normalized causal token n-gram overlay with Adaptive-Beta Hedge. Across the experiments in this PR, Hedge repeatedly improved fixed n-gram tilt by about 0.000386 BPB.

The question was:

If the same overlay is applied to a stronger public PR #1934-style trajectory, does the Hedge gain persist?

Source basis

The transfer screen used PR #1934 source as the starting point:

PR branch: liujshi/parameter-golf, record-lrzip
PR commit: ae80c9fd1ee854d1529fe11524a9fa6a1e084f9e
Record folder: records/track_10min_16mb/2026-04-29_SP8192_LQER_CaseOp_Per-group_Lrzip

The overlay added the same style of scoring counters used in this PR:

paired control from the same eval-time TTT trajectory
fixed normalized token n-gram tilt
Adaptive-Beta Hedge over n-gram boost strengths

All three counters were computed in the same execution, using the same logits / TTT trajectory / document order / scored-token and scored-byte denominators.

Important deviations from PR #1934

This was deliberately a fast transfer screen, not a full PR #1934 audit or reproduction. The run differs from the public PR #1934 setup in important ways:

Training data mismatch
- PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934's public seed42 log used 150 train shards.
- The warm pod data available for this screen had 80 train shards.
- The resulting paired base was therefore weaker than PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934's reported seed42 base.
Memory-safe eval batching
- The default overlay TTT scoring path OOMed at the original larger batch setting.
- The successful eval-only retry used TTT_BATCH_SIZE=32.
- This made the screen complete, but it changed scheduling/batching and pushed eval runtime over the contest target.
No broad PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934 validity audit
- I did not audit all inherited PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934 legality/runtime assumptions.
- Inherited phased/prefix TTT caveats remain.
- Beta-zero equivalence was not run in this emergency screen.

Because of these deviations, the result below should be read as a mechanism transfer screen, not as “PR #1934 + Hedge” record evidence.

Paired same-execution results

Successful eval-only transfer run from the saved artifact, seed42, memory-safe TTT_BATCH_SIZE=32:

Counter	Loss	BPB	Gain vs control	Gain vs fixed
Paired PR1934-like control	`2.32386471`	`1.06191549`	—	—
Fixed n-gram tilt	`2.31960433`	`1.05996866`	`0.00194683`	`0.00000000`
Adaptive-Beta Hedge	`2.31876583`	`1.05958550`	`0.00232999`	`0.00038316`

Official accounting for the screen:

scored tokens: 47,851,520
scored bytes: 151,074,499
validation docs: 50,000
n-gram active hint positions: 13,023,303
doc-order hash: 33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41

Why this is interesting

The absolute BPB is not the main finding, because the base trajectory was weaker than PR #1934's reported seed42 run. The important signal is the paired overlay gain:

fixed n-gram improved the paired control by 0.00194683 BPB
Adaptive-Beta Hedge improved the paired control by 0.00232999 BPB
Adaptive-Beta Hedge improved fixed n-gram by 0.00038316 BPB

That last number is very close to the Hedge-over-fixed gains observed in this PR's own experiments:

Setting	Hedge gain vs fixed n-gram
seed42, default trajectory	`0.00038677 BPB`
seed0, default trajectory	`0.00038971 BPB`
seed1234, default trajectory	`0.00038366 BPB`
seed42, Q disabled	`0.00038744 BPB`
seed42, public-frontier diagnostic base	`0.00038690 BPB`
seed42, 2560 context + no Q/V	`0.00038599 BPB`
PR #1934-like transfer screen	`0.00038316 BPB`

This supports the interpretation that Adaptive-Beta Hedge is correcting a fairly stable n-gram boost calibration error, rather than exploiting a one-off seed or one particular model trajectory.

Why I am not submitting this as a separate result

The transfer screen is not submission-ready:

It does not beat PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934's reported seed42 score of 1.05932556 BPB.
It used 80 train shards rather than PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934's public 150-shard seed42 setup.
The successful eval path used memory-safe TTT_BATCH_SIZE=32 and ran over the 600s target.
The packaged artifact margin was only 4,770 bytes, too thin for comfortable hardening.
PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934 itself was not broadly audited in this screen.

So the classification is:

SCORE-EVIDENCE-ONLY / NO SUBMIT

Takeaway

This screen does not create a new record claim. It does add useful evidence for the mechanism behind this PR:

A strict-prefix normalized token n-gram overlay gives a large paired gain, and Adaptive-Beta Hedge appears to add a stable additional calibration gain even when moved onto a different public base trajectory.

If there were more time, the proper hardening path would be:

reproduce or obtain the full PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934 seed42 artifact/trajectory;
apply the same overlay with original scheduling or prove the memory-safe batch path equivalent enough;
bring runtime below 600s;
increase package margin;
then rerun seed42 and decide whether more seeds are justified.

I am leaving it as a documented supplemental study rather than turning it into a formal submission.

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

…enGeunGeun

…enGeunGeun

AidenGeunGeun added 2 commits May 1, 2026 08:46

Add draft B2 adaptive hedge ngram candidate

9bd3146

Update B2 adaptive hedge proof runtime

4a45d5b

AidenGeunGeun changed the title ~~Draft: Add SP8192 CaseOps + Adaptive Hedge n-gram candidate (1.06083 BPB)~~ Add SP8192 CaseOps + Adaptive Hedge n-gram candidate (1.06083 BPB) Apr 30, 2026

AidenGeunGeun changed the title ~~Add SP8192 CaseOps + Adaptive Hedge n-gram candidate (1.06083 BPB)~~ Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB) Apr 30, 2026

AidenGeunGeun added 2 commits May 1, 2026 08:58

Expand adaptive hedge engineering log

00a9956

Clarify adaptive hedge README

dbaa510

AidenGeunGeun marked this pull request as ready for review April 30, 2026 23:59

Update Adaptive Hedge record evidence

92084cc

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 2, 2026

fix: correct author handle for PR openai#2050 — @someone114514 → @Aid…

8943d54

…enGeunGeun

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)#2050

Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)#2050
AidenGeunGeun wants to merge 5 commits intoopenai:mainfrom
AidenGeunGeun:add-aiden-b2-adaptive-hedge-ngram

AidenGeunGeun commented Apr 30, 2026 •

edited

Loading

Uh oh!

AidenGeunGeun commented May 1, 2026

Uh oh!

AidenGeunGeun commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AidenGeunGeun commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Headline

Per-seed package proofs

Method

Compliance notes

Reviewer guide

Uh oh!

AidenGeunGeun commented May 1, 2026

Engineering rationale for this submission

1. Frozen starting point

2. What was tried and closed

3. Same-execution counters

4. Token-level causal n-gram tilt

5. Why Adaptive-Beta Hedge

6. Why 2560 context + no Q/V LoRA

7. Final package proof

8. Final per-seed evidence

Uh oh!

AidenGeunGeun commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Supplemental study: Adaptive-Beta Hedge transfer screen on a PR #1934-like trajectory

Question

Source basis

Important deviations from PR #1934

Paired same-execution results

Why this is interesting

Why I am not submitting this as a separate result

Takeaway

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AidenGeunGeun commented Apr 30, 2026 •

edited

Loading

AidenGeunGeun commented May 1, 2026 •

edited

Loading