Skip to content

Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)#2050

Open
AidenGeunGeun wants to merge 5 commits intoopenai:mainfrom
AidenGeunGeun:add-aiden-b2-adaptive-hedge-ngram
Open

Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB)#2050
AidenGeunGeun wants to merge 5 commits intoopenai:mainfrom
AidenGeunGeun:add-aiden-b2-adaptive-hedge-ngram

Conversation

@AidenGeunGeun
Copy link
Copy Markdown

@AidenGeunGeun AidenGeunGeun commented Apr 30, 2026

This is a separate follow-up candidate to PR #1915. PR #1915 remains untouched as the conservative 3-seed anchor.

Headline

  • Seed42 BPB: 1.06082922
  • 3-seed mean BPB: 1.06157781, reported for transparency and not claimed to beat the displayed leaderboard mean
  • Exact counts: 47,851,520 scored tokens / 151,074,499 scored bytes
  • Doc-order hash: 33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41
  • Max artifact size: 15,932,067 bytes, 67,933 bytes under the 16,000,000-byte cap

Per-seed package proofs

Seed BPB Inner TTT eval Total eval wallclock Wrapper wallclock Status
42 1.06082922 544.1s 566.3s 585s under 600s
0 1.06158291 546.7s 568.5s 587s under 600s
1234 1.06232130 545.6s 568.1s 586s under 600s

Method

Compliance notes

  • Score-before-update per-document TTT
  • Strict-prefix n-gram state only
  • No byte PPM, custom tokenizer, validation pre-quant adaptation, global validation SGD, or cross-document adaptive state
  • Final eval data view exposes validation shards only and zero train shards
  • N-gram helper logic is embedded inside the counted train_gpt.py wrapper

Reviewer guide

  • README.md contains the concise result/package/runtime summary.
  • ENGINEERING_LOG.md contains the professional engineering record: starting point, closed mechanisms, same-execution counter methodology, n-gram/Adaptive Hedge math, trajectory interactions, and final runtime/package proof.
  • submission.json, package_size.json, and eval_data_manifest.json provide machine-readable metadata.
  • train_seed42.log, train_seed0.log, and train_seed1234.log contain the exact validation proof logs.

The intended framing is: seed42 record-track proof with three under-600 seed proofs for reproducibility, while explicitly not overclaiming that the 3-seed mean beats the displayed leaderboard mean.

@AidenGeunGeun AidenGeunGeun changed the title Draft: Add SP8192 CaseOps + Adaptive Hedge n-gram candidate (1.06083 BPB) Add SP8192 CaseOps + Adaptive Hedge n-gram candidate (1.06083 BPB) Apr 30, 2026
@AidenGeunGeun AidenGeunGeun changed the title Add SP8192 CaseOps + Adaptive Hedge n-gram candidate (1.06083 BPB) Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB) Apr 30, 2026
@AidenGeunGeun AidenGeunGeun marked this pull request as ready for review April 30, 2026 23:59
@AidenGeunGeun
Copy link
Copy Markdown
Author

Engineering rationale for this submission

I am adding this comment so reviewers do not have to reconstruct the reasoning chain from the full ENGINEERING_LOG.md. The short version: this submission is a focused eval-time follow-up to PR #1915. It was selected because a normalized causal token n-gram overlay produced a large full-validation gain, and Adaptive-Beta Hedge made that overlay transfer reliably across seeds and trajectory settings.

1. Frozen starting point

PR #1915 remains the conservative anchor and was not modified here. It used the SP8192 CaseOps / PR #1855-style legal frontier stack with per-document score-first LoRA TTT and produced:

  • 3-seed mean: 1.06504520 BPB
  • max package: 15,922,155 bytes
  • legal per-document TTT: no global validation SGD, no cross-document adaptive state, score-before-update

This PR is a separate follow-up folder so the clean anchor remains intact.

2. What was tried and closed

Several plausible mechanisms were tested before selecting this one. I am listing them because the final method was chosen by elimination, not by piling on knobs.

Mechanism Observation Decision
Weighted LQER best gain about 0.000013 BPB closed
AWQ/no-embedding rescue package-safe but essentially neutral closed
D-prime / phase bucketing slight BPB gain, too slow closed
First-order TTT-aware training seed42 worsened to 1.07164685 closed
Random-map adapters sampled BPB worsened by 0.008918 closed
Long-context 4K full validation worsened by 0.002438 BPB closed
LeakyReLU-slope retrain seed42 worsened to 1.13888025 closed
Neural Dirichlet context mixer same-execution identity check repaired, then valid slices regressed and runtime projected about 9180s closed

The one small stable positive setting was lower eval-time TTT LR, improving all three seeds by about 0.0005-0.0006 BPB. That became the base eval-time setting for the final follow-up.

3. Same-execution counters

One important lesson from the context-mixer work was that separate eval-time TTT runs are not reliable for tiny BPB deltas. Fresh per-document LoRA state, BF16/fused kernels, and distributed scheduling can move otherwise identical runs at the 1e-5 BPB scale.

For scoring-only transforms, I therefore used same-execution counters: same logits, same TTT trajectory, same document order, same hints, same token/byte accounting. That is why the small Adaptive Hedge deltas are meaningful.

4. Token-level causal n-gram tilt

The main late signal came from a normalized token-level causal n-gram tilt over the official SP8192 alphabet.

For a strict-prefix hint h, fixed tilt scores:

p'(a) = exp(beta * 1[a == h]) * p(a) / Z
Z = 1 + p(h) * (exp(beta) - 1)

This keeps the scored distribution normalized over SP8192. The n-gram state is strict-prefix only and updates after the current token is scored. It is not byte PPM and it does not change the tokenizer.

The first full seed42 validation showed:

  • paired lower-LR control: 1.06373091 BPB
  • fixed token n-gram tilt: 1.06182936 BPB
  • gain: 0.00190155 BPB
  • exact counts: 47,851,520 scored tokens / 151,074,499 scored bytes

That was the first effect large enough to justify final packaging work.

5. Why Adaptive-Beta Hedge

A fixed boost strength can overpay the normalizer on weak hints and underboost on strong hints. Adaptive-Beta Hedge behaves like a small online universal code over boost temperatures.

Across multiple same-execution settings, Hedge improved fixed n-gram by almost the same amount:

Setting Fixed n-gram Adaptive Hedge Hedge gain
seed42, default trajectory 1.06182636 1.06143959 0.00038677
seed0, default trajectory 1.06260772 1.06221801 0.00038971
seed1234, default trajectory 1.06331328 1.06292962 0.00038366
seed42, Q disabled 1.06182914 1.06144170 0.00038744
seed42, public-frontier diagnostic base 1.06135233 1.06096543 0.00038690
seed42, 2560 context + no Q/V 1.06121689 1.06083091 0.00038599

That consistency is the main reason I trust the mechanism: Hedge appears to correct a systematic n-gram boost calibration error rather than exploit a one-off seed42 trajectory.

Limits of the claim: the broader cross-base table is seed42-only per base. The selected family has the three-seed evidence included in this PR.

6. Why 2560 context + no Q/V LoRA

I separated scoring transforms from trajectory changes. The n-gram/Hedge overlay changes the scoring distribution. Context length and LoRA target branches change the eval-time TTT trajectory.

The strongest local trajectory interaction was:

  • paired 2560/no-QV trajectory control: 1.06306045 BPB
  • fixed n-gram: 1.06121689 BPB
  • Adaptive Hedge: 1.06083091 BPB
  • Hedge gain over fixed: 0.00038599 BPB

This showed the trajectory change and the scoring transform were mostly complementary, so this was selected for final packaging.

7. Final package proof

The final package is self-contained and under cap:

  • max compressed model: 15,874,515 bytes
  • counted train_gpt.py wrapper: 57,552 bytes
  • max total package: 15,932,067 bytes
  • margin under cap: 67,933 bytes
  • custom n-gram Python/C helper logic embedded in counted wrapper
  • no uncounted helper files required

The final eval path uses a validation-only data view:

  • train shards visible: 0
  • validation token shards: 5
  • validation byte shards: 5

The main runtime optimization was implementation-only. The previous path computed CE and then separately computed a full log-softmax for the n-gram hint probability. The optimized path reuses:

loss = logZ - target_logit
logZ = loss + target_logit
hint_log_prob = hint_logit - logZ

This removes a duplicate vocabulary-wide normalization without changing scoring constants or legality.

8. Final per-seed evidence

Seed BPB Inner TTT eval Total eval wallclock Wrapper wallclock Note
42 1.06082922 544.1s 566.3s 585s paired control/fixed/adaptive proof
0 1.06158291 546.7s 568.5s 587s paired control/fixed/adaptive proof
1234 1.06232130 545.6s 568.1s 586s selected Adaptive Hedge runtime proof

All three runs use exact official accounting:

  • scored tokens: 47,851,520
  • scored bytes: 151,074,499
  • doc-order hash: 33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41

The 3-seed mean is 1.06157781 BPB. I am reporting that transparently and not claiming it beats the displayed leaderboard mean. The headline claim is the seed42 record-track proof, with two additional under-600s seed proofs for reproducibility.

@AidenGeunGeun
Copy link
Copy Markdown
Author

AidenGeunGeun commented May 1, 2026

Supplemental study: Adaptive-Beta Hedge transfer screen on a PR #1934-like trajectory

After preparing this PR, I ran one additional isolated transfer screen to test whether the strict-prefix token n-gram + Adaptive-Beta Hedge overlay was specific to this submission's base, or whether it behaves like a reusable scoring calibration layer.

This is score evidence only. It is not a new submission claim, and it is not a faithful reproduction of PR #1934. I am documenting it here because the paired result is useful evidence about the mechanism.

Question

The final method in this PR uses a normalized causal token n-gram overlay with Adaptive-Beta Hedge. Across the experiments in this PR, Hedge repeatedly improved fixed n-gram tilt by about 0.000386 BPB.

The question was:

If the same overlay is applied to a stronger public PR #1934-style trajectory, does the Hedge gain persist?

Source basis

The transfer screen used PR #1934 source as the starting point:

  • PR branch: liujshi/parameter-golf, record-lrzip
  • PR commit: ae80c9fd1ee854d1529fe11524a9fa6a1e084f9e
  • Record folder: records/track_10min_16mb/2026-04-29_SP8192_LQER_CaseOp_Per-group_Lrzip

The overlay added the same style of scoring counters used in this PR:

  • paired control from the same eval-time TTT trajectory
  • fixed normalized token n-gram tilt
  • Adaptive-Beta Hedge over n-gram boost strengths

All three counters were computed in the same execution, using the same logits / TTT trajectory / document order / scored-token and scored-byte denominators.

Important deviations from PR #1934

This was deliberately a fast transfer screen, not a full PR #1934 audit or reproduction. The run differs from the public PR #1934 setup in important ways:

  1. Training data mismatch

  2. Memory-safe eval batching

    • The default overlay TTT scoring path OOMed at the original larger batch setting.
    • The successful eval-only retry used TTT_BATCH_SIZE=32.
    • This made the screen complete, but it changed scheduling/batching and pushed eval runtime over the contest target.
  3. No broad PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934 validity audit

Because of these deviations, the result below should be read as a mechanism transfer screen, not as “PR #1934 + Hedge” record evidence.

Paired same-execution results

Successful eval-only transfer run from the saved artifact, seed42, memory-safe TTT_BATCH_SIZE=32:

Counter Loss BPB Gain vs control Gain vs fixed
Paired PR1934-like control 2.32386471 1.06191549
Fixed n-gram tilt 2.31960433 1.05996866 0.00194683 0.00000000
Adaptive-Beta Hedge 2.31876583 1.05958550 0.00232999 0.00038316

Official accounting for the screen:

  • scored tokens: 47,851,520
  • scored bytes: 151,074,499
  • validation docs: 50,000
  • n-gram active hint positions: 13,023,303
  • doc-order hash: 33236cc6bd19fa6b89e06d441d3fcd8eb37dc8540f6a4f2b627b20af10894a41

Why this is interesting

The absolute BPB is not the main finding, because the base trajectory was weaker than PR #1934's reported seed42 run. The important signal is the paired overlay gain:

  • fixed n-gram improved the paired control by 0.00194683 BPB
  • Adaptive-Beta Hedge improved the paired control by 0.00232999 BPB
  • Adaptive-Beta Hedge improved fixed n-gram by 0.00038316 BPB

That last number is very close to the Hedge-over-fixed gains observed in this PR's own experiments:

Setting Hedge gain vs fixed n-gram
seed42, default trajectory 0.00038677 BPB
seed0, default trajectory 0.00038971 BPB
seed1234, default trajectory 0.00038366 BPB
seed42, Q disabled 0.00038744 BPB
seed42, public-frontier diagnostic base 0.00038690 BPB
seed42, 2560 context + no Q/V 0.00038599 BPB
PR #1934-like transfer screen 0.00038316 BPB

This supports the interpretation that Adaptive-Beta Hedge is correcting a fairly stable n-gram boost calibration error, rather than exploiting a one-off seed or one particular model trajectory.

Why I am not submitting this as a separate result

The transfer screen is not submission-ready:

So the classification is:

SCORE-EVIDENCE-ONLY / NO SUBMIT

Takeaway

This screen does not create a new record claim. It does add useful evidence for the mechanism behind this PR:

A strict-prefix normalized token n-gram overlay gives a large paired gain, and Adaptive-Beta Hedge appears to add a stable additional calibration gain even when moved onto a different public base trajectory.

If there were more time, the proper hardening path would be:

  1. reproduce or obtain the full PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934 seed42 artifact/trajectory;
  2. apply the same overlay with original scheduling or prove the memory-safe batch path equivalent enough;
  3. bring runtime below 600s;
  4. increase package margin;
  5. then rerun seed42 and decide whether more seeds are justified.

I am leaving it as a documented supplemental study rather than turning it into a formal submission.

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant