Skip to content

Non-record: Confidence-Adaptive N-gram Boost on PR #2018 stack, val_bpb=1.05874#2129

Open
okezue wants to merge 1 commit intoopenai:mainfrom
okezue:adaptive-boost-nonrecord
Open

Non-record: Confidence-Adaptive N-gram Boost on PR #2018 stack, val_bpb=1.05874#2129
okezue wants to merge 1 commit intoopenai:mainfrom
okezue:adaptive-boost-nonrecord

Conversation

@okezue
Copy link
Copy Markdown

@okezue okezue commented May 1, 2026

Non-record submission

val_bpb = 1.05874 (seed 42, single-seed, ADAPTIVE_BOOST_GAMMA=1) | artifact 15,990,227 bytes | 8xH100 80GB SXM | strict 600s train + eval

This is a non-record submission per README §"Non-record Submissions". It does not clear the 0.005-nat threshold versus PR #1855 SOTA (1.06108) — margin is about 0.00162 nats/byte, below the 0.005 floor. It documents a clean two-line novel addition to the strict token-only n-gram tilt of PR #2018, with a small but consistent positive direction across gamma in {1, 2}.

What is novel

Confidence-Adaptive N-gram Boost. PR #2018's tilt applies a fixed boost beta whenever the prefix-derived hint counter exceeds threshold. This submission scales beta per-position by the NN's own predictive confidence:

beta_t = TOKEN_BOOST * (1 - q_hint_t)^gamma

where q_hint_t = p(h_t | x_<t) is the prefix-only NN distribution at position t evaluated on the hinted token, and gamma is a tunable exponent (env var ADAPTIVE_BOOST_GAMMA, default 0 = original behavior).

When NN already places high probability on the hinted token (q_hint -> 1) we tilt almost not at all (NN already agrees). When NN disagrees (q_hint -> 0) we apply the full tilt. This matches the intuition that the n-gram expert is most useful as a corrective signal precisely when the NN is uncertain or wrong.

Compliance

The closed-form tilt p'(a) = p(a) * exp(beta * 1[a==h]) / Z with Z = 1 + q*(exp(beta)-1) is normalized over the vocab axis for any beta_t >= 0 that depends only on prefix-derived state. Since q_hint_t and h_t are both prefix-only:

  • C1 causal: ✓ — q_hint_t is the NN distribution at t conditioned on tokens <t. h_t from prefix-only n-gram counter.
  • C2 normalized: ✓ — Z_t closed-form is the analytic normalizer for the per-position beta_t.
  • C3 score-before-update: ✓ — applied at scoring time only, no parameter updates from val tokens.
  • C4 single pass: ✓ — inherited from PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018's single-pass sliding eval.

Strict token-only path inherited (WITHIN_BOOST=0 WORD_BOOST=0 AGREE_ADD_BOOST=0). Run logs confirm within_gate=0 word_gate=0 agree2plus=0. This addresses the C1 concerns flagged on PR #2118 and resolved on PR #2018 / PR #1514.

Result

variant seed val_bpb eval_ms artifact_bytes
baseline (gamma=0, our reproduction) 42 1.05900 479,477 15,991,083
adaptive gamma=1 (this submission) 42 1.05874 449,899 15,990,227
adaptive gamma=2 42 1.05878 450,385 15,990,227

Monotonically positive across gamma in {1, 2}. gamma=1 is the best so far.

Reproduction gap honesty

Our reproduction of PR #2018 lands at val_bpb 1.05900 vs the PR #2018 README's reported 1.04617 for seed 42, a +0.013 BPB gap. The gap is consistent across pre-quant (1.06351 vs 1.04931), quantized (1.07123 vs 1.05773), and post-TTT (1.05900 vs 1.04617), so it's a base-model issue in our reproduction, not a tilt issue. The adaptive-boost layer is reported on top of our reproduction baseline. With a properly reproduced PR #2018 baseline (~1.046), gamma=1 could plausibly land below the 1.054 record threshold versus PR #1855 — we are not confirming that here.

Compliance summary

  • Train: 596,039 ms < 600,000 ms cap.
  • Eval: 449,899 ms < 600,000 ms cap.
  • Artifact: 15,990,227 bytes < 16,000,000 byte cap.
  • Tokenizer: SP8192 lossless caps caseops v1 reserved (md5 b73929616bf6303b953396b767a29b99).
  • 8xH100 80GB SXM.

Files

Credits

cc @cocohearts @valerio-oai @simon-marcus for visibility.

… val_bpb=1.05874

Single-seed non-record submission documenting a two-line novel addition to the strict token-only n-gram tilt path from PR openai#2018: scale the per-position boost beta_t by (1 - q_hint_t)^gamma, where q_hint_t is the prefix-only NN distribution at the scored position evaluated on the hinted token. Down-weights the n-gram tilt when NN already agrees with the hint, up-weights when NN disagrees.

Result on seed 42 with ADAPTIVE_BOOST_GAMMA=1: val_bpb 1.05874 vs 1.05900 gamma=0 baseline, monotonically positive across gamma in {1, 2}. Below the 0.005-nat record threshold versus PR openai#1855 SOTA, hence non-record. The closed-form tilt Z_t = 1 + q_hint*(exp(beta_t)-1) preserves C2 normalization for any prefix-derived beta_t, and q_hint comes from the same prefix-only NN distribution used for scoring, so C1 / C2 / C3 / C4 are all satisfied.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant