Non-record: Confidence-Adaptive N-gram Boost on PR #2018 stack, val_bpb=1.05874#2129
Open
okezue wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Confidence-Adaptive N-gram Boost on PR #2018 stack, val_bpb=1.05874#2129okezue wants to merge 1 commit intoopenai:mainfrom
okezue wants to merge 1 commit intoopenai:mainfrom
Conversation
… val_bpb=1.05874 Single-seed non-record submission documenting a two-line novel addition to the strict token-only n-gram tilt path from PR openai#2018: scale the per-position boost beta_t by (1 - q_hint_t)^gamma, where q_hint_t is the prefix-only NN distribution at the scored position evaluated on the hinted token. Down-weights the n-gram tilt when NN already agrees with the hint, up-weights when NN disagrees. Result on seed 42 with ADAPTIVE_BOOST_GAMMA=1: val_bpb 1.05874 vs 1.05900 gamma=0 baseline, monotonically positive across gamma in {1, 2}. Below the 0.005-nat record threshold versus PR openai#1855 SOTA, hence non-record. The closed-form tilt Z_t = 1 + q_hint*(exp(beta_t)-1) preserves C2 normalization for any prefix-derived beta_t, and q_hint comes from the same prefix-only NN distribution used for scoring, so C1 / C2 / C3 / C4 are all satisfied.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record submission
val_bpb = 1.05874 (seed 42, single-seed, ADAPTIVE_BOOST_GAMMA=1) | artifact 15,990,227 bytes | 8xH100 80GB SXM | strict 600s train + eval
This is a non-record submission per README §"Non-record Submissions". It does not clear the 0.005-nat threshold versus PR #1855 SOTA (1.06108) — margin is about 0.00162 nats/byte, below the 0.005 floor. It documents a clean two-line novel addition to the strict token-only n-gram tilt of PR #2018, with a small but consistent positive direction across
gamma in {1, 2}.What is novel
Confidence-Adaptive N-gram Boost. PR #2018's tilt applies a fixed boost
betawhenever the prefix-derived hint counter exceeds threshold. This submission scalesbetaper-position by the NN's own predictive confidence:where
q_hint_t = p(h_t | x_<t)is the prefix-only NN distribution at positiontevaluated on the hinted token, andgammais a tunable exponent (env varADAPTIVE_BOOST_GAMMA, default 0 = original behavior).When NN already places high probability on the hinted token (q_hint -> 1) we tilt almost not at all (NN already agrees). When NN disagrees (q_hint -> 0) we apply the full tilt. This matches the intuition that the n-gram expert is most useful as a corrective signal precisely when the NN is uncertain or wrong.
Compliance
The closed-form tilt
p'(a) = p(a) * exp(beta * 1[a==h]) / ZwithZ = 1 + q*(exp(beta)-1)is normalized over the vocab axis for anybeta_t >= 0that depends only on prefix-derived state. Sinceq_hint_tandh_tare both prefix-only:q_hint_tis the NN distribution attconditioned on tokens<t.h_tfrom prefix-only n-gram counter.beta_t.Strict token-only path inherited (
WITHIN_BOOST=0 WORD_BOOST=0 AGREE_ADD_BOOST=0). Run logs confirmwithin_gate=0 word_gate=0 agree2plus=0. This addresses the C1 concerns flagged on PR #2118 and resolved on PR #2018 / PR #1514.Result
Monotonically positive across
gamma in {1, 2}.gamma=1is the best so far.Reproduction gap honesty
Our reproduction of PR #2018 lands at val_bpb 1.05900 vs the PR #2018 README's reported 1.04617 for seed 42, a +0.013 BPB gap. The gap is consistent across pre-quant (1.06351 vs 1.04931), quantized (1.07123 vs 1.05773), and post-TTT (1.05900 vs 1.04617), so it's a base-model issue in our reproduction, not a tilt issue. The adaptive-boost layer is reported on top of our reproduction baseline. With a properly reproduced PR #2018 baseline (~1.046),
gamma=1could plausibly land below the 1.054 record threshold versus PR #1855 — we are not confirming that here.Compliance summary
Files
README.md,submission.jsontrain_gpt.py(PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018 train script + 2-line adaptive-boost wiring atHyperparameters.__init__andapply_tilt_to_ptl_torchcall site)online_ngram_tilt.py(PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018 tilt module + adaptive-gamma multiplication in bothapply_tilt_to_ptl_torchandapply_tilt_to_ptl_torch_fast)online_ngram_state.c,prepare_caseops_data.py,lossless_caps.py,run.sh(unchanged from PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018)train_seed42.logCredits
cc @cocohearts @valerio-oai @simon-marcus for visibility.