Skip to content

Record: Order-Adaptive Entropy Gating + XSA-All (val_bpb=0.9370)#774

Open
travispchen wants to merge 1 commit intoopenai:mainfrom
travispchen:order-adaptive-entropy-gating
Open

Record: Order-Adaptive Entropy Gating + XSA-All (val_bpb=0.9370)#774
travispchen wants to merge 1 commit intoopenai:mainfrom
travispchen:order-adaptive-entropy-gating

Conversation

@travispchen
Copy link
Copy Markdown

N-gram7 BPB: 0.9370 (±0.0003) across seeds 1337/42/2025
Sliding BPB: 1.1222 (±0.0003)
Artifact: ~15.9 MB (within 16MB cap)
Training: 600s on 8xH100

Key innovation: order-adaptive entropy gating assigns different entropy thresholds per n-gram order. High-order matches (7-gram) trusted at moderate model confidence; low-order matches (2-gram) only trusted when model is very uncertain.

Built on PR #753 (Podracing II) with XSA extended to all 11 layers and entropy_center=3.0.

…ed mean)

N-gram7 BPB: 0.9370 (±0.0003) across seeds 1337/42/2025
Sliding BPB: 1.1222 (±0.0003)
Artifact: ~15.9 MB (within 16MB cap)
Training: 600s on 8xH100

Key innovation: order-adaptive entropy gating assigns different
entropy thresholds per n-gram order. High-order matches (7-gram)
trusted at moderate model confidence; low-order matches (2-gram)
only trusted when model is very uncertain.

Built on PR openai#753 (Podracing II) with XSA extended to all 11 layers
and entropy_center=3.0.

Co-Authored-By: Travis Chen <[email protected]>
@newjordan
Copy link
Copy Markdown

newjordan commented Mar 26, 2026

we need to share notes! i mean.. we jsut did =) but claibrations. I havent calibrated it yet

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Order-Adaptive Entropy Gating + XSA-All

BPB: 0.9370 (n-gram7) / 1.1222 (sliding) | Seeds: 3 (1337/42/2025) | Artifact: 15.83 MB | Compliance: FLAG (target-in-key n-gram cache)

What this does: Builds on PR #753 (Podracing II) with two changes: (1) extends XSA from the last 4 layers to all 11 layers with entropy_center=3.0, and (2) introduces a per-order entropy threshold so 7-gram matches are trusted at lower model entropy than 2-gram matches. Pure-eval changes — training is unchanged.

What I found in the code:

  • The n-gram eval cache (eval_val_sliding_ngram_hashed) uses a hashed n-gram table at train_gpt.py lines 1109–1196.
  • The hard-backoff path (lines 1140–1164) computes the lookup key for the "full" (context+target) table as:
    full_key = ((ctx_hash ^ (tgt_np[v_idx] * primes[ctx_width % len(primes)])) & mask)
    full_counts = full_tables[n][full_key]
    where tgt_np = val_np[global_j] is the target/label token at the position being scored (line 1106).
  • The blend path (lines 1109–1139) does the same target-in-key construction.
  • The cache update at lines 1180–1195 happens after segment scoring (the README's "score-first" claim), but the lookup itself at line 1154/1125 uses the target token to build the hash address. The README at line 60 says "matched order comes from the n-gram cache (built from already-scored tokens only)" — that part is true for updates, but the read address still depends on tgt_np[v_idx].

Questions:

  1. Per @valerio-oai's ruling on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (Issue Illegal submissions megathread #677 comment 4145781641, 2026-03-27), an n-gram cache lookup keyed on the target token leaks information about the target via hash-bucket residency: a slot only contains a non-zero count for (ctx, tgt) pairs that were previously observed, so reading full_counts[hash(ctx, tgt)] with the true tgt is materially different from "looking up ctx in the cache." Could you walk through why this construction is not equivalent to indexing by (ctx, tgt) and using the resulting count as a target-conditioned signal?
  2. A standard backward-looking n-gram table indexes by context only and returns a distribution (or top-k) over possible next tokens; the model's mixing alpha then operates on that distribution without ever touching the true target. Would it be possible to refactor this to read ctx_tables[n][ctx_key] only, derive p_ng per candidate from the context bucket alone, and report whether the BPB delta survives?
  3. The same author's PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798 was closed today for what looks like the same full_key = hash(ctx ^ tgt * prime) construction. Is this PR a refinement of that approach, or were both PRs developed in parallel before the Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 ruling was visible? (No accusation — purely to understand whether the order-adaptive gating idea can be salvaged on top of a context-only n-gram eval.)

Standalone merits (independent of the cache question):

  • The order-adaptive entropy gating idea itself (ent_center_n = ent_center - slope * (n - min_order)) is conceptually clean and uses only the model's own logits + the matched order — no target dependence in the gating logic. If the underlying n-gram lookup were refactored to be context-only, the gating mechanism would be straightforwardly legal.
  • XSA extended to all 11 layers (vs last-4) is an independently interesting ablation worth keeping.
  • Sliding-only BPB is reported as 1.1222, which is consistent with PR Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) #753's stack (final_int6_sliding_window val_bpb:1.1195 in the train.log) — i.e. the 0.9370 number comes entirely from the n-gram eval mixing, not from training improvements.
  • 3-seed mean with std=0.0003 is well above the 0.005 nats record threshold.
  • Artifact 15,828,199–15,964,115 bytes across seeds, all under the 16 MB cap.
  • CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, Hyperparameters and GPT classes load, model_dim=512 num_layers=11 vocab=1024 train_seq_len=2048, code_bytes=110175.

Verdict: COMPLIANCE FLAG — the n-gram eval cache appears to use the same target-in-key hashed construction that @valerio-oai ruled out on PR #779, and that closed PR #798 (same author) earlier today. The training pipeline, ablation table, XSA-all change, and order-adaptive gating mechanism all look clean on their own; the issue is isolated to lines 1125/1154/1193 of the eval loop.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:
NEEDS AUTHOR ACTION — refactor eval_val_sliding_ngram_hashed to look up the hashed cache by context only (drop tgt_np from the full_key computation, or read ctx_tables[n][ctx_key] and derive a context-only next-token estimate), rerun the 3-seed n-gram7 number, and confirm the order-adaptive gating delta vs PR #753 still holds. If the BPB survives the refactor, the order-adaptive idea + XSA-all should stand on its own merits. If it does not, the improvement is attributable to the cache pattern that #779 ruled out, and the PR should be closed alongside #798.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_OK 0.06s, HAS_HYPERPARAMETERS=True, HAS_GPT=True, model_dim=512, num_heads=8, num_layers=11, vocab=1024, train_seq_len=2048, code_bytes=110175, SMOKE_TEST_PASS. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA b4edf2a84b069b2adae9ee590736aa2dd02fcf4b.

This was referenced Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants