Conversation
…t integrity) Tooling + methodology contribution systematizing the build_sentencepiece_luts bug disclosed in @yahya010's PR openai#1734 closure (2026-04-19). Static LUT inspection tool detecting three byte-count bug variants (leading_space_plus_one, byte_token_wrong_size, missing_is_unused) without running the model. Applied to current top-10 open PRs on 2026-04-23: 6 CORRECT, 4 OBFUSCATED, 0 BUGGY. Frontier of verified correct-LUT PRs: openai#1735 (AjAnubolu, 1.04290). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the audit — useful tooling, and the byte-count bug analysis is clear. A couple of notes on how this intersects with my PRs: PR #1785 is closed and supersededThe audit lists PR #1785 (OE-GOD, 1.01925, OBFUSCATED). That PR is closed (by me, on 2026-04-23, after @nprime06's gate-legality review on PR #1795). The earlier commit that wound up OBFUSCATED in your scan shipped an The successor is PR #1795 (open, current commit My LUT is canonical (inherited from @clarkkev unchanged)Grepping for token_id in range(sp_vocab_size):
if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
continue # boundary predicate: full (is_unused included)
is_boundary_token_np[token_id] = False
if sp.is_byte(token_id):
base_bytes_np[token_id] = 1 # byte tokens: sized as 1 (not len('<0xXX>'.encode()))
continue
piece = sp.id_to_piece(token_id)
if piece.startswith("\u2581"):
has_leading_space_np[token_id] = True
piece = piece[1:]
base_bytes_np[token_id] = len(piece.encode("utf-8")) # no + 1 baked into LUTAll three canonical properties satisfied:
And So the inflation-ratio correction doesn't apply to PR #1795's NN-only column: our NN-only sliding BPB mean is 1.09764, matching @clarkkev's 2026-04-01 record of 1.09785 within seed noise. The −0.074 Δ from the mixture is computed on top of that canonical NN byte count. Aside on the "naive 1.1671 if bug present" arithmeticYour note "naive application of the 1.1671 ratio if the bug were present would yield #1785 → ~1.190" — appreciate you flagging it as a hypothetical. For record: my PR's LUT is verifiably not buggy, so 1.190 is not a correction that applies. SuggestionWhen re-running the audit (or adding PR #1795 to the corrected leaderboard), the classification should be |
This is a non-record submission: a static audit tool + snapshot report, not a model. Intended as a reusable self-check for submitters and a clarity aid for reviewers. Full details below.
Measurement Integrity Note: BPB Byte-Count Audit of the #1698 Lineage
Type: Non-record PR — tooling + methodology contribution.
Track:
track_non_record_16mbAuthors of this PR: (filer)
Acknowledgement: This work systematizes the byte-count discrepancy that
yahya010 discovered and self-reported in PR #1734 closure on 2026-04-19.
TL;DR
build_sentencepiece_lutsin the Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean) #1698 lineage bakes a+1into the byteLUT for leading-space tokens, while
eval_val_slidingthen adds the same+1again, double-counting.the sliding-window scored subset that PR Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean) #1727's
eval_val_slidingactually uses (151,080,891 canonical vs 176,332,748 buggy bytes on SP8192
fineweb val, 633,420 windows of
seq_len=2048, stride=64). yahya010'sclosure quoted ~17.46% against a different reference — his own
Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734 LUT applied to the decoded-stream ground truth. Both ratios
characterize the same underlying bug; the small numerical difference is
a scoring-strategy + LUT-construction artefact, documented in
audit/methodology.md§4. Reported buggy BPBs translate to canonicalBPBs via
canonical = reported × inflation_ratiowhere the ratio iswhichever one matches the PR's own scoring.
scripts/canonical_rescore.py: a static LUT inspection +byte-count tool that requires no GPU, no checkpoint, and no reproduction
run. Drop in any
train_gpt.pyand it returns the LUT classification,the exact inflation ratio over the actual scored-token subset, and the
inferred canonical BPB. The tool supports three
--scoring-modevariants so reviewers can reproduce both the 1.1671 and 1.1746 numbers.
leading-space bake (
leading_space_plus_one) it also checksbyte_token_wrong_size(sp.is_byte branch sizing byte tokens by UTF-8length of the literal
"<0xXX>"string) andmissing_is_unused(boundary predicate omits
sp.is_unused). yahya010's PR Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734train_gdn_7k.pyis the case where multiple variants co-occur. Theextended classifier applied to the current top-10 PRs produces the
same classification as the single-bug detector (6 CORRECT, 4
OBFUSCATED) — see
audit/changelog_v2.md.2026-04-23: 6 are CORRECT (canonical LUT verified), 4 are OBFUSCATED
(
lzma.decompress(base64.b85decode(...))— LUT cannot be verifiedstatically). The LUT-verified correct-LUT frontier is PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735
(AjAnubolu, 1.04290), followed by the cluster of 1.064-1.071 PRs
anchored by the reproducible PR Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean) #1727 stack.
This is a tooling and methodology contribution, not a disqualification
petition. The intent is to give future submitters a one-command self-check
("did I inherit the #1698 LUT bug?") and to help reviewers separate
LUT-verified results from unverified ones.
The bug, in one paragraph
Canonical SentencePiece BPB attributes one byte to the leading space of a
piece beginning with the
▁marker, but only when the previous token isnot a boundary token (UNK / control / unused). The #1700-line
implementation (PR #1727 line 196) writes
base_bytes_np[token_id] = len(piece.encode("utf-8"))after stripping the▁, then ineval_val_slidingadds(has_leading_space[y] & ~is_boundary[x_prev]). The#1698 line writes
base_bytes_np[token_id] = len(piece.encode("utf-8")) + 1inside the leading-space branch — so the
+1is already baked into theLUT — and then also adds the boundary-gated
+1at eval time. Eachleading-space scored token is therefore credited with one extra byte beyond
canonical. On SP8192 fineweb val, leading-space tokens account for 62.3% of
all val tokens, so the byte denominator is inflated by ~16.71% and the
reported BPB is correspondingly deflated.
Why we can correct without re-running the model: the cross-entropy
numerator is independent of the LUT.
bpb = (loss × N_tokens) / (ln(2) × byte_count). Multiply both sides by thebuggy_bytes / canonical_bytesratio and you recover the canonical BPB from the buggy reported value.
Methodology (full version:
audit/methodology.md)For each PR:
git fetch upstream pull/<N>/head:pr-<N>and check it out.train_gpt.pyunderrecords/track_10min_16mb/<latest-dated-dir>/.scripts/canonical_rescore.pyagainst that script + the SP8192tokenizer + the fineweb_val shard.
lut_status:CORRECT/BUGGY/OBFUSCATED/UNKNOWNinflation_ratio:1.0for CORRECT, computed buggy/canonical forBUGGY (~
1.1671on SP8192),nullotherwise.inferred_canonical_bpb:reported_bpb × inflation_ratioif both areknown;
nullotherwise.passes_merged_sota_threshold: boolean, threshold default 1.0738 (onerecord-class margin under the merged-SOTA reference).
Hardware parity is anchored by exp_001: a verbatim PR #1727 reproduction on
8×H100 SXM, seed 1337, val_bpb = 1.07431, within 0.00214 of the reported
3-seed mean of 1.07217 — confirming our toolchain (torch 2.8.0+cu128) sees
the same numbers as upstream and that the audit's analytic correction can
be trusted. See
experiments/exp_001/analysis.md.Scope and limitations
What "LUT-verified CORRECT" does and does not mean:
build_sentencepiece_lutsfunction in the PR'strain_gpt.pyuses the canonicallen(piece.encode("utf-8"))pattern(no
+1for leading-space tokens) and is not wrapped inlzma.decompress(base64.b85decode(...)).BPB. The tool verifies the LUT only; the cross-entropy numerator of BPB
is taken as given.
eval_val_slidingitself is canonical. A PRthat modified the eval loop would not be caught by this tool. We assume
upstream-faithful eval logic.
shards, different tokenizers, custom BPB definitions. Independent
reproduction remains the gold standard for a contested record.
What the OBFUSCATED verdict does and does not mean:
*.decompress(*.b85decode(...))chain and could not locate a readable
build_sentencepiece_lutsimplementation.
verifying the LUT inside the wrapper requires sandbox execution, which
is out of scope for this audit.
PR #1735's 0.021 BPB lead over the next-best CORRECT result (#1779 at
1.06421) is sufficiently large that independent reproduction is warranted
before treating it as authoritative for record-class comparisons. The tool
verifies only the LUT, not the full training pipeline; a wide gap like
this could be real or could reflect some other path that the tool does not
inspect. The frontier PR #1735 reading is "LUT-verified, reproduction
pending", not "verified as the true top".
Tool usage
python scripts/canonical_rescore.py \ --train-script <path-to-PR-train_gpt.py> \ --tokenizer data/tokenizers/fineweb_8192_bpe.model \ --val-data 'data/datasets/fineweb10B_sp8192/fineweb_val_*.bin' \ --reported-bpb 1.02840 \ --pr-number 1758Output (JSON to stdout /
--output):{ "pr_number": 1758, "script_path": "...", "lut_status": "OBFUSCATED", "inflation_ratio": null, "inferred_canonical_bpb": null, "passes_merged_sota_threshold": null, "notes": "Code is lzma/b85-obfuscated; LUT cannot be verified statically." }For a CORRECT script the output looks like:
{ "pr_number": 1735, "lut_status": "CORRECT", "inflation_ratio": 1.0, "inferred_canonical_bpb": 1.0429, "passes_merged_sota_threshold": true }For a BUGGY script the output reports the exact byte counts, the inflation
ratio, and the corrected BPB.
Tests covering CORRECT (PR #1727), BUGGY (four synthetic fixtures — one
per bug variant plus the triple-bug case), OBFUSCATED (both inline-
execand
runpy-style wrappers), UNKNOWN, the three scoring-mode variants,and the full end-to-end rescore are in
tests/test_canonical_rescore.py(20 tests, all green).
Results (full version:
audit/results.mdandaudit/corrected_leaderboard.md)† "LUT-verified" means the tool statically confirmed a canonical
build_sentencepiece_luts. Under the v2 (three-variant) classifierthis requires all three canonical properties —
leading_space_noplus,byte_token_one, andboundary_predicate_full— to match. This isnecessary but not sufficient for a trustworthy BPB — see "Scope and
limitations" above. The v2 classifier reproduces the same
classification as v1 on every row of this table; see
audit/changelog_v2.mdfor the side-by-side.LUT-verified frontier: PR #1735 (AjAnubolu) at reported BPB 1.04290,
with PR #1779 the next-best LUT-verified entry at 1.06421. The 0.021 BPB
gap is large enough that independent reproduction is warranted before
treating #1735 as the authoritative record.
Four PRs in the top 10 (#1785, #1758, #1738, #1771) returned OBFUSCATED
and could not be statically audited. We do not claim these are buggy; we
state the observation neutrally: the three lowest reported BPBs on the
current top-10 snapshot are all in obfuscated code, and the only sub-1.05
submission with a self-disclosed LUT classification (yahya010's PR #1734,
1.0108 → ~1.1873) was buggy. This is a pattern, not a causal claim. A
naive application of the 1.1671 ratio if the bug were present would
yield #1785 → ~1.190, #1758 → ~1.200, #1738 → ~1.208, and #1771 → ~1.243,
but this arithmetic is only meaningful if the obfuscated LUTs actually
match the #1698 lineage, which we have not verified and cannot verify
without sandbox execution of the wrapped code.
Attribution
Verbatim from the PR #1734 closure comment by yahya010, 2026-04-19:
yahya010's quoted ratio (1.1746) was computed against his own #1734 LUT,
which has two byte-counting differences from the #1727-style LUT: byte
tokens are sized by
len("<0xXX>".encode("utf-8"))(6 bytes) rather than1, and
sp.is_unusedtokens are not treated as boundary. Our tool'sthree
--scoring-modevariants converge to 1.1671 on SP8192 fineweb valwhen applied to the #1727-style LUT shape; running yahya's LUT directly
against the same val stream gives 1.1770 — within 0.2% of the quoted
1.1746. Both characterizations describe the same underlying defect
(leading-space bytes baked into the LUT and re-added at eval); the
numerical correction to any particular PR depends on which flavour of
LUT that PR uses. Full analysis in
audit/methodology.md§4, and theper-property detection design in §5.
This audit extends yahya010's finding by:
quoted ratios are no longer a source of confusion.
own train_gdn_7k.py (byte-token sizing, missing
is_unusedin theboundary predicate) as explicitly-named deviations in the tool's
JSON output, so future submissions can be checked for each variant
individually.
Framing
We do not request any PR be re-classified or closed. The competition
maintainers and authors are best positioned to decide whether obfuscated
submissions are eligible for record consideration. Our contribution is:
scripts/canonical_rescore.py) that any submittercan run before filing — including a regex check that catches the buggy
+1pattern in seconds.audit/methodology.md) definingcanonical BPB rigorously enough that disagreements about "what is
canonical" can be resolved by code rather than discussion.
audit/corrected_leaderboard.md,audit/results.md) that distinguishes verified canonical BPB fromreported BPB, so reviewers do not have to re-derive that distinction
per-PR.
The LUT-verified frontier (PR #1735 at canonical 1.04290, leading the
cluster around 1.064-1.071) is the cleanest statement we can make from
static inspection alone. Whether the 0.021 BPB gap between #1735 and the
next-best LUT-verified entry reflects a genuine capability step-change
or a reporting artefact is outside the scope of this audit; we flag it as
"reproduction-pending" rather than "verified record".