Audit 1698 lineage bpb bytecount by abi2024 · Pull Request #1804 · openai/parameter-golf

abi2024 · 2026-04-24T08:15:39Z

This is a non-record submission: a static audit tool + snapshot report, not a model. Intended as a reusable self-check for submitters and a clarity aid for reviewers. Full details below.

Measurement Integrity Note: BPB Byte-Count Audit of the #1698 Lineage

Type: Non-record PR — tooling + methodology contribution.
Track: track_non_record_16mb
Authors of this PR: (filer)
Acknowledgement: This work systematizes the byte-count discrepancy that
yahya010 discovered and self-reported in PR #1734 closure on 2026-04-19.

TL;DR

yahya010 self-reported in PR Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734 closure that
build_sentencepiece_luts in the Record: GatedDeltaNet (FLA) + Legal Score-First TTT — val_bpb 1.00995 (3-seed mean) #1698 lineage bakes a +1 into the byte
LUT for leading-space tokens, while eval_val_sliding then adds the same
+1 again, double-counting.
That double-count inflates the byte denominator of BPB by ~16.71% on
the sliding-window scored subset that PR Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean) #1727's eval_val_sliding
actually uses (151,080,891 canonical vs 176,332,748 buggy bytes on SP8192
fineweb val, 633,420 windows of seq_len=2048, stride=64). yahya010's
closure quoted ~17.46% against a different reference — his own
Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734 LUT applied to the decoded-stream ground truth. Both ratios
characterize the same underlying bug; the small numerical difference is
a scoring-strategy + LUT-construction artefact, documented in
audit/methodology.md §4. Reported buggy BPBs translate to canonical
BPBs via canonical = reported × inflation_ratio where the ratio is
whichever one matches the PR's own scoring.
We publish scripts/canonical_rescore.py: a static LUT inspection +
byte-count tool that requires no GPU, no checkpoint, and no reproduction
run. Drop in any train_gpt.py and it returns the LUT classification,
the exact inflation ratio over the actual scored-token subset, and the
inferred canonical BPB. The tool supports three --scoring-mode
variants so reviewers can reproduce both the 1.1671 and 1.1746 numbers.
The classifier is a three-variant detector: beyond the +1
leading-space bake (leading_space_plus_one) it also checks
byte_token_wrong_size (sp.is_byte branch sizing byte tokens by UTF-8
length of the literal "<0xXX>" string) and missing_is_unused
(boundary predicate omits sp.is_unused). yahya010's PR Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734
train_gdn_7k.py is the case where multiple variants co-occur. The
extended classifier applied to the current top-10 PRs produces the
same classification as the single-bug detector (6 CORRECT, 4
OBFUSCATED) — see audit/changelog_v2.md.
Applying the tool to the top 10 open PRs by reported BPB as of
2026-04-23: 6 are CORRECT (canonical LUT verified), 4 are OBFUSCATED
(lzma.decompress(base64.b85decode(...)) — LUT cannot be verified
statically). The LUT-verified correct-LUT frontier is PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735
(AjAnubolu, 1.04290), followed by the cluster of 1.064-1.071 PRs
anchored by the reproducible PR Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean) #1727 stack.

This is a tooling and methodology contribution, not a disqualification
petition. The intent is to give future submitters a one-command self-check
("did I inherit the #1698 LUT bug?") and to help reviewers separate
LUT-verified results from unverified ones.

The bug, in one paragraph

Canonical SentencePiece BPB attributes one byte to the leading space of a
piece beginning with the ▁ marker, but only when the previous token is
not a boundary token (UNK / control / unused). The #1700-line
implementation (PR #1727 line 196) writes base_bytes_np[token_id] = len(piece.encode("utf-8")) after stripping the ▁, then in
eval_val_sliding adds (has_leading_space[y] & ~is_boundary[x_prev]). The
#1698 line writes base_bytes_np[token_id] = len(piece.encode("utf-8")) + 1
inside the leading-space branch — so the +1 is already baked into the
LUT — and then also adds the boundary-gated +1 at eval time. Each
leading-space scored token is therefore credited with one extra byte beyond
canonical. On SP8192 fineweb val, leading-space tokens account for 62.3% of
all val tokens, so the byte denominator is inflated by ~16.71% and the
reported BPB is correspondingly deflated.

Why we can correct without re-running the model: the cross-entropy
numerator is independent of the LUT. bpb = (loss × N_tokens) / (ln(2) × byte_count). Multiply both sides by the buggy_bytes / canonical_bytes
ratio and you recover the canonical BPB from the buggy reported value.

Methodology (full version: `audit/methodology.md`)

For each PR:

git fetch upstream pull/<N>/head:pr-<N> and check it out.
Find the train_gpt.py under records/track_10min_16mb/<latest-dated-dir>/.
Run scripts/canonical_rescore.py against that script + the SP8192
tokenizer + the fineweb_val shard.
Tool returns:
- lut_status: CORRECT / BUGGY / OBFUSCATED / UNKNOWN
- inflation_ratio: 1.0 for CORRECT, computed buggy/canonical for
  BUGGY (~1.1671 on SP8192), null otherwise.
- inferred_canonical_bpb: reported_bpb × inflation_ratio if both are
  known; null otherwise.
- passes_merged_sota_threshold: boolean, threshold default 1.0738 (one
  record-class margin under the merged-SOTA reference).

Hardware parity is anchored by exp_001: a verbatim PR #1727 reproduction on
8×H100 SXM, seed 1337, val_bpb = 1.07431, within 0.00214 of the reported
3-seed mean of 1.07217 — confirming our toolchain (torch 2.8.0+cu128) sees
the same numbers as upstream and that the audit's analytic correction can
be trusted. See experiments/exp_001/analysis.md.

Scope and limitations

What "LUT-verified CORRECT" does and does not mean:

Does mean the build_sentencepiece_luts function in the PR's
train_gpt.py uses the canonical len(piece.encode("utf-8")) pattern
(no +1 for leading-space tokens) and is not wrapped in
lzma.decompress(base64.b85decode(...)).
Does not imply the model artifact the PR ships achieves its reported
BPB. The tool verifies the LUT only; the cross-entropy numerator of BPB
is taken as given.
Does not imply that eval_val_sliding itself is canonical. A PR
that modified the eval loop would not be caught by this tool. We assume
upstream-faithful eval logic.
Does not rule out other measurement irregularities — modified val
shards, different tokenizers, custom BPB definitions. Independent
reproduction remains the gold standard for a contested record.

What the OBFUSCATED verdict does and does not mean:

Does mean the tool's static regex found a *.decompress(*.b85decode(...))
chain and could not locate a readable build_sentencepiece_luts
implementation.
Does not mean the PR is buggy. The OBFUSCATED verdict is neutral;
verifying the LUT inside the wrapper requires sandbox execution, which
is out of scope for this audit.

PR #1735's 0.021 BPB lead over the next-best CORRECT result (#1779 at
1.06421) is sufficiently large that independent reproduction is warranted
before treating it as authoritative for record-class comparisons. The tool
verifies only the LUT, not the full training pipeline; a wide gap like
this could be real or could reflect some other path that the tool does not
inspect. The frontier PR #1735 reading is "LUT-verified, reproduction
pending", not "verified as the true top".

Tool usage

python scripts/canonical_rescore.py \
    --train-script <path-to-PR-train_gpt.py> \
    --tokenizer    data/tokenizers/fineweb_8192_bpe.model \
    --val-data     'data/datasets/fineweb10B_sp8192/fineweb_val_*.bin' \
    --reported-bpb 1.02840 \
    --pr-number    1758

Output (JSON to stdout / --output):

{
  "pr_number": 1758,
  "script_path": "...",
  "lut_status": "OBFUSCATED",
  "inflation_ratio": null,
  "inferred_canonical_bpb": null,
  "passes_merged_sota_threshold": null,
  "notes": "Code is lzma/b85-obfuscated; LUT cannot be verified statically."
}

For a CORRECT script the output looks like:

{
  "pr_number": 1735,
  "lut_status": "CORRECT",
  "inflation_ratio": 1.0,
  "inferred_canonical_bpb": 1.0429,
  "passes_merged_sota_threshold": true
}

For a BUGGY script the output reports the exact byte counts, the inflation
ratio, and the corrected BPB.

Tests covering CORRECT (PR #1727), BUGGY (four synthetic fixtures — one
per bug variant plus the triple-bug case), OBFUSCATED (both inline-exec
and runpy-style wrappers), UNKNOWN, the three scoring-mode variants,
and the full end-to-end rescore are in tests/test_canonical_rescore.py
(20 tests, all green).

Results (full version: `audit/results.md` and `audit/corrected_leaderboard.md`)

Rank	PR	Author	Reported	LUT status	LUT-verified†	Canonical BPB
1	#1785	OE-GOD	1.01925	OBFUSCATED	no	unverified
2	#1758	kilojoules	1.02840	OBFUSCATED	no	unverified
3	#1738	alertcat	1.03540	OBFUSCATED	no	unverified
4	#1735	AjAnubolu	1.04290	CORRECT	yes	1.04290
5	#1779	leon2k2k2k	1.06421	CORRECT	yes	1.06421
6	#1769	dexhunter	1.06453	CORRECT	yes	1.06453
7	#1756	romeerp	1.06505	CORRECT	yes	1.06505
8	#1771	bigbag	1.06513	OBFUSCATED	no	unverified
9	#1736	dexhunter	1.06549	CORRECT	yes	1.06549
10	#1784	renqianluo	1.07081	CORRECT	yes	1.07081

† "LUT-verified" means the tool statically confirmed a canonical
build_sentencepiece_luts. Under the v2 (three-variant) classifier
this requires all three canonical properties — leading_space_noplus,
byte_token_one, and boundary_predicate_full — to match. This is
necessary but not sufficient for a trustworthy BPB — see "Scope and
limitations" above. The v2 classifier reproduces the same
classification as v1 on every row of this table; see
audit/changelog_v2.md for the side-by-side.

LUT-verified frontier: PR #1735 (AjAnubolu) at reported BPB 1.04290,
with PR #1779 the next-best LUT-verified entry at 1.06421. The 0.021 BPB
gap is large enough that independent reproduction is warranted before
treating #1735 as the authoritative record.

Four PRs in the top 10 (#1785, #1758, #1738, #1771) returned OBFUSCATED
and could not be statically audited. We do not claim these are buggy; we
state the observation neutrally: the three lowest reported BPBs on the
current top-10 snapshot are all in obfuscated code, and the only sub-1.05
submission with a self-disclosed LUT classification (yahya010's PR #1734,
1.0108 → ~1.1873) was buggy. This is a pattern, not a causal claim. A
naive application of the 1.1671 ratio if the bug were present would
yield #1785 → ~1.190, #1758 → ~1.200, #1738 → ~1.208, and #1771 → ~1.243,
but this arithmetic is only meaningful if the obfuscated LUTs actually
match the #1698 lineage, which we have not verified and cannot verify
without sandbox execution of the wrapped code.

Attribution

Verbatim from the PR #1734 closure comment by yahya010, 2026-04-19:

"build_sentencepiece_luts bakes +1 into LUT for leading-space tokens,
then eval_val_sliding adds +1 again at eval. Buggy code overcounts bytes
by 17.46% vs canonical sp.decode_ids().encode('utf-8'). Reported
val_bpb=1.0108 corresponds to canonical val_bpb≈1.1873..."

yahya010's quoted ratio (1.1746) was computed against his own #1734 LUT,
which has two byte-counting differences from the #1727-style LUT: byte
tokens are sized by len("<0xXX>".encode("utf-8")) (6 bytes) rather than
1, and sp.is_unused tokens are not treated as boundary. Our tool's
three --scoring-mode variants converge to 1.1671 on SP8192 fineweb val
when applied to the #1727-style LUT shape; running yahya's LUT directly
against the same val stream gives 1.1770 — within 0.2% of the quoted
1.1746. Both characterizations describe the same underlying defect
(leading-space bytes baked into the LUT and re-added at eval); the
numerical correction to any particular PR depends on which flavour of
LUT that PR uses. Full analysis in audit/methodology.md §4, and the
per-property detection design in §5.

This audit extends yahya010's finding by:

Publishing a tool anyone can run without reproducing on GPU.
Applying it to the full set of currently-open top-10 PRs.
Documenting the scoring-strategy sensitivity explicitly so the two
quoted ratios are no longer a source of confusion.
Detecting the two additional LUT-construction bugs in yahya's
own train_gdn_7k.py (byte-token sizing, missing is_unused in the
boundary predicate) as explicitly-named deviations in the tool's
JSON output, so future submissions can be checked for each variant
individually.

Framing

We do not request any PR be re-classified or closed. The competition
maintainers and authors are best positioned to decide whether obfuscated
submissions are eligible for record consideration. Our contribution is:

A reusable tool (scripts/canonical_rescore.py) that any submitter
can run before filing — including a regex check that catches the buggy
+1 pattern in seconds.
A clean methodology document (audit/methodology.md) defining
canonical BPB rigorously enough that disagreements about "what is
canonical" can be resolved by code rather than discussion.
A snapshot leaderboard (audit/corrected_leaderboard.md,
audit/results.md) that distinguishes verified canonical BPB from
reported BPB, so reviewers do not have to re-derive that distinction
per-PR.

The LUT-verified frontier (PR #1735 at canonical 1.04290, leading the
cluster around 1.064-1.071) is the cleanest statement we can make from
static inspection alone. Whether the 0.021 BPB gap between #1735 and the
next-best LUT-verified entry reflects a genuine capability step-change
or a reporting artefact is outside the scope of this audit; we flag it as
"reproduction-pending" rather than "verified record".

@yahya010

…t integrity) Tooling + methodology contribution systematizing the build_sentencepiece_luts bug disclosed in @yahya010's PR openai#1734 closure (2026-04-19). Static LUT inspection tool detecting three byte-count bug variants (leading_space_plus_one, byte_token_wrong_size, missing_is_unused) without running the model. Applied to current top-10 open PRs on 2026-04-23: 6 CORRECT, 4 OBFUSCATED, 0 BUGGY. Frontier of verified correct-LUT PRs: openai#1735 (AjAnubolu, 1.04290). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…in_gpt file

OE-GOD · 2026-04-24T09:11:57Z

Thanks for the audit — useful tooling, and the byte-count bug analysis is clear.

A couple of notes on how this intersects with my PRs:

PR #1785 is closed and superseded

The audit lists PR #1785 (OE-GOD, 1.01925, OBFUSCATED). That PR is closed (by me, on 2026-04-23, after @nprime06's gate-legality review on PR #1795). The earlier commit that wound up OBFUSCATED in your scan shipped an lzma.decompress(base85.b85decode(...)) stub because I thought I needed it to fit 16 MB.

The successor is PR #1795 (open, current commit cb5ad95 pushed today). Headline there is 1.01252 ± 0.00044 (3-seed mean, full val, strict-legal outcome-independent gate — fixed in response to @nprime06). More importantly for your audit: the commit ships train_gpt.py as readable source, not an lzma-compressed stub. If you rerun canonical_rescore.py against PR #1795 @ cb5ad95, lut_status should classify as CORRECT, not OBFUSCATED.

My LUT is canonical (inherited from @clarkkev unchanged)

Grepping build_sentencepiece_luts in the current PR #1795 file:

for token_id in range(sp_vocab_size):
    if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
        continue                         # boundary predicate: full (is_unused included)
    is_boundary_token_np[token_id] = False
    if sp.is_byte(token_id):
        base_bytes_np[token_id] = 1      # byte tokens: sized as 1 (not len('<0xXX>'.encode()))
        continue
    piece = sp.id_to_piece(token_id)
    if piece.startswith("\u2581"):
        has_leading_space_np[token_id] = True
        piece = piece[1:]
    base_bytes_np[token_id] = len(piece.encode("utf-8"))  # no + 1 baked into LUT

All three canonical properties satisfied:

leading_space_noplus ✓ (line: base_bytes_np[token_id] = len(piece.encode("utf-8")), no +1)
byte_token_one ✓ (line: base_bytes_np[token_id] = 1)
boundary_predicate_full ✓ (predicate includes sp.is_unused)

And eval_val_sliding adds the leading-space byte exactly once, gated by ~is_boundary_token_lut[prev_ids] — the canonical #1727 pattern. This is verbatim from @clarkkev's PR #1334; my changes to the file are only the _ppm_mixture_bpb addition and a small hook in eval_val_sliding to collect per-token target logprobs and call the mixture. I did not touch build_sentencepiece_luts or the byte-count arithmetic.

So the inflation-ratio correction doesn't apply to PR #1795's NN-only column: our NN-only sliding BPB mean is 1.09764, matching @clarkkev's 2026-04-01 record of 1.09785 within seed noise. The −0.074 Δ from the mixture is computed on top of that canonical NN byte count.

Aside on the "naive 1.1671 if bug present" arithmetic

Your note "naive application of the 1.1671 ratio if the bug were present would yield #1785 → ~1.190" — appreciate you flagging it as a hypothetical. For record: my PR's LUT is verifiably not buggy, so 1.190 is not a correction that applies.

Suggestion

When re-running the audit (or adding PR #1795 to the corrected leaderboard), the classification should be CORRECT on the current commit. Happy to answer any specific questions about the byte-count path in my code.

abi2024 and others added 2 commits April 24, 2026 07:46

Submission README: add pytest recovery instructions for canonical tra…

ba1784c

…in_gpt file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit 1698 lineage bpb bytecount#1804

Audit 1698 lineage bpb bytecount#1804
abi2024 wants to merge 2 commits intoopenai:mainfrom
abi2024:audit-1698-lineage-bpb-bytecount

abi2024 commented Apr 24, 2026

Uh oh!

OE-GOD commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abi2024 commented Apr 24, 2026

Measurement Integrity Note: BPB Byte-Count Audit of the #1698 Lineage

TL;DR

The bug, in one paragraph

Methodology (full version: audit/methodology.md)

Scope and limitations

Tool usage

Results (full version: audit/results.md and audit/corrected_leaderboard.md)

Attribution

Framing

Uh oh!

OE-GOD commented Apr 24, 2026

PR #1785 is closed and superseded

My LUT is canonical (inherited from @clarkkev unchanged)

Aside on the "naive 1.1671 if bug present" arithmetic

Suggestion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Methodology (full version: `audit/methodology.md`)

Results (full version: `audit/results.md` and `audit/corrected_leaderboard.md`)