You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several open GDN/FLA-family submissions report val_bpb figures that are inconsistent with Section V of the README. Root cause is a shared build_sentencepiece_luts helper that double-credits the leading-space byte for ▁-prefixed SentencePiece tokens, inflating the byte denominator by ≈17% and proportionally deflating the reported val_bpb.
All LUTs examined contain the byte-identical len(piece[1:].encode("utf-8")) + 1 pattern. After correction, none of the listed submissions surpass the merged SOTA (PR #1493 at 1.0810).
Recommended action
Affected authors: replace base_bytes[i] = len(piece[1:].encode("utf-8")) + 1 with base_bytes[i] = len(piece[1:].encode("utf-8")) and re-evaluate. The eval accumulator already adds the +1 via has_leading_space_lut & ~is_boundary_token_lut.
Evaluation pipeline: consider a semantic check that rejects any submission where the byte LUT pre-credits the leading-space byte while the eval accumulator also adds it.
Summary
Several open GDN/FLA-family submissions report
val_bpbfigures that are inconsistent with Section V of the README. Root cause is a sharedbuild_sentencepiece_lutshelper that double-credits the leading-space byte for▁-prefixed SentencePiece tokens, inflating the byte denominator by ≈17% and proportionally deflating the reportedval_bpb.Defect
Example from PR #1712,
train_gdn_7k.py:204-217:The eval accumulator (PR #1712,
train_gdn_7k.py:373-375) then adds+1a second time:For every
▁-prefixed target following a non-boundary token, one byte is counted twice.Canonical reference
Merged PR #1019 —
records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py:266-290:The
+1is applied exactly once, in the eval accumulator.Quantified impact (sp8192 val stream, 40,540,160 tokens)
Self-check: the ratio
val_loss / val_bpbshould be ≈2.58 for sp8192 (≈3.73 bytes/token × ln 2); a ratio ≈3.04 indicates the bug.Affected submissions
All LUTs examined contain the byte-identical
len(piece[1:].encode("utf-8")) + 1pattern. After correction, none of the listed submissions surpass the merged SOTA (PR #1493 at 1.0810).Recommended action
base_bytes[i] = len(piece[1:].encode("utf-8")) + 1withbase_bytes[i] = len(piece[1:].encode("utf-8"))and re-evaluate. The eval accumulator already adds the+1viahas_leading_space_lut & ~is_boundary_token_lut.build_sentencepiece_lutsfrom merged PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 rather than inheriting from family source files.