Skip to content

Byte-accounting bug in build_sentencepiece_luts affects GDN-family submissions #1719

@dexhunter

Description

@dexhunter

Summary

Several open GDN/FLA-family submissions report val_bpb figures that are inconsistent with Section V of the README. Root cause is a shared build_sentencepiece_luts helper that double-credits the leading-space byte for -prefixed SentencePiece tokens, inflating the byte denominator by ≈17% and proportionally deflating the reported val_bpb.

Defect

Example from PR #1712, train_gdn_7k.py:204-217:

if piece.startswith("\u2581"):
    has_space[i] = True
    base_bytes[i] = len(piece[1:].encode("utf-8")) + 1   # pre-credits +1

The eval accumulator (PR #1712, train_gdn_7k.py:373-375) then adds +1 a second time:

tb = base_bytes_lut[tgt].to(torch.float64)
tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64)   # +1 again

For every -prefixed target following a non-boundary token, one byte is counted twice.

Canonical reference

Merged PR #1019records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py:266-290:

if piece.startswith("\u2581"):
    has_leading_space_np[token_id] = True
    piece = piece[1:]                                    # strip ▁ first
base_bytes_np[token_id] = len(piece.encode("utf-8"))     # no +1 in LUT

The +1 is applied exactly once, in the eval accumulator.

Quantified impact (sp8192 val stream, 40,540,160 tokens)

Scheme Total val bytes Bytes/token
Buggy 177,825,759 4.386
Canonical ~151–152M ~3.73
Inflation +17.7%

Self-check: the ratio val_loss / val_bpb should be ≈2.58 for sp8192 (≈3.73 bytes/token × ln 2); a ratio ≈3.04 indicates the bug.

Affected submissions

PR Reported val_bpb Canonical estimate
#1576 (closed) 1.0167
#1632 (closed) 1.028 ~1.20
#1698 1.00995 ~1.177
#1711 1.00980 ~1.177
#1712 1.01902 ~1.200

All LUTs examined contain the byte-identical len(piece[1:].encode("utf-8")) + 1 pattern. After correction, none of the listed submissions surpass the merged SOTA (PR #1493 at 1.0810).

Recommended action

  1. Affected authors: replace base_bytes[i] = len(piece[1:].encode("utf-8")) + 1 with base_bytes[i] = len(piece[1:].encode("utf-8")) and re-evaluate. The eval accumulator already adds the +1 via has_leading_space_lut & ~is_boundary_token_lut.
  2. Evaluation pipeline: consider a semantic check that rejects any submission where the byte LUT pre-credits the leading-space byte while the eval accumulator also adds it.
  3. New submissions: derive build_sentencepiece_luts from merged PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 rather than inheriting from family source files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions