Byte-accounting bug in `build_sentencepiece_luts` affects GDN-family submissions

## Summary

Several open GDN/FLA-family submissions report `val_bpb` figures that are inconsistent with Section V of the README. Root cause is a shared `build_sentencepiece_luts` helper that double-credits the leading-space byte for `▁`-prefixed SentencePiece tokens, inflating the byte denominator by ≈17% and proportionally deflating the reported `val_bpb`.

## Defect

Example from PR #1712, `train_gdn_7k.py:204-217`:

```python
if piece.startswith("\u2581"):
    has_space[i] = True
    base_bytes[i] = len(piece[1:].encode("utf-8")) + 1   # pre-credits +1
```

The eval accumulator (PR #1712, `train_gdn_7k.py:373-375`) then adds `+1` a second time:

```python
tb = base_bytes_lut[tgt].to(torch.float64)
tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64)   # +1 again
```

For every `▁`-prefixed target following a non-boundary token, one byte is counted twice.

## Canonical reference

Merged PR #1019 — `records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py:266-290`:

```python
if piece.startswith("\u2581"):
    has_leading_space_np[token_id] = True
    piece = piece[1:]                                    # strip ▁ first
base_bytes_np[token_id] = len(piece.encode("utf-8"))     # no +1 in LUT
```

The `+1` is applied exactly once, in the eval accumulator.

## Quantified impact (sp8192 val stream, 40,540,160 tokens)

| Scheme | Total val bytes | Bytes/token |
|---|---|---|
| Buggy | 177,825,759 | 4.386 |
| Canonical | ~151–152M | ~3.73 |
| Inflation | +17.7% | — |

Self-check: the ratio `val_loss / val_bpb` should be ≈2.58 for sp8192 (≈3.73 bytes/token × ln 2); a ratio ≈3.04 indicates the bug.

## Affected submissions

| PR | Reported val_bpb | Canonical estimate |
|---|---|---|
| #1576 (closed) | 1.0167 | — |
| #1632 (closed) | 1.028 | ~1.20 |
| #1698 | 1.00995 | ~1.177 |
| #1711 | 1.00980 | ~1.177 |
| #1712 | 1.01902 | ~1.200 |

All LUTs examined contain the byte-identical `len(piece[1:].encode("utf-8")) + 1` pattern. After correction, none of the listed submissions surpass the merged SOTA (PR #1493 at 1.0810).

## Recommended action

1. **Affected authors**: replace `base_bytes[i] = len(piece[1:].encode("utf-8")) + 1` with `base_bytes[i] = len(piece[1:].encode("utf-8"))` and re-evaluate. The eval accumulator already adds the `+1` via `has_leading_space_lut & ~is_boundary_token_lut`.
2. **Evaluation pipeline**: consider a semantic check that rejects any submission where the byte LUT pre-credits the leading-space byte while the eval accumulator also adds it.
3. **New submissions**: derive `build_sentencepiece_luts` from merged PR #1019 rather than inheriting from family source files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte-accounting bug in `build_sentencepiece_luts` affects GDN-family submissions #1719

Summary

Defect

Canonical reference

Quantified impact (sp8192 val stream, 40,540,160 tokens)

Affected submissions

Recommended action

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scheme	Total val bytes	Bytes/token
Buggy	177,825,759	4.386
Canonical	~151–152M	~3.73
Inflation	+17.7%	—

PR	Reported val_bpb	Canonical estimate
#1576 (closed)	1.0167	—
#1632 (closed)	1.028	~1.20
#1698	1.00995	~1.177
#1711	1.00980	~1.177
#1712	1.01902	~1.200

Byte-accounting bug in build_sentencepiece_luts affects GDN-family submissions #1719

Description

Summary

Defect

Canonical reference

Quantified impact (sp8192 val stream, 40,540,160 tokens)

Affected submissions

Recommended action

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Byte-accounting bug in `build_sentencepiece_luts` affects GDN-family submissions #1719