Skip to content

Update README leaderboard with recent record submissions#1806

Merged
cocohearts merged 4 commits intomainfrom
codex/update-readme-format-blocked-leaders
Apr 26, 2026
Merged

Update README leaderboard with recent record submissions#1806
cocohearts merged 4 commits intomainfrom
codex/update-readme-format-blocked-leaders

Conversation

@cocohearts
Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts commented Apr 24, 2026

@cocohearts cocohearts force-pushed the codex/update-readme-format-blocked-leaders branch from a53635a to c810781 Compare April 24, 2026 15:45
@cocohearts cocohearts changed the title Update README leaderboard with blocked record candidates Update README leaderboard with recent record submissions Apr 24, 2026
@cocohearts cocohearts force-pushed the codex/update-readme-format-blocked-leaders branch from c810781 to 5f87b21 Compare April 24, 2026 15:52
@cocohearts cocohearts force-pushed the codex/update-readme-format-blocked-leaders branch from 5f87b21 to f47b26c Compare April 24, 2026 15:55
@romeerp
Copy link
Copy Markdown
Contributor

romeerp commented Apr 24, 2026

Hi @cocohearts - wanted to flag the two Scylla tokenizer submissions. Under the PR conversations, both admit to byte accounting errors that make BPB invalid.

@codemath3000
Copy link
Copy Markdown
Contributor

codemath3000 commented Apr 24, 2026

@cocohearts, I second @romeerp's concern. Thank you for looking into this!

@dexhunter
Copy link
Copy Markdown
Contributor

dexhunter commented Apr 25, 2026

Thanks @cocohearts for getting these entries onto the leaderboard — useful for newcomers tracing lineage. Two of the six new rows have known byte-accounting issues that meaningfully change their ranking. Flagging here so the table reflects the actual canonical numbers.

TL;DR — affected entries

PR Claimed val_bpb Actual canonical BPB Status
#1184 Scylla + FullGPTQ + XSA + FA3 (icryo) 0.9485 ≈ 1.13 Verified by independent rerun in PR #1271 (andrewbaggio1) using the exact same train_gpt.py with corrected candidate.meta.npz: 1.1289 BPB. Author @icryo acknowledged the issue and committed to re-run before resubmitting.
#1143 Scylla + TTT (simon-marcus) 1.0806 ≈ 1.28 (est.) Same Scylla codebase, same candidate.meta.npz bug. Author @simon-marcus explicitly withdrew this PR; superseded by PR #1314 / PR #1796.
#1060, #1148, #1031, #1120 as listed as listed Clean — canonical PR #1019-style byte LUT

What the issue is

A "record" entry's val_bpb is only meaningful if the byte-count denominator matches what every other submission uses on the same val stream. Per Issue #1017 §V (byte-level BPB via the canonical sentencepiece piece table; no hardcoded bytes/token; full val shards) and Issue #677 (organizer enforcement of the canonical eval), the leaderboard implicitly assumes uniform byte accounting.

Two known mismatch classes inflate the byte denominator and thus deflate the reported BPB:

  1. Scylla / TokenMonster meta bug — applies to Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184 + Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143. The shipped candidate.meta.npz has 27 byte-fallback tokens (IDs 75-101) with base_bytes=3 instead of 1, plus 38 capcode space-stripping modifier tokens not flagged as is_boundary_token=True. Per @andrewbaggio1's verified rerun in PR #1271: correcting only the byte-fallback meta + reconciling val-token boundary differences gives +0.180 BPB vs the claim, broken down as +0.133 byte accounting + +0.037 val text boundary + +0.010 model quality. So PR Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184's 0.9485 is really ~1.129 on canonical accounting; the Scylla tokenizer provides no meaningful advantage over SP1024 (~1.11-1.12).

  2. GDN/FLA -prefix double-count (Issue #1719) — different mechanism but coincidentally similar magnitude (~17.7% inflation). Not in this PR's added entries; noting for future curators.

A 1-line diagnostic anyone can run (no rerun needed)

For canonical sp8192:

bytes_per_token = val_loss / (val_bpb × ln(2))
canonical sp8192 ≈ 3.7266
Observed ratio Interpretation Approx. corrected BPB
≈ 3.73 canonical sp8192 as reported
≈ 4.39 double-count (#1719 family) reported × 1.177
Scylla family (different tokenizer, ratio ≈ 2.93) meta byte-accounting bug per PR #1271 reported × ~1.19, and verify via direct rerun

The Scylla case is special because its tokenizer isn't sp8192 — the diagnostic ratio test alone won't catch it; the actual reproducer is in PR #1271's "Reproducing" section.

Happy to help cross-check any specific submission.

@cocohearts
Copy link
Copy Markdown
Collaborator Author

hi sorry thanks for correction lemme triple check

@codemath3000
Copy link
Copy Markdown
Contributor

hi sorry thanks for correction lemme triple check

Thank you so much, @cocohearts !

@cocohearts cocohearts marked this pull request as ready for review April 26, 2026 04:49
@cocohearts cocohearts merged commit 7427de2 into main Apr 26, 2026
@andrewbaggio1
Copy link
Copy Markdown

@cocohearts thank you!! could we please also get a ruling on the legality of caseops tokenizers? it's being debated in issue #1604 and it's currently splitting the leaderboard of unmerged record submissions

hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
… merges (openai#1806)

* Update leaderboard with recent record submissions

* Keep only valid recent leaderboard rows

* Remove invalid Scylla record

* Remove non-record Muon TTT submission
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants