Update README leaderboard with recent record submissions by cocohearts · Pull Request #1806 · openai/parameter-golf

cocohearts · 2026-04-24T15:35:49Z

Summary

adds val_bpb 1.1099 (3-seed mean) Rascal #1120 Rascal to the leaderboard
adds Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 coprime-loader + Full GPTQ + XSA-all to the leaderboard
removes the invalid Scylla Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184 record artifact
removes the non-record Record: 11L Muon Legal TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) #1148 Muon TTT artifact

#1019 was already present on main, so this PR leaves that existing row in place.

Checks

git diff --check
verified Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean) #1060 and val_bpb 1.1099 (3-seed mean) Rascal #1120 were frontier improvements by PR creation time
verified stale record submission ready for review labels were removed from Record: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon #1031, Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143, Record: 11L Muon Legal TTT + Entropy-Adaptive Epochs (8×H100) — val_bpb 1.1179 (3-seed mean) #1148, and Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184

romeerp · 2026-04-24T17:26:19Z

Hi @cocohearts - wanted to flag the two Scylla tokenizer submissions. Under the PR conversations, both admit to byte accounting errors that make BPB invalid.

codemath3000 · 2026-04-24T20:00:21Z

@cocohearts, I second @romeerp's concern. Thank you for looking into this!

dexhunter · 2026-04-25T10:35:28Z

Thanks @cocohearts for getting these entries onto the leaderboard — useful for newcomers tracing lineage. Two of the six new rows have known byte-accounting issues that meaningfully change their ranking. Flagging here so the table reflects the actual canonical numbers.

TL;DR — affected entries

PR	Claimed `val_bpb`	Actual canonical BPB	Status
#1184 Scylla + FullGPTQ + XSA + FA3 (icryo)	0.9485	≈ 1.13	Verified by independent rerun in PR #1271 (andrewbaggio1) using the exact same `train_gpt.py` with corrected `candidate.meta.npz`: 1.1289 BPB. Author @icryo acknowledged the issue and committed to re-run before resubmitting.
#1143 Scylla + TTT (simon-marcus)	1.0806	≈ 1.28 (est.)	Same Scylla codebase, same `candidate.meta.npz` bug. Author @simon-marcus explicitly withdrew this PR; superseded by PR #1314 / PR #1796.
#1060, #1148, #1031, #1120	as listed	as listed	Clean — canonical PR #1019-style byte LUT

What the issue is

A "record" entry's val_bpb is only meaningful if the byte-count denominator matches what every other submission uses on the same val stream. Per Issue #1017 §V (byte-level BPB via the canonical sentencepiece piece table; no hardcoded bytes/token; full val shards) and Issue #677 (organizer enforcement of the canonical eval), the leaderboard implicitly assumes uniform byte accounting.

Two known mismatch classes inflate the byte denominator and thus deflate the reported BPB:

Scylla / TokenMonster meta bug — applies to Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184 + Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143. The shipped candidate.meta.npz has 27 byte-fallback tokens (IDs 75-101) with base_bytes=3 instead of 1, plus 38 capcode space-stripping modifier tokens not flagged as is_boundary_token=True. Per @andrewbaggio1's verified rerun in PR #1271: correcting only the byte-fallback meta + reconciling val-token boundary differences gives +0.180 BPB vs the claim, broken down as +0.133 byte accounting + +0.037 val text boundary + +0.010 model quality. So PR Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184's 0.9485 is really ~1.129 on canonical accounting; the Scylla tokenizer provides no meaningful advantage over SP1024 (~1.11-1.12).
GDN/FLA ▁-prefix double-count (Issue #1719) — different mechanism but coincidentally similar magnitude (~17.7% inflation). Not in this PR's added entries; noting for future curators.

A 1-line diagnostic anyone can run (no rerun needed)

For canonical sp8192:

bytes_per_token = val_loss / (val_bpb × ln(2))
canonical sp8192 ≈ 3.7266

Observed ratio	Interpretation	Approx. corrected BPB
≈ 3.73	canonical sp8192	as reported
≈ 4.39	`▁` double-count (#1719 family)	reported × 1.177
Scylla family (different tokenizer, ratio ≈ 2.93)	meta byte-accounting bug per PR #1271	reported × ~1.19, and verify via direct rerun

The Scylla case is special because its tokenizer isn't sp8192 — the diagnostic ratio test alone won't catch it; the actual reproducer is in PR #1271's "Reproducing" section.

Happy to help cross-check any specific submission.

cocohearts · 2026-04-26T04:32:55Z

hi sorry thanks for correction lemme triple check

codemath3000 · 2026-04-26T04:34:26Z

hi sorry thanks for correction lemme triple check

Thank you so much, @cocohearts !

andrewbaggio1 · 2026-04-26T07:07:22Z

@cocohearts thank you!! could we please also get a ruling on the legality of caseops tokenizers? it's being debated in issue #1604 and it's currently splitting the leaderboard of unmerged record submissions

… merges (openai#1806) * Update leaderboard with recent record submissions * Keep only valid recent leaderboard rows * Remove invalid Scylla record * Remove non-record Muon TTT submission

cocohearts force-pushed the codex/update-readme-format-blocked-leaders branch from a53635a to c810781 Compare April 24, 2026 15:45

cocohearts changed the title ~~Update README leaderboard with blocked record candidates~~ Update README leaderboard with recent record submissions Apr 24, 2026

cocohearts force-pushed the codex/update-readme-format-blocked-leaders branch from c810781 to 5f87b21 Compare April 24, 2026 15:52

Update leaderboard with recent record submissions

f47b26c

cocohearts force-pushed the codex/update-readme-format-blocked-leaders branch from 5f87b21 to f47b26c Compare April 24, 2026 15:55

cocohearts added 3 commits April 26, 2026 00:40

Keep only valid recent leaderboard rows

f13ae15

Remove invalid Scylla record

6c808fc

Remove non-record Muon TTT submission

79d57f8

cocohearts marked this pull request as ready for review April 26, 2026 04:49

cocohearts merged commit 7427de2 into main Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update README leaderboard with recent record submissions#1806

Update README leaderboard with recent record submissions#1806
cocohearts merged 4 commits intomainfrom
codex/update-readme-format-blocked-leaders

cocohearts commented Apr 24, 2026 •

edited

Loading

Uh oh!

romeerp commented Apr 24, 2026

Uh oh!

codemath3000 commented Apr 24, 2026 •

edited

Loading

Uh oh!

dexhunter commented Apr 25, 2026 •

edited

Loading

Uh oh!

cocohearts commented Apr 26, 2026

Uh oh!

codemath3000 commented Apr 26, 2026

Uh oh!

andrewbaggio1 commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

cocohearts commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checks

Uh oh!

romeerp commented Apr 24, 2026

Uh oh!

codemath3000 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dexhunter commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR — affected entries

What the issue is

A 1-line diagnostic anyone can run (no rerun needed)

Uh oh!

cocohearts commented Apr 26, 2026

Uh oh!

codemath3000 commented Apr 26, 2026

Uh oh!

andrewbaggio1 commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cocohearts commented Apr 24, 2026 •

edited

Loading

codemath3000 commented Apr 24, 2026 •

edited

Loading

dexhunter commented Apr 25, 2026 •

edited

Loading