Skip to content

Update Parameter Golf leaderboard#1900

Open
cocohearts wants to merge 1 commit intomainfrom
codex/update-parameter-golf-leaderboard-p025
Open

Update Parameter Golf leaderboard#1900
cocohearts wants to merge 1 commit intomainfrom
codex/update-parameter-golf-leaderboard-p025

Conversation

Co-authored-by: Codex <noreply@openai.com>
@codemath3000
Copy link
Copy Markdown

Thanks for the leaderboard update, @cocohearts. A couple of FYIs for context:

#1797#1855. Just a heads-up: while the base #1797 PR has the validity concern you raised in your audit comment, the downstream PR #1855 — which is built on the #1797 stack — is itself valid, because that specific concern is fixed there. #1855 applies the not_bos = (input_ids[:, 1:] != BOS_ID) mask in both _forward_hidden and forward_ttt, exactly as your audit recommended. 3-seed mean: 1.06108 BPB (std 0.00090), independently reproduced by @okezue.

#1530. For reference, this PR has its own structural concern open in its thread — the TTT compile warmup runs backward() / step() on actual validation tokens before the main eval loop, which @dexhunter and @msisovic flagged as structurally matching the pattern called out in #677. @samacqua confirmed the gap is within run-to-run variance and offered a synthetic-token warmup as the fix, but the merged head still appears to use val tokens for the compile warmup. Whether this rises to the same kind of validity blocker that was applied to #1797 is the maintainers' call, but flagging it explicitly since the structural pattern (adapt-on-validation-before-the-reported-eval-pass, per #677) looks similar.

@msisovic
Copy link
Copy Markdown
Contributor

msisovic commented Apr 28, 2026

@cocohearts Hi, I'd like to note that #1518 changed the score since it's opening and thus messed up the timeline a bit. My PR #1529 was at the time of it's opening, better than PR 1518, which can be checked in the commit history of PR 1518, and thus I think should be included in the leaderboard as well. It did have some legality tweaks after opening, but no structural changes that improved the model since.

Additionally, #1530 you mention here for inclusion mentions be as the SOTA at the time, just as additional proof.

@codemath3000
Copy link
Copy Markdown

codemath3000 commented Apr 28, 2026

@cocohearts Hi, I'd like to note that #1518 changed the score since it's opening and thus messed up the timeline a bit. My PR #1529 was at the time of it's opening, better than PR 1518, which can be checked in the commit history of PR 1518, and thus I think should be included in the leaderboard as well. It did have some legality tweaks, but no structural changes that improved the model since.

@cocohearts, the same is true for my subsequent PR #1584: when #1584 was first opened, its score was ahead of #1518's score at #1518's opening, and that ordering hasn't been disturbed by structural improvements since. By score-at-opening, #1584 also came in ahead of #1518.

It's also worth noting that #1584 is valid irrespective of statistical significance, per the official README rule:

"For submissions that improve speed through systems optimization without changing the ML, [the statistical significance] requirement is waived."

#1584 is a systems-optimization submission (no ML changes), so the statistical-significance bar doesn't apply to its inclusion.

@codemath3000
Copy link
Copy Markdown

codemath3000 commented Apr 28, 2026

@cocohearts Thank you so much for taking a look; I know how busy you likely are and really appreciate you taking the time to review these PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants