Skip to content

Update Parameter Golf leaderboard#1899

Open
cocohearts wants to merge 1 commit intomainfrom
codex/update-parameter-golf-leaderboard-p025-worktree
Open

Update Parameter Golf leaderboard#1899
cocohearts wants to merge 1 commit intomainfrom
codex/update-parameter-golf-leaderboard-p025-worktree

Conversation

@cocohearts
Copy link
Copy Markdown
Collaborator

README-only leaderboard update adding p<0.25 accepted chain #1518/#1530/#1610/#1626/#1667/#1784; #1787/#1797/#1801 intentionally excluded due validity/provenance blockers.

Co-authored-by: Codex <noreply@openai.com>
@codemath3000
Copy link
Copy Markdown

Thanks for the leaderboard update, @cocohearts. A couple of FYIs for context:

#1797#1855. Just a heads-up: while the base #1797 PR has the validity concern you raised in your audit comment, the downstream PR #1855 — which is built on the #1797 stack — is itself valid, because that specific concern is fixed there. #1855 applies the not_bos = (input_ids[:, 1:] != BOS_ID) mask in both _forward_hidden and forward_ttt, exactly as your audit recommended. 3-seed mean: 1.06108 BPB (std 0.00090), independently reproduced by @okezue.

#1530. For reference, this PR has its own structural concern open in its thread — the TTT compile warmup runs backward() / step() on actual validation tokens before the main eval loop, which @dexhunter and @msisovic flagged as structurally matching the pattern called out in #677. @samacqua confirmed the gap is within run-to-run variance and offered a synthetic-token warmup as the fix, but the merged head still appears to use val tokens for the compile warmup. Whether this rises to the same kind of validity blocker that was applied to #1797 is the maintainers' call, but flagging it explicitly since the structural pattern (adapt-on-validation-before-the-reported-eval-pass, per #677) looks similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants