Skip to content

Update Parameter Golf leaderboard with BOS fix#1902

Open
cocohearts wants to merge 5 commits intomainfrom
codex/update-parameter-golf-leaderboard-with-bosfix
Open

Update Parameter Golf leaderboard with BOS fix#1902
cocohearts wants to merge 5 commits intomainfrom
codex/update-parameter-golf-leaderboard-with-bosfix

Conversation

@cocohearts
Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts commented Apr 28, 2026

README-only p-value progression leaderboard update. Applies the p<0.25 chronological progression cutoff after scanning PRs #1494-#1908 and addressing follow-up review comments. Adds #1855 as the new top row using 6-sample evidence plus independent reproduction (p≈0.127 vs #1851/#1868); adds #1529 and #1584 under score-at-opening chronology, with #1584 treated as systems-only so significance is waived. Retains #1514/#1518/#1530/#1610/#1626/#1667/#1729/#1736/#1769/#1787/#1851/#1868; excludes valid-but-non-progression rows plus invalid/conditional rows (PPM-D/byte-mixture C2, pre-quant/future-validation leakage, over-cap artifacts, duplicates, missing-evidence submissions, p-fail rows, and single-seed tiny-margin rows).

cocohearts and others added 3 commits April 28, 2026 11:57
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
@codemath3000
Copy link
Copy Markdown

codemath3000 commented Apr 28, 2026

@cocohearts Thank you so much for taking a look! I was looking over the results, and, including the independent reproduction done on #1855 (comment) as additional samples of #1855's distribution, the 6-sample picture is:

seed source val_bpb
42 #1855 submitted runs 1.05989
0 #1855 submitted runs 1.06125
1234 #1855 submitted runs 1.06209
42 @okezue reproduction 1.05965
314 @okezue reproduction 1.06041
999 @okezue reproduction 1.06124

6-sample mean 1.060755 BPB, sample std (n−1) 0.000933.

Welch's two-sample t-test vs #1851/#1868 (n=3, mean 1.06145, std 0.00068):

  • Mean delta: 0.000695 BPB (~0.00152 nats)
  • SE ≈ 0.000547, t ≈ 1.27, df ≈ 5.6
  • One-sided p ≈ 0.127493397391

That's under the 0.25 threshold compared to #1851/#1868. Therefore, #1855 does appear to be valid per my understanding.

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855
only on significance grounds (p=0.325). Our prior 050 line built on openai#1797
which is under validity-cloud per cocohearts. Re-anchor research baseline
on openai#1855's accepted chain.

Pure port — zero modifications. Files copied verbatim from
codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack
@ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/.

Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time
levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.)
on this baseline.
@codemath3000
Copy link
Copy Markdown

codemath3000 commented Apr 28, 2026

@cocohearts Separate issue from the #1855 chain-inclusion question above — flagging a concern about #1518 that's independent of anything on #1855. It came up on #1900 (here), but it bears on the chain decisions being made in this leaderboard update, so consolidating here as well.

Timeline / scored value at opening

Per @msisovic's note on #1900: when #1518 was first opened, its score was worse than #1529's score at #1529's opening, and worse than #1584's score at #1584's opening as well. By score-at-opening, both #1529 and #1584 came in ahead of #1518, and we'd appreciate them being included in the chain on that basis.

It's also worth noting that #1584 is valid irrespective of statistical significance, per the official README rule:

"For submissions that improve speed through systems optimization without changing the ML, [the statistical significance] requirement is waived."

#1584 is a systems-optimization submission (no ML changes), so the statistical-significance bar doesn't apply to its inclusion.

This is consequential because chain inclusion of #1518 currently displaces #1529 and #1584 from the SOTA timeline. Sharing for the maintainers' chain-inclusion call — and very much appreciate the careful work going into reconstructing the chain.

Co-authored-by: Codex <noreply@openai.com>
@msisovic
Copy link
Copy Markdown
Contributor

Cross posting my comment from the other PR:

#1900 (comment)

Co-authored-by: Codex <noreply@openai.com>
@cocohearts
Copy link
Copy Markdown
Collaborator Author

Addressed both follow-up comments in the README table. I added #1855 as the new top row using the combined 6-sample evidence from the submission plus independent reproduction (one-sided Welch p≈0.127 vs #1851/#1868). I also added #1529 and #1584 to reflect score-at-opening chronology before #1518's later score update; #1584 is marked as a systems-only progression row, so the statistical-significance requirement is waived under the README rule. Direct #1784/#1797 remain excluded under the p<0.25 progression cutoff, with #1797 credited through downstream BOS-fixed rows.

1 similar comment
@cocohearts
Copy link
Copy Markdown
Collaborator Author

Addressed both follow-up comments in the README table. I added #1855 as the new top row using the combined 6-sample evidence from the submission plus independent reproduction (one-sided Welch p≈0.127 vs #1851/#1868). I also added #1529 and #1584 to reflect score-at-opening chronology before #1518's later score update; #1584 is marked as a systems-only progression row, so the statistical-significance requirement is waived under the README rule. Direct #1784/#1797 remain excluded under the p<0.25 progression cutoff, with #1797 credited through downstream BOS-fixed rows.

@msisovic
Copy link
Copy Markdown
Contributor

Thanks for taking a look @cocohearts!

@codemath3000
Copy link
Copy Markdown

@cocohearts Thank you so much for working through all of this and for handling the resolution. Really appreciate the time you put into the leaderboard update. Needless to say, please feel free to follow up if any further questions or concerns come up on my end of things, happy to dig into anything further.

@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor

@cocohearts Would there be a chance for you to look at the PRs that were published before 1400? Some PRs had the highest score before some of the new ones and they got ignored.

Separately, will you have a chance to look at PRs that specifically target the 2nd leaderboard? I have 3 PRs (Ternary #923 this one adds to the binary run that's already present in the 2nd leaderboard, XNOR-net #1388, LeWorldModel Mamba2 #903, all 10 mins and unlimited compute) and I'm sure others have many too that were for the 2nd leaderboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants