benchmark: 4 new corpus episodes, grok-4.3 swap-in, deprecated-model report cleanup#228
Closed
ttlequals0 wants to merge 3 commits into
Closed
benchmark: 4 new corpus episodes, grok-4.3 swap-in, deprecated-model report cleanup#228ttlequals0 wants to merge 3 commits into
ttlequals0 wants to merge 3 commits into
Conversation
… agreement, buckets The Deprecated Models footnote correctly preserved historical F1/cost for deprecated entries, but several report sections still pulled raw data across all models -- so x-ai/grok-4.1-fast's 165 upstream 404s leaked into "Failures and provider issues" and its data still showed up in the calibration table, detection buckets, alignment table, and the corresponding chart SVGs. Added _Extras.without(model_ids) to filter calibration/agreement/ detection_buckets in one place, and trimmed the calls list at the _render_failures call site. Active charts and tables now exclude deprecated models everywhere; the Deprecated Models footnote is the only place they appear.
Owner
Author
|
Superseding -- branch contained two commits already squash-merged in #227. Opening a clean PR with just the deprecated-model report-filter fix. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three changes on this branch, all benchmark-tooling / data only (no production runtime impact, no
version.pybump):4 new corpus episodes promoted from
data/candidates/todata/corpus/--ep-daily-tech-news-show-b576979e1fe8,ep-on-air-with-dan-and-alex2-574e4f303730,ep-oxide-and-friends-ce789ff5b62e,ep-tosh-show-5f6894439bb6. Truth files realigned againstsegments.jsonword-level timings via a new one-shot script. Adds ~6,076 new rows tocalls.jsonland the corresponding prompt/response artifacts.x-ai/grok-4.3swapped in forx-ai/grok-4.1-fast(the latter is now upstream-deprecated by xAI -- 165 production 404s returning "Grok 4.1 Fast is deprecated. xAI recommends switching to Grok 4.3").grok-4.1-fastis markeddeprecated = trueinbenchmark.toml; historical F1 data preserved in the Deprecated Models footnote.README
Recommended Modelstable refreshed against the latest 32-model report. New top isqwen/qwen3.5-plus-02-15at F1 0.649 (free, rank 1 of all models, paid or free).openai/gpt-5.5is the new "best paid" pick at F1 0.636 / $4.66 (beatsclaude-opus-4-7at 0.618 / $5.54 on both axes).claude-opus-4-7retains the "Best Anthropic-direct" slot. The stalegrok-4.1-fast"0.64 / $0.15" claim is removed.Deprecated-model leak fix in
report.py. TheDeprecated Modelsfootnote section was correctly preserving historical F1/cost, but the Failures section, calibration table, detection buckets, cross-model alignment table, and the corresponding chart SVGs all pulled raw data across every model -- sogrok-4.1-fast's 165 upstream 404s and its full per-window vote history were leaking through. Consolidated into_Extras.without(model_ids); active charts and tables now exclude deprecated models everywhere, and the Deprecated Models footnote is the only place they appear.Test plan
grep -c "grok-4.1-fast" benchmarks/llm/results/report.mdreturns1(only the Deprecated Models footnote, not the Failures section)report_assets/alignment.svgno longer renders agrok-4.1-fastrowreport_assets/calibration.svgno longer renders agrok-4.1-fastline