fix(benchmark): exclude deprecated models from failures, calibration, agreement, buckets#229
Merged
Merged
Conversation
… agreement, buckets The Deprecated Models footnote correctly preserved historical F1/cost for deprecated entries, but several report sections still pulled raw data across all models -- so x-ai/grok-4.1-fast's 165 upstream 404s leaked into "Failures and provider issues" and its data still showed up in the calibration table, detection buckets, alignment table, and the corresponding chart SVGs. Added _Extras.without(model_ids) to filter calibration/agreement/ detection_buckets in one place, and trimmed the calls list at the _render_failures call site. Active charts and tables now exclude deprecated models everywhere; the Deprecated Models footnote is the only place they appear.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
Deprecated Modelsfootnote section in the LLM benchmark report correctly preserved historical F1/cost for entries flaggeddeprecated = trueinbenchmark.toml, but several other report sections still pulled raw data across every model. After flippingx-ai/grok-4.1-fastto deprecated (xAI returns 404 with "Grok 4.1 Fast is deprecated. xAI recommends switching to Grok 4.3"), its 165 upstream 404 errors leaked into the Failures section, and its historical per-window vote data still appeared in the calibration table, detection-bucket tables, cross-model alignment table, and the corresponding chart SVGs (calibration.svg,alignment.svg,detection_by_length.svg,detection_by_position.svg).This PR threads
deprecated_idsthrough the leaking call sites in one place via a new_Extras.without(model_ids)helper. Active charts and tables now exclude deprecated models everywhere; the### Deprecated Modelsfootnote remains the only place they appear.Benchmark-tooling only (
benchmarks/is dockerignored). Noversion.pybump, no CHANGELOG entry.Test plan
grep -c "grok-4.1-fast" benchmarks/llm/results/report.mdreturns1(Deprecated Models footnote only)report_assets/alignment.svgno longer renders agrok-4.1-fastrowreport_assets/calibration.svgno longer renders agrok-4.1-fastline