Skip to content

fix(benchmark): exclude deprecated models from failures, calibration, agreement, buckets#229

Merged
ttlequals0 merged 1 commit into
mainfrom
fix/benchmark-deprecated-leak
May 16, 2026
Merged

fix(benchmark): exclude deprecated models from failures, calibration, agreement, buckets#229
ttlequals0 merged 1 commit into
mainfrom
fix/benchmark-deprecated-leak

Conversation

@ttlequals0
Copy link
Copy Markdown
Owner

Summary

The Deprecated Models footnote section in the LLM benchmark report correctly preserved historical F1/cost for entries flagged deprecated = true in benchmark.toml, but several other report sections still pulled raw data across every model. After flipping x-ai/grok-4.1-fast to deprecated (xAI returns 404 with "Grok 4.1 Fast is deprecated. xAI recommends switching to Grok 4.3"), its 165 upstream 404 errors leaked into the Failures section, and its historical per-window vote data still appeared in the calibration table, detection-bucket tables, cross-model alignment table, and the corresponding chart SVGs (calibration.svg, alignment.svg, detection_by_length.svg, detection_by_position.svg).

This PR threads deprecated_ids through the leaking call sites in one place via a new _Extras.without(model_ids) helper. Active charts and tables now exclude deprecated models everywhere; the ### Deprecated Models footnote remains the only place they appear.

Benchmark-tooling only (benchmarks/ is dockerignored). No version.py bump, no CHANGELOG entry.

Test plan

  • CI lint + tests pass
  • grep -c "grok-4.1-fast" benchmarks/llm/results/report.md returns 1 (Deprecated Models footnote only)
  • Failures section reads "No call errors observed across this run"
  • report_assets/alignment.svg no longer renders a grok-4.1-fast row
  • report_assets/calibration.svg no longer renders a grok-4.1-fast line

… agreement, buckets

The Deprecated Models footnote correctly preserved historical F1/cost for
deprecated entries, but several report sections still pulled raw data
across all models -- so x-ai/grok-4.1-fast's 165 upstream 404s leaked
into "Failures and provider issues" and its data still showed up in the
calibration table, detection buckets, alignment table, and the
corresponding chart SVGs.

Added _Extras.without(model_ids) to filter calibration/agreement/
detection_buckets in one place, and trimmed the calls list at the
_render_failures call site. Active charts and tables now exclude
deprecated models everywhere; the Deprecated Models footnote is the
only place they appear.
@ttlequals0 ttlequals0 merged commit 28c876f into main May 16, 2026
13 checks passed
@ttlequals0 ttlequals0 deleted the fix/benchmark-deprecated-leak branch May 16, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant