Skip to content

benchmark: 4 new corpus episodes, grok-4.3 swap-in, deprecated-model report cleanup#228

Closed
ttlequals0 wants to merge 3 commits into
mainfrom
benchmarks/05-15-26
Closed

benchmark: 4 new corpus episodes, grok-4.3 swap-in, deprecated-model report cleanup#228
ttlequals0 wants to merge 3 commits into
mainfrom
benchmarks/05-15-26

Conversation

@ttlequals0
Copy link
Copy Markdown
Owner

Summary

Three changes on this branch, all benchmark-tooling / data only (no production runtime impact, no version.py bump):

  • 4 new corpus episodes promoted from data/candidates/ to data/corpus/ -- ep-daily-tech-news-show-b576979e1fe8, ep-on-air-with-dan-and-alex2-574e4f303730, ep-oxide-and-friends-ce789ff5b62e, ep-tosh-show-5f6894439bb6. Truth files realigned against segments.json word-level timings via a new one-shot script. Adds ~6,076 new rows to calls.jsonl and the corresponding prompt/response artifacts.

  • x-ai/grok-4.3 swapped in for x-ai/grok-4.1-fast (the latter is now upstream-deprecated by xAI -- 165 production 404s returning "Grok 4.1 Fast is deprecated. xAI recommends switching to Grok 4.3"). grok-4.1-fast is marked deprecated = true in benchmark.toml; historical F1 data preserved in the Deprecated Models footnote.

  • README Recommended Models table refreshed against the latest 32-model report. New top is qwen/qwen3.5-plus-02-15 at F1 0.649 (free, rank 1 of all models, paid or free). openai/gpt-5.5 is the new "best paid" pick at F1 0.636 / $4.66 (beats claude-opus-4-7 at 0.618 / $5.54 on both axes). claude-opus-4-7 retains the "Best Anthropic-direct" slot. The stale grok-4.1-fast "0.64 / $0.15" claim is removed.

  • Deprecated-model leak fix in report.py. The Deprecated Models footnote section was correctly preserving historical F1/cost, but the Failures section, calibration table, detection buckets, cross-model alignment table, and the corresponding chart SVGs all pulled raw data across every model -- so grok-4.1-fast's 165 upstream 404s and its full per-window vote history were leaking through. Consolidated into _Extras.without(model_ids); active charts and tables now exclude deprecated models everywhere, and the Deprecated Models footnote is the only place they appear.

Test plan

  • CI lint + tests pass
  • grep -c "grok-4.1-fast" benchmarks/llm/results/report.md returns 1 (only the Deprecated Models footnote, not the Failures section)
  • Failures section header reads "No call errors observed across this run" (or only lists non-deprecated models if there are any)
  • report_assets/alignment.svg no longer renders a grok-4.1-fast row
  • report_assets/calibration.svg no longer renders a grok-4.1-fast line
  • README Recommended Models table reflects current report numbers

… agreement, buckets

The Deprecated Models footnote correctly preserved historical F1/cost for
deprecated entries, but several report sections still pulled raw data
across all models -- so x-ai/grok-4.1-fast's 165 upstream 404s leaked
into "Failures and provider issues" and its data still showed up in the
calibration table, detection buckets, alignment table, and the
corresponding chart SVGs.

Added _Extras.without(model_ids) to filter calibration/agreement/
detection_buckets in one place, and trimmed the calls list at the
_render_failures call site. Active charts and tables now exclude
deprecated models everywhere; the Deprecated Models footnote is the
only place they appear.
@ttlequals0
Copy link
Copy Markdown
Owner Author

Superseding -- branch contained two commits already squash-merged in #227. Opening a clean PR with just the deprecated-model report-filter fix.

@ttlequals0 ttlequals0 closed this May 16, 2026
@ttlequals0 ttlequals0 deleted the benchmarks/05-15-26 branch May 16, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant