benchmark: 4 new corpus episodes, grok-4.3 swap-in, deprecated-model report cleanup by ttlequals0 · Pull Request #228 · ttlequals0/MinusPod

ttlequals0 · 2026-05-16T00:08:00Z

Summary

Three changes on this branch, all benchmark-tooling / data only (no production runtime impact, no version.py bump):

4 new corpus episodes promoted from data/candidates/ to data/corpus/ -- ep-daily-tech-news-show-b576979e1fe8, ep-on-air-with-dan-and-alex2-574e4f303730, ep-oxide-and-friends-ce789ff5b62e, ep-tosh-show-5f6894439bb6. Truth files realigned against segments.json word-level timings via a new one-shot script. Adds ~6,076 new rows to calls.jsonl and the corresponding prompt/response artifacts.
x-ai/grok-4.3 swapped in for x-ai/grok-4.1-fast (the latter is now upstream-deprecated by xAI -- 165 production 404s returning "Grok 4.1 Fast is deprecated. xAI recommends switching to Grok 4.3"). grok-4.1-fast is marked deprecated = true in benchmark.toml; historical F1 data preserved in the Deprecated Models footnote.
README Recommended Models table refreshed against the latest 32-model report. New top is qwen/qwen3.5-plus-02-15 at F1 0.649 (free, rank 1 of all models, paid or free). openai/gpt-5.5 is the new "best paid" pick at F1 0.636 / $4.66 (beats claude-opus-4-7 at 0.618 / $5.54 on both axes). claude-opus-4-7 retains the "Best Anthropic-direct" slot. The stale grok-4.1-fast "0.64 / $0.15" claim is removed.
Deprecated-model leak fix in report.py. The Deprecated Models footnote section was correctly preserving historical F1/cost, but the Failures section, calibration table, detection buckets, cross-model alignment table, and the corresponding chart SVGs all pulled raw data across every model -- so grok-4.1-fast's 165 upstream 404s and its full per-window vote history were leaking through. Consolidated into _Extras.without(model_ids); active charts and tables now exclude deprecated models everywhere, and the Deprecated Models footnote is the only place they appear.

Test plan

CI lint + tests pass
grep -c "grok-4.1-fast" benchmarks/llm/results/report.md returns 1 (only the Deprecated Models footnote, not the Failures section)
Failures section header reads "No call errors observed across this run" (or only lists non-deprecated models if there are any)
report_assets/alignment.svg no longer renders a grok-4.1-fast row
report_assets/calibration.svg no longer renders a grok-4.1-fast line
README Recommended Models table reflects current report numbers

… agreement, buckets The Deprecated Models footnote correctly preserved historical F1/cost for deprecated entries, but several report sections still pulled raw data across all models -- so x-ai/grok-4.1-fast's 165 upstream 404s leaked into "Failures and provider issues" and its data still showed up in the calibration table, detection buckets, alignment table, and the corresponding chart SVGs. Added _Extras.without(model_ids) to filter calibration/agreement/ detection_buckets in one place, and trimmed the calls list at the _render_failures call site. Active charts and tables now exclude deprecated models everywhere; the Deprecated Models footnote is the only place they appear.

ttlequals0 · 2026-05-16T00:09:58Z

Superseding -- branch contained two commits already squash-merged in #227. Opening a clean PR with just the deprecated-model report-filter fix.

ttlequals0 added 3 commits May 15, 2026 19:39

New episodes added plus grok-4.3

3430238

update top models list

0b28d03

ttlequals0 closed this May 16, 2026

ttlequals0 deleted the benchmarks/05-15-26 branch May 16, 2026 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: 4 new corpus episodes, grok-4.3 swap-in, deprecated-model report cleanup#228

benchmark: 4 new corpus episodes, grok-4.3 swap-in, deprecated-model report cleanup#228
ttlequals0 wants to merge 3 commits into
mainfrom
benchmarks/05-15-26

ttlequals0 commented May 16, 2026

Uh oh!

ttlequals0 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ttlequals0 commented May 16, 2026

Summary

Test plan

Uh oh!

ttlequals0 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant