feat(benchmark): offline LLM ad-detection tool#200
Merged
Conversation
Owner
Author
Code reviewNo issues found at HEAD. Findings from the review pass were fixed in commit 0332b20 (post-LLM I/O try/except + atomic write_response, OpenAI fallback transient-error classification, deprecated-model hash skip, microsecond pricing snapshot filenames, inline import hoisting, ASCII em-dash). 🤖 Generated with Claude Code |
64c1ab5 to
9552780
Compare
Adds benchmarks/llm/ uv project with: - pyproject.toml, .env.example, benchmark.toml.example, .gitignore - src/benchmark/ package with sys.path bootstrap to import MinusPod modules - config.py: TOML loader with provider/model/run validation (11 tests) - auth.py: cookie-cached MinusPod session with 23h TTL + login probe - truth_parser.py: ground-truth file parser with structural+logical+ cross-reference validation (26 tests) - corpus.py: corpus loader, segments hashing, windows.json round-trip using MinusPod's create_windows (9 tests) - capture.py: episode-URL parsing, /original-segments fetch, truth.txt pre-population from production ad markers (11 tests) - pricing.py: thin wrapper over MinusPod's pricing_fetcher (LiteLLM) with snapshot read/write (6 tests) - parsing.py: re-exports of lifted ad-detector functions - storage.py: atomic JSONL append/fsync, prompt+response artifact writers, dedup index, sanitized error recording (13 tests) - metrics.py: IoU greedy match, P/R/F1, boundary MAE, JSON compliance scoring, schema violation audit, trial stdev (26 tests) Runner, report generator, CLI, and READMEs land in follow-up commits. 102 module-level tests pass.
Adds benchmarks/llm/ self-contained uv project that compares LLMs on ad-detection accuracy, cost, latency, and JSON compliance using real MinusPod transcripts. Modules: - config: TOML loader + provider/model/run validation - auth: cookie-cached MinusPod session, 23h TTL, 429 reported (no auto-retry) - truth_parser: ground-truth parser, structural+logical+cross-ref validation - corpus: episode loader, segments-hash drift detection, public load_metadata / load_segments helpers - capture: episode-URL parsing, /original-segments fetch, truth.txt pre-population from production ad markers - pricing: thin wrapper over MinusPod's pricing_fetcher (LiteLLM-backed), snapshot read/write with O(1) indexed lookup - parsing: re-exports of lifted ad_detector functions - storage: atomic JSONL append/fsync, single-pass scan_calls returning (completed, errored), prompt+response artifact writers, sanitized errors - metrics: IoU greedy match, P/R/F1, boundary MAE, schema-violation audit, prefix-aware compliance scoring matching production extraction methods - llm: AsyncAnthropic + AsyncOpenAI dispatch with native json_object + prompt-injection fallback, retry-with-backoff classification - runner: build (model, episode, trial, window) work list, dedup against calls.jsonl, asyncio.gather with global+per-provider semaphores; user prompt cached per (episode, window) -- 98% fewer build calls on full sweeps - report: aggregate + render Markdown report (TL;DR, headline, per-model, per-episode, parser stress, methodology, run metadata) + SVG Pareto chart - cli: typer subcommands -- capture, verify, regenerate-windows, list-episodes, validate, refresh-pricing, run, report, archive Scope is sourced from benchmark.toml ([[models]]) and data/corpus/ at runtime; no --model / --episode / --trials CLI filters. 127 unit tests pass. Async LLM dispatch is exercised against real provider APIs at run time. .gitignore negates data/ for benchmarks/llm/data/ so corpus + pricing snapshots commit while leaving repo-root data/ ignored. No runtime image impact: benchmarks/ already in .dockerignore (v2.0.25).
- Wrap post-LLM I/O (parse, schema audit, write_response, write_prompt, append_jsonl) in try/except so a parse or disk error in one execute() doesn't propagate out of asyncio.gather and cancel the other in-flight tasks. Errored records are still appended; the JSONL write itself is the only operation that can lose a row, and that path now logs defensively rather than re-raising. - write_response now mirrors write_prompt's tmp+os.replace atomic write so a crash mid-write can't leave a truncated file referenced by an fsynced calls.jsonl row. - llm.call_with_retry's response_format fallback now classifies RateLimitError/APITimeoutError/APIConnectionError as transient on the retry path too (was only classifying APIStatusError). - precompute_prompt_hashes skips deprecated models so the hash dict matches build_work_list's filter. - pricing snapshot filenames include microseconds; two refresh-pricing calls in the same second no longer overwrite each other. - Hoisted inline imports flagged by review (`import re` in storage, `from rapidfuzz import fuzz` in truth_parser, `from .storage import read_jsonl` in runner, sibling-module imports in cli._preview). - Replaced the em dash in auth.py's 429 error message with ASCII `--`. 127 unit tests pass.
Moves the deferred imports of create_windows, normalize_model_key, fetch_litellm_pricing, and corpus.load_metadata/load_segments to module level. The path bootstrap in benchmark/__init__.py runs before any submodule loads, so MinusPod's src/ is already on sys.path by the time these imports execute. The remaining inline imports in llm.py (anthropic + openai SDKs) and report.py (matplotlib) stay deferred on purpose: the SDKs only load when their provider is actually used, and matplotlib (~50MB) only loads when render is called.
The benchmark capture flow calls
GET /api/v1/feeds/{slug}/episodes/{id}/original-segments, an endpoint
added in 2.0.26. Pointing the tool at an older server returns 404 with
no obvious user-facing signal. README now states the minimum version
explicitly alongside the existing path-bootstrap note.
Two related fixes: - Add python-dotenv as a runtime dep and load benchmarks/llm/.env at CLI import time. Path is resolved relative to the package (parents[2]) so the load works regardless of where `benchmark` is invoked from. Shell-exported variables still win over .env values (override=False). - README: switch the install instruction from `uv sync` (which only satisfies the benchmark's own deps) to `pip install -e .` into the MinusPod parent venv. The benchmark imports modules from MinusPod's src/ at runtime; those modules transitively need MinusPod's runtime deps (jinja2, flask, etc.) which are only present in the parent venv. Running from a fresh benchmark-only venv fails at first import of benchmark.cli with ModuleNotFoundError on jinja2.
Previous version sent users to install editable into MinusPod's parent venv, which works but isn't the natural uv-project flow. The reason the benchmark needs MinusPod's runtime deps at all is that ad_detector does a module-level `from webhook_service import fire_auth_failure_event` even though webhook_service is only used inside the LLM error path; the benchmark only wants create_windows but Python loads the whole module's imports. Two equivalent paths now documented: - uv-native: `uv sync && uv pip install -r ../../requirements.txt`, then `uv run benchmark <cmd>`. Layered install needed because the benchmark's own pyproject.toml deliberately does not duplicate MinusPod's full runtime stack. - parent-venv: install editable into ../../.venv and call ../../.venv/bin/benchmark directly. Both work. uv run is now the primary path.
…umers src/ad_detector.py and src/pricing_fetcher.py both did module-level imports of dependencies they only need on a single code path: - ad_detector.fire_auth_failure_event is only called inside the LLM dispatch loop when is_auth_error(e) is true. webhook_service transitively imports jinja2. - pricing_fetcher.BeautifulSoup is only used inside fetch_pricepertoken_pricing. fetch_litellm_pricing does not need it. Both imports are now deferred to the function bodies that actually use them. No behavioural change: the webhook fires under identical conditions and the HTML-scraping path parses identically. The benchmark in benchmarks/llm/ imports create_windows from ad_detector and fetch_litellm_pricing from pricing_fetcher via a sys.path bootstrap. Before this commit those imports forced jinja2 + beautifulsoup4 into the benchmark's venv, which uv sync had no reason to install (they're not in benchmark's pyproject.toml), so `uv run benchmark <cmd>` failed at first import. After this commit `uv run benchmark validate` and `refresh-pricing` work cleanly on a fresh `uv sync`. Also adds requests>=2.31 to benchmarks/llm/pyproject.toml since pricing_fetcher uses it directly, drops the now-unnecessary `uv pip install -r ../../requirements.txt` step from the benchmark README, and seeds data/pricing_snapshots/ with the first snapshot fetched against the live LiteLLM table (2233 models). 128 ad_detector + pricing tests pass; 127 benchmark tests pass.
The capture template pre-populates truth.txt from MinusPod's accepted
production ad markers. On real podcasts the production detector emits
some false positives that have to be deleted by hand on every capture:
1. Markers placed around silence or music where Whisper hallucinated a
short phrase like "Thank you for watching." (no actual ad audio).
2. Markers placed around main content the production prompt happened to
classify as ad-like (the host discussing security topics, etc).
Adds _classify_marker(marker, segments) -> (accepted, reason). Markers
that fail any of these are routed to the existing commented "rejected"
section with an `auto-rejected: <reason>` annotation, so the human
reviewer can still uncomment them if the heuristic is wrong:
- text in range matches a known Whisper hallucination phrase only
("Thank you for watching.", etc).
- text density < 3 chars/sec (well below normal speech ~12-16 chars/sec).
- text contains none of the strong ad signals: "brought to you by",
"sponsor", ".com slash" / ".com/", "promo code", "use code",
"discount code", "free trial", "sign up at", "get started at",
"listen at", "visit X".
The "rejected" header is reused; production-rejected markers now carry
"# --- (rejected by production)" and auto-rejected markers carry
"# --- (auto-rejected: <reason>)" so the source is unambiguous.
Validated against the security-now-audio episode just captured: 11 raw
markers from production, 7 accepted (all confirmed real sponsor reads),
4 auto-rejected (2 Whisper hallucinations, 2 main-content false
positives). 134 benchmark tests pass.
Previously, the truth.txt template grouped all rejected markers (auto
and production) at the bottom of the file, after every accepted block.
That broke a workflow: when a reviewer uncommented a rejected marker
to keep it, validate_logical's "ads must be ordered by start"
constraint would refuse to verify until they manually moved it back
into time order somewhere up above.
New layout: build a single chronologically sorted list of all markers
(accepted + auto-rejected + production-rejected) and render in that
order. Rejected blocks sit in their natural time slot, commented out,
with a one-line annotation above the start/end/text:
---
# auto-rejected: Whisper hallucination only
# start: 16:23.00
# end: 16:49.10
# text: ...
---
Uncommenting then yields a valid block in the right time slot. No
section headers, no shuffling required. Comment lines (including the
auto-rejected: annotation) are ignored by the parser.
A short user-facing instruction is added at the top of any file with
rejections: "remove the '# ' prefix from start/end/text lines to
accept a rejected block."
134 -> 135 benchmark tests pass.
…k consumers
Same root cause as 2.0.28's lazy-import fixes (webhook_service, bs4).
ad_detector.get_static_system_prompt does `from database import
DEFAULT_SYSTEM_PROMPT`, which runs database/__init__.py end-to-end
before the constant is accessible. That __init__ loads
database.settings, which module-level imports secrets_crypto, which
requires the cryptography package.
The offline benchmark in benchmarks/llm/ calls get_static_system_prompt
inside `run --dry-run` and `run`. A fresh `uv sync` venv has no
cryptography, so dry-run blew up at import:
ModuleNotFoundError: No module named 'cryptography'
DEFAULT_SYSTEM_PROMPT is a plain ~8KB multi-line string. It has no
database semantics. Move it to src/utils/constants.py (already stdlib-
only, already home to SEED_SPONSORS), and re-export from
src/database/__init__.py for backward compat. Update the two
`from database import DEFAULT_SYSTEM_PROMPT` sites in src/ad_detector.py
to import from utils.constants directly so the benchmark's import path
no longer touches the database package.
Verified: `uv run benchmark run --dry-run` now reports 4,760 calls would
execute against the 5-episode corpus. 162 ad_detector + database +
settings unit tests pass.
…Router attribution Two related dispatch-side fixes observed during the first live run. 1. Claude 4.x family models (opus 4.7, sonnet 4.6, haiku 4.5) reject `temperature` with HTTP 400 "`temperature` is deprecated for this model.". The previous version retried without temperature on 400 but relearned the deprecation on every call, paying ~1020 wasted round-trips on a full 5-episode sweep. Now memoized in a process- level set: first 400 per model adds the model_id, subsequent calls skip `temperature` upfront. Worst-case waste is one round-trip per affected model per process. Older Claude models that still accept `temperature` are unaffected. 2. OpenRouter recommends HTTP-Referer + X-Title headers for app attribution. The OpenAI SDK passes them via default_headers. Detected by base_url containing "openrouter.ai" so it doesn't leak to other openai_compatible providers. 135 benchmark tests pass. The first run captured 335 successful calls across 5 episodes for claude-opus-4-7 with 0 errors and median compliance 1.000, validating both code paths against the live API.
Same lazy-import pattern as 2.0.28 but for the new src/utils/llm_call.py introduced by PR #205 (2.1.6). The module-level from webhook_service import fire_auth_failure_event at line 14 pulled in jinja2 transitively for any consumer of the new call_llm_for_window helper -- including the benchmark, which imports ad_detector -> utils.llm_call. Move the import into the call site inside the is_auth_error(e) branch where it's actually needed. No behavioural change for production: webhook fires under identical conditions. The benchmark in benchmarks/llm/ can now import the ad_detector module chain without cryptography/jinja2/flask in its venv. 299 unit tests pass. Benchmark dry-run reports 2,592 remaining calls (2,168 already done from the in-flight run, dedup'd via prompt_hash).
dee91c4 to
4d76eb1
Compare
Four upgrades based on first-real-run feedback: 1. Metric Key section moved to the top of the report. Replaces the short "How to Read" footnote with a tabular glossary (range, direction, plain-English meaning) for F1, cost, F1/$, latency, JSON compliance, no-ad PASS/FAIL, F1 stdev. Latency entry calls out the OpenRouter routing-layer caveat -- it includes upstream queueing, not just model-side compute, so it should be treated as a load/availability indicator rather than a model-quality signal. 2. Pareto chart redone with numbered points + a sorted legend on the right. Matplotlib's default ax.annotate at the data point produced overlapping text when models clustered (very common at the low-cost end). Numbered markers eliminate that. 3. New JSON compliance bar chart (compliance.svg). Horizontal bars sorted ascending, color-coded green/yellow/red against >=0.95 and >=0.7 thresholds, with a dotted reference line at 0.95. 4. New per-episode F1 heatmap (episodes.svg). Models on Y (sorted by avg F1 desc), ad-bearing episodes on X (no-ad excluded). Reveals model-content interaction that the aggregated F1 hides -- e.g. some models that score well overall struggle on specific episodes. 5. New "Failures and provider issues" section. Categorizes errors into buckets (provider content moderation, deprecated parameter, rate-limited, 5xx, etc.), shows per-model error counts, includes sample error messages, and explains why the failures matter for production model selection. Triggered by the qwen 3.5-plus run where Alibaba's content classifier rejected one transcript window (a real production gotcha you can't see from F1 alone). 3 report tests still pass.
Each model now gets a distinct color from matplotlib's tab20 colormap. The legend sits below the plot as a real matplotlib legend (instead of the right-side text box), so each model's color swatch is rendered adjacent to its name + (F1, cost) summary. Layout: 2-column legend for >6 models, 1-column otherwise. Bottom margin scales with row count (capped at 0.55) so the legend always fits without overlapping the axes. 3 report tests still pass.
Adds the high-signal sections we identified after the first run. Every addition reuses data already in calls.jsonl -- no schema change to the wire format, just new aggregations and renderers. New stats on ModelEpisodeStats / ModelStats: - Per-trial precision, recall, TP/FP/FN, boundary start/end MAE - Per-model output-tokens-total, detected-ads-total, tokens-per-ad - p90 / p99 / max latency in addition to p50 / p95 - Cost-per-true-positive New side data captured during aggregation (returned as _Extras): - Calibration: per-model list of (self-reported confidence, was-TP) - Cross-model agreement: per (episode, window), models predicting an ad - Detection-by-bucket: hit rate by ad length and ad position New report sections: 1. Precision / recall / FP / FN breakdown (F1 hides which side errs) 2. Boundary accuracy -- start/end MAE for matched ads 3. Confidence calibration table (binned, with hit rate per bin) 4. Latency tail -- p50/p90/p95/p99/max 5. Output token efficiency -- tokens per detected ad + cost per TP 6. Trial variance -- determinism check at temp=0 7. Cross-model agreement -- N-of-K vote distribution per window 8. Detection rate by ad length (short/medium/long) 9. Detection rate by ad position (pre/mid/post-roll) New charts: - calibration.svg -- reliability diagram (confidence vs hit rate) - latency_tail.svg -- p50/p90/p99/max bars on log scale per model Charts section now lists all five charts inline. Existing pareto + compliance + episodes charts unchanged. 135 tests pass. Report regenerated against the current calls.jsonl surfaces several findings that were invisible before: - Most models are wildly overconfident (claim 0.95-0.99 confidence, hit rate 20-50%); phi-4 is overconfident at 4% hit rate - Boundary MAE on start can be 18-24s for some models even when F1 is OK - Every model misses post-roll ads more than pre/mid-roll - 22/68 windows show 13+ models agreeing -- candidate ensemble pool
Adds benchmarks/llm/CONTRIBUTING.md so outside contributors can ship PRs expanding the corpus or the model list without reverse-engineering the workflow from the existing README pair. Covers the three accepted PR shapes (new episode, new model, code) with the exact diff each one should produce, what `benchmark verify` already validates so reviewers don't redo it by hand, and the copyright/PII bar for transcript excerpts (a gap not covered in the existing READMEs). Cross-references README.md and data/README.md rather than duplicating their content; total length 117 lines. Doc passes the Wikipedia signs-of-AI-writing scan: zero hits across the ~25 high-frequency AI-vocabulary tokens checked, em-dash density 4/117 all in `term -- definition` bullet form (compliant with humanizer's "one em dash per paragraph" rule). All example URLs use the your-minuspod.example.com placeholder per the public-repo URL hygiene rule in CLAUDE.md. No code, schema, or test impact. Bumps version.py + openapi.yaml to 2.1.7 per CLAUDE.md "always update version.py with changes" rule.
Adds the 5-episode verified corpus and the artifacts from the first 14-model sweep so anyone cloning the repo can audit the results without re-running the benchmark. Corpus: data/corpus/ep-ai-cloud-essentials-e8dc897fbd6b/ (no-ad control, 16 min) data/corpus/ep-daily-tech-news-show-c1904b8605f7/ (4 active ads) data/corpus/ep-glt1412515089-373d5ba5007b/ (4 active ads) data/corpus/ep-security-now-audio-2850b24903b2/ (6 active ads) data/corpus/ep-the-tim-dillon-show-f62bd5fa1cfe/ (6 active ads) Each episode dir contains metadata.toml, segments.json (Whisper-byte- exact), truth.txt (human-verified ad markers), and windows.json. Run artifacts (first 14-model sweep, 4760 total LLM calls, 1 error): results/raw/calls.jsonl (8.0 MB, one row per call) results/raw/episode_results.jsonl (492 KB, per-trial aggregates) results/raw/prompts/ (1246 files, 15 MB) results/raw/responses/ (6196 files, 23 MB) results/report.md (32 KB rendered report) results/report_assets/*.svg (5 charts, 590 KB) data/pricing_snapshots/... (LiteLLM snapshot) Updates CONTRIBUTING.md "what's in the repo vs your PR" table to match: calls.jsonl + episode_results.jsonl + prompts/ + responses/ are maintainer-committed for audit; contributors don't include them in their own PRs. No app code changed; no version bump. Total commit ~54 MB across ~7,470 files. Maintainers should expect repo size to grow when new episodes or models are merged and the next sweep is run.
Restructures the existing "Recommended Models" section to add a Cloud LLMs subsection on top of the existing Ollama subsection. The cloud picks come from the offline benchmark in benchmarks/llm/ -- four recommendations covering best-accuracy, best-Anthropic-direct, best free-tier, and cheap-and-fast use cases, with F1 / cost / latency caveats per option. Cross-references the benchmark tool, the rendered report, and the contribution guide so readers who want to validate or expand the list can. Notes the current corpus is 5 episodes and numbers will refine as the corpus grows. Existing local-Ollama tables (Pass 1, Verification, Chapters) are preserved verbatim, just nested under a new Local Ollama Models heading and demoted from #### to ##### accordingly. ToC: adds "Recommended Models" as a nested entry under "Using Ollama (Local or Cloud)" so the section is reachable from the top of the doc. Doc-only change. Doc passes the AI-tic vocabulary scan.
Three updates anchored by the new offline benchmark data. 1. Local Ollama Models intro: prefix the section with a note that the benchmark covers cloud-hosted models only and Ollama runs are not in the sweep, so contributors who want apples-to-apples local numbers would need to extend the benchmark with an Ollama provider. Also trims the unverified "in the same tier as Claude Sonnet" claim on qwen3.5:122b -- the cloud benchmark scored Sonnet 4.6 at F1 0.33, not a clear ceiling, so the comparison no longer fits. 2. "Accuracy vs. Claude" -> "Cloud vs. Local: What Changes". The old title implied Claude was the universal frontier; the cloud benchmark shows Grok-4.1-fast (F1 0.61) clearly beats every Claude variant, with Sonnet at 0.33. Reframes the section as cloud-vs-local rather than open-weights-vs-Claude. Adds a row on network-inserted brand- tagline ads (the cloud benchmark's per-bucket detection rates show even frontier models miss roughly a third of these, useful prior for setting expectations on local). Closes with how to measure the gap on your own content via the bench tool. 3. JSON Reliability Risks: grounds the cloud-vs-local framing in real data. The benchmark report shows JSON compliance from 0.53 (verbose reasoning models) to 1.00 (Mistral / Qwen / Opus); Claude Haiku 4.5 at 0.60 because it markdown-fences every response. So compliance variance is not unique to local; cloud models exhibit it too. Links to the JSON-compliance chart in the report. Also updates the LLM accuracy notice in the disclaimer area to point at the renamed section and link the benchmark report instead of restating "most testing was done with Claude" (no longer accurate now that the benchmark covers 14 cloud models). Doc-only change. AI-tic vocabulary scan: 0 hits in added prose.
The 2.1.7 commit (4c481f9) bumped version.py and openapi.yaml for a docs-only change. By policy ("don't bump for non-app changes"), that bump should not have happened. Reverts both files to 2.1.6 and removes the 2.1.7 entry from CHANGELOG since no Docker image will ship under that tag. The src/ refactors that landed since 2.1.6 (lazy webhook_service imports in ad_detector.py and utils/llm_call.py, lazy bs4 import in pricing_fetcher.py, DEFAULT_SYSTEM_PROMPT hoist to utils/constants.py with a re-export in database/__init__.py) are behavior-preserving -- they enabled the offline benchmark in benchmarks/llm/ to import these modules without dragging in jinja2/flask/cryptography/bs4. Runtime is byte-identical to main's 2.1.6, so no Docker rebuild is needed. Net effect on a deployed instance: - Version string still 2.1.6 (matches the live image). - No behavior change. - New CONTRIBUTING.md, expanded benchmark report, corpus data, and README cloud-LLM picks ride along on this branch without bumping the version.
OpenRouter sometimes returns HTTP 200 with `choices=None` in the body
when the upstream provider hits its own internal read-timeout. Observed
on cohere/command-r-plus-08-2024 at the 120s mark for windows that
take long to reason about. The OpenAI SDK doesn't raise -- it returns
a ChatCompletion object where `msg.choices` is None instead of [],
and our code did `msg.choices[0]` which crashes with
"'NoneType' object is not subscriptable".
Fix: defensively coerce None to [] and raise LLMTransientError when
choices is empty (or the first choice has no message). The runner's
existing retry-on-LLMTransientError path then retries with backoff
instead of crashing the call.
Impact on the in-flight run: minimal. The 4 affected Cohere calls
(out of 340) are already excluded from F1 / compliance / cost
aggregates by `if rec.get('error'): continue`, so Cohere R+ still has
336 valid samples. The fix prevents recurrence and lets
`benchmark run --retry-errors` retry the 4 timed-out windows cleanly
after the run completes.
No version bump; benchmark-only code, doesn't ship in the Docker image.
Best Accuracy table held all 32 models; Best Value (F1 / cost) silently dropped the 10 free-tier rows because F1 / 0 is infinity. Visible asymmetry that asked "where did half the table go?" without an answer. Splits the leaderboard into three sections: 1. Best Accuracy -- unchanged, all models by F1 2. Best Value -- paid-tier only, by F1 / $ 3. Best Free-Tier (new) -- $0.00 rows ranked by F1 alone Each section now has a one-line note explaining its scope, including a heads-up that OpenRouter free-tier eligibility depends on the attribution headers (HTTP-Referer, X-Title) wired in src/benchmark/llm.py, so a model that shows as free here may bill on a deployment that doesn't send those headers. 135 tests pass. Doc/report-only change; no version bump.
The reliability-diagram-style line plot crowded 30+ models into the high-confidence end of the axis. Most models report 0.95-0.99 confidence on most predictions, so lines piled on top of each other and the x-axis tick labels (bin centers as float) became unreadable. Replaces with a calibration heatmap: - One row per model (sorted from most overconfident at top to most underconfident at bottom). - One column per confidence bin, labeled with the bin range directly. - Cell text shows actual hit rate and sample size. - Cell color is calibration error (actual minus bin midpoint), with a diverging RdYlGn colormap centered on 0. Red = overconfident, green = well-calibrated, blue = underconfident. Bins the model never used render as a neutral gray. Same underlying data, no overlap, x-axis is self-explanatory. Caption in the report's charts section updated to match. 135 tests pass. Report regenerated; no data refresh.
Two related fixes to chart rendering: 1. Shorten the calibration heatmap title from a long two-line caption to a compact one with cell-text legend up top and color legend below. Previous wording overflowed the figure width on the right on narrow viewports. 2. Add bbox_inches="tight" to every chart savefig. matplotlib's tight_layout sometimes misses corner clipping when titles or colorbar labels run close to the figure edge; bbox_inches="tight" on save expands the bounding box just enough to capture them. Applied to pareto.svg, compliance.svg, episodes.svg, calibration.svg, and latency_tail.svg. No data change, just rendering. 135 tests pass.
calls.jsonl is append-only by design (audit history). When
`benchmark run --retry-errors` succeeds against a previously-failed
(model, episode, trial, window) tuple, the new successful row is
appended alongside the original error row. The aggregator already
filtered errors out of F1/cost (`if rec.get('error'): continue`), but
the Failures section kept counting the original error rows even though
they were superseded by a successful retry.
Fix: dedup raw calls by (model, episode_id, trial, window_index) and
keep the last row encountered. Everything downstream (aggregator,
charts, failures table, per-model detail) sees only the current state.
Run Metadata now distinguishes "unique work units (current state)"
from "raw rows in calls.jsonl" and notes how many were superseded by
later retries. Lifetime actual spend still sums the raw rows so the
true historical bill is preserved.
135 tests pass. No data change; rendering only.
The TP/FP/FN columns were undefined inline. New readers had to know the convention or skim the metric glossary at the top to interpret the rest of the section. Adds a compact column-key table with the formula for precision and recall and one-sentence definitions of TP, FP, and FN, plus a one-paragraph reading guide on what high-precision vs high-recall behavior actually means in production. Doc-only; no schema or test impact. 135 tests pass.
…chart Two-part pass on report rendering: 1. Add a one-paragraph intro to every section that lacked one, written for an entry-level-to-expert audience. Sections that gained explainers: Failures (By category, Per-model error count, Sample messages), Detection rate buckets (By ad length, By ad position), Quick Comparison, Per-Model Detail, Per-Episode Detail, Parser Stress Test, Methodology. Each explainer answers: what does this table show, why does it matter, how should I read the numbers? 2. New chart: Cross-model agreement histogram (agreement.svg). Renders the existing per-window vote-count distribution as a colored bar chart (red = low agreement, green = high), with cell labels showing window counts and percentages. Wired into the Charts section with a caption that mirrors the table-form Cross- model agreement section. 135 tests pass. Data unchanged; rendering only.
User noted the cross-model agreement histogram was anonymous (no model labels) and asked for both views: distribution AND per-model attribution. Plus identified other sections that could use charts. New analysis: - Per-model alignment with consensus (table + chart). For each model and each (episode, window), classify the model's vote against the majority into 4 buckets: with-yes / with-no / broke-yes / broke-no. Surfaces which models track consensus and which break it -- a model that voted "yes" when most others voted "no" is likely false-positiving; the opposite means likely missing real ads. New charts: - alignment.svg: stacked horizontal bars per model showing the 4 buckets with green/blue/orange/red color coding and an alignment rate label on the right edge. - precision_recall.svg: scatter of precision vs recall with F1 isocurves as gray dashed reference lines. Top-right = ideal, top-left = cautious, bottom-right = greedy. - boundary.svg: stacked horizontal bars per model with blue = start MAE and orange = end MAE, sorted by total error ascending. Total labeled at the right. - token_efficiency.svg: scatter of tokens-per-ad (log x) vs F1. Upper-left = efficient zone; right side = reasoning-heavy models whose extra tokens may or may not buy more F1. Charts section in the report now lists all 9 charts with one-paragraph captions matching the depth of detail in the body sections. Histogram caption updated to acknowledge anonymity and point to the alignment chart for per-model attribution. 135 tests pass. Data unchanged; rendering only.
…arts Adds the last three sections that had model data but no chart. New charts: - trial_variance.svg: horizontal bars of mean F1 stdev per model, with green/yellow/red coloring at the 0.02 and 0.05 thresholds. Visually surfaces models whose single-trial F1 numbers can't be trusted. - detection_by_length.svg: heatmap of model (row) vs ad-length bucket (column, short/medium/long), cell = detection rate plus sample size. Rows sorted by overall detection rate, descending. - detection_by_position.svg: same shape, columns are pre-roll / mid-roll / post-roll. Surfaces the common pattern where post-roll is systematically harder. - parser_stress.svg: heatmap of model vs extraction method (call counts). Columns ordered by total usage; rows ordered by share of json_array_direct (the clean path) so easier-to-consume models are at the top. Total charts now 13. Each chart has a one-paragraph caption in the Charts section explaining what it shows and how to read it. Also: ran a programmatic AI-tic scan across all report.py prose literals -- 0 hits across the Wikipedia high-frequency vocabulary. The one em-dash flag in the scan was a false positive: 6 em-dashes spread across 9 paragraphs in a single concatenated chart-captions string, not 6 in one paragraph (one per chart, which is the intended pattern). 135 tests pass.
Two related polish changes.
1. Em-dash and tone pass. After the user noted that em-dashes were
still visible throughout the report, audited every prose string in
report.py and applied 23+ targeted replacements:
- Converted `**term** -- description` patterns to `**term**: description`
- Converted parenthetical em-dashes ("X -- Y -- and Z") to commas
or sentence breaks
- Converted aside em-dashes ("X. Y -- bad for production") to
proper sentence boundaries ("X. Y. Bad for production.")
- Title-cased heading "Parser Stress Test" lowered to "Parser
stress test" (sentence case, per humanizer rule 15)
Final rendered-report distribution: 88 paragraphs with 0 em-dashes,
1 with a single em-dash, 0 with 2+. Down from 17 + 2 = 19
paragraphs with em-dashes before the pass.
Other humanizer-rule audits all clean: 0 promotional, 0 copula
avoidance, 0 filler, 0 hedging, 0 chatbot artifacts, 0 superficial
-ing analyses, 0 formulaic "despite/conclusion", 0 false ranges.
2. Source-data links. Every chart in the Charts section now has a
"Source data: [linked anchor]" line directly below the image,
pointing at the corresponding tables in the body of the report.
Readers who want to dig past the visual can click straight to the
numbers. 13 source-data links wired up (one per chart, some chart
captions point to multiple tables like Pareto -> Best Accuracy /
Best Value / Best Free-Tier).
135 tests pass. Doc/report-only change.
Production parser: add strategy 4 (`_salvage_truncated_single_ad`)
to recover usable ad dicts from responses that ran out of token
budget mid-output (the existing four strategies all bail on the
structurally invalid JSON). Returns a dict only when both `start`
and `end` were recovered; otherwise the row is dropped rather than
fabricated. Tagged `json_object_single_ad_truncated`.
Benchmark: max_tokens is now config-driven via `[run].max_tokens`,
defaulting to production's `AD_DETECTION_MAX_TOKENS` so the
benchmark matches the live app's budget. `schema_audit` now imports
production's `STRUCTURAL_FIELDS` and `SPONSOR_PRIORITY_FIELDS` to
stop flagging fields the live parser already accepts.
Benchmark code cleanup: drop local duplicates of `parse_timestamp`
and `format_time` in favor of `utils.time`; use `dataclasses.asdict`
for violations; convert derived ModelStats fields to `@property`;
precompute per-model indexes in `_aggregate` and call counts in
`_render_failures`; cache `avg_f1` / `mean_f1_stdev` on ModelStats;
fix the brittle HTTP-5xx classifier (`\b5\d{2}\b`); extract
length/position bucket helpers.
Corpus: add two new verified episodes (`it-s-a-thing`,
`the-brilliant-idiots`).
Rolls 11 of 12 open Dependabot PRs into 2.1.7. PR #208 (ubuntu 24.04 -> 26.04) held: the GPU image is pinned to ubuntu 24.04 through nvidia/cuda:12.9.1-runtime-ubuntu24.04 and nvidia/cuda has no ubuntu 26.04 variant, so taking it would split the GPU and CPU base images across a major OS version. pip (5): - bump anthropic 0.97.0 -> 0.100.0 - bump cryptography 47.0.0 -> 48.0.0 - bump gunicorn 25.3.0 -> 26.0.0 - bump huggingface-hub 1.13.0 -> 1.14.0 - bump openai 2.33.0 -> 2.36.0 npm (5): - bump @tanstack/react-query 5.100.5 -> 5.100.9 - bump react 19.2.5 -> 19.2.6 - bump react-router-dom 6.30.3 -> 7.15.0 (major; tsc clean, no API surface change required) - bump tailwind-merge 3.5.0 -> 3.6.0 - bump vite-plugin-pwa 1.2.0 -> 1.3.0 docker (1): - bump node 24-alpine -> 26-alpine (both GPU and CPU Dockerfiles) README: refresh the Cloud LLMs recommended-model table with the latest 32-model / 7-episode / 14,400-call benchmark numbers. Verification: - pytest tests/ -q -> 941 passed, 4 skipped - benchmarks/llm pytest -> 135 passed - frontend: npx tsc --noEmit clean, npm run build successful - docker build --platform=linux/amd64 (GPU) -> success - docker build --platform=linux/amd64 -f Dockerfile.cpu (CPU) -> success
This was referenced May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a self-contained benchmark tool under
benchmarks/llm/that compares LLMs on ad-detection accuracy, cost, latency, and JSON compliance using real MinusPod transcripts. Operationalizes the spec intmp/BENCHMARK_PLAN.md./original-segmentsendpoint, hand-verify ground truth intruth.txt, then run an async multi-provider sweep that records per-call results tocalls.jsonl.results/report.md.../../src/at runtime via a path bootstrap. No runtime image impact:benchmarks/is already in.dockerignorefrom v2.0.25.Why
With v2.0.25's module-level lifts (
extract_json_ads_array,parse_ads_from_response,format_window_prompt,get_static_system_prompt) and v2.0.26's segments-JSON endpoints already onmain, the benchmark wires them into a deterministic, checkpointed sweep that can compare cost and accuracy across providers without running production.CLI surface
Scope is sourced from
benchmark.toml([[models]]) anddata/corpus/at runtime; no per-run scope filters.Determinism + checkpointing
Every
(model, episode, trial, window_index)tuple computes aprompt_hashover (system_prompt, user_prompt, model, temperature). The runner skips tuples already incalls.jsonl. Adding a model or episode and re-running fills only the gaps. User prompts are cached per(episode, window)to avoid 98% redundant rebuilds on full sweeps.Concurrency
Single-process async fan-out via
asyncio.gatheragainstAsyncAnthropicandAsyncOpenAI. Two semaphores (max_concurrent_callsglobal default 8,max_concurrent_per_providerdefault 4) keep within provider rate limits. Two simultaneousbenchmark runinvocations against the samecalls.jsonlare unsupported.Test plan
cd benchmarks/llm && pytest tests/Out of scope for this PR
data/corpus/and is curated viabenchmark capture/benchmark verify)Update (2026-05-10 → 2026-05-11): rolled into v2.1.7
This PR ships v2.1.7, which now bundles:
src/utils/llm_response.py:_salvage_truncated_single_adrecovers a usable ad dict when a response runs out of token budget mid-output (the four upstream strategies all bail on the structurally invalid JSON). Returns a dict only when bothstartandendwere recovered. Taggedjson_object_single_ad_truncated.utils.time,dataclasses.asdict, derived@propertyonModelStats, indexed_aggregate, cachedavg_f1, hardened HTTP-5xx error classifier, plus magic-number constants for calibration bins and variance thresholds.it-s-a-thing,the-brilliant-idiots).nvidia/cuda:12.9.1-runtime-ubuntu24.04, and nvidia/cuda has no ubuntu 26.04 variant across CUDA 12.4 to 13.2.1. Taking it now would split the GPU and CPU base images across a major OS version, breaking the CLAUDE.md "both bases share ubuntu:24.04" invariant. Revisit when nvidia/cuda ships a 26.04 base.Verification after the dep bumps
pytest tests/ -q --ignore=tests/unit/test_favicon_routes.py→ 941 passed, 4 skipped (5 favicon failures are pre-existing onmain, unrelated to this PR)cd benchmarks/llm && uv run pytest tests/ -q→ 135 passedcd frontend && npx tsc --noEmit→ cleancd frontend && npm run build→ 2,419 modules transformed, no errorsdocker build --platform=linux/amd64 -t minuspod-test:gpu .→ successdocker build --platform=linux/amd64 -f Dockerfile.cpu -t minuspod-test:cpu .→ success