feat(benchmark): offline LLM ad-detection tool by ttlequals0 · Pull Request #200 · ttlequals0/MinusPod

ttlequals0 · 2026-05-07T05:26:54Z

Summary

Adds a self-contained benchmark tool under benchmarks/llm/ that compares LLMs on ad-detection accuracy, cost, latency, and JSON compliance using real MinusPod transcripts. Operationalizes the spec in tmp/BENCHMARK_PLAN.md.

Capture episodes from MinusPod's /original-segments endpoint, hand-verify ground truth in truth.txt, then run an async multi-provider sweep that records per-call results to calls.jsonl.
Generates a Markdown report with TL;DR rankings, per-model and per-episode breakdowns, parser-stress catalog, and an SVG Pareto chart at results/report.md.
Imports MinusPod modules from ../../src/ at runtime via a path bootstrap. No runtime image impact: benchmarks/ is already in .dockerignore from v2.0.25.

Why

With v2.0.25's module-level lifts (extract_json_ads_array, parse_ads_from_response, format_window_prompt, get_static_system_prompt) and v2.0.26's segments-JSON endpoints already on main, the benchmark wires them into a deterministic, checkpointed sweep that can compare cost and accuracy across providers without running production.

CLI surface

benchmark capture --episode-url <url>
benchmark verify <ep-id>
benchmark regenerate-windows <ep-id> --force
benchmark list-episodes
benchmark validate
benchmark refresh-pricing
benchmark run [--retry-errors] [--dry-run] [--force] [--no-report-on-failure]
benchmark report
benchmark archive

Scope is sourced from benchmark.toml ([[models]]) and data/corpus/ at runtime; no per-run scope filters.

Determinism + checkpointing

Every (model, episode, trial, window_index) tuple computes a prompt_hash over (system_prompt, user_prompt, model, temperature). The runner skips tuples already in calls.jsonl. Adding a model or episode and re-running fills only the gaps. User prompts are cached per (episode, window) to avoid 98% redundant rebuilds on full sweeps.

Concurrency

Single-process async fan-out via asyncio.gather against AsyncAnthropic and AsyncOpenAI. Two semaphores (max_concurrent_calls global default 8, max_concurrent_per_provider default 4) keep within provider rate limits. Two simultaneous benchmark run invocations against the same calls.jsonl are unsupported.

Test plan

127 unit tests pass: cd benchmarks/llm && pytest tests/
Main MinusPod test suite still 881 pass
Frontend typecheck unaffected
/simplify run, applied 6 fixes (compliance scoring matches actual production extraction methods, prompt cache, indexed pricing lookup, single-pass scan_calls, public load_* helpers, dead-import cleanup, shared test fixtures)
Async LLM dispatch exercised via a single real call per provider once the corpus has at least one episode
First full sweep runs locally before report archive

Out of scope for this PR

Building the 6-episode corpus (runtime data; lives in data/corpus/ and is curated via benchmark capture/benchmark verify)
Running the first benchmark sweep

Update (2026-05-10 → 2026-05-11): rolled into v2.1.7

This PR ships v2.1.7, which now bundles:

Offline LLM ad-detection benchmark (the original PR scope, above).
Production parser strategy 4 in src/utils/llm_response.py: _salvage_truncated_single_ad recovers a usable ad dict when a response runs out of token budget mid-output (the four upstream strategies all bail on the structurally invalid JSON). Returns a dict only when both start and end were recovered. Tagged json_object_single_ad_truncated.
Benchmark code-quality cleanup from a 3-agent /simplify pass: DRY against utils.time, dataclasses.asdict, derived @property on ModelStats, indexed _aggregate, cached avg_f1, hardened HTTP-5xx error classifier, plus magic-number constants for calibration bins and variance thresholds.
Two new corpus episodes (it-s-a-thing, the-brilliant-idiots).
Dependency bumps (11 of 12 open Dependabot PRs):
- pip: anthropic 0.97 → 0.100, cryptography 47 → 48, gunicorn 25.3 → 26, huggingface-hub 1.13 → 1.14, openai 2.33 → 2.36
- npm: @tanstack/react-query 5.100.5 → 5.100.9, react 19.2.5 → 19.2.6, react-router-dom 6 → 7 (major; tsc clean), tailwind-merge 3.5 → 3.6, vite-plugin-pwa 1.2 → 1.3
- docker: node 24-alpine → 26-alpine (both Dockerfiles)
- Held: Dependabot docker(deps): bump ubuntu from 24.04 to 26.04 #208 (ubuntu 24.04 → 26.04). The GPU image is pinned to ubuntu 24.04 through nvidia/cuda:12.9.1-runtime-ubuntu24.04, and nvidia/cuda has no ubuntu 26.04 variant across CUDA 12.4 to 13.2.1. Taking it now would split the GPU and CPU base images across a major OS version, breaking the CLAUDE.md "both bases share ubuntu:24.04" invariant. Revisit when nvidia/cuda ships a 26.04 base.
README: refreshed the Cloud LLMs recommended-model table with the latest 32-model / 7-episode / 14,400-call benchmark numbers.

Verification after the dep bumps

pytest tests/ -q --ignore=tests/unit/test_favicon_routes.py → 941 passed, 4 skipped (5 favicon failures are pre-existing on main, unrelated to this PR)
cd benchmarks/llm && uv run pytest tests/ -q → 135 passed
cd frontend && npx tsc --noEmit → clean
cd frontend && npm run build → 2,419 modules transformed, no errors
docker build --platform=linux/amd64 -t minuspod-test:gpu . → success
docker build --platform=linux/amd64 -f Dockerfile.cpu -t minuspod-test:cpu . → success

ttlequals0 · 2026-05-07T05:54:55Z

Code review

No issues found at HEAD. Findings from the review pass were fixed in commit 0332b20 (post-LLM I/O try/except + atomic write_response, OpenAI fallback transient-error classification, deprecated-model hash skip, microsecond pricing snapshot filenames, inline import hoisting, ASCII em-dash).

🤖 Generated with Claude Code

Adds benchmarks/llm/ uv project with: - pyproject.toml, .env.example, benchmark.toml.example, .gitignore - src/benchmark/ package with sys.path bootstrap to import MinusPod modules - config.py: TOML loader with provider/model/run validation (11 tests) - auth.py: cookie-cached MinusPod session with 23h TTL + login probe - truth_parser.py: ground-truth file parser with structural+logical+ cross-reference validation (26 tests) - corpus.py: corpus loader, segments hashing, windows.json round-trip using MinusPod's create_windows (9 tests) - capture.py: episode-URL parsing, /original-segments fetch, truth.txt pre-population from production ad markers (11 tests) - pricing.py: thin wrapper over MinusPod's pricing_fetcher (LiteLLM) with snapshot read/write (6 tests) - parsing.py: re-exports of lifted ad-detector functions - storage.py: atomic JSONL append/fsync, prompt+response artifact writers, dedup index, sanitized error recording (13 tests) - metrics.py: IoU greedy match, P/R/F1, boundary MAE, JSON compliance scoring, schema violation audit, trial stdev (26 tests) Runner, report generator, CLI, and READMEs land in follow-up commits. 102 module-level tests pass.

Adds benchmarks/llm/ self-contained uv project that compares LLMs on ad-detection accuracy, cost, latency, and JSON compliance using real MinusPod transcripts. Modules: - config: TOML loader + provider/model/run validation - auth: cookie-cached MinusPod session, 23h TTL, 429 reported (no auto-retry) - truth_parser: ground-truth parser, structural+logical+cross-ref validation - corpus: episode loader, segments-hash drift detection, public load_metadata / load_segments helpers - capture: episode-URL parsing, /original-segments fetch, truth.txt pre-population from production ad markers - pricing: thin wrapper over MinusPod's pricing_fetcher (LiteLLM-backed), snapshot read/write with O(1) indexed lookup - parsing: re-exports of lifted ad_detector functions - storage: atomic JSONL append/fsync, single-pass scan_calls returning (completed, errored), prompt+response artifact writers, sanitized errors - metrics: IoU greedy match, P/R/F1, boundary MAE, schema-violation audit, prefix-aware compliance scoring matching production extraction methods - llm: AsyncAnthropic + AsyncOpenAI dispatch with native json_object + prompt-injection fallback, retry-with-backoff classification - runner: build (model, episode, trial, window) work list, dedup against calls.jsonl, asyncio.gather with global+per-provider semaphores; user prompt cached per (episode, window) -- 98% fewer build calls on full sweeps - report: aggregate + render Markdown report (TL;DR, headline, per-model, per-episode, parser stress, methodology, run metadata) + SVG Pareto chart - cli: typer subcommands -- capture, verify, regenerate-windows, list-episodes, validate, refresh-pricing, run, report, archive Scope is sourced from benchmark.toml ([[models]]) and data/corpus/ at runtime; no --model / --episode / --trials CLI filters. 127 unit tests pass. Async LLM dispatch is exercised against real provider APIs at run time. .gitignore negates data/ for benchmarks/llm/data/ so corpus + pricing snapshots commit while leaving repo-root data/ ignored. No runtime image impact: benchmarks/ already in .dockerignore (v2.0.25).

- Wrap post-LLM I/O (parse, schema audit, write_response, write_prompt, append_jsonl) in try/except so a parse or disk error in one execute() doesn't propagate out of asyncio.gather and cancel the other in-flight tasks. Errored records are still appended; the JSONL write itself is the only operation that can lose a row, and that path now logs defensively rather than re-raising. - write_response now mirrors write_prompt's tmp+os.replace atomic write so a crash mid-write can't leave a truncated file referenced by an fsynced calls.jsonl row. - llm.call_with_retry's response_format fallback now classifies RateLimitError/APITimeoutError/APIConnectionError as transient on the retry path too (was only classifying APIStatusError). - precompute_prompt_hashes skips deprecated models so the hash dict matches build_work_list's filter. - pricing snapshot filenames include microseconds; two refresh-pricing calls in the same second no longer overwrite each other. - Hoisted inline imports flagged by review (`import re` in storage, `from rapidfuzz import fuzz` in truth_parser, `from .storage import read_jsonl` in runner, sibling-module imports in cli._preview). - Replaced the em dash in auth.py's 429 error message with ASCII `--`. 127 unit tests pass.

Moves the deferred imports of create_windows, normalize_model_key, fetch_litellm_pricing, and corpus.load_metadata/load_segments to module level. The path bootstrap in benchmark/__init__.py runs before any submodule loads, so MinusPod's src/ is already on sys.path by the time these imports execute. The remaining inline imports in llm.py (anthropic + openai SDKs) and report.py (matplotlib) stay deferred on purpose: the SDKs only load when their provider is actually used, and matplotlib (~50MB) only loads when render is called.

The benchmark capture flow calls GET /api/v1/feeds/{slug}/episodes/{id}/original-segments, an endpoint added in 2.0.26. Pointing the tool at an older server returns 404 with no obvious user-facing signal. README now states the minimum version explicitly alongside the existing path-bootstrap note.

Two related fixes: - Add python-dotenv as a runtime dep and load benchmarks/llm/.env at CLI import time. Path is resolved relative to the package (parents[2]) so the load works regardless of where `benchmark` is invoked from. Shell-exported variables still win over .env values (override=False). - README: switch the install instruction from `uv sync` (which only satisfies the benchmark's own deps) to `pip install -e .` into the MinusPod parent venv. The benchmark imports modules from MinusPod's src/ at runtime; those modules transitively need MinusPod's runtime deps (jinja2, flask, etc.) which are only present in the parent venv. Running from a fresh benchmark-only venv fails at first import of benchmark.cli with ModuleNotFoundError on jinja2.

Previous version sent users to install editable into MinusPod's parent venv, which works but isn't the natural uv-project flow. The reason the benchmark needs MinusPod's runtime deps at all is that ad_detector does a module-level `from webhook_service import fire_auth_failure_event` even though webhook_service is only used inside the LLM error path; the benchmark only wants create_windows but Python loads the whole module's imports. Two equivalent paths now documented: - uv-native: `uv sync && uv pip install -r ../../requirements.txt`, then `uv run benchmark <cmd>`. Layered install needed because the benchmark's own pyproject.toml deliberately does not duplicate MinusPod's full runtime stack. - parent-venv: install editable into ../../.venv and call ../../.venv/bin/benchmark directly. Both work. uv run is now the primary path.

…umers src/ad_detector.py and src/pricing_fetcher.py both did module-level imports of dependencies they only need on a single code path: - ad_detector.fire_auth_failure_event is only called inside the LLM dispatch loop when is_auth_error(e) is true. webhook_service transitively imports jinja2. - pricing_fetcher.BeautifulSoup is only used inside fetch_pricepertoken_pricing. fetch_litellm_pricing does not need it. Both imports are now deferred to the function bodies that actually use them. No behavioural change: the webhook fires under identical conditions and the HTML-scraping path parses identically. The benchmark in benchmarks/llm/ imports create_windows from ad_detector and fetch_litellm_pricing from pricing_fetcher via a sys.path bootstrap. Before this commit those imports forced jinja2 + beautifulsoup4 into the benchmark's venv, which uv sync had no reason to install (they're not in benchmark's pyproject.toml), so `uv run benchmark <cmd>` failed at first import. After this commit `uv run benchmark validate` and `refresh-pricing` work cleanly on a fresh `uv sync`. Also adds requests>=2.31 to benchmarks/llm/pyproject.toml since pricing_fetcher uses it directly, drops the now-unnecessary `uv pip install -r ../../requirements.txt` step from the benchmark README, and seeds data/pricing_snapshots/ with the first snapshot fetched against the live LiteLLM table (2233 models). 128 ad_detector + pricing tests pass; 127 benchmark tests pass.

The capture template pre-populates truth.txt from MinusPod's accepted production ad markers. On real podcasts the production detector emits some false positives that have to be deleted by hand on every capture: 1. Markers placed around silence or music where Whisper hallucinated a short phrase like "Thank you for watching." (no actual ad audio). 2. Markers placed around main content the production prompt happened to classify as ad-like (the host discussing security topics, etc). Adds _classify_marker(marker, segments) -> (accepted, reason). Markers that fail any of these are routed to the existing commented "rejected" section with an `auto-rejected: <reason>` annotation, so the human reviewer can still uncomment them if the heuristic is wrong: - text in range matches a known Whisper hallucination phrase only ("Thank you for watching.", etc). - text density < 3 chars/sec (well below normal speech ~12-16 chars/sec). - text contains none of the strong ad signals: "brought to you by", "sponsor", ".com slash" / ".com/", "promo code", "use code", "discount code", "free trial", "sign up at", "get started at", "listen at", "visit X". The "rejected" header is reused; production-rejected markers now carry "# --- (rejected by production)" and auto-rejected markers carry "# --- (auto-rejected: <reason>)" so the source is unambiguous. Validated against the security-now-audio episode just captured: 11 raw markers from production, 7 accepted (all confirmed real sponsor reads), 4 auto-rejected (2 Whisper hallucinations, 2 main-content false positives). 134 benchmark tests pass.

Previously, the truth.txt template grouped all rejected markers (auto and production) at the bottom of the file, after every accepted block. That broke a workflow: when a reviewer uncommented a rejected marker to keep it, validate_logical's "ads must be ordered by start" constraint would refuse to verify until they manually moved it back into time order somewhere up above. New layout: build a single chronologically sorted list of all markers (accepted + auto-rejected + production-rejected) and render in that order. Rejected blocks sit in their natural time slot, commented out, with a one-line annotation above the start/end/text: --- # auto-rejected: Whisper hallucination only # start: 16:23.00 # end: 16:49.10 # text: ... --- Uncommenting then yields a valid block in the right time slot. No section headers, no shuffling required. Comment lines (including the auto-rejected: annotation) are ignored by the parser. A short user-facing instruction is added at the top of any file with rejections: "remove the '# ' prefix from start/end/text lines to accept a rejected block." 134 -> 135 benchmark tests pass.

…k consumers Same root cause as 2.0.28's lazy-import fixes (webhook_service, bs4). ad_detector.get_static_system_prompt does `from database import DEFAULT_SYSTEM_PROMPT`, which runs database/__init__.py end-to-end before the constant is accessible. That __init__ loads database.settings, which module-level imports secrets_crypto, which requires the cryptography package. The offline benchmark in benchmarks/llm/ calls get_static_system_prompt inside `run --dry-run` and `run`. A fresh `uv sync` venv has no cryptography, so dry-run blew up at import: ModuleNotFoundError: No module named 'cryptography' DEFAULT_SYSTEM_PROMPT is a plain ~8KB multi-line string. It has no database semantics. Move it to src/utils/constants.py (already stdlib- only, already home to SEED_SPONSORS), and re-export from src/database/__init__.py for backward compat. Update the two `from database import DEFAULT_SYSTEM_PROMPT` sites in src/ad_detector.py to import from utils.constants directly so the benchmark's import path no longer touches the database package. Verified: `uv run benchmark run --dry-run` now reports 4,760 calls would execute against the 5-episode corpus. 162 ad_detector + database + settings unit tests pass.

…Router attribution Two related dispatch-side fixes observed during the first live run. 1. Claude 4.x family models (opus 4.7, sonnet 4.6, haiku 4.5) reject `temperature` with HTTP 400 "`temperature` is deprecated for this model.". The previous version retried without temperature on 400 but relearned the deprecation on every call, paying ~1020 wasted round-trips on a full 5-episode sweep. Now memoized in a process- level set: first 400 per model adds the model_id, subsequent calls skip `temperature` upfront. Worst-case waste is one round-trip per affected model per process. Older Claude models that still accept `temperature` are unaffected. 2. OpenRouter recommends HTTP-Referer + X-Title headers for app attribution. The OpenAI SDK passes them via default_headers. Detected by base_url containing "openrouter.ai" so it doesn't leak to other openai_compatible providers. 135 benchmark tests pass. The first run captured 335 successful calls across 5 episodes for claude-opus-4-7 with 0 errors and median compliance 1.000, validating both code paths against the live API.

Same lazy-import pattern as 2.0.28 but for the new src/utils/llm_call.py introduced by PR #205 (2.1.6). The module-level from webhook_service import fire_auth_failure_event at line 14 pulled in jinja2 transitively for any consumer of the new call_llm_for_window helper -- including the benchmark, which imports ad_detector -> utils.llm_call. Move the import into the call site inside the is_auth_error(e) branch where it's actually needed. No behavioural change for production: webhook fires under identical conditions. The benchmark in benchmarks/llm/ can now import the ad_detector module chain without cryptography/jinja2/flask in its venv. 299 unit tests pass. Benchmark dry-run reports 2,592 remaining calls (2,168 already done from the in-flight run, dedup'd via prompt_hash).

Four upgrades based on first-real-run feedback: 1. Metric Key section moved to the top of the report. Replaces the short "How to Read" footnote with a tabular glossary (range, direction, plain-English meaning) for F1, cost, F1/$, latency, JSON compliance, no-ad PASS/FAIL, F1 stdev. Latency entry calls out the OpenRouter routing-layer caveat -- it includes upstream queueing, not just model-side compute, so it should be treated as a load/availability indicator rather than a model-quality signal. 2. Pareto chart redone with numbered points + a sorted legend on the right. Matplotlib's default ax.annotate at the data point produced overlapping text when models clustered (very common at the low-cost end). Numbered markers eliminate that. 3. New JSON compliance bar chart (compliance.svg). Horizontal bars sorted ascending, color-coded green/yellow/red against >=0.95 and >=0.7 thresholds, with a dotted reference line at 0.95. 4. New per-episode F1 heatmap (episodes.svg). Models on Y (sorted by avg F1 desc), ad-bearing episodes on X (no-ad excluded). Reveals model-content interaction that the aggregated F1 hides -- e.g. some models that score well overall struggle on specific episodes. 5. New "Failures and provider issues" section. Categorizes errors into buckets (provider content moderation, deprecated parameter, rate-limited, 5xx, etc.), shows per-model error counts, includes sample error messages, and explains why the failures matter for production model selection. Triggered by the qwen 3.5-plus run where Alibaba's content classifier rejected one transcript window (a real production gotcha you can't see from F1 alone). 3 report tests still pass.

Each model now gets a distinct color from matplotlib's tab20 colormap. The legend sits below the plot as a real matplotlib legend (instead of the right-side text box), so each model's color swatch is rendered adjacent to its name + (F1, cost) summary. Layout: 2-column legend for >6 models, 1-column otherwise. Bottom margin scales with row count (capped at 0.55) so the legend always fits without overlapping the axes. 3 report tests still pass.

Adds the high-signal sections we identified after the first run. Every addition reuses data already in calls.jsonl -- no schema change to the wire format, just new aggregations and renderers. New stats on ModelEpisodeStats / ModelStats: - Per-trial precision, recall, TP/FP/FN, boundary start/end MAE - Per-model output-tokens-total, detected-ads-total, tokens-per-ad - p90 / p99 / max latency in addition to p50 / p95 - Cost-per-true-positive New side data captured during aggregation (returned as _Extras): - Calibration: per-model list of (self-reported confidence, was-TP) - Cross-model agreement: per (episode, window), models predicting an ad - Detection-by-bucket: hit rate by ad length and ad position New report sections: 1. Precision / recall / FP / FN breakdown (F1 hides which side errs) 2. Boundary accuracy -- start/end MAE for matched ads 3. Confidence calibration table (binned, with hit rate per bin) 4. Latency tail -- p50/p90/p95/p99/max 5. Output token efficiency -- tokens per detected ad + cost per TP 6. Trial variance -- determinism check at temp=0 7. Cross-model agreement -- N-of-K vote distribution per window 8. Detection rate by ad length (short/medium/long) 9. Detection rate by ad position (pre/mid/post-roll) New charts: - calibration.svg -- reliability diagram (confidence vs hit rate) - latency_tail.svg -- p50/p90/p99/max bars on log scale per model Charts section now lists all five charts inline. Existing pareto + compliance + episodes charts unchanged. 135 tests pass. Report regenerated against the current calls.jsonl surfaces several findings that were invisible before: - Most models are wildly overconfident (claim 0.95-0.99 confidence, hit rate 20-50%); phi-4 is overconfident at 4% hit rate - Boundary MAE on start can be 18-24s for some models even when F1 is OK - Every model misses post-roll ads more than pre/mid-roll - 22/68 windows show 13+ models agreeing -- candidate ensemble pool

Adds benchmarks/llm/CONTRIBUTING.md so outside contributors can ship PRs expanding the corpus or the model list without reverse-engineering the workflow from the existing README pair. Covers the three accepted PR shapes (new episode, new model, code) with the exact diff each one should produce, what `benchmark verify` already validates so reviewers don't redo it by hand, and the copyright/PII bar for transcript excerpts (a gap not covered in the existing READMEs). Cross-references README.md and data/README.md rather than duplicating their content; total length 117 lines. Doc passes the Wikipedia signs-of-AI-writing scan: zero hits across the ~25 high-frequency AI-vocabulary tokens checked, em-dash density 4/117 all in `term -- definition` bullet form (compliant with humanizer's "one em dash per paragraph" rule). All example URLs use the your-minuspod.example.com placeholder per the public-repo URL hygiene rule in CLAUDE.md. No code, schema, or test impact. Bumps version.py + openapi.yaml to 2.1.7 per CLAUDE.md "always update version.py with changes" rule.

Adds the 5-episode verified corpus and the artifacts from the first 14-model sweep so anyone cloning the repo can audit the results without re-running the benchmark. Corpus: data/corpus/ep-ai-cloud-essentials-e8dc897fbd6b/ (no-ad control, 16 min) data/corpus/ep-daily-tech-news-show-c1904b8605f7/ (4 active ads) data/corpus/ep-glt1412515089-373d5ba5007b/ (4 active ads) data/corpus/ep-security-now-audio-2850b24903b2/ (6 active ads) data/corpus/ep-the-tim-dillon-show-f62bd5fa1cfe/ (6 active ads) Each episode dir contains metadata.toml, segments.json (Whisper-byte- exact), truth.txt (human-verified ad markers), and windows.json. Run artifacts (first 14-model sweep, 4760 total LLM calls, 1 error): results/raw/calls.jsonl (8.0 MB, one row per call) results/raw/episode_results.jsonl (492 KB, per-trial aggregates) results/raw/prompts/ (1246 files, 15 MB) results/raw/responses/ (6196 files, 23 MB) results/report.md (32 KB rendered report) results/report_assets/*.svg (5 charts, 590 KB) data/pricing_snapshots/... (LiteLLM snapshot) Updates CONTRIBUTING.md "what's in the repo vs your PR" table to match: calls.jsonl + episode_results.jsonl + prompts/ + responses/ are maintainer-committed for audit; contributors don't include them in their own PRs. No app code changed; no version bump. Total commit ~54 MB across ~7,470 files. Maintainers should expect repo size to grow when new episodes or models are merged and the next sweep is run.

Restructures the existing "Recommended Models" section to add a Cloud LLMs subsection on top of the existing Ollama subsection. The cloud picks come from the offline benchmark in benchmarks/llm/ -- four recommendations covering best-accuracy, best-Anthropic-direct, best free-tier, and cheap-and-fast use cases, with F1 / cost / latency caveats per option. Cross-references the benchmark tool, the rendered report, and the contribution guide so readers who want to validate or expand the list can. Notes the current corpus is 5 episodes and numbers will refine as the corpus grows. Existing local-Ollama tables (Pass 1, Verification, Chapters) are preserved verbatim, just nested under a new Local Ollama Models heading and demoted from #### to ##### accordingly. ToC: adds "Recommended Models" as a nested entry under "Using Ollama (Local or Cloud)" so the section is reachable from the top of the doc. Doc-only change. Doc passes the AI-tic vocabulary scan.

Three updates anchored by the new offline benchmark data. 1. Local Ollama Models intro: prefix the section with a note that the benchmark covers cloud-hosted models only and Ollama runs are not in the sweep, so contributors who want apples-to-apples local numbers would need to extend the benchmark with an Ollama provider. Also trims the unverified "in the same tier as Claude Sonnet" claim on qwen3.5:122b -- the cloud benchmark scored Sonnet 4.6 at F1 0.33, not a clear ceiling, so the comparison no longer fits. 2. "Accuracy vs. Claude" -> "Cloud vs. Local: What Changes". The old title implied Claude was the universal frontier; the cloud benchmark shows Grok-4.1-fast (F1 0.61) clearly beats every Claude variant, with Sonnet at 0.33. Reframes the section as cloud-vs-local rather than open-weights-vs-Claude. Adds a row on network-inserted brand- tagline ads (the cloud benchmark's per-bucket detection rates show even frontier models miss roughly a third of these, useful prior for setting expectations on local). Closes with how to measure the gap on your own content via the bench tool. 3. JSON Reliability Risks: grounds the cloud-vs-local framing in real data. The benchmark report shows JSON compliance from 0.53 (verbose reasoning models) to 1.00 (Mistral / Qwen / Opus); Claude Haiku 4.5 at 0.60 because it markdown-fences every response. So compliance variance is not unique to local; cloud models exhibit it too. Links to the JSON-compliance chart in the report. Also updates the LLM accuracy notice in the disclaimer area to point at the renamed section and link the benchmark report instead of restating "most testing was done with Claude" (no longer accurate now that the benchmark covers 14 cloud models). Doc-only change. AI-tic vocabulary scan: 0 hits in added prose.

The 2.1.7 commit (4c481f9) bumped version.py and openapi.yaml for a docs-only change. By policy ("don't bump for non-app changes"), that bump should not have happened. Reverts both files to 2.1.6 and removes the 2.1.7 entry from CHANGELOG since no Docker image will ship under that tag. The src/ refactors that landed since 2.1.6 (lazy webhook_service imports in ad_detector.py and utils/llm_call.py, lazy bs4 import in pricing_fetcher.py, DEFAULT_SYSTEM_PROMPT hoist to utils/constants.py with a re-export in database/__init__.py) are behavior-preserving -- they enabled the offline benchmark in benchmarks/llm/ to import these modules without dragging in jinja2/flask/cryptography/bs4. Runtime is byte-identical to main's 2.1.6, so no Docker rebuild is needed. Net effect on a deployed instance: - Version string still 2.1.6 (matches the live image). - No behavior change. - New CONTRIBUTING.md, expanded benchmark report, corpus data, and README cloud-LLM picks ride along on this branch without bumping the version.

OpenRouter sometimes returns HTTP 200 with `choices=None` in the body when the upstream provider hits its own internal read-timeout. Observed on cohere/command-r-plus-08-2024 at the 120s mark for windows that take long to reason about. The OpenAI SDK doesn't raise -- it returns a ChatCompletion object where `msg.choices` is None instead of [], and our code did `msg.choices[0]` which crashes with "'NoneType' object is not subscriptable". Fix: defensively coerce None to [] and raise LLMTransientError when choices is empty (or the first choice has no message). The runner's existing retry-on-LLMTransientError path then retries with backoff instead of crashing the call. Impact on the in-flight run: minimal. The 4 affected Cohere calls (out of 340) are already excluded from F1 / compliance / cost aggregates by `if rec.get('error'): continue`, so Cohere R+ still has 336 valid samples. The fix prevents recurrence and lets `benchmark run --retry-errors` retry the 4 timed-out windows cleanly after the run completes. No version bump; benchmark-only code, doesn't ship in the Docker image.

Best Accuracy table held all 32 models; Best Value (F1 / cost) silently dropped the 10 free-tier rows because F1 / 0 is infinity. Visible asymmetry that asked "where did half the table go?" without an answer. Splits the leaderboard into three sections: 1. Best Accuracy -- unchanged, all models by F1 2. Best Value -- paid-tier only, by F1 / $ 3. Best Free-Tier (new) -- $0.00 rows ranked by F1 alone Each section now has a one-line note explaining its scope, including a heads-up that OpenRouter free-tier eligibility depends on the attribution headers (HTTP-Referer, X-Title) wired in src/benchmark/llm.py, so a model that shows as free here may bill on a deployment that doesn't send those headers. 135 tests pass. Doc/report-only change; no version bump.

The reliability-diagram-style line plot crowded 30+ models into the high-confidence end of the axis. Most models report 0.95-0.99 confidence on most predictions, so lines piled on top of each other and the x-axis tick labels (bin centers as float) became unreadable. Replaces with a calibration heatmap: - One row per model (sorted from most overconfident at top to most underconfident at bottom). - One column per confidence bin, labeled with the bin range directly. - Cell text shows actual hit rate and sample size. - Cell color is calibration error (actual minus bin midpoint), with a diverging RdYlGn colormap centered on 0. Red = overconfident, green = well-calibrated, blue = underconfident. Bins the model never used render as a neutral gray. Same underlying data, no overlap, x-axis is self-explanatory. Caption in the report's charts section updated to match. 135 tests pass. Report regenerated; no data refresh.

Two related fixes to chart rendering: 1. Shorten the calibration heatmap title from a long two-line caption to a compact one with cell-text legend up top and color legend below. Previous wording overflowed the figure width on the right on narrow viewports. 2. Add bbox_inches="tight" to every chart savefig. matplotlib's tight_layout sometimes misses corner clipping when titles or colorbar labels run close to the figure edge; bbox_inches="tight" on save expands the bounding box just enough to capture them. Applied to pareto.svg, compliance.svg, episodes.svg, calibration.svg, and latency_tail.svg. No data change, just rendering. 135 tests pass.

calls.jsonl is append-only by design (audit history). When `benchmark run --retry-errors` succeeds against a previously-failed (model, episode, trial, window) tuple, the new successful row is appended alongside the original error row. The aggregator already filtered errors out of F1/cost (`if rec.get('error'): continue`), but the Failures section kept counting the original error rows even though they were superseded by a successful retry. Fix: dedup raw calls by (model, episode_id, trial, window_index) and keep the last row encountered. Everything downstream (aggregator, charts, failures table, per-model detail) sees only the current state. Run Metadata now distinguishes "unique work units (current state)" from "raw rows in calls.jsonl" and notes how many were superseded by later retries. Lifetime actual spend still sums the raw rows so the true historical bill is preserved. 135 tests pass. No data change; rendering only.

The TP/FP/FN columns were undefined inline. New readers had to know the convention or skim the metric glossary at the top to interpret the rest of the section. Adds a compact column-key table with the formula for precision and recall and one-sentence definitions of TP, FP, and FN, plus a one-paragraph reading guide on what high-precision vs high-recall behavior actually means in production. Doc-only; no schema or test impact. 135 tests pass.

…chart Two-part pass on report rendering: 1. Add a one-paragraph intro to every section that lacked one, written for an entry-level-to-expert audience. Sections that gained explainers: Failures (By category, Per-model error count, Sample messages), Detection rate buckets (By ad length, By ad position), Quick Comparison, Per-Model Detail, Per-Episode Detail, Parser Stress Test, Methodology. Each explainer answers: what does this table show, why does it matter, how should I read the numbers? 2. New chart: Cross-model agreement histogram (agreement.svg). Renders the existing per-window vote-count distribution as a colored bar chart (red = low agreement, green = high), with cell labels showing window counts and percentages. Wired into the Charts section with a caption that mirrors the table-form Cross- model agreement section. 135 tests pass. Data unchanged; rendering only.

User noted the cross-model agreement histogram was anonymous (no model labels) and asked for both views: distribution AND per-model attribution. Plus identified other sections that could use charts. New analysis: - Per-model alignment with consensus (table + chart). For each model and each (episode, window), classify the model's vote against the majority into 4 buckets: with-yes / with-no / broke-yes / broke-no. Surfaces which models track consensus and which break it -- a model that voted "yes" when most others voted "no" is likely false-positiving; the opposite means likely missing real ads. New charts: - alignment.svg: stacked horizontal bars per model showing the 4 buckets with green/blue/orange/red color coding and an alignment rate label on the right edge. - precision_recall.svg: scatter of precision vs recall with F1 isocurves as gray dashed reference lines. Top-right = ideal, top-left = cautious, bottom-right = greedy. - boundary.svg: stacked horizontal bars per model with blue = start MAE and orange = end MAE, sorted by total error ascending. Total labeled at the right. - token_efficiency.svg: scatter of tokens-per-ad (log x) vs F1. Upper-left = efficient zone; right side = reasoning-heavy models whose extra tokens may or may not buy more F1. Charts section in the report now lists all 9 charts with one-paragraph captions matching the depth of detail in the body sections. Histogram caption updated to acknowledge anonymity and point to the alignment chart for per-model attribution. 135 tests pass. Data unchanged; rendering only.

…arts Adds the last three sections that had model data but no chart. New charts: - trial_variance.svg: horizontal bars of mean F1 stdev per model, with green/yellow/red coloring at the 0.02 and 0.05 thresholds. Visually surfaces models whose single-trial F1 numbers can't be trusted. - detection_by_length.svg: heatmap of model (row) vs ad-length bucket (column, short/medium/long), cell = detection rate plus sample size. Rows sorted by overall detection rate, descending. - detection_by_position.svg: same shape, columns are pre-roll / mid-roll / post-roll. Surfaces the common pattern where post-roll is systematically harder. - parser_stress.svg: heatmap of model vs extraction method (call counts). Columns ordered by total usage; rows ordered by share of json_array_direct (the clean path) so easier-to-consume models are at the top. Total charts now 13. Each chart has a one-paragraph caption in the Charts section explaining what it shows and how to read it. Also: ran a programmatic AI-tic scan across all report.py prose literals -- 0 hits across the Wikipedia high-frequency vocabulary. The one em-dash flag in the scan was a false positive: 6 em-dashes spread across 9 paragraphs in a single concatenated chart-captions string, not 6 in one paragraph (one per chart, which is the intended pattern). 135 tests pass.

Two related polish changes. 1. Em-dash and tone pass. After the user noted that em-dashes were still visible throughout the report, audited every prose string in report.py and applied 23+ targeted replacements: - Converted `**term** -- description` patterns to `**term**: description` - Converted parenthetical em-dashes ("X -- Y -- and Z") to commas or sentence breaks - Converted aside em-dashes ("X. Y -- bad for production") to proper sentence boundaries ("X. Y. Bad for production.") - Title-cased heading "Parser Stress Test" lowered to "Parser stress test" (sentence case, per humanizer rule 15) Final rendered-report distribution: 88 paragraphs with 0 em-dashes, 1 with a single em-dash, 0 with 2+. Down from 17 + 2 = 19 paragraphs with em-dashes before the pass. Other humanizer-rule audits all clean: 0 promotional, 0 copula avoidance, 0 filler, 0 hedging, 0 chatbot artifacts, 0 superficial -ing analyses, 0 formulaic "despite/conclusion", 0 false ranges. 2. Source-data links. Every chart in the Charts section now has a "Source data: [linked anchor]" line directly below the image, pointing at the corresponding tables in the body of the report. Readers who want to dig past the visual can click straight to the numbers. 13 source-data links wired up (one per chart, some chart captions point to multiple tables like Pareto -> Best Accuracy / Best Value / Best Free-Tier). 135 tests pass. Doc/report-only change.

Production parser: add strategy 4 (`_salvage_truncated_single_ad`) to recover usable ad dicts from responses that ran out of token budget mid-output (the existing four strategies all bail on the structurally invalid JSON). Returns a dict only when both `start` and `end` were recovered; otherwise the row is dropped rather than fabricated. Tagged `json_object_single_ad_truncated`. Benchmark: max_tokens is now config-driven via `[run].max_tokens`, defaulting to production's `AD_DETECTION_MAX_TOKENS` so the benchmark matches the live app's budget. `schema_audit` now imports production's `STRUCTURAL_FIELDS` and `SPONSOR_PRIORITY_FIELDS` to stop flagging fields the live parser already accepts. Benchmark code cleanup: drop local duplicates of `parse_timestamp` and `format_time` in favor of `utils.time`; use `dataclasses.asdict` for violations; convert derived ModelStats fields to `@property`; precompute per-model indexes in `_aggregate` and call counts in `_render_failures`; cache `avg_f1` / `mean_f1_stdev` on ModelStats; fix the brittle HTTP-5xx classifier (`\b5\d{2}\b`); extract length/position bucket helpers. Corpus: add two new verified episodes (`it-s-a-thing`, `the-brilliant-idiots`).

Rolls 11 of 12 open Dependabot PRs into 2.1.7. PR #208 (ubuntu 24.04 -> 26.04) held: the GPU image is pinned to ubuntu 24.04 through nvidia/cuda:12.9.1-runtime-ubuntu24.04 and nvidia/cuda has no ubuntu 26.04 variant, so taking it would split the GPU and CPU base images across a major OS version. pip (5): - bump anthropic 0.97.0 -> 0.100.0 - bump cryptography 47.0.0 -> 48.0.0 - bump gunicorn 25.3.0 -> 26.0.0 - bump huggingface-hub 1.13.0 -> 1.14.0 - bump openai 2.33.0 -> 2.36.0 npm (5): - bump @tanstack/react-query 5.100.5 -> 5.100.9 - bump react 19.2.5 -> 19.2.6 - bump react-router-dom 6.30.3 -> 7.15.0 (major; tsc clean, no API surface change required) - bump tailwind-merge 3.5.0 -> 3.6.0 - bump vite-plugin-pwa 1.2.0 -> 1.3.0 docker (1): - bump node 24-alpine -> 26-alpine (both GPU and CPU Dockerfiles) README: refresh the Cloud LLMs recommended-model table with the latest 32-model / 7-episode / 14,400-call benchmark numbers. Verification: - pytest tests/ -q -> 941 passed, 4 skipped - benchmarks/llm pytest -> 135 passed - frontend: npx tsc --noEmit clean, npm run build successful - docker build --platform=linux/amd64 (GPU) -> success - docker build --platform=linux/amd64 -f Dockerfile.cpu (CPU) -> success

ttlequals0 changed the title ~~feat(2.0.27): offline LLM ad-detection benchmark tool~~ feat(benchmark): offline LLM ad-detection tool May 7, 2026

ttlequals0 force-pushed the feat/benchmark-tool branch from 64c1ab5 to 9552780 Compare May 9, 2026 22:28

ttlequals0 added 13 commits May 9, 2026 21:42

ttlequals0 force-pushed the feat/benchmark-tool branch from dee91c4 to 4d76eb1 Compare May 10, 2026 01:46

ttlequals0 added 13 commits May 10, 2026 13:48

ttlequals0 added 9 commits May 10, 2026 19:19

updated report

fba24d1

docs(2.1.7): note accepted CVE-2026-31431 in linux-libc-dev

17ddbbb

ttlequals0 merged commit ed73901 into main May 11, 2026
8 checks passed

ttlequals0 deleted the feat/benchmark-tool branch May 11, 2026 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): offline LLM ad-detection tool#200

feat(benchmark): offline LLM ad-detection tool#200
ttlequals0 merged 35 commits into
mainfrom
feat/benchmark-tool

ttlequals0 commented May 7, 2026 •

edited

Loading

Uh oh!

ttlequals0 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ttlequals0 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

CLI surface

Determinism + checkpointing

Concurrency

Test plan

Out of scope for this PR

Update (2026-05-10 → 2026-05-11): rolled into v2.1.7

Verification after the dep bumps

Uh oh!

ttlequals0 commented May 7, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ttlequals0 commented May 7, 2026 •

edited

Loading