Skip to content

feat(benchmark): offline LLM ad-detection tool#200

Merged
ttlequals0 merged 35 commits into
mainfrom
feat/benchmark-tool
May 11, 2026
Merged

feat(benchmark): offline LLM ad-detection tool#200
ttlequals0 merged 35 commits into
mainfrom
feat/benchmark-tool

Conversation

@ttlequals0
Copy link
Copy Markdown
Owner

@ttlequals0 ttlequals0 commented May 7, 2026

Summary

Adds a self-contained benchmark tool under benchmarks/llm/ that compares LLMs on ad-detection accuracy, cost, latency, and JSON compliance using real MinusPod transcripts. Operationalizes the spec in tmp/BENCHMARK_PLAN.md.

  • Capture episodes from MinusPod's /original-segments endpoint, hand-verify ground truth in truth.txt, then run an async multi-provider sweep that records per-call results to calls.jsonl.
  • Generates a Markdown report with TL;DR rankings, per-model and per-episode breakdowns, parser-stress catalog, and an SVG Pareto chart at results/report.md.
  • Imports MinusPod modules from ../../src/ at runtime via a path bootstrap. No runtime image impact: benchmarks/ is already in .dockerignore from v2.0.25.

Why

With v2.0.25's module-level lifts (extract_json_ads_array, parse_ads_from_response, format_window_prompt, get_static_system_prompt) and v2.0.26's segments-JSON endpoints already on main, the benchmark wires them into a deterministic, checkpointed sweep that can compare cost and accuracy across providers without running production.

CLI surface

benchmark capture --episode-url <url>
benchmark verify <ep-id>
benchmark regenerate-windows <ep-id> --force
benchmark list-episodes
benchmark validate
benchmark refresh-pricing
benchmark run [--retry-errors] [--dry-run] [--force] [--no-report-on-failure]
benchmark report
benchmark archive

Scope is sourced from benchmark.toml ([[models]]) and data/corpus/ at runtime; no per-run scope filters.

Determinism + checkpointing

Every (model, episode, trial, window_index) tuple computes a prompt_hash over (system_prompt, user_prompt, model, temperature). The runner skips tuples already in calls.jsonl. Adding a model or episode and re-running fills only the gaps. User prompts are cached per (episode, window) to avoid 98% redundant rebuilds on full sweeps.

Concurrency

Single-process async fan-out via asyncio.gather against AsyncAnthropic and AsyncOpenAI. Two semaphores (max_concurrent_calls global default 8, max_concurrent_per_provider default 4) keep within provider rate limits. Two simultaneous benchmark run invocations against the same calls.jsonl are unsupported.

Test plan

  • 127 unit tests pass: cd benchmarks/llm && pytest tests/
  • Main MinusPod test suite still 881 pass
  • Frontend typecheck unaffected
  • /simplify run, applied 6 fixes (compliance scoring matches actual production extraction methods, prompt cache, indexed pricing lookup, single-pass scan_calls, public load_* helpers, dead-import cleanup, shared test fixtures)
  • Async LLM dispatch exercised via a single real call per provider once the corpus has at least one episode
  • First full sweep runs locally before report archive

Out of scope for this PR

  • Building the 6-episode corpus (runtime data; lives in data/corpus/ and is curated via benchmark capture/benchmark verify)
  • Running the first benchmark sweep

Update (2026-05-10 → 2026-05-11): rolled into v2.1.7

This PR ships v2.1.7, which now bundles:

  1. Offline LLM ad-detection benchmark (the original PR scope, above).
  2. Production parser strategy 4 in src/utils/llm_response.py: _salvage_truncated_single_ad recovers a usable ad dict when a response runs out of token budget mid-output (the four upstream strategies all bail on the structurally invalid JSON). Returns a dict only when both start and end were recovered. Tagged json_object_single_ad_truncated.
  3. Benchmark code-quality cleanup from a 3-agent /simplify pass: DRY against utils.time, dataclasses.asdict, derived @property on ModelStats, indexed _aggregate, cached avg_f1, hardened HTTP-5xx error classifier, plus magic-number constants for calibration bins and variance thresholds.
  4. Two new corpus episodes (it-s-a-thing, the-brilliant-idiots).
  5. Dependency bumps (11 of 12 open Dependabot PRs):
    • pip: anthropic 0.97 → 0.100, cryptography 47 → 48, gunicorn 25.3 → 26, huggingface-hub 1.13 → 1.14, openai 2.33 → 2.36
    • npm: @tanstack/react-query 5.100.5 → 5.100.9, react 19.2.5 → 19.2.6, react-router-dom 6 → 7 (major; tsc clean), tailwind-merge 3.5 → 3.6, vite-plugin-pwa 1.2 → 1.3
    • docker: node 24-alpine → 26-alpine (both Dockerfiles)
    • Held: Dependabot docker(deps): bump ubuntu from 24.04 to 26.04 #208 (ubuntu 24.04 → 26.04). The GPU image is pinned to ubuntu 24.04 through nvidia/cuda:12.9.1-runtime-ubuntu24.04, and nvidia/cuda has no ubuntu 26.04 variant across CUDA 12.4 to 13.2.1. Taking it now would split the GPU and CPU base images across a major OS version, breaking the CLAUDE.md "both bases share ubuntu:24.04" invariant. Revisit when nvidia/cuda ships a 26.04 base.
  6. README: refreshed the Cloud LLMs recommended-model table with the latest 32-model / 7-episode / 14,400-call benchmark numbers.

Verification after the dep bumps

  • pytest tests/ -q --ignore=tests/unit/test_favicon_routes.py → 941 passed, 4 skipped (5 favicon failures are pre-existing on main, unrelated to this PR)
  • cd benchmarks/llm && uv run pytest tests/ -q → 135 passed
  • cd frontend && npx tsc --noEmit → clean
  • cd frontend && npm run build → 2,419 modules transformed, no errors
  • docker build --platform=linux/amd64 -t minuspod-test:gpu . → success
  • docker build --platform=linux/amd64 -f Dockerfile.cpu -t minuspod-test:cpu . → success

@ttlequals0
Copy link
Copy Markdown
Owner Author

Code review

No issues found at HEAD. Findings from the review pass were fixed in commit 0332b20 (post-LLM I/O try/except + atomic write_response, OpenAI fallback transient-error classification, deprecated-model hash skip, microsecond pricing snapshot filenames, inline import hoisting, ASCII em-dash).

🤖 Generated with Claude Code

@ttlequals0 ttlequals0 changed the title feat(2.0.27): offline LLM ad-detection benchmark tool feat(benchmark): offline LLM ad-detection tool May 7, 2026
@ttlequals0 ttlequals0 force-pushed the feat/benchmark-tool branch from 64c1ab5 to 9552780 Compare May 9, 2026 22:28
ttlequals0 added 13 commits May 9, 2026 21:42
Adds benchmarks/llm/ uv project with:
- pyproject.toml, .env.example, benchmark.toml.example, .gitignore
- src/benchmark/ package with sys.path bootstrap to import MinusPod modules
- config.py: TOML loader with provider/model/run validation (11 tests)
- auth.py: cookie-cached MinusPod session with 23h TTL + login probe
- truth_parser.py: ground-truth file parser with structural+logical+
  cross-reference validation (26 tests)
- corpus.py: corpus loader, segments hashing, windows.json round-trip
  using MinusPod's create_windows (9 tests)
- capture.py: episode-URL parsing, /original-segments fetch, truth.txt
  pre-population from production ad markers (11 tests)
- pricing.py: thin wrapper over MinusPod's pricing_fetcher (LiteLLM)
  with snapshot read/write (6 tests)
- parsing.py: re-exports of lifted ad-detector functions
- storage.py: atomic JSONL append/fsync, prompt+response artifact
  writers, dedup index, sanitized error recording (13 tests)
- metrics.py: IoU greedy match, P/R/F1, boundary MAE, JSON compliance
  scoring, schema violation audit, trial stdev (26 tests)

Runner, report generator, CLI, and READMEs land in follow-up commits.
102 module-level tests pass.
Adds benchmarks/llm/ self-contained uv project that compares LLMs on
ad-detection accuracy, cost, latency, and JSON compliance using real
MinusPod transcripts.

Modules:
- config: TOML loader + provider/model/run validation
- auth: cookie-cached MinusPod session, 23h TTL, 429 reported (no auto-retry)
- truth_parser: ground-truth parser, structural+logical+cross-ref validation
- corpus: episode loader, segments-hash drift detection, public load_metadata
  / load_segments helpers
- capture: episode-URL parsing, /original-segments fetch, truth.txt
  pre-population from production ad markers
- pricing: thin wrapper over MinusPod's pricing_fetcher (LiteLLM-backed),
  snapshot read/write with O(1) indexed lookup
- parsing: re-exports of lifted ad_detector functions
- storage: atomic JSONL append/fsync, single-pass scan_calls returning
  (completed, errored), prompt+response artifact writers, sanitized errors
- metrics: IoU greedy match, P/R/F1, boundary MAE, schema-violation audit,
  prefix-aware compliance scoring matching production extraction methods
- llm: AsyncAnthropic + AsyncOpenAI dispatch with native json_object +
  prompt-injection fallback, retry-with-backoff classification
- runner: build (model, episode, trial, window) work list, dedup against
  calls.jsonl, asyncio.gather with global+per-provider semaphores;
  user prompt cached per (episode, window) -- 98% fewer build calls on
  full sweeps
- report: aggregate + render Markdown report (TL;DR, headline, per-model,
  per-episode, parser stress, methodology, run metadata) + SVG Pareto chart
- cli: typer subcommands -- capture, verify, regenerate-windows,
  list-episodes, validate, refresh-pricing, run, report, archive

Scope is sourced from benchmark.toml ([[models]]) and data/corpus/ at
runtime; no --model / --episode / --trials CLI filters.

127 unit tests pass. Async LLM dispatch is exercised against real provider
APIs at run time. .gitignore negates data/ for benchmarks/llm/data/ so
corpus + pricing snapshots commit while leaving repo-root data/ ignored.

No runtime image impact: benchmarks/ already in .dockerignore (v2.0.25).
- Wrap post-LLM I/O (parse, schema audit, write_response, write_prompt,
  append_jsonl) in try/except so a parse or disk error in one execute()
  doesn't propagate out of asyncio.gather and cancel the other in-flight
  tasks. Errored records are still appended; the JSONL write itself is
  the only operation that can lose a row, and that path now logs
  defensively rather than re-raising.
- write_response now mirrors write_prompt's tmp+os.replace atomic write
  so a crash mid-write can't leave a truncated file referenced by an
  fsynced calls.jsonl row.
- llm.call_with_retry's response_format fallback now classifies
  RateLimitError/APITimeoutError/APIConnectionError as transient on the
  retry path too (was only classifying APIStatusError).
- precompute_prompt_hashes skips deprecated models so the hash dict
  matches build_work_list's filter.
- pricing snapshot filenames include microseconds; two refresh-pricing
  calls in the same second no longer overwrite each other.
- Hoisted inline imports flagged by review (`import re` in storage,
  `from rapidfuzz import fuzz` in truth_parser, `from .storage import
  read_jsonl` in runner, sibling-module imports in cli._preview).
- Replaced the em dash in auth.py's 429 error message with ASCII `--`.

127 unit tests pass.
Moves the deferred imports of create_windows, normalize_model_key,
fetch_litellm_pricing, and corpus.load_metadata/load_segments to
module level. The path bootstrap in benchmark/__init__.py runs
before any submodule loads, so MinusPod's src/ is already on
sys.path by the time these imports execute.

The remaining inline imports in llm.py (anthropic + openai SDKs)
and report.py (matplotlib) stay deferred on purpose: the SDKs only
load when their provider is actually used, and matplotlib (~50MB)
only loads when render is called.
The benchmark capture flow calls
GET /api/v1/feeds/{slug}/episodes/{id}/original-segments, an endpoint
added in 2.0.26. Pointing the tool at an older server returns 404 with
no obvious user-facing signal. README now states the minimum version
explicitly alongside the existing path-bootstrap note.
Two related fixes:

- Add python-dotenv as a runtime dep and load benchmarks/llm/.env at CLI
  import time. Path is resolved relative to the package (parents[2]) so
  the load works regardless of where `benchmark` is invoked from.
  Shell-exported variables still win over .env values (override=False).

- README: switch the install instruction from `uv sync` (which only
  satisfies the benchmark's own deps) to `pip install -e .` into the
  MinusPod parent venv. The benchmark imports modules from MinusPod's
  src/ at runtime; those modules transitively need MinusPod's runtime
  deps (jinja2, flask, etc.) which are only present in the parent venv.
  Running from a fresh benchmark-only venv fails at first import of
  benchmark.cli with ModuleNotFoundError on jinja2.
Previous version sent users to install editable into MinusPod's parent
venv, which works but isn't the natural uv-project flow. The reason the
benchmark needs MinusPod's runtime deps at all is that ad_detector does
a module-level `from webhook_service import fire_auth_failure_event`
even though webhook_service is only used inside the LLM error path; the
benchmark only wants create_windows but Python loads the whole module's
imports.

Two equivalent paths now documented:

- uv-native: `uv sync && uv pip install -r ../../requirements.txt`,
  then `uv run benchmark <cmd>`. Layered install needed because the
  benchmark's own pyproject.toml deliberately does not duplicate
  MinusPod's full runtime stack.

- parent-venv: install editable into ../../.venv and call
  ../../.venv/bin/benchmark directly.

Both work. uv run is now the primary path.
…umers

src/ad_detector.py and src/pricing_fetcher.py both did module-level imports
of dependencies they only need on a single code path:

- ad_detector.fire_auth_failure_event is only called inside the LLM dispatch
  loop when is_auth_error(e) is true. webhook_service transitively imports
  jinja2.
- pricing_fetcher.BeautifulSoup is only used inside fetch_pricepertoken_pricing.
  fetch_litellm_pricing does not need it.

Both imports are now deferred to the function bodies that actually use them.
No behavioural change: the webhook fires under identical conditions and the
HTML-scraping path parses identically.

The benchmark in benchmarks/llm/ imports create_windows from ad_detector and
fetch_litellm_pricing from pricing_fetcher via a sys.path bootstrap. Before
this commit those imports forced jinja2 + beautifulsoup4 into the benchmark's
venv, which uv sync had no reason to install (they're not in benchmark's
pyproject.toml), so `uv run benchmark <cmd>` failed at first import. After
this commit `uv run benchmark validate` and `refresh-pricing` work cleanly
on a fresh `uv sync`.

Also adds requests>=2.31 to benchmarks/llm/pyproject.toml since pricing_fetcher
uses it directly, drops the now-unnecessary `uv pip install -r ../../requirements.txt`
step from the benchmark README, and seeds data/pricing_snapshots/ with the
first snapshot fetched against the live LiteLLM table (2233 models).

128 ad_detector + pricing tests pass; 127 benchmark tests pass.
The capture template pre-populates truth.txt from MinusPod's accepted
production ad markers. On real podcasts the production detector emits
some false positives that have to be deleted by hand on every capture:

1. Markers placed around silence or music where Whisper hallucinated a
   short phrase like "Thank you for watching." (no actual ad audio).
2. Markers placed around main content the production prompt happened to
   classify as ad-like (the host discussing security topics, etc).

Adds _classify_marker(marker, segments) -> (accepted, reason). Markers
that fail any of these are routed to the existing commented "rejected"
section with an `auto-rejected: <reason>` annotation, so the human
reviewer can still uncomment them if the heuristic is wrong:

- text in range matches a known Whisper hallucination phrase only
  ("Thank you for watching.", etc).
- text density < 3 chars/sec (well below normal speech ~12-16 chars/sec).
- text contains none of the strong ad signals: "brought to you by",
  "sponsor", ".com slash" / ".com/", "promo code", "use code",
  "discount code", "free trial", "sign up at", "get started at",
  "listen at", "visit X".

The "rejected" header is reused; production-rejected markers now carry
"# --- (rejected by production)" and auto-rejected markers carry
"# --- (auto-rejected: <reason>)" so the source is unambiguous.

Validated against the security-now-audio episode just captured: 11 raw
markers from production, 7 accepted (all confirmed real sponsor reads),
4 auto-rejected (2 Whisper hallucinations, 2 main-content false
positives). 134 benchmark tests pass.
Previously, the truth.txt template grouped all rejected markers (auto
and production) at the bottom of the file, after every accepted block.
That broke a workflow: when a reviewer uncommented a rejected marker
to keep it, validate_logical's "ads must be ordered by start"
constraint would refuse to verify until they manually moved it back
into time order somewhere up above.

New layout: build a single chronologically sorted list of all markers
(accepted + auto-rejected + production-rejected) and render in that
order. Rejected blocks sit in their natural time slot, commented out,
with a one-line annotation above the start/end/text:

    ---
    # auto-rejected: Whisper hallucination only
    # start: 16:23.00
    # end:   16:49.10
    # text:  ...
    ---

Uncommenting then yields a valid block in the right time slot. No
section headers, no shuffling required. Comment lines (including the
auto-rejected: annotation) are ignored by the parser.

A short user-facing instruction is added at the top of any file with
rejections: "remove the '# ' prefix from start/end/text lines to
accept a rejected block."

134 -> 135 benchmark tests pass.
…k consumers

Same root cause as 2.0.28's lazy-import fixes (webhook_service, bs4).
ad_detector.get_static_system_prompt does `from database import
DEFAULT_SYSTEM_PROMPT`, which runs database/__init__.py end-to-end
before the constant is accessible. That __init__ loads
database.settings, which module-level imports secrets_crypto, which
requires the cryptography package.

The offline benchmark in benchmarks/llm/ calls get_static_system_prompt
inside `run --dry-run` and `run`. A fresh `uv sync` venv has no
cryptography, so dry-run blew up at import:

    ModuleNotFoundError: No module named 'cryptography'

DEFAULT_SYSTEM_PROMPT is a plain ~8KB multi-line string. It has no
database semantics. Move it to src/utils/constants.py (already stdlib-
only, already home to SEED_SPONSORS), and re-export from
src/database/__init__.py for backward compat. Update the two
`from database import DEFAULT_SYSTEM_PROMPT` sites in src/ad_detector.py
to import from utils.constants directly so the benchmark's import path
no longer touches the database package.

Verified: `uv run benchmark run --dry-run` now reports 4,760 calls would
execute against the 5-episode corpus. 162 ad_detector + database +
settings unit tests pass.
…Router attribution

Two related dispatch-side fixes observed during the first live run.

1. Claude 4.x family models (opus 4.7, sonnet 4.6, haiku 4.5) reject
   `temperature` with HTTP 400 "`temperature` is deprecated for this
   model.". The previous version retried without temperature on 400 but
   relearned the deprecation on every call, paying ~1020 wasted
   round-trips on a full 5-episode sweep. Now memoized in a process-
   level set: first 400 per model adds the model_id, subsequent calls
   skip `temperature` upfront. Worst-case waste is one round-trip per
   affected model per process. Older Claude models that still accept
   `temperature` are unaffected.

2. OpenRouter recommends HTTP-Referer + X-Title headers for app
   attribution. The OpenAI SDK passes them via default_headers. Detected
   by base_url containing "openrouter.ai" so it doesn't leak to other
   openai_compatible providers.

135 benchmark tests pass. The first run captured 335 successful calls
across 5 episodes for claude-opus-4-7 with 0 errors and median
compliance 1.000, validating both code paths against the live API.
Same lazy-import pattern as 2.0.28 but for the new src/utils/llm_call.py
introduced by PR #205 (2.1.6). The module-level

    from webhook_service import fire_auth_failure_event

at line 14 pulled in jinja2 transitively for any consumer of the new
call_llm_for_window helper -- including the benchmark, which imports
ad_detector -> utils.llm_call. Move the import into the call site
inside the is_auth_error(e) branch where it's actually needed.

No behavioural change for production: webhook fires under identical
conditions. The benchmark in benchmarks/llm/ can now import the
ad_detector module chain without cryptography/jinja2/flask in its venv.

299 unit tests pass. Benchmark dry-run reports 2,592 remaining calls
(2,168 already done from the in-flight run, dedup'd via prompt_hash).
@ttlequals0 ttlequals0 force-pushed the feat/benchmark-tool branch from dee91c4 to 4d76eb1 Compare May 10, 2026 01:46
ttlequals0 added 13 commits May 10, 2026 13:48
Four upgrades based on first-real-run feedback:

1. Metric Key section moved to the top of the report. Replaces the
   short "How to Read" footnote with a tabular glossary (range,
   direction, plain-English meaning) for F1, cost, F1/$, latency,
   JSON compliance, no-ad PASS/FAIL, F1 stdev. Latency entry calls
   out the OpenRouter routing-layer caveat -- it includes upstream
   queueing, not just model-side compute, so it should be treated
   as a load/availability indicator rather than a model-quality
   signal.

2. Pareto chart redone with numbered points + a sorted legend on
   the right. Matplotlib's default ax.annotate at the data point
   produced overlapping text when models clustered (very common at
   the low-cost end). Numbered markers eliminate that.

3. New JSON compliance bar chart (compliance.svg). Horizontal bars
   sorted ascending, color-coded green/yellow/red against >=0.95 and
   >=0.7 thresholds, with a dotted reference line at 0.95.

4. New per-episode F1 heatmap (episodes.svg). Models on Y (sorted by
   avg F1 desc), ad-bearing episodes on X (no-ad excluded). Reveals
   model-content interaction that the aggregated F1 hides -- e.g. some
   models that score well overall struggle on specific episodes.

5. New "Failures and provider issues" section. Categorizes errors
   into buckets (provider content moderation, deprecated parameter,
   rate-limited, 5xx, etc.), shows per-model error counts, includes
   sample error messages, and explains why the failures matter for
   production model selection. Triggered by the qwen 3.5-plus run
   where Alibaba's content classifier rejected one transcript window
   (a real production gotcha you can't see from F1 alone).

3 report tests still pass.
Each model now gets a distinct color from matplotlib's tab20 colormap.
The legend sits below the plot as a real matplotlib legend (instead of
the right-side text box), so each model's color swatch is rendered
adjacent to its name + (F1, cost) summary.

Layout: 2-column legend for >6 models, 1-column otherwise. Bottom
margin scales with row count (capped at 0.55) so the legend always
fits without overlapping the axes.

3 report tests still pass.
Adds the high-signal sections we identified after the first run.
Every addition reuses data already in calls.jsonl -- no schema change
to the wire format, just new aggregations and renderers.

New stats on ModelEpisodeStats / ModelStats:
- Per-trial precision, recall, TP/FP/FN, boundary start/end MAE
- Per-model output-tokens-total, detected-ads-total, tokens-per-ad
- p90 / p99 / max latency in addition to p50 / p95
- Cost-per-true-positive

New side data captured during aggregation (returned as _Extras):
- Calibration: per-model list of (self-reported confidence, was-TP)
- Cross-model agreement: per (episode, window), models predicting an ad
- Detection-by-bucket: hit rate by ad length and ad position

New report sections:
1. Precision / recall / FP / FN breakdown (F1 hides which side errs)
2. Boundary accuracy -- start/end MAE for matched ads
3. Confidence calibration table (binned, with hit rate per bin)
4. Latency tail -- p50/p90/p95/p99/max
5. Output token efficiency -- tokens per detected ad + cost per TP
6. Trial variance -- determinism check at temp=0
7. Cross-model agreement -- N-of-K vote distribution per window
8. Detection rate by ad length (short/medium/long)
9. Detection rate by ad position (pre/mid/post-roll)

New charts:
- calibration.svg -- reliability diagram (confidence vs hit rate)
- latency_tail.svg -- p50/p90/p99/max bars on log scale per model

Charts section now lists all five charts inline. Existing pareto +
compliance + episodes charts unchanged.

135 tests pass. Report regenerated against the current calls.jsonl
surfaces several findings that were invisible before:
- Most models are wildly overconfident (claim 0.95-0.99 confidence,
  hit rate 20-50%); phi-4 is overconfident at 4% hit rate
- Boundary MAE on start can be 18-24s for some models even when F1 is OK
- Every model misses post-roll ads more than pre/mid-roll
- 22/68 windows show 13+ models agreeing -- candidate ensemble pool
Adds benchmarks/llm/CONTRIBUTING.md so outside contributors can ship PRs
expanding the corpus or the model list without reverse-engineering the
workflow from the existing README pair.

Covers the three accepted PR shapes (new episode, new model, code) with
the exact diff each one should produce, what `benchmark verify` already
validates so reviewers don't redo it by hand, and the copyright/PII bar
for transcript excerpts (a gap not covered in the existing READMEs).
Cross-references README.md and data/README.md rather than duplicating
their content; total length 117 lines.

Doc passes the Wikipedia signs-of-AI-writing scan: zero hits across the
~25 high-frequency AI-vocabulary tokens checked, em-dash density 4/117
all in `term -- definition` bullet form (compliant with humanizer's
"one em dash per paragraph" rule). All example URLs use the
your-minuspod.example.com placeholder per the public-repo URL hygiene
rule in CLAUDE.md.

No code, schema, or test impact. Bumps version.py + openapi.yaml to
2.1.7 per CLAUDE.md "always update version.py with changes" rule.
Adds the 5-episode verified corpus and the artifacts from the first
14-model sweep so anyone cloning the repo can audit the results
without re-running the benchmark.

Corpus:
  data/corpus/ep-ai-cloud-essentials-e8dc897fbd6b/  (no-ad control, 16 min)
  data/corpus/ep-daily-tech-news-show-c1904b8605f7/ (4 active ads)
  data/corpus/ep-glt1412515089-373d5ba5007b/        (4 active ads)
  data/corpus/ep-security-now-audio-2850b24903b2/   (6 active ads)
  data/corpus/ep-the-tim-dillon-show-f62bd5fa1cfe/  (6 active ads)
Each episode dir contains metadata.toml, segments.json (Whisper-byte-
exact), truth.txt (human-verified ad markers), and windows.json.

Run artifacts (first 14-model sweep, 4760 total LLM calls, 1 error):
  results/raw/calls.jsonl                  (8.0 MB, one row per call)
  results/raw/episode_results.jsonl        (492 KB, per-trial aggregates)
  results/raw/prompts/                     (1246 files, 15 MB)
  results/raw/responses/                   (6196 files, 23 MB)
  results/report.md                        (32 KB rendered report)
  results/report_assets/*.svg              (5 charts, 590 KB)
  data/pricing_snapshots/...               (LiteLLM snapshot)

Updates CONTRIBUTING.md "what's in the repo vs your PR" table to match:
calls.jsonl + episode_results.jsonl + prompts/ + responses/ are
maintainer-committed for audit; contributors don't include them in
their own PRs.

No app code changed; no version bump. Total commit ~54 MB across ~7,470
files. Maintainers should expect repo size to grow when new episodes
or models are merged and the next sweep is run.
Restructures the existing "Recommended Models" section to add a Cloud
LLMs subsection on top of the existing Ollama subsection. The cloud
picks come from the offline benchmark in benchmarks/llm/ -- four
recommendations covering best-accuracy, best-Anthropic-direct, best
free-tier, and cheap-and-fast use cases, with F1 / cost / latency
caveats per option.

Cross-references the benchmark tool, the rendered report, and the
contribution guide so readers who want to validate or expand the list
can. Notes the current corpus is 5 episodes and numbers will refine as
the corpus grows.

Existing local-Ollama tables (Pass 1, Verification, Chapters) are
preserved verbatim, just nested under a new Local Ollama Models
heading and demoted from #### to ##### accordingly.

ToC: adds "Recommended Models" as a nested entry under
"Using Ollama (Local or Cloud)" so the section is reachable from the
top of the doc.

Doc-only change. Doc passes the AI-tic vocabulary scan.
Three updates anchored by the new offline benchmark data.

1. Local Ollama Models intro: prefix the section with a note that the
   benchmark covers cloud-hosted models only and Ollama runs are not in
   the sweep, so contributors who want apples-to-apples local numbers
   would need to extend the benchmark with an Ollama provider. Also
   trims the unverified "in the same tier as Claude Sonnet" claim on
   qwen3.5:122b -- the cloud benchmark scored Sonnet 4.6 at F1 0.33,
   not a clear ceiling, so the comparison no longer fits.

2. "Accuracy vs. Claude" -> "Cloud vs. Local: What Changes". The old
   title implied Claude was the universal frontier; the cloud benchmark
   shows Grok-4.1-fast (F1 0.61) clearly beats every Claude variant,
   with Sonnet at 0.33. Reframes the section as cloud-vs-local rather
   than open-weights-vs-Claude. Adds a row on network-inserted brand-
   tagline ads (the cloud benchmark's per-bucket detection rates show
   even frontier models miss roughly a third of these, useful prior
   for setting expectations on local). Closes with how to measure the
   gap on your own content via the bench tool.

3. JSON Reliability Risks: grounds the cloud-vs-local framing in real
   data. The benchmark report shows JSON compliance from 0.53 (verbose
   reasoning models) to 1.00 (Mistral / Qwen / Opus); Claude Haiku 4.5
   at 0.60 because it markdown-fences every response. So compliance
   variance is not unique to local; cloud models exhibit it too. Links
   to the JSON-compliance chart in the report.

Also updates the LLM accuracy notice in the disclaimer area to point
at the renamed section and link the benchmark report instead of
restating "most testing was done with Claude" (no longer accurate now
that the benchmark covers 14 cloud models).

Doc-only change. AI-tic vocabulary scan: 0 hits in added prose.
The 2.1.7 commit (4c481f9) bumped version.py and openapi.yaml for a
docs-only change. By policy ("don't bump for non-app changes"), that
bump should not have happened. Reverts both files to 2.1.6 and removes
the 2.1.7 entry from CHANGELOG since no Docker image will ship under
that tag.

The src/ refactors that landed since 2.1.6 (lazy webhook_service
imports in ad_detector.py and utils/llm_call.py, lazy bs4 import in
pricing_fetcher.py, DEFAULT_SYSTEM_PROMPT hoist to utils/constants.py
with a re-export in database/__init__.py) are behavior-preserving --
they enabled the offline benchmark in benchmarks/llm/ to import these
modules without dragging in jinja2/flask/cryptography/bs4. Runtime is
byte-identical to main's 2.1.6, so no Docker rebuild is needed.

Net effect on a deployed instance:
- Version string still 2.1.6 (matches the live image).
- No behavior change.
- New CONTRIBUTING.md, expanded benchmark report, corpus data, and
  README cloud-LLM picks ride along on this branch without bumping
  the version.
OpenRouter sometimes returns HTTP 200 with `choices=None` in the body
when the upstream provider hits its own internal read-timeout. Observed
on cohere/command-r-plus-08-2024 at the 120s mark for windows that
take long to reason about. The OpenAI SDK doesn't raise -- it returns
a ChatCompletion object where `msg.choices` is None instead of [],
and our code did `msg.choices[0]` which crashes with
"'NoneType' object is not subscriptable".

Fix: defensively coerce None to [] and raise LLMTransientError when
choices is empty (or the first choice has no message). The runner's
existing retry-on-LLMTransientError path then retries with backoff
instead of crashing the call.

Impact on the in-flight run: minimal. The 4 affected Cohere calls
(out of 340) are already excluded from F1 / compliance / cost
aggregates by `if rec.get('error'): continue`, so Cohere R+ still has
336 valid samples. The fix prevents recurrence and lets
`benchmark run --retry-errors` retry the 4 timed-out windows cleanly
after the run completes.

No version bump; benchmark-only code, doesn't ship in the Docker image.
Best Accuracy table held all 32 models; Best Value (F1 / cost) silently
dropped the 10 free-tier rows because F1 / 0 is infinity. Visible
asymmetry that asked "where did half the table go?" without an answer.

Splits the leaderboard into three sections:

1. Best Accuracy -- unchanged, all models by F1
2. Best Value -- paid-tier only, by F1 / $
3. Best Free-Tier (new) -- $0.00 rows ranked by F1 alone

Each section now has a one-line note explaining its scope, including
a heads-up that OpenRouter free-tier eligibility depends on the
attribution headers (HTTP-Referer, X-Title) wired in src/benchmark/llm.py,
so a model that shows as free here may bill on a deployment that
doesn't send those headers.

135 tests pass. Doc/report-only change; no version bump.
The reliability-diagram-style line plot crowded 30+ models into the
high-confidence end of the axis. Most models report 0.95-0.99
confidence on most predictions, so lines piled on top of each other
and the x-axis tick labels (bin centers as float) became unreadable.

Replaces with a calibration heatmap:
- One row per model (sorted from most overconfident at top to most
  underconfident at bottom).
- One column per confidence bin, labeled with the bin range directly.
- Cell text shows actual hit rate and sample size.
- Cell color is calibration error (actual minus bin midpoint), with a
  diverging RdYlGn colormap centered on 0. Red = overconfident, green
  = well-calibrated, blue = underconfident. Bins the model never used
  render as a neutral gray.

Same underlying data, no overlap, x-axis is self-explanatory. Caption
in the report's charts section updated to match.

135 tests pass. Report regenerated; no data refresh.
Two related fixes to chart rendering:

1. Shorten the calibration heatmap title from a long two-line caption
   to a compact one with cell-text legend up top and color legend
   below. Previous wording overflowed the figure width on the right
   on narrow viewports.

2. Add bbox_inches="tight" to every chart savefig. matplotlib's
   tight_layout sometimes misses corner clipping when titles or
   colorbar labels run close to the figure edge; bbox_inches="tight"
   on save expands the bounding box just enough to capture them.
   Applied to pareto.svg, compliance.svg, episodes.svg,
   calibration.svg, and latency_tail.svg.

No data change, just rendering. 135 tests pass.
calls.jsonl is append-only by design (audit history). When
`benchmark run --retry-errors` succeeds against a previously-failed
(model, episode, trial, window) tuple, the new successful row is
appended alongside the original error row. The aggregator already
filtered errors out of F1/cost (`if rec.get('error'): continue`), but
the Failures section kept counting the original error rows even though
they were superseded by a successful retry.

Fix: dedup raw calls by (model, episode_id, trial, window_index) and
keep the last row encountered. Everything downstream (aggregator,
charts, failures table, per-model detail) sees only the current state.

Run Metadata now distinguishes "unique work units (current state)"
from "raw rows in calls.jsonl" and notes how many were superseded by
later retries. Lifetime actual spend still sums the raw rows so the
true historical bill is preserved.

135 tests pass. No data change; rendering only.
The TP/FP/FN columns were undefined inline. New readers had to know
the convention or skim the metric glossary at the top to interpret
the rest of the section. Adds a compact column-key table with the
formula for precision and recall and one-sentence definitions of TP,
FP, and FN, plus a one-paragraph reading guide on what high-precision
vs high-recall behavior actually means in production.

Doc-only; no schema or test impact. 135 tests pass.
…chart

Two-part pass on report rendering:

1. Add a one-paragraph intro to every section that lacked one, written
   for an entry-level-to-expert audience. Sections that gained
   explainers: Failures (By category, Per-model error count, Sample
   messages), Detection rate buckets (By ad length, By ad position),
   Quick Comparison, Per-Model Detail, Per-Episode Detail, Parser
   Stress Test, Methodology. Each explainer answers: what does this
   table show, why does it matter, how should I read the numbers?

2. New chart: Cross-model agreement histogram (agreement.svg).
   Renders the existing per-window vote-count distribution as a
   colored bar chart (red = low agreement, green = high), with cell
   labels showing window counts and percentages. Wired into the
   Charts section with a caption that mirrors the table-form Cross-
   model agreement section.

135 tests pass. Data unchanged; rendering only.
User noted the cross-model agreement histogram was anonymous (no model
labels) and asked for both views: distribution AND per-model
attribution. Plus identified other sections that could use charts.

New analysis:
- Per-model alignment with consensus (table + chart). For each model
  and each (episode, window), classify the model's vote against the
  majority into 4 buckets: with-yes / with-no / broke-yes / broke-no.
  Surfaces which models track consensus and which break it -- a model
  that voted "yes" when most others voted "no" is likely
  false-positiving; the opposite means likely missing real ads.

New charts:
- alignment.svg: stacked horizontal bars per model showing the 4
  buckets with green/blue/orange/red color coding and an alignment
  rate label on the right edge.
- precision_recall.svg: scatter of precision vs recall with F1
  isocurves as gray dashed reference lines. Top-right = ideal,
  top-left = cautious, bottom-right = greedy.
- boundary.svg: stacked horizontal bars per model with blue = start
  MAE and orange = end MAE, sorted by total error ascending. Total
  labeled at the right.
- token_efficiency.svg: scatter of tokens-per-ad (log x) vs F1.
  Upper-left = efficient zone; right side = reasoning-heavy models
  whose extra tokens may or may not buy more F1.

Charts section in the report now lists all 9 charts with one-paragraph
captions matching the depth of detail in the body sections. Histogram
caption updated to acknowledge anonymity and point to the alignment
chart for per-model attribution.

135 tests pass. Data unchanged; rendering only.
…arts

Adds the last three sections that had model data but no chart.

New charts:
- trial_variance.svg: horizontal bars of mean F1 stdev per model, with
  green/yellow/red coloring at the 0.02 and 0.05 thresholds. Visually
  surfaces models whose single-trial F1 numbers can't be trusted.
- detection_by_length.svg: heatmap of model (row) vs ad-length bucket
  (column, short/medium/long), cell = detection rate plus sample size.
  Rows sorted by overall detection rate, descending.
- detection_by_position.svg: same shape, columns are pre-roll /
  mid-roll / post-roll. Surfaces the common pattern where post-roll
  is systematically harder.
- parser_stress.svg: heatmap of model vs extraction method (call
  counts). Columns ordered by total usage; rows ordered by share of
  json_array_direct (the clean path) so easier-to-consume models are
  at the top.

Total charts now 13. Each chart has a one-paragraph caption in the
Charts section explaining what it shows and how to read it.

Also: ran a programmatic AI-tic scan across all report.py prose
literals -- 0 hits across the Wikipedia high-frequency vocabulary.
The one em-dash flag in the scan was a false positive: 6 em-dashes
spread across 9 paragraphs in a single concatenated chart-captions
string, not 6 in one paragraph (one per chart, which is the intended
pattern).

135 tests pass.
Two related polish changes.

1. Em-dash and tone pass. After the user noted that em-dashes were
   still visible throughout the report, audited every prose string in
   report.py and applied 23+ targeted replacements:
   - Converted `**term** -- description` patterns to `**term**: description`
   - Converted parenthetical em-dashes ("X -- Y -- and Z") to commas
     or sentence breaks
   - Converted aside em-dashes ("X. Y -- bad for production") to
     proper sentence boundaries ("X. Y. Bad for production.")
   - Title-cased heading "Parser Stress Test" lowered to "Parser
     stress test" (sentence case, per humanizer rule 15)
   Final rendered-report distribution: 88 paragraphs with 0 em-dashes,
   1 with a single em-dash, 0 with 2+. Down from 17 + 2 = 19
   paragraphs with em-dashes before the pass.

   Other humanizer-rule audits all clean: 0 promotional, 0 copula
   avoidance, 0 filler, 0 hedging, 0 chatbot artifacts, 0 superficial
   -ing analyses, 0 formulaic "despite/conclusion", 0 false ranges.

2. Source-data links. Every chart in the Charts section now has a
   "Source data: [linked anchor]" line directly below the image,
   pointing at the corresponding tables in the body of the report.
   Readers who want to dig past the visual can click straight to the
   numbers. 13 source-data links wired up (one per chart, some chart
   captions point to multiple tables like Pareto -> Best Accuracy /
   Best Value / Best Free-Tier).

135 tests pass. Doc/report-only change.
Production parser: add strategy 4 (`_salvage_truncated_single_ad`)
to recover usable ad dicts from responses that ran out of token
budget mid-output (the existing four strategies all bail on the
structurally invalid JSON). Returns a dict only when both `start`
and `end` were recovered; otherwise the row is dropped rather than
fabricated. Tagged `json_object_single_ad_truncated`.

Benchmark: max_tokens is now config-driven via `[run].max_tokens`,
defaulting to production's `AD_DETECTION_MAX_TOKENS` so the
benchmark matches the live app's budget. `schema_audit` now imports
production's `STRUCTURAL_FIELDS` and `SPONSOR_PRIORITY_FIELDS` to
stop flagging fields the live parser already accepts.

Benchmark code cleanup: drop local duplicates of `parse_timestamp`
and `format_time` in favor of `utils.time`; use `dataclasses.asdict`
for violations; convert derived ModelStats fields to `@property`;
precompute per-model indexes in `_aggregate` and call counts in
`_render_failures`; cache `avg_f1` / `mean_f1_stdev` on ModelStats;
fix the brittle HTTP-5xx classifier (`\b5\d{2}\b`); extract
length/position bucket helpers.

Corpus: add two new verified episodes (`it-s-a-thing`,
`the-brilliant-idiots`).
Rolls 11 of 12 open Dependabot PRs into 2.1.7. PR #208 (ubuntu
24.04 -> 26.04) held: the GPU image is pinned to ubuntu 24.04
through nvidia/cuda:12.9.1-runtime-ubuntu24.04 and nvidia/cuda has
no ubuntu 26.04 variant, so taking it would split the GPU and CPU
base images across a major OS version.

pip (5):
- bump anthropic 0.97.0 -> 0.100.0
- bump cryptography 47.0.0 -> 48.0.0
- bump gunicorn 25.3.0 -> 26.0.0
- bump huggingface-hub 1.13.0 -> 1.14.0
- bump openai 2.33.0 -> 2.36.0

npm (5):
- bump @tanstack/react-query 5.100.5 -> 5.100.9
- bump react 19.2.5 -> 19.2.6
- bump react-router-dom 6.30.3 -> 7.15.0 (major; tsc clean, no API surface change required)
- bump tailwind-merge 3.5.0 -> 3.6.0
- bump vite-plugin-pwa 1.2.0 -> 1.3.0

docker (1):
- bump node 24-alpine -> 26-alpine (both GPU and CPU Dockerfiles)

README: refresh the Cloud LLMs recommended-model table with the
latest 32-model / 7-episode / 14,400-call benchmark numbers.

Verification:
- pytest tests/ -q -> 941 passed, 4 skipped
- benchmarks/llm pytest -> 135 passed
- frontend: npx tsc --noEmit clean, npm run build successful
- docker build --platform=linux/amd64 (GPU) -> success
- docker build --platform=linux/amd64 -f Dockerfile.cpu (CPU) -> success
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant