Skip to content

docker(deps): bump ubuntu from 24.04 to 26.04#208

Open
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/docker/ubuntu-26.04
Open

docker(deps): bump ubuntu from 24.04 to 26.04#208
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/docker/ubuntu-26.04

Conversation

@dependabot
Copy link
Copy Markdown

@dependabot dependabot Bot commented on behalf of github May 11, 2026

Bumps ubuntu from 24.04 to 26.04.

@dependabot dependabot Bot added the dependencies Pull requests that update a dependency file label May 11, 2026
@dependabot @github
Copy link
Copy Markdown
Author

dependabot Bot commented on behalf of github May 11, 2026

Labels

The following labels could not be found: docker. Please create it before Dependabot can add it to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

ttlequals0 added a commit that referenced this pull request May 11, 2026
Rolls 11 of 12 open Dependabot PRs into 2.1.7. PR #208 (ubuntu
24.04 -> 26.04) held: the GPU image is pinned to ubuntu 24.04
through nvidia/cuda:12.9.1-runtime-ubuntu24.04 and nvidia/cuda has
no ubuntu 26.04 variant, so taking it would split the GPU and CPU
base images across a major OS version.

pip (5):
- bump anthropic 0.97.0 -> 0.100.0
- bump cryptography 47.0.0 -> 48.0.0
- bump gunicorn 25.3.0 -> 26.0.0
- bump huggingface-hub 1.13.0 -> 1.14.0
- bump openai 2.33.0 -> 2.36.0

npm (5):
- bump @tanstack/react-query 5.100.5 -> 5.100.9
- bump react 19.2.5 -> 19.2.6
- bump react-router-dom 6.30.3 -> 7.15.0 (major; tsc clean, no API surface change required)
- bump tailwind-merge 3.5.0 -> 3.6.0
- bump vite-plugin-pwa 1.2.0 -> 1.3.0

docker (1):
- bump node 24-alpine -> 26-alpine (both GPU and CPU Dockerfiles)

README: refresh the Cloud LLMs recommended-model table with the
latest 32-model / 7-episode / 14,400-call benchmark numbers.

Verification:
- pytest tests/ -q -> 941 passed, 4 skipped
- benchmarks/llm pytest -> 135 passed
- frontend: npx tsc --noEmit clean, npm run build successful
- docker build --platform=linux/amd64 (GPU) -> success
- docker build --platform=linux/amd64 -f Dockerfile.cpu (CPU) -> success
ttlequals0 added a commit that referenced this pull request May 11, 2026
* feat(benchmark): scaffold + foundation modules

Adds benchmarks/llm/ uv project with:
- pyproject.toml, .env.example, benchmark.toml.example, .gitignore
- src/benchmark/ package with sys.path bootstrap to import MinusPod modules
- config.py: TOML loader with provider/model/run validation (11 tests)
- auth.py: cookie-cached MinusPod session with 23h TTL + login probe
- truth_parser.py: ground-truth file parser with structural+logical+
  cross-reference validation (26 tests)
- corpus.py: corpus loader, segments hashing, windows.json round-trip
  using MinusPod's create_windows (9 tests)
- capture.py: episode-URL parsing, /original-segments fetch, truth.txt
  pre-population from production ad markers (11 tests)
- pricing.py: thin wrapper over MinusPod's pricing_fetcher (LiteLLM)
  with snapshot read/write (6 tests)
- parsing.py: re-exports of lifted ad-detector functions
- storage.py: atomic JSONL append/fsync, prompt+response artifact
  writers, dedup index, sanitized error recording (13 tests)
- metrics.py: IoU greedy match, P/R/F1, boundary MAE, JSON compliance
  scoring, schema violation audit, trial stdev (26 tests)

Runner, report generator, CLI, and READMEs land in follow-up commits.
102 module-level tests pass.

* feat(benchmark): offline LLM ad-detection benchmark tool

Adds benchmarks/llm/ self-contained uv project that compares LLMs on
ad-detection accuracy, cost, latency, and JSON compliance using real
MinusPod transcripts.

Modules:
- config: TOML loader + provider/model/run validation
- auth: cookie-cached MinusPod session, 23h TTL, 429 reported (no auto-retry)
- truth_parser: ground-truth parser, structural+logical+cross-ref validation
- corpus: episode loader, segments-hash drift detection, public load_metadata
  / load_segments helpers
- capture: episode-URL parsing, /original-segments fetch, truth.txt
  pre-population from production ad markers
- pricing: thin wrapper over MinusPod's pricing_fetcher (LiteLLM-backed),
  snapshot read/write with O(1) indexed lookup
- parsing: re-exports of lifted ad_detector functions
- storage: atomic JSONL append/fsync, single-pass scan_calls returning
  (completed, errored), prompt+response artifact writers, sanitized errors
- metrics: IoU greedy match, P/R/F1, boundary MAE, schema-violation audit,
  prefix-aware compliance scoring matching production extraction methods
- llm: AsyncAnthropic + AsyncOpenAI dispatch with native json_object +
  prompt-injection fallback, retry-with-backoff classification
- runner: build (model, episode, trial, window) work list, dedup against
  calls.jsonl, asyncio.gather with global+per-provider semaphores;
  user prompt cached per (episode, window) -- 98% fewer build calls on
  full sweeps
- report: aggregate + render Markdown report (TL;DR, headline, per-model,
  per-episode, parser stress, methodology, run metadata) + SVG Pareto chart
- cli: typer subcommands -- capture, verify, regenerate-windows,
  list-episodes, validate, refresh-pricing, run, report, archive

Scope is sourced from benchmark.toml ([[models]]) and data/corpus/ at
runtime; no --model / --episode / --trials CLI filters.

127 unit tests pass. Async LLM dispatch is exercised against real provider
APIs at run time. .gitignore negates data/ for benchmarks/llm/data/ so
corpus + pricing snapshots commit while leaving repo-root data/ ignored.

No runtime image impact: benchmarks/ already in .dockerignore (v2.0.25).

* chore(benchmark): code-review followups

- Wrap post-LLM I/O (parse, schema audit, write_response, write_prompt,
  append_jsonl) in try/except so a parse or disk error in one execute()
  doesn't propagate out of asyncio.gather and cancel the other in-flight
  tasks. Errored records are still appended; the JSONL write itself is
  the only operation that can lose a row, and that path now logs
  defensively rather than re-raising.
- write_response now mirrors write_prompt's tmp+os.replace atomic write
  so a crash mid-write can't leave a truncated file referenced by an
  fsynced calls.jsonl row.
- llm.call_with_retry's response_format fallback now classifies
  RateLimitError/APITimeoutError/APIConnectionError as transient on the
  retry path too (was only classifying APIStatusError).
- precompute_prompt_hashes skips deprecated models so the hash dict
  matches build_work_list's filter.
- pricing snapshot filenames include microseconds; two refresh-pricing
  calls in the same second no longer overwrite each other.
- Hoisted inline imports flagged by review (`import re` in storage,
  `from rapidfuzz import fuzz` in truth_parser, `from .storage import
  read_jsonl` in runner, sibling-module imports in cli._preview).
- Replaced the em dash in auth.py's 429 error message with ASCII `--`.

127 unit tests pass.

* chore(benchmark): hoist remaining sibling-package and MinusPod imports

Moves the deferred imports of create_windows, normalize_model_key,
fetch_litellm_pricing, and corpus.load_metadata/load_segments to
module level. The path bootstrap in benchmark/__init__.py runs
before any submodule loads, so MinusPod's src/ is already on
sys.path by the time these imports execute.

The remaining inline imports in llm.py (anthropic + openai SDKs)
and report.py (matplotlib) stay deferred on purpose: the SDKs only
load when their provider is actually used, and matplotlib (~50MB)
only loads when render is called.

* docs(benchmark): note MinusPod >=2.0.26 requirement

The benchmark capture flow calls
GET /api/v1/feeds/{slug}/episodes/{id}/original-segments, an endpoint
added in 2.0.26. Pointing the tool at an older server returns 404 with
no obvious user-facing signal. README now states the minimum version
explicitly alongside the existing path-bootstrap note.

* feat(benchmark): auto-load .env, document parent-venv install

Two related fixes:

- Add python-dotenv as a runtime dep and load benchmarks/llm/.env at CLI
  import time. Path is resolved relative to the package (parents[2]) so
  the load works regardless of where `benchmark` is invoked from.
  Shell-exported variables still win over .env values (override=False).

- README: switch the install instruction from `uv sync` (which only
  satisfies the benchmark's own deps) to `pip install -e .` into the
  MinusPod parent venv. The benchmark imports modules from MinusPod's
  src/ at runtime; those modules transitively need MinusPod's runtime
  deps (jinja2, flask, etc.) which are only present in the parent venv.
  Running from a fresh benchmark-only venv fails at first import of
  benchmark.cli with ModuleNotFoundError on jinja2.

* docs(benchmark): document uv run as primary install path

Previous version sent users to install editable into MinusPod's parent
venv, which works but isn't the natural uv-project flow. The reason the
benchmark needs MinusPod's runtime deps at all is that ad_detector does
a module-level `from webhook_service import fire_auth_failure_event`
even though webhook_service is only used inside the LLM error path; the
benchmark only wants create_windows but Python loads the whole module's
imports.

Two equivalent paths now documented:

- uv-native: `uv sync && uv pip install -r ../../requirements.txt`,
  then `uv run benchmark <cmd>`. Layered install needed because the
  benchmark's own pyproject.toml deliberately does not duplicate
  MinusPod's full runtime stack.

- parent-venv: install editable into ../../.venv and call
  ../../.venv/bin/benchmark directly.

Both work. uv run is now the primary path.

* fix(2.0.28): defer webhook_service and bs4 imports for benchmark consumers

src/ad_detector.py and src/pricing_fetcher.py both did module-level imports
of dependencies they only need on a single code path:

- ad_detector.fire_auth_failure_event is only called inside the LLM dispatch
  loop when is_auth_error(e) is true. webhook_service transitively imports
  jinja2.
- pricing_fetcher.BeautifulSoup is only used inside fetch_pricepertoken_pricing.
  fetch_litellm_pricing does not need it.

Both imports are now deferred to the function bodies that actually use them.
No behavioural change: the webhook fires under identical conditions and the
HTML-scraping path parses identically.

The benchmark in benchmarks/llm/ imports create_windows from ad_detector and
fetch_litellm_pricing from pricing_fetcher via a sys.path bootstrap. Before
this commit those imports forced jinja2 + beautifulsoup4 into the benchmark's
venv, which uv sync had no reason to install (they're not in benchmark's
pyproject.toml), so `uv run benchmark <cmd>` failed at first import. After
this commit `uv run benchmark validate` and `refresh-pricing` work cleanly
on a fresh `uv sync`.

Also adds requests>=2.31 to benchmarks/llm/pyproject.toml since pricing_fetcher
uses it directly, drops the now-unnecessary `uv pip install -r ../../requirements.txt`
step from the benchmark README, and seeds data/pricing_snapshots/ with the
first snapshot fetched against the live LiteLLM table (2233 models).

128 ad_detector + pricing tests pass; 127 benchmark tests pass.

* feat(benchmark): auto-reject suspect ad markers during capture

The capture template pre-populates truth.txt from MinusPod's accepted
production ad markers. On real podcasts the production detector emits
some false positives that have to be deleted by hand on every capture:

1. Markers placed around silence or music where Whisper hallucinated a
   short phrase like "Thank you for watching." (no actual ad audio).
2. Markers placed around main content the production prompt happened to
   classify as ad-like (the host discussing security topics, etc).

Adds _classify_marker(marker, segments) -> (accepted, reason). Markers
that fail any of these are routed to the existing commented "rejected"
section with an `auto-rejected: <reason>` annotation, so the human
reviewer can still uncomment them if the heuristic is wrong:

- text in range matches a known Whisper hallucination phrase only
  ("Thank you for watching.", etc).
- text density < 3 chars/sec (well below normal speech ~12-16 chars/sec).
- text contains none of the strong ad signals: "brought to you by",
  "sponsor", ".com slash" / ".com/", "promo code", "use code",
  "discount code", "free trial", "sign up at", "get started at",
  "listen at", "visit X".

The "rejected" header is reused; production-rejected markers now carry
"# --- (rejected by production)" and auto-rejected markers carry
"# --- (auto-rejected: <reason>)" so the source is unambiguous.

Validated against the security-now-audio episode just captured: 11 raw
markers from production, 7 accepted (all confirmed real sponsor reads),
4 auto-rejected (2 Whisper hallucinations, 2 main-content false
positives). 134 benchmark tests pass.

* feat(benchmark): interleave rejected ad markers in chronological order

Previously, the truth.txt template grouped all rejected markers (auto
and production) at the bottom of the file, after every accepted block.
That broke a workflow: when a reviewer uncommented a rejected marker
to keep it, validate_logical's "ads must be ordered by start"
constraint would refuse to verify until they manually moved it back
into time order somewhere up above.

New layout: build a single chronologically sorted list of all markers
(accepted + auto-rejected + production-rejected) and render in that
order. Rejected blocks sit in their natural time slot, commented out,
with a one-line annotation above the start/end/text:

    ---
    # auto-rejected: Whisper hallucination only
    # start: 16:23.00
    # end:   16:49.10
    # text:  ...
    ---

Uncommenting then yields a valid block in the right time slot. No
section headers, no shuffling required. Comment lines (including the
auto-rejected: annotation) are ignored by the parser.

A short user-facing instruction is added at the top of any file with
rejections: "remove the '# ' prefix from start/end/text lines to
accept a rejected block."

134 -> 135 benchmark tests pass.

* fix(2.0.29): hoist DEFAULT_SYSTEM_PROMPT out of database for benchmark consumers

Same root cause as 2.0.28's lazy-import fixes (webhook_service, bs4).
ad_detector.get_static_system_prompt does `from database import
DEFAULT_SYSTEM_PROMPT`, which runs database/__init__.py end-to-end
before the constant is accessible. That __init__ loads
database.settings, which module-level imports secrets_crypto, which
requires the cryptography package.

The offline benchmark in benchmarks/llm/ calls get_static_system_prompt
inside `run --dry-run` and `run`. A fresh `uv sync` venv has no
cryptography, so dry-run blew up at import:

    ModuleNotFoundError: No module named 'cryptography'

DEFAULT_SYSTEM_PROMPT is a plain ~8KB multi-line string. It has no
database semantics. Move it to src/utils/constants.py (already stdlib-
only, already home to SEED_SPONSORS), and re-export from
src/database/__init__.py for backward compat. Update the two
`from database import DEFAULT_SYSTEM_PROMPT` sites in src/ad_detector.py
to import from utils.constants directly so the benchmark's import path
no longer touches the database package.

Verified: `uv run benchmark run --dry-run` now reports 4,760 calls would
execute against the 5-episode corpus. 162 ad_detector + database +
settings unit tests pass.

* feat(benchmark): memoize Anthropic temperature deprecation + add OpenRouter attribution

Two related dispatch-side fixes observed during the first live run.

1. Claude 4.x family models (opus 4.7, sonnet 4.6, haiku 4.5) reject
   `temperature` with HTTP 400 "`temperature` is deprecated for this
   model.". The previous version retried without temperature on 400 but
   relearned the deprecation on every call, paying ~1020 wasted
   round-trips on a full 5-episode sweep. Now memoized in a process-
   level set: first 400 per model adds the model_id, subsequent calls
   skip `temperature` upfront. Worst-case waste is one round-trip per
   affected model per process. Older Claude models that still accept
   `temperature` are unaffected.

2. OpenRouter recommends HTTP-Referer + X-Title headers for app
   attribution. The OpenAI SDK passes them via default_headers. Detected
   by base_url containing "openrouter.ai" so it doesn't leak to other
   openai_compatible providers.

135 benchmark tests pass. The first run captured 335 successful calls
across 5 episodes for claude-opus-4-7 with 0 errors and median
compliance 1.000, validating both code paths against the live API.

* fix(benchmark): defer webhook_service import in utils/llm_call.py

Same lazy-import pattern as 2.0.28 but for the new src/utils/llm_call.py
introduced by PR #205 (2.1.6). The module-level

    from webhook_service import fire_auth_failure_event

at line 14 pulled in jinja2 transitively for any consumer of the new
call_llm_for_window helper -- including the benchmark, which imports
ad_detector -> utils.llm_call. Move the import into the call site
inside the is_auth_error(e) branch where it's actually needed.

No behavioural change for production: webhook fires under identical
conditions. The benchmark in benchmarks/llm/ can now import the
ad_detector module chain without cryptography/jinja2/flask in its venv.

299 unit tests pass. Benchmark dry-run reports 2,592 remaining calls
(2,168 already done from the in-flight run, dedup'd via prompt_hash).

* feat(benchmark): metric key, charts, failure surfacing in report

Four upgrades based on first-real-run feedback:

1. Metric Key section moved to the top of the report. Replaces the
   short "How to Read" footnote with a tabular glossary (range,
   direction, plain-English meaning) for F1, cost, F1/$, latency,
   JSON compliance, no-ad PASS/FAIL, F1 stdev. Latency entry calls
   out the OpenRouter routing-layer caveat -- it includes upstream
   queueing, not just model-side compute, so it should be treated
   as a load/availability indicator rather than a model-quality
   signal.

2. Pareto chart redone with numbered points + a sorted legend on
   the right. Matplotlib's default ax.annotate at the data point
   produced overlapping text when models clustered (very common at
   the low-cost end). Numbered markers eliminate that.

3. New JSON compliance bar chart (compliance.svg). Horizontal bars
   sorted ascending, color-coded green/yellow/red against >=0.95 and
   >=0.7 thresholds, with a dotted reference line at 0.95.

4. New per-episode F1 heatmap (episodes.svg). Models on Y (sorted by
   avg F1 desc), ad-bearing episodes on X (no-ad excluded). Reveals
   model-content interaction that the aggregated F1 hides -- e.g. some
   models that score well overall struggle on specific episodes.

5. New "Failures and provider issues" section. Categorizes errors
   into buckets (provider content moderation, deprecated parameter,
   rate-limited, 5xx, etc.), shows per-model error counts, includes
   sample error messages, and explains why the failures matter for
   production model selection. Triggered by the qwen 3.5-plus run
   where Alibaba's content classifier rejected one transcript window
   (a real production gotcha you can't see from F1 alone).

3 report tests still pass.

* fix(benchmark): pareto chart -- distinct colors per model, legend below

Each model now gets a distinct color from matplotlib's tab20 colormap.
The legend sits below the plot as a real matplotlib legend (instead of
the right-side text box), so each model's color swatch is rendered
adjacent to its name + (F1, cost) summary.

Layout: 2-column legend for >6 models, 1-column otherwise. Bottom
margin scales with row count (capped at 0.55) so the legend always
fits without overlapping the axes.

3 report tests still pass.

* feat(benchmark): expanded report with 10 new analytical sections

Adds the high-signal sections we identified after the first run.
Every addition reuses data already in calls.jsonl -- no schema change
to the wire format, just new aggregations and renderers.

New stats on ModelEpisodeStats / ModelStats:
- Per-trial precision, recall, TP/FP/FN, boundary start/end MAE
- Per-model output-tokens-total, detected-ads-total, tokens-per-ad
- p90 / p99 / max latency in addition to p50 / p95
- Cost-per-true-positive

New side data captured during aggregation (returned as _Extras):
- Calibration: per-model list of (self-reported confidence, was-TP)
- Cross-model agreement: per (episode, window), models predicting an ad
- Detection-by-bucket: hit rate by ad length and ad position

New report sections:
1. Precision / recall / FP / FN breakdown (F1 hides which side errs)
2. Boundary accuracy -- start/end MAE for matched ads
3. Confidence calibration table (binned, with hit rate per bin)
4. Latency tail -- p50/p90/p95/p99/max
5. Output token efficiency -- tokens per detected ad + cost per TP
6. Trial variance -- determinism check at temp=0
7. Cross-model agreement -- N-of-K vote distribution per window
8. Detection rate by ad length (short/medium/long)
9. Detection rate by ad position (pre/mid/post-roll)

New charts:
- calibration.svg -- reliability diagram (confidence vs hit rate)
- latency_tail.svg -- p50/p90/p99/max bars on log scale per model

Charts section now lists all five charts inline. Existing pareto +
compliance + episodes charts unchanged.

135 tests pass. Report regenerated against the current calls.jsonl
surfaces several findings that were invisible before:
- Most models are wildly overconfident (claim 0.95-0.99 confidence,
  hit rate 20-50%); phi-4 is overconfident at 4% hit rate
- Boundary MAE on start can be 18-24s for some models even when F1 is OK
- Every model misses post-roll ads more than pre/mid-roll
- 22/68 windows show 13+ models agreeing -- candidate ensemble pool

* docs(2.1.7): contribution guide for benchmarks/llm/

Adds benchmarks/llm/CONTRIBUTING.md so outside contributors can ship PRs
expanding the corpus or the model list without reverse-engineering the
workflow from the existing README pair.

Covers the three accepted PR shapes (new episode, new model, code) with
the exact diff each one should produce, what `benchmark verify` already
validates so reviewers don't redo it by hand, and the copyright/PII bar
for transcript excerpts (a gap not covered in the existing READMEs).
Cross-references README.md and data/README.md rather than duplicating
their content; total length 117 lines.

Doc passes the Wikipedia signs-of-AI-writing scan: zero hits across the
~25 high-frequency AI-vocabulary tokens checked, em-dash density 4/117
all in `term -- definition` bullet form (compliant with humanizer's
"one em dash per paragraph" rule). All example URLs use the
your-minuspod.example.com placeholder per the public-repo URL hygiene
rule in CLAUDE.md.

No code, schema, or test impact. Bumps version.py + openapi.yaml to
2.1.7 per CLAUDE.md "always update version.py with changes" rule.

* data(benchmark): commit verified corpus + first full-run results

Adds the 5-episode verified corpus and the artifacts from the first
14-model sweep so anyone cloning the repo can audit the results
without re-running the benchmark.

Corpus:
  data/corpus/ep-ai-cloud-essentials-e8dc897fbd6b/  (no-ad control, 16 min)
  data/corpus/ep-daily-tech-news-show-c1904b8605f7/ (4 active ads)
  data/corpus/ep-glt1412515089-373d5ba5007b/        (4 active ads)
  data/corpus/ep-security-now-audio-2850b24903b2/   (6 active ads)
  data/corpus/ep-the-tim-dillon-show-f62bd5fa1cfe/  (6 active ads)
Each episode dir contains metadata.toml, segments.json (Whisper-byte-
exact), truth.txt (human-verified ad markers), and windows.json.

Run artifacts (first 14-model sweep, 4760 total LLM calls, 1 error):
  results/raw/calls.jsonl                  (8.0 MB, one row per call)
  results/raw/episode_results.jsonl        (492 KB, per-trial aggregates)
  results/raw/prompts/                     (1246 files, 15 MB)
  results/raw/responses/                   (6196 files, 23 MB)
  results/report.md                        (32 KB rendered report)
  results/report_assets/*.svg              (5 charts, 590 KB)
  data/pricing_snapshots/...               (LiteLLM snapshot)

Updates CONTRIBUTING.md "what's in the repo vs your PR" table to match:
calls.jsonl + episode_results.jsonl + prompts/ + responses/ are
maintainer-committed for audit; contributors don't include them in
their own PRs.

No app code changed; no version bump. Total commit ~54 MB across ~7,470
files. Maintainers should expect repo size to grow when new episodes
or models are merged and the next sweep is run.

* docs(readme): add cloud LLM picks from benchmark + link to bench tool

Restructures the existing "Recommended Models" section to add a Cloud
LLMs subsection on top of the existing Ollama subsection. The cloud
picks come from the offline benchmark in benchmarks/llm/ -- four
recommendations covering best-accuracy, best-Anthropic-direct, best
free-tier, and cheap-and-fast use cases, with F1 / cost / latency
caveats per option.

Cross-references the benchmark tool, the rendered report, and the
contribution guide so readers who want to validate or expand the list
can. Notes the current corpus is 5 episodes and numbers will refine as
the corpus grows.

Existing local-Ollama tables (Pass 1, Verification, Chapters) are
preserved verbatim, just nested under a new Local Ollama Models
heading and demoted from #### to ##### accordingly.

ToC: adds "Recommended Models" as a nested entry under
"Using Ollama (Local or Cloud)" so the section is reachable from the
top of the doc.

Doc-only change. Doc passes the AI-tic vocabulary scan.

* docs(readme): rework local-Ollama + cloud-vs-local + JSON sections

Three updates anchored by the new offline benchmark data.

1. Local Ollama Models intro: prefix the section with a note that the
   benchmark covers cloud-hosted models only and Ollama runs are not in
   the sweep, so contributors who want apples-to-apples local numbers
   would need to extend the benchmark with an Ollama provider. Also
   trims the unverified "in the same tier as Claude Sonnet" claim on
   qwen3.5:122b -- the cloud benchmark scored Sonnet 4.6 at F1 0.33,
   not a clear ceiling, so the comparison no longer fits.

2. "Accuracy vs. Claude" -> "Cloud vs. Local: What Changes". The old
   title implied Claude was the universal frontier; the cloud benchmark
   shows Grok-4.1-fast (F1 0.61) clearly beats every Claude variant,
   with Sonnet at 0.33. Reframes the section as cloud-vs-local rather
   than open-weights-vs-Claude. Adds a row on network-inserted brand-
   tagline ads (the cloud benchmark's per-bucket detection rates show
   even frontier models miss roughly a third of these, useful prior
   for setting expectations on local). Closes with how to measure the
   gap on your own content via the bench tool.

3. JSON Reliability Risks: grounds the cloud-vs-local framing in real
   data. The benchmark report shows JSON compliance from 0.53 (verbose
   reasoning models) to 1.00 (Mistral / Qwen / Opus); Claude Haiku 4.5
   at 0.60 because it markdown-fences every response. So compliance
   variance is not unique to local; cloud models exhibit it too. Links
   to the JSON-compliance chart in the report.

Also updates the LLM accuracy notice in the disclaimer area to point
at the renamed section and link the benchmark report instead of
restating "most testing was done with Claude" (no longer accurate now
that the benchmark covers 14 cloud models).

Doc-only change. AI-tic vocabulary scan: 0 hits in added prose.

* revert(2.1.7): unbump version; CONTRIBUTING.md add is doc-only

The 2.1.7 commit (4c481f9) bumped version.py and openapi.yaml for a
docs-only change. By policy ("don't bump for non-app changes"), that
bump should not have happened. Reverts both files to 2.1.6 and removes
the 2.1.7 entry from CHANGELOG since no Docker image will ship under
that tag.

The src/ refactors that landed since 2.1.6 (lazy webhook_service
imports in ad_detector.py and utils/llm_call.py, lazy bs4 import in
pricing_fetcher.py, DEFAULT_SYSTEM_PROMPT hoist to utils/constants.py
with a re-export in database/__init__.py) are behavior-preserving --
they enabled the offline benchmark in benchmarks/llm/ to import these
modules without dragging in jinja2/flask/cryptography/bs4. Runtime is
byte-identical to main's 2.1.6, so no Docker rebuild is needed.

Net effect on a deployed instance:
- Version string still 2.1.6 (matches the live image).
- No behavior change.
- New CONTRIBUTING.md, expanded benchmark report, corpus data, and
  README cloud-LLM picks ride along on this branch without bumping
  the version.

* fix(benchmark): classify empty-choices Cohere responses as transient

OpenRouter sometimes returns HTTP 200 with `choices=None` in the body
when the upstream provider hits its own internal read-timeout. Observed
on cohere/command-r-plus-08-2024 at the 120s mark for windows that
take long to reason about. The OpenAI SDK doesn't raise -- it returns
a ChatCompletion object where `msg.choices` is None instead of [],
and our code did `msg.choices[0]` which crashes with
"'NoneType' object is not subscriptable".

Fix: defensively coerce None to [] and raise LLMTransientError when
choices is empty (or the first choice has no message). The runner's
existing retry-on-LLMTransientError path then retries with backoff
instead of crashing the call.

Impact on the in-flight run: minimal. The 4 affected Cohere calls
(out of 340) are already excluded from F1 / compliance / cost
aggregates by `if rec.get('error'): continue`, so Cohere R+ still has
336 valid samples. The fix prevents recurrence and lets
`benchmark run --retry-errors` retry the 4 timed-out windows cleanly
after the run completes.

No version bump; benchmark-only code, doesn't ship in the Docker image.

* fix(benchmark): split Best Value vs Best Free-Tier in report TLDR

Best Accuracy table held all 32 models; Best Value (F1 / cost) silently
dropped the 10 free-tier rows because F1 / 0 is infinity. Visible
asymmetry that asked "where did half the table go?" without an answer.

Splits the leaderboard into three sections:

1. Best Accuracy -- unchanged, all models by F1
2. Best Value -- paid-tier only, by F1 / $
3. Best Free-Tier (new) -- $0.00 rows ranked by F1 alone

Each section now has a one-line note explaining its scope, including
a heads-up that OpenRouter free-tier eligibility depends on the
attribution headers (HTTP-Referer, X-Title) wired in src/benchmark/llm.py,
so a model that shows as free here may bill on a deployment that
doesn't send those headers.

135 tests pass. Doc/report-only change; no version bump.

* fix(benchmark): calibration chart -- heatmap instead of line overlay

The reliability-diagram-style line plot crowded 30+ models into the
high-confidence end of the axis. Most models report 0.95-0.99
confidence on most predictions, so lines piled on top of each other
and the x-axis tick labels (bin centers as float) became unreadable.

Replaces with a calibration heatmap:
- One row per model (sorted from most overconfident at top to most
  underconfident at bottom).
- One column per confidence bin, labeled with the bin range directly.
- Cell text shows actual hit rate and sample size.
- Cell color is calibration error (actual minus bin midpoint), with a
  diverging RdYlGn colormap centered on 0. Red = overconfident, green
  = well-calibrated, blue = underconfident. Bins the model never used
  render as a neutral gray.

Same underlying data, no overlap, x-axis is self-explanatory. Caption
in the report's charts section updated to match.

135 tests pass. Report regenerated; no data refresh.

* fix(benchmark): tighten chart bounding boxes; shorten calibration title

Two related fixes to chart rendering:

1. Shorten the calibration heatmap title from a long two-line caption
   to a compact one with cell-text legend up top and color legend
   below. Previous wording overflowed the figure width on the right
   on narrow viewports.

2. Add bbox_inches="tight" to every chart savefig. matplotlib's
   tight_layout sometimes misses corner clipping when titles or
   colorbar labels run close to the figure edge; bbox_inches="tight"
   on save expands the bounding box just enough to capture them.
   Applied to pareto.svg, compliance.svg, episodes.svg,
   calibration.svg, and latency_tail.svg.

No data change, just rendering. 135 tests pass.

* fix(benchmark): dedup calls.jsonl in report rendering (last-write-wins)

calls.jsonl is append-only by design (audit history). When
`benchmark run --retry-errors` succeeds against a previously-failed
(model, episode, trial, window) tuple, the new successful row is
appended alongside the original error row. The aggregator already
filtered errors out of F1/cost (`if rec.get('error'): continue`), but
the Failures section kept counting the original error rows even though
they were superseded by a successful retry.

Fix: dedup raw calls by (model, episode_id, trial, window_index) and
keep the last row encountered. Everything downstream (aggregator,
charts, failures table, per-model detail) sees only the current state.

Run Metadata now distinguishes "unique work units (current state)"
from "raw rows in calls.jsonl" and notes how many were superseded by
later retries. Lifetime actual spend still sums the raw rows so the
true historical bill is preserved.

135 tests pass. No data change; rendering only.

* docs(benchmark): add column key to precision/recall/FP/FN section

The TP/FP/FN columns were undefined inline. New readers had to know
the convention or skim the metric glossary at the top to interpret
the rest of the section. Adds a compact column-key table with the
formula for precision and recall and one-sentence definitions of TP,
FP, and FN, plus a one-paragraph reading guide on what high-precision
vs high-recall behavior actually means in production.

Doc-only; no schema or test impact. 135 tests pass.

* docs(benchmark): explainers on every section + cross-model agreement chart

Two-part pass on report rendering:

1. Add a one-paragraph intro to every section that lacked one, written
   for an entry-level-to-expert audience. Sections that gained
   explainers: Failures (By category, Per-model error count, Sample
   messages), Detection rate buckets (By ad length, By ad position),
   Quick Comparison, Per-Model Detail, Per-Episode Detail, Parser
   Stress Test, Methodology. Each explainer answers: what does this
   table show, why does it matter, how should I read the numbers?

2. New chart: Cross-model agreement histogram (agreement.svg).
   Renders the existing per-window vote-count distribution as a
   colored bar chart (red = low agreement, green = high), with cell
   labels showing window counts and percentages. Wired into the
   Charts section with a caption that mirrors the table-form Cross-
   model agreement section.

135 tests pass. Data unchanged; rendering only.

* feat(benchmark): per-model agreement + 3 new charts

User noted the cross-model agreement histogram was anonymous (no model
labels) and asked for both views: distribution AND per-model
attribution. Plus identified other sections that could use charts.

New analysis:
- Per-model alignment with consensus (table + chart). For each model
  and each (episode, window), classify the model's vote against the
  majority into 4 buckets: with-yes / with-no / broke-yes / broke-no.
  Surfaces which models track consensus and which break it -- a model
  that voted "yes" when most others voted "no" is likely
  false-positiving; the opposite means likely missing real ads.

New charts:
- alignment.svg: stacked horizontal bars per model showing the 4
  buckets with green/blue/orange/red color coding and an alignment
  rate label on the right edge.
- precision_recall.svg: scatter of precision vs recall with F1
  isocurves as gray dashed reference lines. Top-right = ideal,
  top-left = cautious, bottom-right = greedy.
- boundary.svg: stacked horizontal bars per model with blue = start
  MAE and orange = end MAE, sorted by total error ascending. Total
  labeled at the right.
- token_efficiency.svg: scatter of tokens-per-ad (log x) vs F1.
  Upper-left = efficient zone; right side = reasoning-heavy models
  whose extra tokens may or may not buy more F1.

Charts section in the report now lists all 9 charts with one-paragraph
captions matching the depth of detail in the body sections. Histogram
caption updated to acknowledge anonymity and point to the alignment
chart for per-model attribution.

135 tests pass. Data unchanged; rendering only.

* feat(benchmark): trial variance + detection-bucket + parser-stress charts

Adds the last three sections that had model data but no chart.

New charts:
- trial_variance.svg: horizontal bars of mean F1 stdev per model, with
  green/yellow/red coloring at the 0.02 and 0.05 thresholds. Visually
  surfaces models whose single-trial F1 numbers can't be trusted.
- detection_by_length.svg: heatmap of model (row) vs ad-length bucket
  (column, short/medium/long), cell = detection rate plus sample size.
  Rows sorted by overall detection rate, descending.
- detection_by_position.svg: same shape, columns are pre-roll /
  mid-roll / post-roll. Surfaces the common pattern where post-roll
  is systematically harder.
- parser_stress.svg: heatmap of model vs extraction method (call
  counts). Columns ordered by total usage; rows ordered by share of
  json_array_direct (the clean path) so easier-to-consume models are
  at the top.

Total charts now 13. Each chart has a one-paragraph caption in the
Charts section explaining what it shows and how to read it.

Also: ran a programmatic AI-tic scan across all report.py prose
literals -- 0 hits across the Wikipedia high-frequency vocabulary.
The one em-dash flag in the scan was a false positive: 6 em-dashes
spread across 9 paragraphs in a single concatenated chart-captions
string, not 6 in one paragraph (one per chart, which is the intended
pattern).

135 tests pass.

* docs(benchmark): humanizer pass + source-data links under each chart

Two related polish changes.

1. Em-dash and tone pass. After the user noted that em-dashes were
   still visible throughout the report, audited every prose string in
   report.py and applied 23+ targeted replacements:
   - Converted `**term** -- description` patterns to `**term**: description`
   - Converted parenthetical em-dashes ("X -- Y -- and Z") to commas
     or sentence breaks
   - Converted aside em-dashes ("X. Y -- bad for production") to
     proper sentence boundaries ("X. Y. Bad for production.")
   - Title-cased heading "Parser Stress Test" lowered to "Parser
     stress test" (sentence case, per humanizer rule 15)
   Final rendered-report distribution: 88 paragraphs with 0 em-dashes,
   1 with a single em-dash, 0 with 2+. Down from 17 + 2 = 19
   paragraphs with em-dashes before the pass.

   Other humanizer-rule audits all clean: 0 promotional, 0 copula
   avoidance, 0 filler, 0 hedging, 0 chatbot artifacts, 0 superficial
   -ing analyses, 0 formulaic "despite/conclusion", 0 false ranges.

2. Source-data links. Every chart in the Charts section now has a
   "Source data: [linked anchor]" line directly below the image,
   pointing at the corresponding tables in the body of the report.
   Readers who want to dig past the visual can click straight to the
   numbers. 13 source-data links wired up (one per chart, some chart
   captions point to multiple tables like Pareto -> Best Accuracy /
   Best Value / Best Free-Tier).

135 tests pass. Doc/report-only change.

* feat(2.1.7): salvage truncated single-ad JSON + benchmark cleanup

Production parser: add strategy 4 (`_salvage_truncated_single_ad`)
to recover usable ad dicts from responses that ran out of token
budget mid-output (the existing four strategies all bail on the
structurally invalid JSON). Returns a dict only when both `start`
and `end` were recovered; otherwise the row is dropped rather than
fabricated. Tagged `json_object_single_ad_truncated`.

Benchmark: max_tokens is now config-driven via `[run].max_tokens`,
defaulting to production's `AD_DETECTION_MAX_TOKENS` so the
benchmark matches the live app's budget. `schema_audit` now imports
production's `STRUCTURAL_FIELDS` and `SPONSOR_PRIORITY_FIELDS` to
stop flagging fields the live parser already accepts.

Benchmark code cleanup: drop local duplicates of `parse_timestamp`
and `format_time` in favor of `utils.time`; use `dataclasses.asdict`
for violations; convert derived ModelStats fields to `@property`;
precompute per-model indexes in `_aggregate` and call counts in
`_render_failures`; cache `avg_f1` / `mean_f1_stdev` on ModelStats;
fix the brittle HTTP-5xx classifier (`\b5\d{2}\b`); extract
length/position bucket helpers.

Corpus: add two new verified episodes (`it-s-a-thing`,
`the-brilliant-idiots`).

* updated report

* chore(deps): bulk dependency bumps + benchmark numbers refresh for 2.1.7

Rolls 11 of 12 open Dependabot PRs into 2.1.7. PR #208 (ubuntu
24.04 -> 26.04) held: the GPU image is pinned to ubuntu 24.04
through nvidia/cuda:12.9.1-runtime-ubuntu24.04 and nvidia/cuda has
no ubuntu 26.04 variant, so taking it would split the GPU and CPU
base images across a major OS version.

pip (5):
- bump anthropic 0.97.0 -> 0.100.0
- bump cryptography 47.0.0 -> 48.0.0
- bump gunicorn 25.3.0 -> 26.0.0
- bump huggingface-hub 1.13.0 -> 1.14.0
- bump openai 2.33.0 -> 2.36.0

npm (5):
- bump @tanstack/react-query 5.100.5 -> 5.100.9
- bump react 19.2.5 -> 19.2.6
- bump react-router-dom 6.30.3 -> 7.15.0 (major; tsc clean, no API surface change required)
- bump tailwind-merge 3.5.0 -> 3.6.0
- bump vite-plugin-pwa 1.2.0 -> 1.3.0

docker (1):
- bump node 24-alpine -> 26-alpine (both GPU and CPU Dockerfiles)

README: refresh the Cloud LLMs recommended-model table with the
latest 32-model / 7-episode / 14,400-call benchmark numbers.

Verification:
- pytest tests/ -q -> 941 passed, 4 skipped
- benchmarks/llm pytest -> 135 passed
- frontend: npx tsc --noEmit clean, npm run build successful
- docker build --platform=linux/amd64 (GPU) -> success
- docker build --platform=linux/amd64 -f Dockerfile.cpu (CPU) -> success

* docs(2.1.7): note accepted CVE-2026-31431 in linux-libc-dev
@ttlequals0
Copy link
Copy Markdown
Owner

Deferring this bump. Rationale:

The MinusPod GPU image is pinned to ubuntu 24.04 through its base image nvidia/cuda:12.9.1-runtime-ubuntu24.04. nvidia/cuda currently publishes no ubuntu 26.04 variant -- the newest available is ubuntu24.04 across CUDA 12.4 through 13.2.1. Verified at https://hub.docker.com/r/nvidia/cuda/tags.

This PR touches Dockerfile.cpu only (the CPU image uses FROM ubuntu:24.04 directly, where dependabot can see and bump it). Merging it as-is would split the two production images across a major OS version: GPU on ubuntu 24.04, CPU on ubuntu 26.04. That breaks the project invariant documented in CLAUDE.md under "Image variants":

Both bases share ubuntu:24.04 underneath, so CVE overlap is high but not total.

The intent there is to keep both images on the same OS so their CVE surface stays in sync and per-variant Trivy reports are comparable. Diverging across an LTS jump would widen that gap considerably (different glibc, different default python, different apt package versions) for the lifetime of the gap.

The other 11 dependabot PRs from this batch shipped in v2.1.7 (#200, merged as ed73901d2b42f52a7c7892f0454792a3878b7a4e). See CHANGELOG.md under [2.1.7] -> ### Dependencies for the explicit "holding" note.

This PR will become viable when either:

  1. nvidia/cuda ships an ubuntu 26.04 base variant, in which case both Dockerfile and Dockerfile.cpu can move together, or
  2. The GPU image migrates off the nvidia/cuda base entirely (e.g. plain ubuntu + manual CUDA runtime install).

Keeping this PR open so dependabot doesn't re-raise it on every weekly rebase. Will revisit when (1) or (2) lands.

Bumps ubuntu from 24.04 to 26.04.

---
updated-dependencies:
- dependency-name: ubuntu
  dependency-version: '26.04'
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot force-pushed the dependabot/docker/ubuntu-26.04 branch from 7d47aba to d477df4 Compare May 16, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant