fix(evals): un-break the R@5 failure-mode audit against the enveloped batch output by jamie8johnson · Pull Request #2247 · jamie8johnson/cqs

jamie8johnson · 2026-07-05T13:51:08Z

The R@5 failure-mode audit script (evals/audit_r5_failure_modes.py, April-era) had silently broken: batch output moved its search payload under the {"data": ...} envelope (json_envelope work) and the script read top-level results — every query scored 0/109 with results: [] and no error surfaced. A measurement tool reporting zeros instead of failing loudly is the docs-lying class applied to eval tooling.

What lands

Envelope unwrap with pre-envelope fallback (out.get("data", out)) + a typed error check (the current error line is {"error": {code, message}}).
Stale hardcoded header ("v1.27.0 shipping, BGE-large") replaced with a config-neutral line.
docs/audit-r5-failure-modes.md + evals/queries/v3_r5_audit.json refreshed with a current-config run (gemma-300m, PARSER_VERSION 17, 18.3k chunks, v3 test split; dev also run, recorded in the research log).

The refreshed finding (both splits, 14 near-misses each)

near_dup_crowding still dominates and grew: 71.4% test / 57.1% dev (April BGE-era: 60%). Paired-fixture agreement — real signal.
The concrete shadowing shape: #[cfg(test)] functions inside src files (e.g. src/parser/chunk.rs tests) outranking gold production chunks (src/schema.sql tables) — test-code crowding is cfg-test-mod chunks in src/, not tests/-dir files.
Fixture debt: 3 golds across the splits live in docs/ plan-file code blocks (eval_artifact_docs) — regeneration targets for the fixture program.

🤖 Generated with Claude Code

… batch output The batch line moved its search payload under {"data": ...} (json_envelope); the script read top-level results and scored 0/109 on every query with no error surfaced. Unwrap the envelope (with pre-envelope fallback), type the error check, and refresh docs/audit-r5-failure-modes.md + the raw record with a current-config run (gemma-300m, PARSER_VERSION 17): crowding still dominates near-misses on both splits (71.4% test / 57.1% dev). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jamie8johnson merged commit 55457d5 into main Jul 5, 2026
10 checks passed

jamie8johnson deleted the fix/audit-script-envelope branch July 5, 2026 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evals): un-break the R@5 failure-mode audit against the enveloped batch output#2247

fix(evals): un-break the R@5 failure-mode audit against the enveloped batch output#2247
jamie8johnson merged 1 commit into
mainfrom
fix/audit-script-envelope

jamie8johnson commented Jul 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamie8johnson commented Jul 5, 2026

What lands

The refreshed finding (both splits, 14 near-misses each)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant