Skip to content

fix(evals): un-break the R@5 failure-mode audit against the enveloped batch output#2247

Merged
jamie8johnson merged 1 commit into
mainfrom
fix/audit-script-envelope
Jul 5, 2026
Merged

fix(evals): un-break the R@5 failure-mode audit against the enveloped batch output#2247
jamie8johnson merged 1 commit into
mainfrom
fix/audit-script-envelope

Conversation

@jamie8johnson

Copy link
Copy Markdown
Owner

The R@5 failure-mode audit script (evals/audit_r5_failure_modes.py, April-era) had silently broken: batch output moved its search payload under the {"data": ...} envelope (json_envelope work) and the script read top-level results — every query scored 0/109 with results: [] and no error surfaced. A measurement tool reporting zeros instead of failing loudly is the docs-lying class applied to eval tooling.

What lands

  • Envelope unwrap with pre-envelope fallback (out.get("data", out)) + a typed error check (the current error line is {"error": {code, message}}).
  • Stale hardcoded header ("v1.27.0 shipping, BGE-large") replaced with a config-neutral line.
  • docs/audit-r5-failure-modes.md + evals/queries/v3_r5_audit.json refreshed with a current-config run (gemma-300m, PARSER_VERSION 17, 18.3k chunks, v3 test split; dev also run, recorded in the research log).

The refreshed finding (both splits, 14 near-misses each)

  • near_dup_crowding still dominates and grew: 71.4% test / 57.1% dev (April BGE-era: 60%). Paired-fixture agreement — real signal.
  • The concrete shadowing shape: #[cfg(test)] functions inside src files (e.g. src/parser/chunk.rs tests) outranking gold production chunks (src/schema.sql tables) — test-code crowding is cfg-test-mod chunks in src/, not tests/-dir files.
  • Fixture debt: 3 golds across the splits live in docs/ plan-file code blocks (eval_artifact_docs) — regeneration targets for the fixture program.

🤖 Generated with Claude Code

… batch output

The batch line moved its search payload under {"data": ...} (json_envelope);
the script read top-level results and scored 0/109 on every query with no
error surfaced. Unwrap the envelope (with pre-envelope fallback), type the
error check, and refresh docs/audit-r5-failure-modes.md + the raw record
with a current-config run (gemma-300m, PARSER_VERSION 17): crowding still
dominates near-misses on both splits (71.4% test / 57.1% dev).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jamie8johnson jamie8johnson merged commit 55457d5 into main Jul 5, 2026
10 checks passed
@jamie8johnson jamie8johnson deleted the fix/audit-script-envelope branch July 5, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant