Skip to content

epic(eval): pr_review eval expansion (n≥50) + per-voter precision — advisory until it earns enforce #3845

Description

@williamzujkowski

Narrative

pr_review stays advisory until it earns enforcement. Grow the eval set from n=10 to n≥50 (target 100) with a documented labeling methodology that fixes the v5 mislabeling; track precision/recall per voter role so chronically noisy voters can be demoted on the Epic D ladder; define and encode the explicit pr_review→enforce promotion criterion.

Grounding (Phase 0 recon — docs/research/pr-review-experiment-results-v5.md)

  • v5 baseline (2026-04-26): n=10 (5 synthetic buggy, 3 historical, 2 synthetic clean). Bug-catch 100% (6/6), location-match 83.3%, raw FP rate 50% — re-classified after triage as dataset error + borderline judgment calls: fix(research): repair github + semantic_scholar source coverage (#2234) #2235 was labeled "clean" but shipped a real bug (wrong env var in a rate-limit message) that the voters CAUGHT. The "false positives" were partly the dataset being wrong — the labeling methodology is the bottleneck, not just the panel.
  • No per-voter metrics exist; v5 reports panel-level only. The 5 pr_review roles: architect, security, devex, catfish, scope_steward.
  • The README claim "100% bug-catch" enters the claims registry (Epic A) linked to the v5 artifact WITH the n=10 caveat; v6 supersedes it.

Children

Absorbed

Dependency notes

Depends on Epic D for the demotion path (noisy voter → ladder demotion) and the promotion-criterion machinery. #3134 (author identity input) related, not absorbed.

Definition of done

  • n≥50 labeled eval set with a published rubric; v5 mislabels corrected
  • Per-voter precision/recall in the outcome store, queryable over time
  • v6 results doc; promotion criterion encoded in the claims registry

Checkbox reconciliation 2026-06-21: flipped boxes for children closed since the last edit (backlog deep-dive). Remaining unchecked items are genuinely open.

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicParent tracking issueevalEval sets, per-voter precision (Epic E)p2Priority 2 - Medium impact, moderate changes neededroadmap:control-planeControl Plane roadmap (M1-M4)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions