Skip to content

Bimodal INFERRED confidence-score distribution — calibration needed #540

@saxster

Description

@saxster

Labels: extraction, calibration

Body:

Problem

The extraction prompt template (~/.claude/skills/graphify/SKILL.md, Step B2) instructs subagents:

- INFERRED edges: reason about each edge individually. Direct structural evidence (shared data structure, clear dependency): 0.8-0.9. Reasonable inference with some uncertainty: 0.6-0.7. Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.

In practice, on a 10,129-INFERRED-edge graph produced by Claude Sonnet subagents (2026-04-25), the actual distribution is:

Score bucket Count % of INFERRED
<0.4 0 0
0.4–0.6 5,807 57%
0.6–0.8 14 0.1%
0.8+ 4,308 42%

The "0.6-0.9 should be most" guidance is being ignored. Subagents appear to default to 0.5 (exactly the midpoint) for uncertain edges and 0.85/0.9 for confident ones, with nothing in between.

Impact

Downstream filtering by confidence_score is effectively a binary switch, not a continuum. The calibration promised by the prompt does not materialize.

Suggested options

  1. Post-process. Add a calibrate_confidence() step that shifts the 0.5 bucket up by a small random jitter based on per-edge features (degree of target, relation type, etc). Not ideal but deterministic.
  2. Prompt surgery. Reformulate the scoring guidance as a forced-rank schema ("pick from this exact set: 0.40, 0.55, 0.65, 0.75, 0.85, 0.95") — models are known to follow discrete rubrics better than continuous ones.
  3. Second pass. A lightweight "re-score INFERRED edges" pass that looks at each edge in isolation and picks from a smaller forced-rank set.

Reproducer

Run graphify on any repo with >50k LOC of Python + a few markdown design docs. Query the graph.json for the confidence_score distribution on edges with confidence=INFERRED.

Motivation

Calibration matters for graph-powered queries. /graphify query results cite confidence, and when the confidence is bimodal rather than graded, the user can't distinguish "likely" from "probable" answers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions