Bimodal INFERRED confidence-score distribution — calibration needed


**Labels:** extraction, calibration

**Body:**

## Problem

The extraction prompt template (`~/.claude/skills/graphify/SKILL.md`, Step B2) instructs subagents:

> `- INFERRED edges: reason about each edge individually. Direct structural evidence (shared data structure, clear dependency): 0.8-0.9. Reasonable inference with some uncertainty: 0.6-0.7. Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.`

In practice, on a 10,129-INFERRED-edge graph produced by Claude Sonnet subagents (2026-04-25), the actual distribution is:

| Score bucket | Count | % of INFERRED |
|---|---|---|
| <0.4 | 0 | 0 |
| 0.4–0.6 | 5,807 | 57% |
| 0.6–0.8 | 14 | 0.1% |
| 0.8+ | 4,308 | 42% |

The "0.6-0.9 should be most" guidance is being ignored. Subagents appear to default to `0.5` (exactly the midpoint) for uncertain edges and `0.85`/`0.9` for confident ones, with nothing in between.

## Impact

Downstream filtering by confidence_score is effectively a binary switch, not a continuum. The calibration promised by the prompt does not materialize.

## Suggested options

1. **Post-process.** Add a `calibrate_confidence()` step that shifts the `0.5` bucket up by a small random jitter based on per-edge features (degree of target, relation type, etc). Not ideal but deterministic.
2. **Prompt surgery.** Reformulate the scoring guidance as a forced-rank schema ("pick from this exact set: 0.40, 0.55, 0.65, 0.75, 0.85, 0.95") — models are known to follow discrete rubrics better than continuous ones.
3. **Second pass.** A lightweight "re-score INFERRED edges" pass that looks at each edge in isolation and picks from a smaller forced-rank set.

## Reproducer

Run graphify on any repo with >50k LOC of Python + a few markdown design docs. Query the `graph.json` for the `confidence_score` distribution on edges with `confidence=INFERRED`.

## Motivation

Calibration matters for graph-powered queries. `/graphify query` results cite confidence, and when the confidence is bimodal rather than graded, the user can't distinguish "likely" from "probable" answers.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bimodal INFERRED confidence-score distribution — calibration needed #540

Problem

Impact

Suggested options

Reproducer

Motivation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Bimodal INFERRED confidence-score distribution — calibration needed #540

Description

Problem

Impact

Suggested options

Reproducer

Motivation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions