Labels: extraction, calibration
Body:
Problem
The extraction prompt template (~/.claude/skills/graphify/SKILL.md, Step B2) instructs subagents:
- INFERRED edges: reason about each edge individually. Direct structural evidence (shared data structure, clear dependency): 0.8-0.9. Reasonable inference with some uncertainty: 0.6-0.7. Weak or speculative: 0.4-0.5. Most edges should be 0.6-0.9, not 0.5.
In practice, on a 10,129-INFERRED-edge graph produced by Claude Sonnet subagents (2026-04-25), the actual distribution is:
| Score bucket |
Count |
% of INFERRED |
| <0.4 |
0 |
0 |
| 0.4–0.6 |
5,807 |
57% |
| 0.6–0.8 |
14 |
0.1% |
| 0.8+ |
4,308 |
42% |
The "0.6-0.9 should be most" guidance is being ignored. Subagents appear to default to 0.5 (exactly the midpoint) for uncertain edges and 0.85/0.9 for confident ones, with nothing in between.
Impact
Downstream filtering by confidence_score is effectively a binary switch, not a continuum. The calibration promised by the prompt does not materialize.
Suggested options
- Post-process. Add a
calibrate_confidence() step that shifts the 0.5 bucket up by a small random jitter based on per-edge features (degree of target, relation type, etc). Not ideal but deterministic.
- Prompt surgery. Reformulate the scoring guidance as a forced-rank schema ("pick from this exact set: 0.40, 0.55, 0.65, 0.75, 0.85, 0.95") — models are known to follow discrete rubrics better than continuous ones.
- Second pass. A lightweight "re-score INFERRED edges" pass that looks at each edge in isolation and picks from a smaller forced-rank set.
Reproducer
Run graphify on any repo with >50k LOC of Python + a few markdown design docs. Query the graph.json for the confidence_score distribution on edges with confidence=INFERRED.
Motivation
Calibration matters for graph-powered queries. /graphify query results cite confidence, and when the confidence is bimodal rather than graded, the user can't distinguish "likely" from "probable" answers.
Labels: extraction, calibration
Body:
Problem
The extraction prompt template (
~/.claude/skills/graphify/SKILL.md, Step B2) instructs subagents:In practice, on a 10,129-INFERRED-edge graph produced by Claude Sonnet subagents (2026-04-25), the actual distribution is:
The "0.6-0.9 should be most" guidance is being ignored. Subagents appear to default to
0.5(exactly the midpoint) for uncertain edges and0.85/0.9for confident ones, with nothing in between.Impact
Downstream filtering by confidence_score is effectively a binary switch, not a continuum. The calibration promised by the prompt does not materialize.
Suggested options
calibrate_confidence()step that shifts the0.5bucket up by a small random jitter based on per-edge features (degree of target, relation type, etc). Not ideal but deterministic.Reproducer
Run graphify on any repo with >50k LOC of Python + a few markdown design docs. Query the
graph.jsonfor theconfidence_scoredistribution on edges withconfidence=INFERRED.Motivation
Calibration matters for graph-powered queries.
/graphify queryresults cite confidence, and when the confidence is bimodal rather than graded, the user can't distinguish "likely" from "probable" answers.