Dead code benchmark results — Feb 17, 2026 blog post data

## Context

Ran the dead code benchmark suite on `feat/supermodel-benchmark` branch to generate data for a blog post on AI-assisted dead code detection. This issue captures the full context, results, and observations.

## Runs Completed

### Run 1: Baseline Configuration (`deadcode-baseline.yaml`)
- **Task:** `typescript-express-app` (synthetic corpus, 35 files, 102 dead functions)
- **MCP:** Filesystem MCP with grep-based approach
- **Result:** MCP 90%P/92%R/91%F1 vs Baseline 92%P/85%R/88%F1 — **both PASS**

### Run 2: Precomputed Analysis (`deadcode-precomputed.yaml`)
- **Task:** Same synthetic corpus
- **MCP:** Pre-generated `.supermodel/dead-code-analysis.json`
- **Result:** MCP 91%P/**99%R**/95%F1 vs Baseline 90%P/84%R/87%F1 — **both PASS**
- **Key insight:** Precomputed analysis achieves 99% recall with only 5 tool calls

### Run 3: Real PR (`supermodel-deadcode-pr.yaml -t tyr_pr258`)
- **Task:** `tyr_pr258` — `uncovering-world/track-your-regions` PR #258 (22 ground truth items)
- **MCP:** Supermodel staging API (95.5% API recall)
- **Result:** MCP 53%P/41%R/46%F1 vs Baseline 14%P/5%R/7%F1 — **both FAIL**

## Results Summary Table

| Run | Task | MCP P/R/F1 | Baseline P/R/F1 | Resolved |
|-----|------|------------|-----------------|----------|
| 01 Baseline | ts-express-app | 90/92/91% | 92/85/88% | Both YES |
| 02 Precomputed | ts-express-app | 91/99/95% | 90/84/87% | Both YES |
| 03 Real PR | tyr_pr258 | 53/41/46% | 14/5/7% | Both NO |

## Key Findings

### 1. Pre-computed analysis dramatically improves recall
Run 02 achieved 99% recall with 5 tool calls ($0.47, 1m 55s) vs baseline's 84% recall with 88 tool calls ($0.62, 3m 26s).

### 2. Agent-API gap on real-world codebases
The Supermodel API has 95.5% recall on tyr_pr258, but the agent only achieves 41% recall. Possible causes:
- Analysis JSON file size (204 KB) causing truncation
- Agent selectively filtering instead of faithfully transcribing
- Path normalization issues between agent output and ground truth
- Context window limits on tool output

### 3. Baseline agents give up on complex repos
The baseline agent on tyr_pr258 only ran 5 iterations and found 7 items (1 TP). It essentially surrendered early.

### 4. Scaling pattern confirmed
Consistent with earlier manual tests (antiwork/helper):
- Synthetic (35 files): Baseline ~85% recall, MCP ~92-99% recall
- Real-world (100+ files): Baseline <5% recall, MCP ~41% recall
- Baseline performance degrades with size; MCP advantage grows

## Cost Summary
- Run 01: $1.26 (MCP $0.69 + BL $0.57)
- Run 02: $1.09 (MCP $0.47 + BL $0.62)
- Run 03: $3.41 (MCP $1.77 + BL $1.64)
- **Total: ~$5.76**

## Environment
- **mcpbr:** v0.13.4 on `feat/supermodel-benchmark`
- **Model:** `claude-sonnet-4-20250514`
- **Agent Harness:** `claude-code`
- **Docker:** v28.1.1
- **Python:** 3.14.0
- **Platform:** macOS Darwin 25.2.0

## Artifacts Location
All run artifacts (configs, logs, ground truth, evaluation results) are saved in:
`~/Downloads/blogpost-deadcode/`

Includes:
- 6 runs (01-03 without verbose, 04-06 with `-vv` verbose logging)
- Reference data from earlier manual runs (antiwork/helper benchmark, dead-code-endpoint-analysis)
- Full agent transcripts for each condition
- Ground truth files and Supermodel API analysis caches

## Open Questions
- [ ] Investigate the agent-API gap on tyr_pr258 (41% agent recall vs 95.5% API recall)
- [ ] Consider adding verbose logging to configs by default
- [ ] Should the evaluation normalization be more lenient on path matching?
- [ ] Would a prompt emphasizing "transcribe ALL candidates without filtering" improve MCP agent recall?

## Related
- Branch: `feat/supermodel-benchmark`
- Earlier manual benchmark: `~/Downloads/supermodel-benchmark-results/supermodel-dead-code-benchmark.md`
- Dead code corpus: `supermodeltools/dead-code-benchmark-corpus`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead code benchmark results — Feb 17, 2026 blog post data #2

Context

Runs Completed

Run 1: Baseline Configuration (`deadcode-baseline.yaml`)

Run 2: Precomputed Analysis (`deadcode-precomputed.yaml`)

Run 3: Real PR (`supermodel-deadcode-pr.yaml -t tyr_pr258`)

Results Summary Table

Key Findings

1. Pre-computed analysis dramatically improves recall

2. Agent-API gap on real-world codebases

3. Baseline agents give up on complex repos

4. Scaling pattern confirmed

Cost Summary

Environment

Artifacts Location

Open Questions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run	Task	MCP P/R/F1	Baseline P/R/F1	Resolved
01 Baseline	ts-express-app	90/92/91%	92/85/88%	Both YES
02 Precomputed	ts-express-app	91/99/95%	90/84/87%	Both YES
03 Real PR	tyr_pr258	53/41/46%	14/5/7%	Both NO

Dead code benchmark results — Feb 17, 2026 blog post data #2

Description

Context

Runs Completed

Run 1: Baseline Configuration (deadcode-baseline.yaml)

Run 2: Precomputed Analysis (deadcode-precomputed.yaml)

Run 3: Real PR (supermodel-deadcode-pr.yaml -t tyr_pr258)

Results Summary Table

Key Findings

1. Pre-computed analysis dramatically improves recall

2. Agent-API gap on real-world codebases

3. Baseline agents give up on complex repos

4. Scaling pattern confirmed

Cost Summary

Environment

Artifacts Location

Open Questions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Run 1: Baseline Configuration (`deadcode-baseline.yaml`)

Run 2: Precomputed Analysis (`deadcode-precomputed.yaml`)

Run 3: Real PR (`supermodel-deadcode-pr.yaml -t tyr_pr258`)