Context
Ran the dead code benchmark suite on feat/supermodel-benchmark branch to generate data for a blog post on AI-assisted dead code detection. This issue captures the full context, results, and observations.
Runs Completed
Run 1: Baseline Configuration (deadcode-baseline.yaml)
- Task:
typescript-express-app (synthetic corpus, 35 files, 102 dead functions)
- MCP: Filesystem MCP with grep-based approach
- Result: MCP 90%P/92%R/91%F1 vs Baseline 92%P/85%R/88%F1 — both PASS
Run 2: Precomputed Analysis (deadcode-precomputed.yaml)
- Task: Same synthetic corpus
- MCP: Pre-generated
.supermodel/dead-code-analysis.json
- Result: MCP 91%P/99%R/95%F1 vs Baseline 90%P/84%R/87%F1 — both PASS
- Key insight: Precomputed analysis achieves 99% recall with only 5 tool calls
Run 3: Real PR (supermodel-deadcode-pr.yaml -t tyr_pr258)
- Task:
tyr_pr258 — uncovering-world/track-your-regions PR #258 (22 ground truth items)
- MCP: Supermodel staging API (95.5% API recall)
- Result: MCP 53%P/41%R/46%F1 vs Baseline 14%P/5%R/7%F1 — both FAIL
Results Summary Table
| Run |
Task |
MCP P/R/F1 |
Baseline P/R/F1 |
Resolved |
| 01 Baseline |
ts-express-app |
90/92/91% |
92/85/88% |
Both YES |
| 02 Precomputed |
ts-express-app |
91/99/95% |
90/84/87% |
Both YES |
| 03 Real PR |
tyr_pr258 |
53/41/46% |
14/5/7% |
Both NO |
Key Findings
1. Pre-computed analysis dramatically improves recall
Run 02 achieved 99% recall with 5 tool calls ($0.47, 1m 55s) vs baseline's 84% recall with 88 tool calls ($0.62, 3m 26s).
2. Agent-API gap on real-world codebases
The Supermodel API has 95.5% recall on tyr_pr258, but the agent only achieves 41% recall. Possible causes:
- Analysis JSON file size (204 KB) causing truncation
- Agent selectively filtering instead of faithfully transcribing
- Path normalization issues between agent output and ground truth
- Context window limits on tool output
3. Baseline agents give up on complex repos
The baseline agent on tyr_pr258 only ran 5 iterations and found 7 items (1 TP). It essentially surrendered early.
4. Scaling pattern confirmed
Consistent with earlier manual tests (antiwork/helper):
- Synthetic (35 files): Baseline ~85% recall, MCP ~92-99% recall
- Real-world (100+ files): Baseline <5% recall, MCP ~41% recall
- Baseline performance degrades with size; MCP advantage grows
Cost Summary
- Run 01: $1.26 (MCP $0.69 + BL $0.57)
- Run 02: $1.09 (MCP $0.47 + BL $0.62)
- Run 03: $3.41 (MCP $1.77 + BL $1.64)
- Total: ~$5.76
Environment
- mcpbr: v0.13.4 on
feat/supermodel-benchmark
- Model:
claude-sonnet-4-20250514
- Agent Harness:
claude-code
- Docker: v28.1.1
- Python: 3.14.0
- Platform: macOS Darwin 25.2.0
Artifacts Location
All run artifacts (configs, logs, ground truth, evaluation results) are saved in:
~/Downloads/blogpost-deadcode/
Includes:
- 6 runs (01-03 without verbose, 04-06 with
-vv verbose logging)
- Reference data from earlier manual runs (antiwork/helper benchmark, dead-code-endpoint-analysis)
- Full agent transcripts for each condition
- Ground truth files and Supermodel API analysis caches
Open Questions
Related
- Branch:
feat/supermodel-benchmark
- Earlier manual benchmark:
~/Downloads/supermodel-benchmark-results/supermodel-dead-code-benchmark.md
- Dead code corpus:
supermodeltools/dead-code-benchmark-corpus
Context
Ran the dead code benchmark suite on
feat/supermodel-benchmarkbranch to generate data for a blog post on AI-assisted dead code detection. This issue captures the full context, results, and observations.Runs Completed
Run 1: Baseline Configuration (
deadcode-baseline.yaml)typescript-express-app(synthetic corpus, 35 files, 102 dead functions)Run 2: Precomputed Analysis (
deadcode-precomputed.yaml).supermodel/dead-code-analysis.jsonRun 3: Real PR (
supermodel-deadcode-pr.yaml -t tyr_pr258)tyr_pr258—uncovering-world/track-your-regionsPR #258 (22 ground truth items)Results Summary Table
Key Findings
1. Pre-computed analysis dramatically improves recall
Run 02 achieved 99% recall with 5 tool calls ($0.47, 1m 55s) vs baseline's 84% recall with 88 tool calls ($0.62, 3m 26s).
2. Agent-API gap on real-world codebases
The Supermodel API has 95.5% recall on tyr_pr258, but the agent only achieves 41% recall. Possible causes:
3. Baseline agents give up on complex repos
The baseline agent on tyr_pr258 only ran 5 iterations and found 7 items (1 TP). It essentially surrendered early.
4. Scaling pattern confirmed
Consistent with earlier manual tests (antiwork/helper):
Cost Summary
Environment
feat/supermodel-benchmarkclaude-sonnet-4-20250514claude-codeArtifacts Location
All run artifacts (configs, logs, ground truth, evaluation results) are saved in:
~/Downloads/blogpost-deadcode/Includes:
-vvverbose logging)Open Questions
Related
feat/supermodel-benchmark~/Downloads/supermodel-benchmark-results/supermodel-dead-code-benchmark.mdsupermodeltools/dead-code-benchmark-corpus