You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our SWE-bench-verified evaluation, the agent spends 13.7% of all tool calls on TodoWrite (Claude Code's built-in task tracking tool), averaging 3.6 calls per task. This tool call overhead provides no measurable benefit — tasks with heavy TodoWrite usage don't resolve at higher rates.
Data
3.6 TodoWrite calls per task on average (across all MCP tasks)
13.7% of total tool budget consumed by TodoWrite
No correlation between TodoWrite usage and task resolution
With a 30-iteration limit, each wasted call is ~3.3% of the total budget
Root Cause
Claude Code's default behavior includes proactive task list management. When the agent receives a complex problem statement, it creates a todo list, updates it as it works, and marks items complete — all consuming tool call turns that could be spent on actual exploration and coding.
Impact
Recovering even half of these wasted calls would give the agent ~2 additional exploration or editing turns per task. Over 500 tasks, this is significant.
Recommended Fixes
Add instruction to MCP server: "Do not use TodoWrite or task management tools — focus all tool calls on exploration and code editing"
Problem
In our SWE-bench-verified evaluation, the agent spends 13.7% of all tool calls on
TodoWrite(Claude Code's built-in task tracking tool), averaging 3.6 calls per task. This tool call overhead provides no measurable benefit — tasks with heavy TodoWrite usage don't resolve at higher rates.Data
Root Cause
Claude Code's default behavior includes proactive task list management. When the agent receives a complex problem statement, it creates a todo list, updates it as it works, and marks items complete — all consuming tool call turns that could be spent on actual exploration and coding.
Impact
Recovering even half of these wasted calls would give the agent ~2 additional exploration or editing turns per task. Over 500 tasks, this is significant.
Recommended Fixes
Labels
performance, swe-bench