MCP vs Direct vs CLI Benchmark

Benchmarking three approaches to wiring tools into AI agents — so you can pick the right one.

The Question

When building AI agents, how should tools be connected to the LLM? MCP (Model Context Protocol) is the emerging standard for tool integration, but it introduces protocol overhead: server lifecycle, JSON-RPC serialization, and dynamic discovery. Is that overhead worth it when your tools run in the same process? This benchmark measures the concrete cost of each approach across latency, token usage, and end-to-end task completion.

Three Approaches

MCP (Model Context Protocol)

Spawns a server process, connects via stdio JSON-RPC, and discovers tools dynamically at runtime. This is the standard for cross-boundary tool communication — ideal when tools live in separate processes, machines, or are provided by third parties.

Direct API (Pydantic)

In-process Python functions with Pydantic models for input validation. Tool schemas are generated from model_json_schema() and passed directly to the Anthropic API. Zero serialization overhead, zero process boundaries.

CLI (subprocess)

Standalone Python scripts invoked via asyncio.create_subprocess_exec. Each tool is an independently runnable script that accepts JSON on stdin and returns JSON on stdout. The agent shells out for every call.

What We Measure

Metric	Description
Token overhead	Size of tool definitions sent in each API request
Per-call latency	Wall-clock time from tool invocation to result
End-to-end task time	Total time to complete a multi-step agent task
Total API tokens	Input + output tokens consumed across the full task
All three approaches execute the same task (file operations in a temp directory) with the same underlying tool implementations, so differences reflect pure integration overhead.

Run It Yourself

Prerequisites

Python 3.12+
uv package manager
Node.js (required by MCP server runtime)
Anthropic API key and/or OpenAI API key

Setup

git clone https://github.com/odedha-dr/mcp-vs-direct-benchmark.git
cd mcp-vs-direct-benchmark

cp .env.example .env
# Add ANTHROPIC_API_KEY and/or OPENAI_API_KEY to .env

uv sync --group dev

Run

uv run python -m src.benchmark           # default: 3 runs per approach
uv run python -m src.benchmark --runs 2  # faster: 2 runs

The benchmark auto-detects which API keys are present and runs Claude, GPT, or both.

# Override models
uv run python -m src.benchmark --claude-model claude-sonnet-4-20250514 --openai-model gpt-4o-mini

Results are printed to the terminal and saved to results/.

Results

Benchmark run on 2026-03-09, 2 runs per approach, same filesystem task (list directory, read file, write summary file). All approaches expose the same 3 tools for a fair apples-to-apples comparison.

Claude (Sonnet 4)

Metric	Direct (Pydantic)	CLI (subprocess)	MCP
Tool definition tokens	203	186	461
Avg tool call latency	1.8 ms	28.8 ms	4.6 ms
Avg total task time	10.8 s	10.3 s	10.0 s
Avg API input tokens	3,447	3,360	4,373
Avg API output tokens	490	482	495
Avg API turns	4	4	4
Avg tool calls	3	3	3

GPT-4o

Metric	Direct (Pydantic)	CLI (subprocess)	MCP
Tool definition tokens	227	210	485
Avg tool call latency	2.8 ms	26.9 ms	3.3 ms
Avg total task time	3.8 s	3.6 s	3.6 s
Avg API input tokens	1,206	1,164	1,877
Avg API output tokens	204	209	210
Avg API turns	4	4	4
Avg tool calls	3	3	3

Cross-LLM Comparison (Direct approach)

Metric	Claude Sonnet 4	GPT-4o
Avg total task time	10.8 s	3.8 s
Avg API input tokens	3,447	1,206
Avg API output tokens	490	204
Avg API turns	4	4
Avg tool calls	3	3

Analysis

All runs completed the same task with identical behavior: 4 API turns, 3 tool calls (list, read, write). No errors, no retries.

1. MCP overhead exists but isn't dramatic

With the same 3 tools, MCP adds measurable but modest cost:

~2x more tool definition tokens (461 vs 203) — MCP schemas include richer metadata (annotations, readOnlyHint, etc.)
~27% more API input tokens with Claude (4,373 vs 3,447)
No meaningful difference in end-to-end time — within noise for both LLMs

All three approaches (Direct, CLI, MCP) produce the same task behavior and comparable wall-clock times. The integration method doesn't matter much — LLM response latency dominates.

2. Claude and GPT-4o differ significantly

This is the more interesting finding. With identical task behavior (same turns, same tool calls), GPT-4o completes in ~3.8s vs Claude's ~10.8s — nearly 3x faster. The gap shows up across every metric:

2.9x more input tokens with Claude (3,447 vs 1,206) — same 3 tool definitions, same conversation, but Claude's message format is significantly more verbose
2.4x more output tokens (490 vs 204) — Claude generates longer responses for the same task
~3x slower end-to-end — a combination of higher per-turn latency and more tokens to process

Both models complete the task correctly with the same number of steps. The difference is pure efficiency: GPT-4o's message encoding is more compact and its responses are more concise. For tool-heavy agentic workloads where you're paying per-token and per-second, this gap compounds.

Takeaway

MCP vs Direct: MCP adds ~2x tool definition tokens due to richer schemas, but the impact on total cost and latency is modest. Pick your integration approach based on architecture (same-process vs cross-boundary), not performance.

Claude vs GPT-4o: The bigger surprise is the cross-LLM gap. GPT-4o uses ~3x fewer input tokens and ~2.4x fewer output tokens for the same task, resulting in ~3x faster completion. For token-heavy agentic workloads, the choice of LLM matters more than the choice of tool integration.

When to Use What

Approach	Use when...	Avoid when...
MCP	Tools cross process/machine boundaries; third-party tool providers; marketplace/plugin scenarios; you need dynamic tool discovery	Tools are local to the agent process and you control them all
Direct	Tools run in the same process as the agent; you control all tool implementations; latency and token cost matter	You need to expose tools to external consumers or support a plugin model
CLI	Wrapping existing scripts or binaries; tools need to be independently testable from the terminal; polyglot toolchains	High-frequency tool calls where subprocess overhead adds up

Project Structure

mcp-vs-direct-benchmark/
├── src/
│   ├── benchmark.py              # Main benchmark entry point
│   ├── __main__.py               # Module runner
│   ├── harness/
│   │   ├── runner.py             # Benchmark runner (drives the agent loop)
│   │   ├── reporter.py           # Results formatting and comparison
│   │   └── token_counter.py      # Token usage tracking
│   └── tools/
│       ├── interface.py           # Common tool interface
│       ├── mcp/
│       │   └── client.py          # MCP client + server lifecycle
│       ├── direct/
│       │   └── tools.py           # Pydantic-based direct tools
│       └── cli/
│           ├── wrapper.py         # Subprocess wrapper for CLI tools
│           ├── list_dir.py        # Standalone list_directory script
│           ├── read_file.py       # Standalone read_file script
│           └── write_file.py      # Standalone write_file script
├── tests/
│   ├── test_mcp_tools.py
│   ├── test_direct_tools.py
│   ├── test_cli_tools.py
│   └── test_harness.py
├── docs/plans/                    # Design docs and implementation plan
├── pyproject.toml
└── .env.example

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCP vs Direct vs CLI Benchmark

The Question

Three Approaches

MCP (Model Context Protocol)

Direct API (Pydantic)

CLI (subprocess)

What We Measure

Run It Yourself

Prerequisites

Setup

Run

Results

Claude (Sonnet 4)

GPT-4o

Cross-LLM Comparison (Direct approach)

Analysis

Takeaway

When to Use What

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs/plans		docs/plans
results		results
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

MCP vs Direct vs CLI Benchmark

The Question

Three Approaches

MCP (Model Context Protocol)

Direct API (Pydantic)

CLI (subprocess)

What We Measure

Run It Yourself

Prerequisites

Setup

Run

Results

Claude (Sonnet 4)

GPT-4o

Cross-LLM Comparison (Direct approach)

Analysis

Takeaway

When to Use What

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages