Skills vs Sub-Agents: Pattern Analysis & Benchmark

status

active

lead

Oded Har-Tal

people

Oded Har-Tal

created

2026-03-10

Skills vs Sub-Agents: Pattern Analysis & Benchmark

Understanding when to inject a skill into your context vs spawn an isolated sub-agent — and what it costs.

The Question

AI coding assistants like Claude Code offer two patterns for extending agent capabilities: skills (prompt templates injected into the current context) and sub-agents (isolated agent processes with their own context window). These patterns have fundamentally different trade-offs around token cost, latency, context management, and composability. This project measures those trade-offs concretely.

Approach

4 tasks of increasing complexity, each implemented as both a skill pattern and a sub-agent pattern, run against the MCP vs Direct benchmark codebase as the target repo. Cross-platform comparison with Codex CLI and OpenClaw.

Dimensions

Dimension	Key Question
Token overhead	When does context pollution from skills cost more than sub-agent overhead?
Latency	When does parallelism offset spawn cost?
Context isolation	When is isolation a feature vs a limitation?
Composability	Which composes better for multi-step workflows?

Results

Benchmark run on 2026-03-10 using Claude Sonnet 4, 2 runs per pattern, targeting a ~15-file Python codebase.

Token Consumption Chart

Per-Task Results

T1: Single Lookup (find a function signature)

Metric	Skill	Sub-Agent
Parent input tokens	5,639	344
Child input tokens	0	5,713
Total tokens	6,053	6,884
Total time	10.7s	22.1s

Winner: Skill. For simple tasks, skill pattern is faster and cheaper. Sub-agent overhead (spawn + synthesis) isn't justified.

T2: Multi-File Analysis (analyze all files, produce summary)

Metric	Skill	Sub-Agent
Parent input tokens	90,298	137
Child input tokens	0	90,437
Total tokens	92,008	92,583
Total time	54.8s	54.6s

Winner: Tie. Same total cost. But note the parent context: skill leaves 90K tokens in parent, sub-agent leaves only 137. If you have follow-up work, sub-agent wins on context hygiene.

T3: Parallel Analysis (3 independent analyses)

Metric	Skill	Sub-Agent	Parallel Sub-Agent
Parent input tokens	84,719	161	1,274
Child input tokens	0	98,633	133,260
Total tokens	86,533	100,872	139,865
Total time	51.4s	56.5s	79.0s

Winner: Skill on total cost. Parallel sub-agents are the most expensive — each agent builds its own context independently. However, parallel sub-agents keep parent context at only 2,099 tokens vs 86,533 for skill.

T4: Composed Workflow (research → analyze → recommend)

Metric	Skill	Sub-Agent
Parent input tokens	96,711	946
Child input tokens	0	71,193
Total tokens	99,249	74,687
Total time	61.8s	67.8s

Winner: Sub-Agent on total tokens (-25%). The child agent produced a concise summary, avoiding the conversation history accumulation that inflated the skill pattern. Parent context stayed under 1,400 tokens.

Summary

Task	Skill Tokens	Sub-Agent Tokens	Delta
Single Lookup	6,053	6,884	+14%
Multi-File Analysis	92,008	92,583	~0%
Parallel Analysis	86,533	100,872	+17%
Composed Workflow	99,249	74,687	-25%

Analysis

1. Skills win for simple, focused tasks. No spawn overhead, no context handoff. For single lookups and straightforward analyses, skills are the right choice.

2. Sub-agents win for composed workflows. When a task involves multiple steps that build on each other, the sub-agent's summary acts as a natural compression point — the child does the heavy reading, and only the insight flows back to the parent. This saved 25% on tokens for T4.

3. Context isolation is the real value. The headline number isn't total tokens — it's parent context tokens. After T4, the skill pattern left 99K tokens in the parent context. The sub-agent left 1.4K. In a real session where you have follow-up tasks, that 97.6K difference compounds.

4. Parallel sub-agents are expensive but clean. Running 3 independent sub-agents used 62% more tokens than a single skill — each agent needs its own tool definitions and conversation setup. The parallelism advantage only materializes when tasks are truly independent AND the time savings (from concurrent API calls) outweigh the token cost.

5. The crossover point is task complexity. Simple tasks → skill. Complex multi-step tasks → sub-agent. The break-even is around T2 complexity (~20 tool calls), where both patterns cost roughly the same but sub-agents preserve parent context.

When to Use Which

Use Skill When...	Use Sub-Agent When...
Task needs < 5 tool calls	Task needs > 10 tool calls
You need the result in-context for follow-up	You need context isolation
Task is a single focused operation	Task has multiple composed steps
Latency matters (no spawn overhead)	You want concurrent execution
Parent context isn't near capacity	Parent context is getting full

Cross-Platform Comparison

See docs/analysis/cross-platform-comparison.md for a detailed analysis of how Claude Code, Codex CLI, and OpenClaw handle these patterns differently.

Run It Yourself

Prerequisites

Python 3.12+
uv package manager
Anthropic API key
The MCP vs Direct benchmark codebase (as the target)

Setup

cd projects/personal/skills-vs-subagents

cp .env.example .env
# Add ANTHROPIC_API_KEY to .env

uv sync --group dev

Run

# Full benchmark (all 4 tasks, 2 runs each) ~10-15 min
uv run python -m src --runs 2

# Quick single task
uv run python -m src --runs 1 --tasks t1

# Specific tasks
uv run python -m src --runs 2 --tasks t1,t4

Results are saved to results/ (markdown + JSON) and charts to charts/ (PNG).

Project Structure

skills-vs-subagents/
├── src/
│   ├── benchmark.py             # Main entry point + CLI
│   ├── tools.py                 # File operation tools (shared by both patterns)
│   ├── harness/
│   │   ├── runner.py            # Skill, sub-agent, and parallel runners
│   │   └── reporter.py          # Markdown results formatting
│   ├── tasks/
│   │   └── definitions.py       # 4 benchmark task definitions
│   └── charts/
│       └── token_chart.py       # Stacked bar chart generation
├── tests/                       # Full test suite
├── docs/
│   ├── plans/                   # Design doc + implementation plan
│   └── analysis/                # Cross-platform comparison
├── results/                     # Benchmark output (markdown + JSON)
└── charts/                      # Generated charts (PNG)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
charts		charts
docs		docs
results		results
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skills vs Sub-Agents: Pattern Analysis & Benchmark

The Question

Approach

Dimensions

Results

Token Consumption Chart

Per-Task Results

T1: Single Lookup (find a function signature)

T2: Multi-File Analysis (analyze all files, produce summary)

T3: Parallel Analysis (3 independent analyses)

T4: Composed Workflow (research → analyze → recommend)

Summary

Analysis

When to Use Which

Cross-Platform Comparison

Run It Yourself

Prerequisites

Setup

Run

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Skills vs Sub-Agents: Pattern Analysis & Benchmark

The Question

Approach

Dimensions

Results

Token Consumption Chart

Per-Task Results

T1: Single Lookup (find a function signature)

T2: Multi-File Analysis (analyze all files, produce summary)

T3: Parallel Analysis (3 independent analyses)

T4: Composed Workflow (research → analyze → recommend)

Summary

Analysis

When to Use Which

Cross-Platform Comparison

Run It Yourself

Prerequisites

Setup

Run

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages