| status | active | |
|---|---|---|
| lead | Oded Har-Tal | |
| people |
|
|
| created | 2026-03-10 |
Understanding when to inject a skill into your context vs spawn an isolated sub-agent — and what it costs.
AI coding assistants like Claude Code offer two patterns for extending agent capabilities: skills (prompt templates injected into the current context) and sub-agents (isolated agent processes with their own context window). These patterns have fundamentally different trade-offs around token cost, latency, context management, and composability. This project measures those trade-offs concretely.
4 tasks of increasing complexity, each implemented as both a skill pattern and a sub-agent pattern, run against the MCP vs Direct benchmark codebase as the target repo. Cross-platform comparison with Codex CLI and OpenClaw.
| Dimension | Key Question |
|---|---|
| Token overhead | When does context pollution from skills cost more than sub-agent overhead? |
| Latency | When does parallelism offset spawn cost? |
| Context isolation | When is isolation a feature vs a limitation? |
| Composability | Which composes better for multi-step workflows? |
Benchmark run on 2026-03-10 using Claude Sonnet 4, 2 runs per pattern, targeting a ~15-file Python codebase.
| Metric | Skill | Sub-Agent |
|---|---|---|
| Parent input tokens | 5,639 | 344 |
| Child input tokens | 0 | 5,713 |
| Total tokens | 6,053 | 6,884 |
| Total time | 10.7s | 22.1s |
Winner: Skill. For simple tasks, skill pattern is faster and cheaper. Sub-agent overhead (spawn + synthesis) isn't justified.
| Metric | Skill | Sub-Agent |
|---|---|---|
| Parent input tokens | 90,298 | 137 |
| Child input tokens | 0 | 90,437 |
| Total tokens | 92,008 | 92,583 |
| Total time | 54.8s | 54.6s |
Winner: Tie. Same total cost. But note the parent context: skill leaves 90K tokens in parent, sub-agent leaves only 137. If you have follow-up work, sub-agent wins on context hygiene.
| Metric | Skill | Sub-Agent | Parallel Sub-Agent |
|---|---|---|---|
| Parent input tokens | 84,719 | 161 | 1,274 |
| Child input tokens | 0 | 98,633 | 133,260 |
| Total tokens | 86,533 | 100,872 | 139,865 |
| Total time | 51.4s | 56.5s | 79.0s |
Winner: Skill on total cost. Parallel sub-agents are the most expensive — each agent builds its own context independently. However, parallel sub-agents keep parent context at only 2,099 tokens vs 86,533 for skill.
| Metric | Skill | Sub-Agent |
|---|---|---|
| Parent input tokens | 96,711 | 946 |
| Child input tokens | 0 | 71,193 |
| Total tokens | 99,249 | 74,687 |
| Total time | 61.8s | 67.8s |
Winner: Sub-Agent on total tokens (-25%). The child agent produced a concise summary, avoiding the conversation history accumulation that inflated the skill pattern. Parent context stayed under 1,400 tokens.
| Task | Skill Tokens | Sub-Agent Tokens | Delta |
|---|---|---|---|
| Single Lookup | 6,053 | 6,884 | +14% |
| Multi-File Analysis | 92,008 | 92,583 | ~0% |
| Parallel Analysis | 86,533 | 100,872 | +17% |
| Composed Workflow | 99,249 | 74,687 | -25% |
1. Skills win for simple, focused tasks. No spawn overhead, no context handoff. For single lookups and straightforward analyses, skills are the right choice.
2. Sub-agents win for composed workflows. When a task involves multiple steps that build on each other, the sub-agent's summary acts as a natural compression point — the child does the heavy reading, and only the insight flows back to the parent. This saved 25% on tokens for T4.
3. Context isolation is the real value. The headline number isn't total tokens — it's parent context tokens. After T4, the skill pattern left 99K tokens in the parent context. The sub-agent left 1.4K. In a real session where you have follow-up tasks, that 97.6K difference compounds.
4. Parallel sub-agents are expensive but clean. Running 3 independent sub-agents used 62% more tokens than a single skill — each agent needs its own tool definitions and conversation setup. The parallelism advantage only materializes when tasks are truly independent AND the time savings (from concurrent API calls) outweigh the token cost.
5. The crossover point is task complexity. Simple tasks → skill. Complex multi-step tasks → sub-agent. The break-even is around T2 complexity (~20 tool calls), where both patterns cost roughly the same but sub-agents preserve parent context.
| Use Skill When... | Use Sub-Agent When... |
|---|---|
| Task needs < 5 tool calls | Task needs > 10 tool calls |
| You need the result in-context for follow-up | You need context isolation |
| Task is a single focused operation | Task has multiple composed steps |
| Latency matters (no spawn overhead) | You want concurrent execution |
| Parent context isn't near capacity | Parent context is getting full |
See docs/analysis/cross-platform-comparison.md for a detailed analysis of how Claude Code, Codex CLI, and OpenClaw handle these patterns differently.
- Python 3.12+
- uv package manager
- Anthropic API key
- The MCP vs Direct benchmark codebase (as the target)
cd projects/personal/skills-vs-subagents
cp .env.example .env
# Add ANTHROPIC_API_KEY to .env
uv sync --group dev# Full benchmark (all 4 tasks, 2 runs each) ~10-15 min
uv run python -m src --runs 2
# Quick single task
uv run python -m src --runs 1 --tasks t1
# Specific tasks
uv run python -m src --runs 2 --tasks t1,t4Results are saved to results/ (markdown + JSON) and charts to charts/ (PNG).
skills-vs-subagents/
├── src/
│ ├── benchmark.py # Main entry point + CLI
│ ├── tools.py # File operation tools (shared by both patterns)
│ ├── harness/
│ │ ├── runner.py # Skill, sub-agent, and parallel runners
│ │ └── reporter.py # Markdown results formatting
│ ├── tasks/
│ │ └── definitions.py # 4 benchmark task definitions
│ └── charts/
│ └── token_chart.py # Stacked bar chart generation
├── tests/ # Full test suite
├── docs/
│ ├── plans/ # Design doc + implementation plan
│ └── analysis/ # Cross-platform comparison
├── results/ # Benchmark output (markdown + JSON)
└── charts/ # Generated charts (PNG)
MIT
