Skip to content

odedha-dr/skills-vs-subagents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

status active
lead Oded Har-Tal
people
Oded Har-Tal
created 2026-03-10

Skills vs Sub-Agents: Pattern Analysis & Benchmark

Understanding when to inject a skill into your context vs spawn an isolated sub-agent — and what it costs.

The Question

AI coding assistants like Claude Code offer two patterns for extending agent capabilities: skills (prompt templates injected into the current context) and sub-agents (isolated agent processes with their own context window). These patterns have fundamentally different trade-offs around token cost, latency, context management, and composability. This project measures those trade-offs concretely.

Approach

4 tasks of increasing complexity, each implemented as both a skill pattern and a sub-agent pattern, run against the MCP vs Direct benchmark codebase as the target repo. Cross-platform comparison with Codex CLI and OpenClaw.

Dimensions

Dimension Key Question
Token overhead When does context pollution from skills cost more than sub-agent overhead?
Latency When does parallelism offset spawn cost?
Context isolation When is isolation a feature vs a limitation?
Composability Which composes better for multi-step workflows?

Results

Benchmark run on 2026-03-10 using Claude Sonnet 4, 2 runs per pattern, targeting a ~15-file Python codebase.

Token Consumption Chart

Token Consumption by Context

Per-Task Results

T1: Single Lookup (find a function signature)

Metric Skill Sub-Agent
Parent input tokens 5,639 344
Child input tokens 0 5,713
Total tokens 6,053 6,884
Total time 10.7s 22.1s

Winner: Skill. For simple tasks, skill pattern is faster and cheaper. Sub-agent overhead (spawn + synthesis) isn't justified.

T2: Multi-File Analysis (analyze all files, produce summary)

Metric Skill Sub-Agent
Parent input tokens 90,298 137
Child input tokens 0 90,437
Total tokens 92,008 92,583
Total time 54.8s 54.6s

Winner: Tie. Same total cost. But note the parent context: skill leaves 90K tokens in parent, sub-agent leaves only 137. If you have follow-up work, sub-agent wins on context hygiene.

T3: Parallel Analysis (3 independent analyses)

Metric Skill Sub-Agent Parallel Sub-Agent
Parent input tokens 84,719 161 1,274
Child input tokens 0 98,633 133,260
Total tokens 86,533 100,872 139,865
Total time 51.4s 56.5s 79.0s

Winner: Skill on total cost. Parallel sub-agents are the most expensive — each agent builds its own context independently. However, parallel sub-agents keep parent context at only 2,099 tokens vs 86,533 for skill.

T4: Composed Workflow (research → analyze → recommend)

Metric Skill Sub-Agent
Parent input tokens 96,711 946
Child input tokens 0 71,193
Total tokens 99,249 74,687
Total time 61.8s 67.8s

Winner: Sub-Agent on total tokens (-25%). The child agent produced a concise summary, avoiding the conversation history accumulation that inflated the skill pattern. Parent context stayed under 1,400 tokens.

Summary

Task Skill Tokens Sub-Agent Tokens Delta
Single Lookup 6,053 6,884 +14%
Multi-File Analysis 92,008 92,583 ~0%
Parallel Analysis 86,533 100,872 +17%
Composed Workflow 99,249 74,687 -25%

Analysis

1. Skills win for simple, focused tasks. No spawn overhead, no context handoff. For single lookups and straightforward analyses, skills are the right choice.

2. Sub-agents win for composed workflows. When a task involves multiple steps that build on each other, the sub-agent's summary acts as a natural compression point — the child does the heavy reading, and only the insight flows back to the parent. This saved 25% on tokens for T4.

3. Context isolation is the real value. The headline number isn't total tokens — it's parent context tokens. After T4, the skill pattern left 99K tokens in the parent context. The sub-agent left 1.4K. In a real session where you have follow-up tasks, that 97.6K difference compounds.

4. Parallel sub-agents are expensive but clean. Running 3 independent sub-agents used 62% more tokens than a single skill — each agent needs its own tool definitions and conversation setup. The parallelism advantage only materializes when tasks are truly independent AND the time savings (from concurrent API calls) outweigh the token cost.

5. The crossover point is task complexity. Simple tasks → skill. Complex multi-step tasks → sub-agent. The break-even is around T2 complexity (~20 tool calls), where both patterns cost roughly the same but sub-agents preserve parent context.

When to Use Which

Use Skill When... Use Sub-Agent When...
Task needs < 5 tool calls Task needs > 10 tool calls
You need the result in-context for follow-up You need context isolation
Task is a single focused operation Task has multiple composed steps
Latency matters (no spawn overhead) You want concurrent execution
Parent context isn't near capacity Parent context is getting full

Cross-Platform Comparison

See docs/analysis/cross-platform-comparison.md for a detailed analysis of how Claude Code, Codex CLI, and OpenClaw handle these patterns differently.

Run It Yourself

Prerequisites

Setup

cd projects/personal/skills-vs-subagents

cp .env.example .env
# Add ANTHROPIC_API_KEY to .env

uv sync --group dev

Run

# Full benchmark (all 4 tasks, 2 runs each) ~10-15 min
uv run python -m src --runs 2

# Quick single task
uv run python -m src --runs 1 --tasks t1

# Specific tasks
uv run python -m src --runs 2 --tasks t1,t4

Results are saved to results/ (markdown + JSON) and charts to charts/ (PNG).

Project Structure

skills-vs-subagents/
├── src/
│   ├── benchmark.py             # Main entry point + CLI
│   ├── tools.py                 # File operation tools (shared by both patterns)
│   ├── harness/
│   │   ├── runner.py            # Skill, sub-agent, and parallel runners
│   │   └── reporter.py          # Markdown results formatting
│   ├── tasks/
│   │   └── definitions.py       # 4 benchmark task definitions
│   └── charts/
│       └── token_chart.py       # Stacked bar chart generation
├── tests/                       # Full test suite
├── docs/
│   ├── plans/                   # Design doc + implementation plan
│   └── analysis/                # Cross-platform comparison
├── results/                     # Benchmark output (markdown + JSON)
└── charts/                      # Generated charts (PNG)

License

MIT

About

CLI benchmark comparing skills (context injection) vs sub-agents (context isolation) patterns in AI coding assistants

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages