Hermes Agent Self-Evolution — Evolutionary Self-Improvement for Hermes Agent

Vision

A standalone optimization pipeline that systematically improves Hermes Agent's performance by evolving skills, prompts, tool descriptions, and agent configurations using automated optimization loops. Lives in its own repo (NousResearch/hermes-agent-self-evolution), operates ON hermes-agent — not part of it.

Three complementary engines, unified under one workflow:

Engine	What It Optimizes	License	Integration
DSPy + GEPA	Skills, prompts, instructions, tool descriptions	MIT	Native Python, primary engine
Darwinian Evolver	Code files, algorithms, tool implementations	AGPL v3	External CLI only
DSPy MIPROv2	Few-shot examples, instruction text	MIT	Native Python, fallback optimizer

GEPA is the star — it's integrated into DSPy, reads execution traces to understand WHY things fail (not just that they fail), and works with as few as 3 examples. It outperforms both RL and previous DSPy optimizers.

Important: No GPU training required. Everything in this plan operates via API calls only. DSPy+GEPA and MIPROv2 optimize the text of prompts, instructions, and few-shot examples — they mutate and evaluate strings, not model weights. The Darwinian Evolver evolves code files (also text). The only DSPy component that trains weights (BootstrapFinetune) is explicitly excluded from this plan. All evaluation runs through batch_runner making standard LLM API calls.

What Can Be Improved

Tier 1: Skill Files (Highest Value, Lowest Risk)

What: SKILL.md files — procedural instructions the agent follows
How: Wrap skill text as a DSPy module, evaluate on test tasks via batch_runner, evolve with GEPA
Why it works: Skills are pure text, easily mutated, and directly measurable (did the agent complete the task correctly when following this skill?)
Example: Evolve the github-code-review skill to produce better reviews by testing against a dataset of known-good code reviews

Tier 2: Tool Descriptions (Medium Value, Low Risk)

What: The description field in tool schemas (what the agent sees when deciding which tool to use)
How: GEPA evolves descriptions, evaluates whether the agent picks the right tool for given tasks
Why it works: Tool selection is a classification problem — perfect for DSPy optimization
Example: Evolve the search_files description so the agent picks it over terminal(grep) more reliably

Tier 3: System Prompt Components (High Value, Higher Risk)

What: Sections of the system prompt (persona, policies, formatting instructions)
How: Parameterize prompt_builder.py sections as DSPy Signatures, optimize with GEPA
Why it works: System prompt quality directly determines agent behavior quality
Risk: Must be careful not to break prompt caching — only optimize offline, deploy as new versions
Example: Evolve the "tool usage guidelines" section to reduce unnecessary tool calls

Tier 4: Code Evolution (High Value, Highest Risk)

What: Tool implementation code, helper functions
How: Darwinian Evolver with GitBasedOrganism, test via pytest + batch_runner
Why it works: Some tool implementations have subtle bugs or inefficiencies that evolutionary search can find
Risk: Code changes can break things — requires strong test suites as guardrails
Example: Evolve file_tools.py patch matching to handle more edge cases

Architecture

The Optimization Loop

┌─────────────────────────────────────────────┐
│  1. SELECT TARGET                           │
│     - Pick a skill, prompt section, or tool  │
│     - Load current version as baseline       │
│                                             │
│  2. BUILD EVALUATION DATASET                │
│     - Mine session_db for real usage examples │
│     - Or use hand-crafted test cases         │
│     - Split: train / validation / test       │
│                                             │
│  3. WRAP AS DSPy MODULE                     │
│     - Skill text → dspy.Signature            │
│     - Agent workflow → dspy.ReAct             │
│     - Tool selection → dspy.Predict           │
│                                             │
│  4. RUN OPTIMIZER                           │
│     - Primary: dspy.GEPA (reflective evolution)│
│     - Fallback: dspy.MIPROv2 (bayesian opt)  │
│     - Code: Darwinian Evolver (external CLI)  │
│                                             │
│  5. EVALUATE & COMPARE                      │
│     - Run optimized version on held-out test  │
│     - Compare: accuracy, cost, latency        │
│     - Statistical significance check          │
│                                             │
│  6. DEPLOY (with approval)                  │
│     - Git commit the improved version         │
│     - A/B test in production (optional)       │
│     - Rollback mechanism via git revert       │
└─────────────────────────────────────────────┘

Integration Points with Existing Hermes Infrastructure

Hermes Component	Role in Self-Improvement
`batch_runner.py`	Evaluation harness — run agent on test tasks in parallel
`agent/trajectory.py`	Collect execution traces for GEPA's reflective analysis
`hermes_state.py` (SessionDB)	Mine real usage data for evaluation datasets
`skills/` directory	The primary optimization targets
`tools/registry.py`	Tool descriptions to optimize
`agent/prompt_builder.py`	System prompt components to optimize
`tests/`	Guardrails — evolved code must pass all tests
Git history	Track all evolution lineage, enable rollback

Data Flow

SessionDB (real conversations)
    │
    ▼
Evaluation Dataset Builder
    │
    ├──► DSPy Module Wrapper (wraps skill/prompt/tool as optimizable module)
    │        │
    │        ▼
    │    GEPA Optimizer ◄── Execution Traces (from batch_runner)
    │        │                    ▲
    │        │                    │
    │        ▼                    │
    │    Candidate Variants ──► batch_runner (parallel evaluation)
    │        │
    │        ├──► Constraint Validation (tests, char limits, caching compat)
    │        │
    │        ▼
    │    Best Valid Variant
    │        │
    ▼        ▼
Git Branch + PR (with diff, metrics, before/after comparison)
    │
    ▼
Human Review & Merge

Implementation Structure

Where It Lives

Hermes Agent Self-Evolution lives in its own repo (NousResearch/hermes-agent-self-evolution), separate from hermes-agent. It pip-installs or clones hermes-agent to access its infrastructure, and outputs PRs against the hermes-agent repo.

hermes-agent-self-evolution/             # Standalone repo
├── PLAN.md                             # This file
├── README.md                           # Setup, usage, examples
├── pyproject.toml                      # Package config + dependencies (dspy, gepa)
│
├── evolution/                          # Main package
│   ├── core/                           # Shared infrastructure
│   │   ├── __init__.py
│   │   ├── dataset_builder.py          # Eval dataset generation (synthetic, SessionDB mining)
│   │   ├── fitness.py                  # Fitness functions (LLM-as-judge, rubrics, length penalties)
│   │   ├── constraints.py              # Constraint validators (char limits, caching compat, test suite)
│   │   ├── benchmark_gate.py           # Benchmark gating (run TBLite/YC-Bench, check regression)
│   │   └── pr_builder.py              # Auto-generate PR with metrics, diffs, comparison
│   │
│   ├── skills/                         # Phase 1: Skill evolution
│   │   ├── __init__.py
│   │   ├── evolve_skill.py            # Main entry: python -m evolution.skills.evolve_skill --skill <name>
│   │   └── skill_module.py            # Wraps SKILL.md as DSPy module
│   │
│   ├── tools/                          # Phase 2: Tool description evolution
│   ├── prompts/                        # Phase 3: System prompt evolution
│   ├── code/                           # Phase 4: Code evolution (Darwinian Evolver)
│   └── monitor/                        # Phase 5: Continuous loop
│
├── datasets/                           # Generated eval datasets (gitignored, local)
│   ├── skills/
│   └── tools/
│
└── tests/                              # Test suite

How It's Invoked

# Clone and install
git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
cd hermes-agent-self-evolution
pip install -e ".[dev]"

# Point at hermes-agent repo (auto-detected from ~/.hermes/hermes-agent or env var)
export HERMES_AGENT_REPO=~/.hermes/hermes-agent

# Phase 1: Evolve a skill
python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 10 \
    --eval-source synthetic         # or: sessiondb, golden, auto

# Phase 2: Evolve tool descriptions
python -m evolution.tools.evolve_tool_descriptions \
    --iterations 5 \
    --benchmark-gate tblite-fast

# Phase 3: Evolve a system prompt section
python -m evolution.prompts.evolve_prompt_section \
    --section MEMORY_GUIDANCE \
    --iterations 5

# Phase 4: Evolve tool code (uses Darwinian Evolver CLI)
python -m evolution.code.evolve_tool_code \
    --tool file_tools \
    --bug-issue 742 \
    --iterations 10

# All commands output a PR branch + summary against hermes-agent. Human merges.

Relationship to hermes-agent

hermes-agent-self-evolution operates ON hermes-agent, not inside it. Zero changes to the agent repo are needed. It reads from the hermes-agent codebase and writes evolved versions to git branches, creating PRs for human review.

hermes-agent Component	How Self-Evolution Uses It
`batch_runner.py`	Run agent on eval tasks in parallel
`environments/benchmarks/tblite/`	Benchmark gating
`environments/benchmarks/yc_bench/`	Coherence checks
`hermes_state.py` (SessionDB)	Mine real usage for eval data
`agent/prompt_builder.py`	Read current prompt sections (read-only)
`tools/registry.py`	Read current tool descriptions (read-only)
`skills/` directory	Read current skills, write evolved versions to branch

Execution Plan

How Phases Work

Phases are sequential — each one builds on infrastructure from the previous one and must prove itself before we move on. The flow is:

Phase 1 ──► Validation Gate ──► Phase 2 ──► Validation Gate ──► Phase 3 ──► ...
  Build       "Did it actually       Build       "Did it work         Build
  & test       make things            & test       without breaking     & test
               better?"                            anything?"

Between every phase:

Run full benchmark suite (TBLite + YC-Bench fast_test) to establish a new baseline
Review all evolved artifacts — are the changes sensible to a human?
Merge the proven improvements via PR
Retrospective: what worked, what didn't, adjust approach for next phase

Each phase has three stages:

Build (~1-2 weeks): Write the optimization infrastructure for that tier
Run (~1 week): Execute optimization on real targets, iterate on eval datasets
Validate (~1 week): Benchmark, review, merge. Decide if results justify moving to next phase

If a phase doesn't produce meaningful improvements (evolved variants aren't better than baseline), we stop and reassess before moving on. No point optimizing tool descriptions if we can't even improve skills.

Timeline Overview

Phase	What	Duration	Depends On	Gate to Next
Phase 1	Skill evolution	3-4 weeks	Nothing — starts here	≥1 skill measurably improved, no benchmark regression
Phase 2	Tool descriptions	2-3 weeks	Phase 1 infra (GEPA runner, eval framework)	Tool selection accuracy improved, no benchmark regression
Phase 3	System prompt	2-3 weeks	Phase 1-2 infra + validated benchmark gating	Behavioral tests pass, benchmarks hold or improve
Phase 4	Code evolution	3-4 weeks	Phases 1-3 + strong eval pipeline	Bugs fixed, tests pass, benchmarks hold
Phase 5	Continuous loop	2 weeks	All above working	Automated pipeline runs unattended

Total: ~13-17 weeks if all phases prove valuable. But we may stop at Phase 1 or 2 if the returns diminish — no obligation to do all five.

Detailed Phase Breakdown

Phase 1: Skill Evolution via DSPy+GEPA (Core Capability)

Goal: The agent can optimize any SKILL.md file by running it through GEPA.

Week 1-2 (Build):

Install DSPy + GEPA, verify they work in Hermes' .venv
Build the skill-as-DSPy-module wrapper (takes a SKILL.md → DSPy module)
Build the eval dataset generator (strong model reads skill → generates test cases)
Build the GEPA optimization runner (wraps dspy.GEPA with Hermes config)
Unit tests for all components

Week 2-3 (Run):

Pick 2-3 target skills: github-code-review, systematic-debugging, arxiv
Generate eval datasets for each (15-30 examples per skill)
Run GEPA optimization (5-10 iterations per skill)
Compare baseline vs evolved on holdout set
Iterate on eval dataset quality if results are noisy

Week 3-4 (Validate):

Run TBLite + YC-Bench fast_test with evolved skills vs baseline
Human review of all evolved skill diffs — do the changes make sense?
Create PRs for improvements that pass all gates
Document what worked and what didn't

Done when:

≥1 skill shows measurable improvement on its eval dataset (≥10% score increase)
No benchmark regression (TBLite score holds within 2%)
The evolved skill diff reads sensibly to a human reviewer
The optimization pipeline is reusable (can point it at any skill and run)

What to build:

Skill-as-DSPy-Module wrapper — Takes a SKILL.md, creates a DSPy module that:
- Injects the skill text as the system prompt
- Runs the agent on a test task
- Returns the result for scoring
Evaluation dataset builder — Creates train/val/holdout splits from multiple sources:

Source A: Synthetic generation (primary, bootstrapping) Use a strong model (e.g., Claude Opus) to generate test cases for a skill:
- Read the skill file → understand what it does
- Generate 15-30 realistic (task_input, expected_behavior) pairs
- Expected_behavior is a rubric, not exact text — e.g., "should identify the SQL injection on line 42" not "output this exact string"
- Split: 10 train / 5 val / 5-10 holdout
- GEPA works with as few as 3 examples, so this is sufficient to start
Source B: SessionDB mining (real usage, LLM-as-judge scored)
- Query SessionDB for sessions where the skill was loaded (search for skill name in messages)
- Extract the task the user gave and the agent's full response
- Use LLM-as-judge to score each (task, response) pair on a rubric
- High-scoring pairs become "good" examples; low-scoring pairs become failure cases for GEPA's reflective analysis
- This improves over time as more real usage accumulates
Source C: Hand-curated golden sets (optional, high-value skills)
- Manually written test cases with expected outputs
- Stored as JSONL in ~/.hermes/evolution/datasets/<skill-name>/golden.jsonl
- Highest quality signal but requires manual effort — reserve for critical skills
Source D: Skill-specific auto-evaluation (where applicable)
- systematic-debugging: Plant a bug, run the skill, check if tests pass after
- arxiv: Search for known papers, check if they're found
- github-code-review: Create a PR with planted issues, check if they're caught
- Not all skills have natural auto-eval — this is a bonus, not a requirement
Scoring: LLM-as-judge with rubrics For most skills, there's no binary right/wrong — quality is subjective. The fitness function uses an LLM judge that scores on a rubric:
- Did the agent follow the skill's procedure? (0-1)
- Was the output correct/useful? (0-1)
- Was it concise (within token budget)? (0-1)
- Rubrics are skill-specific and stored alongside the eval dataset
GEPA optimization runner — Wraps dspy.GEPA with Hermes-specific config:
- Uses batch_runner for parallel evaluation
- Captures execution traces (trajectories) for GEPA's reflective analysis
- Saves snapshots for pause/resume
Comparison & deployment — Side-by-side evaluation:
- Runs baseline vs optimized on held-out test set
- Shows diff of what changed
- Commits improved version with evolution metadata

CLI interface:

# Evolve a skill with auto-generated eval data from session history
hermes evolve skill github-code-review --iterations 10

# Evolve with a custom evaluation dataset
hermes evolve skill arxiv --dataset eval_tasks.jsonl --iterations 5

# Compare baseline vs evolved
hermes evolve compare github-code-review --version latest

# Deploy evolved version
hermes evolve deploy github-code-review --version 3

Or as agent tool calls:

The agent can self-invoke optimization:
"I notice this skill could be improved. Let me run GEPA optimization on it."
→ Uses execute_code to run DSPy+GEPA
→ Evaluates results
→ Proposes the improved skill for human approval

Phase 2: Tool Description Optimization

Goal: Optimize the natural language descriptions in tool schemas so the agent picks the right tools more reliably and uses them correctly.

Prerequisite: Phase 1 gate passed — GEPA optimization loop proven to work on skills.

Week 1 (Build): Adapt Phase 1's GEPA runner for tool descriptions. Build tool selection evaluator and synthetic dataset generator. The hard part is cross-tool evaluation — ensuring one tool's improvement doesn't steal from another.

Week 2 (Run): Generate tool selection dataset (~200-400 triples). Run GEPA on all tool descriptions simultaneously. Mine SessionDB for misselection patterns.

Week 3 (Validate): Benchmark gate. Human review of evolved descriptions — do they still accurately describe the tools? PR.

Done when:

Tool selection accuracy improves on holdout set (≥5% improvement)
No individual tool's selection rate regresses
Benchmarks hold (TBLite within 2%)
Evolved descriptions are factually accurate and ≤500 chars

What gets evolved: Tool descriptions are hardcoded string constants in tools/*.py files, registered via registry.register(). Each tool has:

A top-level description field (what the tool does, when to use it, behavioral guidance)
Per-parameter description fields (what each parameter means, valid values)
Some tools have separate description constants (e.g., TERMINAL_TOOL_DESCRIPTION)

These descriptions are sent with every API call as part of the tool schema — every extra character multiplies across the entire conversation.

What to build:

Tool selection evaluator — Given a task description, does the agent pick the right tool?
- Build a dataset of (task_description, correct_tool, correct_params) triples
- Example: "find all Python files containing 'import os'" → search_files (not terminal(grep))
- Example: "read lines 50-100 of config.py" → read_file (not terminal(cat))
- Score: tool_selection_accuracy + parameter_correctness
Description optimizer — GEPA evolves description text to improve selection accuracy
- Wrap each tool description as a DSPy Signature parameter
- Mutate descriptions, evaluate on tool selection dataset
- GEPA reads traces of WRONG tool selections to understand why the agent was confused
Cross-tool evaluation — Ensure improving one description doesn't hurt others
- Always evaluate ALL tool descriptions together (not in isolation)
- Fitness function penalizes regressions on any tool's selection rate
- This prevents a search_files description from "stealing" selections from read_file

Evaluation data sources:

Source A: Synthetic tool selection dataset Generate (task, correct_tool, correct_params) triples using a strong model:

For each tool, generate 10-20 tasks where that tool is clearly the right choice
Include 10-20 "confusing" tasks where two tools could work but one is better
Include 10 tasks where the agent should use NO tool (just respond directly)
Total: ~200-400 triples, split 60/20/20 train/val/holdout

Source B: SessionDB mining — tool selection patterns

Find conversations where the agent used a tool
Identify cases where the agent used terminal(grep) when search_files was better (or similar mismatches)
LLM-as-judge scores whether the tool choice was optimal
Misselections become high-value training examples

Source C: Benchmark-derived tool selection

Run TBLite with baseline descriptions, log every tool call
Identify tasks where wrong tool selection caused failures
These become hard examples in the eval dataset

Constraints specific to tool descriptions:

Max 500 chars per tool description (sent every API call)
Max 200 chars per parameter description
Must remain factually accurate (can't claim a tool does something it doesn't)
Schema structure (parameter names, types, required fields) is FROZEN — only text evolves

Phase 3: System Prompt Evolution

Goal: Optimize the sections of the system prompt that guide agent behavior.

Prerequisite: Phase 2 gate passed — benchmark gating validated, GEPA producing sensible text mutations.

Week 1 (Build): Build section-as-DSPy-parameter wrapper for the 5 evolvable prompt sections. Build behavioral test suite generator. This is the riskiest tier so far — system prompt changes affect everything.

Week 2 (Run): Generate behavioral test scenarios (~60-80 total across all sections). Run GEPA on each section independently first, then jointly. Run benchmarks after each optimization round.

Week 2-3 (Validate): Full benchmark suite (TBLite + YC-Bench). Extra scrutiny here — system prompt changes have the widest blast radius. Multiple human reviewers if possible.

Done when:

Behavioral test scores improve (≥10% on targeted sections)
Benchmarks hold or improve (zero tolerance for regression here)
The agent's personality/tone hasn't drifted noticeably
Prompt stays within caching boundaries

What gets evolved: The system prompt is assembled in run_agent.py / agent/prompt_builder.py from 8 distinct sections:

Section	Location	What It Does	Evolvable?
`DEFAULT_AGENT_IDENTITY`	prompt_builder.py	Core persona, behavioral traits	✅ Yes — tone, priorities, approach
`MEMORY_GUIDANCE`	prompt_builder.py	How to use persistent memory	✅ Yes — when to save, what to save
`SESSION_SEARCH_GUIDANCE`	prompt_builder.py	When to search past sessions	✅ Yes — trigger conditions
`SKILLS_GUIDANCE`	prompt_builder.py	When to save/load skills	✅ Yes — trigger conditions
`PLATFORM_HINTS`	prompt_builder.py	Per-platform formatting guidance	✅ Yes — per platform
Memory block	memory_store.py	User's actual memories	❌ No — user data
Skills index	prompt_builder.py	Auto-generated skill list	❌ No — auto-generated
Context files	prompt_builder.py	AGENTS.md, .cursorrules	❌ No — project-specific

What to build:

Section-as-DSPy-parameter wrapper — Each evolvable section becomes a DSPy Signature field
- The optimizer can mutate each section independently
- Sections are evaluated together (the full system prompt matters, not individual sections)
Behavioral evaluator — Does the agent behave correctly with this system prompt?
- Measure: tool usage patterns, response quality, memory usage, skill loading
- Use batch_runner to run the agent on diverse tasks with the evolved prompt
Benchmark-gated validation — Evolved prompts must not regress on benchmarks
- Run TBLite (fast, ~1-2 hours) as a regression check
- If TBLite score drops, reject the variant regardless of other metrics

Evaluation data sources:

Source A: Behavioral test suite (synthetic) Generate scenarios that test specific prompt sections:

Memory guidance: "Does the agent save important user preferences?" (10 scenarios)
Session search: "Does the agent search history when the user says 'like last time'?" (10 scenarios)
Skills guidance: "Does the agent load relevant skills before starting?" (10 scenarios)
Identity: "Is the response helpful, direct, and not overly verbose?" (20 scenarios)
Platform hints: "Does CLI output avoid markdown? Does Telegram use formatting?" (10 per platform)

Source B: Benchmark scores as fitness signal

TBLite (100 tasks, ~1-2 hours, binary pass/fail) — primary regression check
YC-Bench fast_test preset (~50 turns, composite score) — tests long-term coherence
These don't measure specific prompt sections, but they catch broad regressions
A prompt variant that scores higher on behavioral tests but lower on TBLite is rejected

Source C: SessionDB — behavioral pattern mining

Find sessions where the agent failed to search memory when it should have
Find sessions where the agent was too verbose or used wrong formatting
These become targeted test cases for the relevant prompt section

Constraints specific to system prompt sections:

Each section must not exceed its current size by >20% (prevents prompt bloat)
Total system prompt must stay under the model's prompt caching boundary
Identity section must retain core traits (helpful, direct, admits uncertainty)
Platform hints must remain platform-accurate (don't tell Telegram to use ANSI codes)

Phase 4: Code Evolution via Darwinian Evolver

Goal: Evolve tool implementation code for better performance and fewer bugs.

Prerequisite: Phases 1-3 complete — strong evaluation pipeline, validated benchmark gating, confidence in the optimization loop.

Week 1-2 (Build): Set up Darwinian Evolver as external CLI. Build code-as-organism wrapper mapping tool files to GitBasedOrganism. Build composite fitness function (pytest + benchmarks + bug reproduction). This phase uses a different engine (Darwinian Evolver instead of DSPy+GEPA) so there's new infrastructure to build.

Week 2-3 (Run): Start with known bugs from GitHub issues — create reproduction scripts, run evolution to find fixes. Then try edge case hardening on 1-2 tools (e.g., file_tools.py, search_files).

Week 3-4 (Validate): Full test suite + full benchmark suite (including TerminalBench2 for thorough validation). Strictest human review — every line of evolved code reviewed before merge.

Done when:

≥1 known bug fixed by evolution (validated by reproduction script)
Full test suite passes (2550+ tests)
Full benchmark suite holds (TBLite + TerminalBench2 + YC-Bench)
No function signatures or registry calls changed
Human reviewer approves all code changes

What gets evolved: Actual Python source code in tools/*.py files. This is the highest-risk tier — code changes can break everything.

What to build:

Code-as-organism wrapper — Maps tool source files to Darwinian Evolver's GitBasedOrganism
- Each tool file is a separate organism
- Mutations are proposed by the LLM based on specific failure cases
- All mutations are committed to a git branch for traceability
Test-driven fitness function — Composite score from multiple signals:
- pytest results (hard gate — must pass 100%)
- Benchmark scores (TBLite pass rate)
- Specific failure case resolution (did the mutation fix the bug it targeted?)
- Code quality heuristics (no regressions in error handling, no removed safety checks)
Safety guardrails — Strictest of all tiers:
- Full test suite must pass
- No changes to function signatures (would break callers)
- No changes to registry.register() calls (would break tool discovery)
- No removal of error handling or safety checks
- Human review required on every PR

Evaluation data sources:

Source A: pytest suite (2550+ tests, primary gate)

Every code mutation must pass the full test suite
Test failures = immediate rejection, no exceptions
This is the hard floor — it prevents regressions

Source B: Benchmark scores (broad capability check)

Run TBLite with evolved code to verify tool implementations still work end-to-end
TerminalBench2 for thorough validation (89 tasks, but expensive — use selectively)
YC-Bench for long-horizon coherence (does the agent still handle 200-turn sessions?)

Source C: Known bug reproduction datasets

Collect GitHub issues that report tool bugs
Create reproduction scripts that trigger the bug
Fitness: does the evolved code fix the bug while passing all tests?
This is the most targeted and efficient use of code evolution

Source D: Edge case generation

Use a strong model to generate adversarial inputs for each tool
Example: read_file with symlinks, binary files, huge files, missing files, permission errors
Example: search_files with regex edge cases, unicode, very large repos
Score: does the tool handle all edge cases gracefully?

Constraints specific to code evolution:

Full test suite (2550+ tests) must pass — zero tolerance
Function signatures frozen (no breaking API changes)
registry.register() calls frozen (no tool discovery changes)
Error handling coverage must not decrease
Darwinian Evolver runs as external CLI only (AGPL v3)
All PRs require human review — no auto-merge for code changes

Phase 5: Continuous Self-Improvement Loop

Goal: The agent automatically identifies its weakest areas and improves them over time.

Prerequisite: Phases 1-4 proven — manual optimization works reliably for skills, tools, prompts, and code. Now we automate it.

Week 1 (Build): Build performance monitor (tracks skill success rates, tool selection accuracy, benchmark scores over time). Build auto-triage logic (ranks optimization targets by impact × frequency). Wire up to Hermes cron scheduler.

Week 2 (Deploy & Monitor): Set up weekly benchmark runs via cron. Set up threshold-triggered optimization (when a skill's failure rate exceeds X%, auto-trigger GEPA). All automated PRs still require human merge.

Done when:

Weekly benchmark runs execute unattended and report scores
Auto-triage correctly identifies underperforming skills
At least one optimization cycle runs end-to-end (detect problem → optimize → PR) without manual intervention
Human still reviews and merges every PR — this phase automates detection and optimization, not deployment

What to build:

Performance monitor — Track metrics from real usage:
- Per-skill success rates (from SessionDB — was the skill loaded? did the task succeed?)
- Tool selection accuracy (from trajectories — did the agent pick the right tools?)
- Benchmark scores over time (periodic TBLite + YC-Bench runs)
- User corrections (when the user says "no, use X instead" — that's a signal)
Auto-triage — Identify what to optimize next:
- Skills with declining success rates or high failure rates
- Tools that are frequently misselected
- Benchmark categories with low pass rates
- Rank by (potential improvement × usage frequency)
Scheduled optimization — Cron job pipeline:
- Weekly: Run TBLite + YC-Bench fast_test, log scores
- When scores drop or skill failure rate exceeds threshold: trigger GEPA optimization
- Generate PR with evolved improvements
- Notify for human review
Feedback loop — Real usage improves evaluation datasets:
- User corrections are logged and added to eval datasets
- High-quality sessions become positive examples
- Failed sessions become failure cases for GEPA's reflective analysis
- Evaluation datasets grow organically over time

Benchmarks as Fitness Signals

The three existing benchmarks serve different roles in the optimization pipeline:

Benchmark	What It Tests	Speed	Cost	Role in Self-Improvement
TBLite	Coding/sysadmin (100 tasks, calibrated difficulty)	~1-2 hours	~$20-50	Primary regression gate — fast enough to run on every candidate
TerminalBench2	Coding/sysadmin (89 harder tasks, Docker sandboxes)	~2-4 hours	~$50-200	Thorough validation — run on final candidates before PR
YC-Bench	Long-horizon strategic coherence (100-500 turns)	~3-6 hours	~$50-200	Coherence check — ensures evolved prompts don't break multi-turn behavior

How benchmarks fit into the optimization loop:

Candidate Variant
    │
    ├──► pytest (must pass 100%) ────────── GATE 1: functional correctness
    │
    ├──► TBLite fast subset (20 tasks) ──── GATE 2: quick capability check (~20 min)
    │
    ├──► Task-specific eval dataset ──────── FITNESS: skill/tool/prompt quality score
    │
    ▼
Top Candidates Only (top 3)
    │
    ├──► Full TBLite (100 tasks) ─────────── GATE 3: thorough regression check
    │
    ├──► YC-Bench fast_test ──────────────── GATE 4: coherence check
    │
    ▼
Best Candidate → PR with full metrics

Key principle: Benchmarks are GATES, not fitness functions. The fitness function is task-specific (did the skill/tool/prompt do its job better?). Benchmarks ensure the improvement didn't break something else. A variant that improves skill quality by 20% but drops TBLite by 5% is REJECTED.

Constraints & Guardrails

Every candidate variant must pass ALL of these before it can be considered valid. Variants that fail any constraint are discarded — GEPA/MIPROv2 never see them as successful.

1. Full Test Suite

python -m pytest tests/ -q  # Must pass 100% — zero tolerance

Every evolved variant (skill text, tool description, code) triggers the full test suite. If any test fails, the variant is rejected. This is the hard floor — nothing ships that breaks existing functionality.

2. Character/Token Limits

Evolved text must stay within strict size budgets:

Target	Max Size	Why
Skill files (SKILL.md)	Configurable per skill, default 15KB	Skills are injected as user messages — bloated skills waste context window
Tool descriptions	500 chars	Tool schemas are sent every turn — every extra char multiplies across the entire conversation
System prompt sections	Must not exceed current section size by >20%	Prevents prompt bloat that degrades model attention and increases cost

The optimizer's fitness function applies a length penalty — variants that approach the limit get scored lower even if they're otherwise better. This prevents evolutionary drift toward verbose solutions.

3. Prompt Caching Compatibility

Hermes relies on prompt caching to keep costs manageable. Evolved content must not break this:

Skills: Injected as user messages at conversation start. Evolved skills are deployed as new versions — they take effect on NEW sessions only, never mid-conversation.
Tool descriptions: Part of the tool schema sent with every API call. Changes take effect on next session start. Schema structure (parameter names, types) must NOT change — only the description text.
System prompt sections: Rebuilt once at session start. Evolved sections deploy as config updates, applied on next session. No mid-session prompt rebuilds.

Rule: No evolved content is ever hot-swapped into an active conversation. All changes take effect on the next fresh session.

4. Semantic Preservation

The optimizer must preserve the core behavior/intent of what it's evolving:

A skill for "GitHub code review" must still perform code reviews, not drift into something else
Tool descriptions must still accurately describe what the tool does
System prompt sections must maintain their functional role

This is enforced by including semantic similarity checks in the fitness function — the evolved text is compared against the original to ensure it hasn't drifted too far in meaning, only improved in effectiveness.

5. Deployment via PR (Never Direct Commit)

All evolved changes go through a pull request:

git checkout -b evolve/<target>-<timestamp>
# Apply evolved changes
git add <files>
git commit -m "evolve: <target> — score improved X% → Y%

Optimizer: GEPA (N iterations, M candidates evaluated)
Eval dataset: <dataset name> (K examples)
Before: <baseline score>
After: <evolved score>
Holdout: <holdout score>"
git push -u origin evolve/<target>-<timestamp>
gh pr create --title "evolve: <target>" --body "<metrics, diff, comparison>"

The PR body includes:

Before/after scores on train, validation, AND holdout sets
The full diff of what changed
Cost of the optimization run
Any constraint violations that were caught and rejected during evolution

Practical Considerations

Cost

GEPA optimization: ~$2-10 per run (depending on eval dataset size)
Darwinian Evolver: ~$2-9 per task
Batch evaluation: depends on number of test cases and model cost
Recommendation: Start with small eval sets (10-20 examples), scale up for important skills

Safety

See the Constraints & Guardrails section above for the full enforcement list. Summary:

Human approval required — all changes deploy via PR, never direct commit
Full test suite gate — zero tolerance, every variant must pass 100%
Character/token budgets — prevents evolutionary bloat
Caching compatibility — no mid-conversation changes, ever
Semantic preservation — evolved text must not drift from its original purpose
Git-tracked lineage — every evolution step is a commit, rollback is trivial
Holdout test sets — separate from training data to catch overfitting

Licensing

DSPy: MIT ✓ (can import and integrate freely)
GEPA: MIT ✓ (integrated into DSPy, also standalone pip install gepa)
Darwinian Evolver: AGPL v3 ⚠️ (external CLI only, no Python imports)
All Hermes-native code: MIT ✓

Relationship to Existing Issues

Issue	Status	Relationship
#336 (Darwinian Evolver Skill)	Open	Subsumed by Phase 3 of this plan
#337 (Evolutionary Self-Improvement)	Open	This plan IS the implementation of #337
#339 (PR: Darwinian Evolver skill)	Open	Close — replaced by this unified approach

Open Questions

Should the optimization skill live in the repo (bundled) or Skills Hub (optional install)?
- Recommendation: Core orchestration in repo, optimization engines as optional dependencies
How do we build evaluation datasets for skills that don't have much usage history?
- Option A: LLM-generated synthetic test cases
- Option B: Manual curation by skill authors
- Option C: Community-contributed eval sets
Should evolved skills be versioned separately from the main repo?
- Recommendation: Git branches per evolution run, merge winning variants to main
What's the minimum viable first target?
- Recommendation: Pick 2-3 well-used skills with clear success metrics (e.g., arxiv paper search, github-code-review, systematic-debugging)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hermes Agent Self-Evolution — Evolutionary Self-Improvement for Hermes Agent

Vision

What Can Be Improved

Tier 1: Skill Files (Highest Value, Lowest Risk)

Tier 2: Tool Descriptions (Medium Value, Low Risk)

Tier 3: System Prompt Components (High Value, Higher Risk)

Tier 4: Code Evolution (High Value, Highest Risk)

Architecture

The Optimization Loop

Integration Points with Existing Hermes Infrastructure

Data Flow

Implementation Structure

Where It Lives

How It's Invoked

Relationship to hermes-agent

Execution Plan

How Phases Work

Timeline Overview

Detailed Phase Breakdown

Phase 1: Skill Evolution via DSPy+GEPA (Core Capability)

Phase 2: Tool Description Optimization

Phase 3: System Prompt Evolution

Phase 4: Code Evolution via Darwinian Evolver

Phase 5: Continuous Self-Improvement Loop

Benchmarks as Fitness Signals

Constraints & Guardrails

1. Full Test Suite

2. Character/Token Limits

3. Prompt Caching Compatibility

4. Semantic Preservation

5. Deployment via PR (Never Direct Commit)

Practical Considerations

Cost

Safety

Licensing

Relationship to Existing Issues

Open Questions

FilesExpand file tree

PLAN.md

Latest commit

History

PLAN.md

File metadata and controls

Hermes Agent Self-Evolution — Evolutionary Self-Improvement for Hermes Agent

Vision

What Can Be Improved

Tier 1: Skill Files (Highest Value, Lowest Risk)

Tier 2: Tool Descriptions (Medium Value, Low Risk)

Tier 3: System Prompt Components (High Value, Higher Risk)

Tier 4: Code Evolution (High Value, Highest Risk)

Architecture

The Optimization Loop

Integration Points with Existing Hermes Infrastructure

Data Flow

Implementation Structure

Where It Lives

How It's Invoked

Relationship to hermes-agent

Execution Plan

How Phases Work

Timeline Overview

Detailed Phase Breakdown

Phase 1: Skill Evolution via DSPy+GEPA (Core Capability)

Phase 2: Tool Description Optimization

Phase 3: System Prompt Evolution

Phase 4: Code Evolution via Darwinian Evolver

Phase 5: Continuous Self-Improvement Loop

Benchmarks as Fitness Signals

Constraints & Guardrails

1. Full Test Suite

2. Character/Token Limits

3. Prompt Caching Compatibility

4. Semantic Preservation

5. Deployment via PR (Never Direct Commit)

Practical Considerations

Cost

Safety

Licensing

Relationship to Existing Issues

Open Questions