You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After extensive testing of the Phase 1 skill evolution pipeline, we identified a core architectural issue in SkillModule that prevents GEPA from actually mutating skill content. We also implemented several patches to make the pipeline runnable on the Nous Research API. This issue documents our findings and proposes concrete next steps.
🔬 Environment
Component
Version
hermes-agent-self-evolution
4693c8f (latest main)
dspy
3.1.3
gepa
0.2.1
Python
3.12
OS
Linux Mint 22.3
API Provider
Nous Research (OpenRouter gateway)
🐛 Bug 1: SkillModule Architecture Prevents Skill Text Mutation
The Problem
The SkillModule class in evolution/skills/skill_module.py passes skill_text as a runtime input field, not as part of the optimizable signature:
classTaskWithSkill(dspy.Signature):
"""Complete a task following the provided skill instructions."""skill_instructions: str=dspy.InputField(desc="The skill instructions to follow")
task_input: str=dspy.InputField(desc="The task to complete")
output: str=dspy.OutputField(desc="Your response following the skill instructions")
classSkillModule(dspy.Module):
def__init__(self, skill_text: str):
super().__init__()
self.skill_text=skill_text# ← Stored as instance attributeself.predictor=dspy.ChainOfThought(self.TaskWithSkill)
defforward(self, task_input: str):
returnself.predictor(
skill_instructions=self.skill_text, # ← Passed as input datatask_input=task_input,
)
Why This Breaks GEPA
GEPA (and all DSPy optimizers) mutate Signature instructions — the docstring and desc= fields of the signature class. They do NOT mutate instance attributes or runtime input values.
What happens during optimization:
GEPA sees TaskWithSkill.__doc__ ("Complete a task following...") as the optimizable text
GEPA evolves that docstring into a better "meta-prompt" for the agent wrapper
The self.skill_text (the actual SKILL.md content) is never touched
evolved_skill.md ends up as ~2.4KB of wrapper instructions, not an improved skill
Evidence
After 10 iterations on hermes-agent skill:
Baseline skill size: 27,043 chars
"Evolved" output size: 2,433 chars
Diff: Empty (the extracted skill_text was identical to baseline)
We traced this by inspecting optimized_module.predictor.predict.signature.instructions and confirming it diverged from optimized_module.skill_text.
Our Workaround (Not Ideal)
We patched evolve_skill.py to extract from the signature instead:
def_extract_evolved_instructions(optimized_module):
"""Extract the evolved text from the GEPA-mutated signature instructions."""try:
sig=optimized_module.predictor.predict.signatureinstructions=getattr(sig, "instructions", None)
ifinstructionsandlen(instructions) >100:
returninstructionsexceptException:
passreturnoptimized_module.skill_text# Fallback
This produces a different file, but it's an evolved wrapper prompt, not an evolved skill. The skill's actual procedure, pitfalls, and examples remain unchanged.
🔌 Enhancement: Nous Research API Integration
The current code hardcodes OpenAI API assumptions. We patched three files to support Nous Research's OpenRouter-compatible endpoint:
This allows PROVIDER=nous in run-evolution.sh with models like claude-3-5-sonnet-20241022 (optimizer) and kimi-k2.6 (free evaluator on Nous).
Suggested improvement: Make api_base a first-class config option in EvolutionConfig rather than requiring os.getenv patches.
📏 Config Issue: max_skill_size Too Small for Real Skills
Current default: 15,000 chars.
Real Hermes skills (hermes-agent, github-code-review) exceed this. We bumped to 50,000 chars, which broke tests/core/test_constraints.py::test_skill_over_limit (expects 15K).
Suggestion: Either raise the default or make it auto-detect from the baseline skill size + max_prompt_growth.
🧪 Test Gap: No Test for "Did GEPA Actually Mutate the Skill?"
The test suite verifies parsing and reassembly, but there is no test that confirms:
GEPA.compile() produces a module whose extractable text differs from baseline
The evolved text is semantically a skill (has frontmatter, procedures, etc.)
The fitness metric improves on held-out data with real skill mutations
This allowed the architecture issue to go unnoticed.
🎯 Proposed Fix: Redesign SkillModule for True Skill Evolution
Option A: Inline Skill Body as Signature Instructions
Make the skill's markdown body the instructions string of the signature:
Stage 2: An LLM expands the outline back into full markdown
Validation: Constraint checker ensures the expanded form preserves frontmatter
Option C: MIPROv2 for Skill Body
MIPROv2 produces optimized instructions we can extract directly. Trade-off: it may bloat output size, but it actually mutates the full instruction text rather than a wrapper.
🚀 Path to Phase 2: Tool Description Evolution
The evolution/tools/ directory is empty (only __init__.py). To implement Phase 2:
What's Needed
evolution/tools/tool_description_module.py — A DSPy module where:
Input: Task description + available tools
Optimizable: A specific tool's description field
Output: Correct tool selection
Metric: Accuracy of tool selection on a dataset
evolution/tools/evolve_tool_descriptions.py — CLI entry point mirroring evolve_skill.py
Dataset builder extension — Mine tools/registry.py for tool schemas and generate task→tool mapping examples
Integration with batch_runner.py — The PLAN.md references batch_runner.py from hermes-agent as the evaluation harness. This is currently missing from the repo.
Critical Dependency
The PLAN.md describes deep integration with hermes-agent components (batch_runner.py, agent/trajectory.py, hermes_state.py) that do not exist in this repo. Phase 2+ realistically requires either:
pip install hermes-agent as a dependency and importing its internals
Or copying/adapting those components into this repo
Question for maintainers: Is hermes-agent published as a pip-installable package? If not, should this repo vendor the evaluation harness?
✅ What We Verified Works
Component
Status
Notes
Synthetic dataset generation
✅
Produces valid EvalDataset
SessionDB mining
✅
50 real sessions extracted
GEPA execution
✅
Completes N iterations without crash
Constraint validation
✅
Size, structure, growth checks pass
Skill reassembly
✅
Frontmatter preserved, body replaced
Nous Research API
✅
With patches, runs end-to-end
📋 Recommended Action Items
Fix SkillModule architecture so GEPA mutates actual skill content, not wrapper prompts
Add integration test that asserts evolved_text != baseline_text and has valid skill structure
Expose api_base / base_url as a config option for OpenRouter/Nous users
Clarify hermes-agent integration — Is batch_runner.py available as a package? Should Phase 2 wait for it?
Update max_skill_size default or make it dynamic
We're excited about this project's potential and happy to contribute PRs for any of the above. Just let us know which direction the maintainers prefer.
Summary
After extensive testing of the Phase 1 skill evolution pipeline, we identified a core architectural issue in
SkillModulethat prevents GEPA from actually mutating skill content. We also implemented several patches to make the pipeline runnable on the Nous Research API. This issue documents our findings and proposes concrete next steps.🔬 Environment
hermes-agent-self-evolution4693c8f(latest main)dspy3.1.3gepa0.2.13.12🐛 Bug 1:
SkillModuleArchitecture Prevents Skill Text MutationThe Problem
The
SkillModuleclass inevolution/skills/skill_module.pypassesskill_textas a runtime input field, not as part of the optimizable signature:Why This Breaks GEPA
GEPA (and all DSPy optimizers) mutate
Signatureinstructions — the docstring anddesc=fields of the signature class. They do NOT mutate instance attributes or runtime input values.What happens during optimization:
TaskWithSkill.__doc__("Complete a task following...") as the optimizable textself.skill_text(the actual SKILL.md content) is never touchedevolved_skill.mdends up as ~2.4KB of wrapper instructions, not an improved skillEvidence
After 10 iterations on
hermes-agentskill:skill_textwas identical to baseline)We traced this by inspecting
optimized_module.predictor.predict.signature.instructionsand confirming it diverged fromoptimized_module.skill_text.Our Workaround (Not Ideal)
We patched
evolve_skill.pyto extract from the signature instead:This produces a different file, but it's an evolved wrapper prompt, not an evolved skill. The skill's actual procedure, pitfalls, and examples remain unchanged.
🔌 Enhancement: Nous Research API Integration
The current code hardcodes OpenAI API assumptions. We patched three files to support Nous Research's OpenRouter-compatible endpoint:
evolution/skills/evolve_skill.py:evolution/core/fitness.py:evolution/core/dataset_builder.py:This allows
PROVIDER=nousinrun-evolution.shwith models likeclaude-3-5-sonnet-20241022(optimizer) andkimi-k2.6(free evaluator on Nous).Suggested improvement: Make
api_basea first-class config option inEvolutionConfigrather than requiringos.getenvpatches.📏 Config Issue:
max_skill_sizeToo Small for Real SkillsCurrent default: 15,000 chars.
Real Hermes skills (
hermes-agent,github-code-review) exceed this. We bumped to 50,000 chars, which broketests/core/test_constraints.py::test_skill_over_limit(expects 15K).Suggestion: Either raise the default or make it auto-detect from the baseline skill size +
max_prompt_growth.🧪 Test Gap: No Test for "Did GEPA Actually Mutate the Skill?"
The test suite verifies parsing and reassembly, but there is no test that confirms:
GEPA.compile()produces a module whose extractable text differs from baselineThis allowed the architecture issue to go unnoticed.
🎯 Proposed Fix: Redesign
SkillModulefor True Skill EvolutionOption A: Inline Skill Body as Signature Instructions
Make the skill's markdown body the
instructionsstring of the signature:Challenge: Large skills (27KB) may exceed DSPy's instruction token budget or cause GEPA to compress/collapse the content into a short meta-prompt.
Option B: Two-Stage Evolution
Option C: MIPROv2 for Skill Body
MIPROv2 produces optimized instructions we can extract directly. Trade-off: it may bloat output size, but it actually mutates the full instruction text rather than a wrapper.
🚀 Path to Phase 2: Tool Description Evolution
The
evolution/tools/directory is empty (only__init__.py). To implement Phase 2:What's Needed
evolution/tools/tool_description_module.py— A DSPy module where:descriptionfieldevolution/tools/evolve_tool_descriptions.py— CLI entry point mirroringevolve_skill.pyDataset builder extension — Mine
tools/registry.pyfor tool schemas and generate task→tool mapping examplesIntegration with
batch_runner.py— The PLAN.md referencesbatch_runner.pyfrom hermes-agent as the evaluation harness. This is currently missing from the repo.Critical Dependency
The PLAN.md describes deep integration with hermes-agent components (
batch_runner.py,agent/trajectory.py,hermes_state.py) that do not exist in this repo. Phase 2+ realistically requires either:pip install hermes-agentas a dependency and importing its internalsQuestion for maintainers: Is hermes-agent published as a pip-installable package? If not, should this repo vendor the evaluation harness?
✅ What We Verified Works
EvalDataset📋 Recommended Action Items
SkillModulearchitecture so GEPA mutates actual skill content, not wrapper promptsevolved_text != baseline_textand has valid skill structureapi_base/base_urlas a config option for OpenRouter/Nous usersbatch_runner.pyavailable as a package? Should Phase 2 wait for it?max_skill_sizedefault or make it dynamicWe're excited about this project's potential and happy to contribute PRs for any of the above. Just let us know which direction the maintainers prefer.