Skip to content

Phase 1 SkillModule architecture prevents GEPA from mutating actual skill content + Nous API integration patches #38

@steezkelly

Description

@steezkelly

Summary

After extensive testing of the Phase 1 skill evolution pipeline, we identified a core architectural issue in SkillModule that prevents GEPA from actually mutating skill content. We also implemented several patches to make the pipeline runnable on the Nous Research API. This issue documents our findings and proposes concrete next steps.


🔬 Environment

Component Version
hermes-agent-self-evolution 4693c8f (latest main)
dspy 3.1.3
gepa 0.2.1
Python 3.12
OS Linux Mint 22.3
API Provider Nous Research (OpenRouter gateway)

🐛 Bug 1: SkillModule Architecture Prevents Skill Text Mutation

The Problem

The SkillModule class in evolution/skills/skill_module.py passes skill_text as a runtime input field, not as part of the optimizable signature:

class TaskWithSkill(dspy.Signature):
    """Complete a task following the provided skill instructions."""
    skill_instructions: str = dspy.InputField(desc="The skill instructions to follow")
    task_input: str = dspy.InputField(desc="The task to complete")
    output: str = dspy.OutputField(desc="Your response following the skill instructions")

class SkillModule(dspy.Module):
    def __init__(self, skill_text: str):
        super().__init__()
        self.skill_text = skill_text          # ← Stored as instance attribute
        self.predictor = dspy.ChainOfThought(self.TaskWithSkill)

    def forward(self, task_input: str):
        return self.predictor(
            skill_instructions=self.skill_text,  # ← Passed as input data
            task_input=task_input,
        )

Why This Breaks GEPA

GEPA (and all DSPy optimizers) mutate Signature instructions — the docstring and desc= fields of the signature class. They do NOT mutate instance attributes or runtime input values.

What happens during optimization:

  1. GEPA sees TaskWithSkill.__doc__ ("Complete a task following...") as the optimizable text
  2. GEPA evolves that docstring into a better "meta-prompt" for the agent wrapper
  3. The self.skill_text (the actual SKILL.md content) is never touched
  4. evolved_skill.md ends up as ~2.4KB of wrapper instructions, not an improved skill

Evidence

After 10 iterations on hermes-agent skill:

  • Baseline skill size: 27,043 chars
  • "Evolved" output size: 2,433 chars
  • Diff: Empty (the extracted skill_text was identical to baseline)

We traced this by inspecting optimized_module.predictor.predict.signature.instructions and confirming it diverged from optimized_module.skill_text.

Our Workaround (Not Ideal)

We patched evolve_skill.py to extract from the signature instead:

def _extract_evolved_instructions(optimized_module):
    """Extract the evolved text from the GEPA-mutated signature instructions."""
    try:
        sig = optimized_module.predictor.predict.signature
        instructions = getattr(sig, "instructions", None)
        if instructions and len(instructions) > 100:
            return instructions
    except Exception:
        pass
    return optimized_module.skill_text  # Fallback

This produces a different file, but it's an evolved wrapper prompt, not an evolved skill. The skill's actual procedure, pitfalls, and examples remain unchanged.


🔌 Enhancement: Nous Research API Integration

The current code hardcodes OpenAI API assumptions. We patched three files to support Nous Research's OpenRouter-compatible endpoint:

evolution/skills/evolve_skill.py:

api_base = os.getenv("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1")
optimizer_lm = dspy.LM(model=optimizer_model, api_base=api_base, ...)

evolution/core/fitness.py:

llm = dspy.LM(model=config.judge_model, api_base=api_base, ...)

evolution/core/dataset_builder.py:

llm = dspy.LM(model=config.judge_model, api_base=api_base, ...)

This allows PROVIDER=nous in run-evolution.sh with models like claude-3-5-sonnet-20241022 (optimizer) and kimi-k2.6 (free evaluator on Nous).

Suggested improvement: Make api_base a first-class config option in EvolutionConfig rather than requiring os.getenv patches.


📏 Config Issue: max_skill_size Too Small for Real Skills

Current default: 15,000 chars.

Real Hermes skills (hermes-agent, github-code-review) exceed this. We bumped to 50,000 chars, which broke tests/core/test_constraints.py::test_skill_over_limit (expects 15K).

Suggestion: Either raise the default or make it auto-detect from the baseline skill size + max_prompt_growth.


🧪 Test Gap: No Test for "Did GEPA Actually Mutate the Skill?"

The test suite verifies parsing and reassembly, but there is no test that confirms:

  1. GEPA.compile() produces a module whose extractable text differs from baseline
  2. The evolved text is semantically a skill (has frontmatter, procedures, etc.)
  3. The fitness metric improves on held-out data with real skill mutations

This allowed the architecture issue to go unnoticed.


🎯 Proposed Fix: Redesign SkillModule for True Skill Evolution

Option A: Inline Skill Body as Signature Instructions

Make the skill's markdown body the instructions string of the signature:

class SkillModule(dspy.Module):
    def __init__(self, skill_body: str):
        super().__init__()
        
        class DynamicTask(dspy.Signature):
            """{skill_body}"""   # ← GEPA will mutate this
            task_input: str = dspy.InputField(...)
            output: str = dspy.OutputField(...)
        
        DynamicTask.__doc__ = skill_body
        self.predictor = dspy.ChainOfThought(DynamicTask)

Challenge: Large skills (27KB) may exceed DSPy's instruction token budget or cause GEPA to compress/collapse the content into a short meta-prompt.

Option B: Two-Stage Evolution

  1. Stage 1: GEPA evolves a "skill outline" (sections, key points, structure)
  2. Stage 2: An LLM expands the outline back into full markdown
  3. Validation: Constraint checker ensures the expanded form preserves frontmatter

Option C: MIPROv2 for Skill Body

MIPROv2 produces optimized instructions we can extract directly. Trade-off: it may bloat output size, but it actually mutates the full instruction text rather than a wrapper.


🚀 Path to Phase 2: Tool Description Evolution

The evolution/tools/ directory is empty (only __init__.py). To implement Phase 2:

What's Needed

  1. evolution/tools/tool_description_module.py — A DSPy module where:

    • Input: Task description + available tools
    • Optimizable: A specific tool's description field
    • Output: Correct tool selection
    • Metric: Accuracy of tool selection on a dataset
  2. evolution/tools/evolve_tool_descriptions.py — CLI entry point mirroring evolve_skill.py

  3. Dataset builder extension — Mine tools/registry.py for tool schemas and generate task→tool mapping examples

  4. Integration with batch_runner.py — The PLAN.md references batch_runner.py from hermes-agent as the evaluation harness. This is currently missing from the repo.

Critical Dependency

The PLAN.md describes deep integration with hermes-agent components (batch_runner.py, agent/trajectory.py, hermes_state.py) that do not exist in this repo. Phase 2+ realistically requires either:

  • pip install hermes-agent as a dependency and importing its internals
  • Or copying/adapting those components into this repo

Question for maintainers: Is hermes-agent published as a pip-installable package? If not, should this repo vendor the evaluation harness?


✅ What We Verified Works

Component Status Notes
Synthetic dataset generation Produces valid EvalDataset
SessionDB mining 50 real sessions extracted
GEPA execution Completes N iterations without crash
Constraint validation Size, structure, growth checks pass
Skill reassembly Frontmatter preserved, body replaced
Nous Research API With patches, runs end-to-end

📋 Recommended Action Items

  1. Fix SkillModule architecture so GEPA mutates actual skill content, not wrapper prompts
  2. Add integration test that asserts evolved_text != baseline_text and has valid skill structure
  3. Review and merge community PRs addressing DSPy 3.1 compatibility (GEPA optimizer fails with DSPy >=3.1: max_steps is not a valid parameter #10), constraint validation (Constraint validator rejects every evolved skill: checks frontmatter on body-only text #11), and fitness metric (Fitness metric uses keyword overlap only — insufficient signal for optimization #12)
  4. Expose api_base / base_url as a config option for OpenRouter/Nous users
  5. Clarify hermes-agent integration — Is batch_runner.py available as a package? Should Phase 2 wait for it?
  6. Update max_skill_size default or make it dynamic

We're excited about this project's potential and happy to contribute PRs for any of the above. Just let us know which direction the maintainers prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions