Phase 1 SkillModule architecture prevents GEPA from mutating actual skill content + Nous API integration patches

## Summary

After extensive testing of the Phase 1 skill evolution pipeline, we identified a core architectural issue in `SkillModule` that prevents GEPA from actually mutating skill content. We also implemented several patches to make the pipeline runnable on the Nous Research API. This issue documents our findings and proposes concrete next steps.

---

## 🔬 Environment

| Component | Version |
|-----------|---------|
| `hermes-agent-self-evolution` | `4693c8f` (latest main) |
| `dspy` | `3.1.3` |
| `gepa` | `0.2.1` |
| Python | `3.12` |
| OS | Linux Mint 22.3 |
| API Provider | Nous Research (OpenRouter gateway) |

---

## 🐛 Bug 1: `SkillModule` Architecture Prevents Skill Text Mutation

### The Problem

The `SkillModule` class in `evolution/skills/skill_module.py` passes `skill_text` as a **runtime input field**, not as part of the optimizable signature:

```python
class TaskWithSkill(dspy.Signature):
    """Complete a task following the provided skill instructions."""
    skill_instructions: str = dspy.InputField(desc="The skill instructions to follow")
    task_input: str = dspy.InputField(desc="The task to complete")
    output: str = dspy.OutputField(desc="Your response following the skill instructions")

class SkillModule(dspy.Module):
    def __init__(self, skill_text: str):
        super().__init__()
        self.skill_text = skill_text          # ← Stored as instance attribute
        self.predictor = dspy.ChainOfThought(self.TaskWithSkill)

    def forward(self, task_input: str):
        return self.predictor(
            skill_instructions=self.skill_text,  # ← Passed as input data
            task_input=task_input,
        )
```

### Why This Breaks GEPA

GEPA (and all DSPy optimizers) mutate **`Signature` instructions** — the docstring and `desc=` fields of the signature class. They do **NOT** mutate instance attributes or runtime input values.

What happens during optimization:
1. GEPA sees `TaskWithSkill.__doc__` ("Complete a task following...") as the optimizable text
2. GEPA evolves that docstring into a better "meta-prompt" for the agent wrapper
3. The `self.skill_text` (the actual SKILL.md content) is **never touched**
4. `evolved_skill.md` ends up as ~2.4KB of wrapper instructions, not an improved skill

### Evidence

After 10 iterations on `hermes-agent` skill:
- Baseline skill size: **27,043 chars**
- "Evolved" output size: **2,433 chars**
- Diff: Empty (the extracted `skill_text` was identical to baseline)

We traced this by inspecting `optimized_module.predictor.predict.signature.instructions` and confirming it diverged from `optimized_module.skill_text`.

### Our Workaround (Not Ideal)

We patched `evolve_skill.py` to extract from the signature instead:

```python
def _extract_evolved_instructions(optimized_module):
    """Extract the evolved text from the GEPA-mutated signature instructions."""
    try:
        sig = optimized_module.predictor.predict.signature
        instructions = getattr(sig, "instructions", None)
        if instructions and len(instructions) > 100:
            return instructions
    except Exception:
        pass
    return optimized_module.skill_text  # Fallback
```

This produces a *different* file, but it's an evolved **wrapper prompt**, not an evolved **skill**. The skill's actual procedure, pitfalls, and examples remain unchanged.

---

## 🔌 Enhancement: Nous Research API Integration

The current code hardcodes OpenAI API assumptions. We patched three files to support Nous Research's OpenRouter-compatible endpoint:

**`evolution/skills/evolve_skill.py`:**
```python
api_base = os.getenv("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1")
optimizer_lm = dspy.LM(model=optimizer_model, api_base=api_base, ...)
```

**`evolution/core/fitness.py`:**
```python
llm = dspy.LM(model=config.judge_model, api_base=api_base, ...)
```

**`evolution/core/dataset_builder.py`:**
```python
llm = dspy.LM(model=config.judge_model, api_base=api_base, ...)
```

This allows `PROVIDER=nous` in `run-evolution.sh` with models like `claude-3-5-sonnet-20241022` (optimizer) and `kimi-k2.6` (free evaluator on Nous).

**Suggested improvement:** Make `api_base` a first-class config option in `EvolutionConfig` rather than requiring `os.getenv` patches.

---

## 📏 Config Issue: `max_skill_size` Too Small for Real Skills

Current default: **15,000 chars**.

Real Hermes skills (`hermes-agent`, `github-code-review`) exceed this. We bumped to **50,000 chars**, which broke `tests/core/test_constraints.py::test_skill_over_limit` (expects 15K).

**Suggestion:** Either raise the default or make it auto-detect from the baseline skill size + `max_prompt_growth`.

---

## 🧪 Test Gap: No Test for "Did GEPA Actually Mutate the Skill?"

The test suite verifies parsing and reassembly, but there is **no test** that confirms:
1. `GEPA.compile()` produces a module whose extractable text differs from baseline
2. The evolved text is semantically a skill (has frontmatter, procedures, etc.)
3. The fitness metric improves on held-out data with real skill mutations

This allowed the architecture issue to go unnoticed.

---

## 🎯 Proposed Fix: Redesign `SkillModule` for True Skill Evolution

### Option A: Inline Skill Body as Signature Instructions

Make the skill's markdown body the `instructions` string of the signature:

```python
class SkillModule(dspy.Module):
    def __init__(self, skill_body: str):
        super().__init__()
        
        class DynamicTask(dspy.Signature):
            """{skill_body}"""   # ← GEPA will mutate this
            task_input: str = dspy.InputField(...)
            output: str = dspy.OutputField(...)
        
        DynamicTask.__doc__ = skill_body
        self.predictor = dspy.ChainOfThought(DynamicTask)
```

**Challenge:** Large skills (27KB) may exceed DSPy's instruction token budget or cause GEPA to compress/collapse the content into a short meta-prompt.

### Option B: Two-Stage Evolution

1. **Stage 1:** GEPA evolves a "skill outline" (sections, key points, structure)
2. **Stage 2:** An LLM expands the outline back into full markdown
3. **Validation:** Constraint checker ensures the expanded form preserves frontmatter

### Option C: MIPROv2 for Skill Body

MIPROv2 produces optimized instructions we can extract directly. Trade-off: it may bloat output size, but it actually mutates the full instruction text rather than a wrapper.

---

## 🚀 Path to Phase 2: Tool Description Evolution

The `evolution/tools/` directory is empty (only `__init__.py`). To implement Phase 2:

### What's Needed
1. **`evolution/tools/tool_description_module.py`** — A DSPy module where:
   - Input: Task description + available tools
   - Optimizable: A specific tool's `description` field
   - Output: Correct tool selection
   - Metric: Accuracy of tool selection on a dataset

2. **`evolution/tools/evolve_tool_descriptions.py`** — CLI entry point mirroring `evolve_skill.py`

3. **Dataset builder extension** — Mine `tools/registry.py` for tool schemas and generate task→tool mapping examples

4. **Integration with `batch_runner.py`** — The PLAN.md references `batch_runner.py` from hermes-agent as the evaluation harness. This is currently missing from the repo.

### Critical Dependency

The PLAN.md describes deep integration with hermes-agent components (`batch_runner.py`, `agent/trajectory.py`, `hermes_state.py`) that **do not exist in this repo**. Phase 2+ realistically requires either:
- `pip install hermes-agent` as a dependency and importing its internals
- Or copying/adapting those components into this repo

**Question for maintainers:** Is hermes-agent published as a pip-installable package? If not, should this repo vendor the evaluation harness?

---

## ✅ What We Verified Works

| Component | Status | Notes |
|-----------|--------|-------|
| Synthetic dataset generation | ✅ | Produces valid `EvalDataset` |
| SessionDB mining | ✅ | 50 real sessions extracted |
| GEPA execution | ✅ | Completes N iterations without crash |
| Constraint validation | ✅ | Size, structure, growth checks pass |
| Skill reassembly | ✅ | Frontmatter preserved, body replaced |
| Nous Research API | ✅ | With patches, runs end-to-end |

---

## 📋 Recommended Action Items

1. **Fix `SkillModule` architecture** so GEPA mutates actual skill content, not wrapper prompts
2. **Add integration test** that asserts `evolved_text != baseline_text` and has valid skill structure
3. **Review and merge community PRs** addressing DSPy 3.1 compatibility (#10), constraint validation (#11), and fitness metric (#12)
4. **Expose `api_base` / `base_url`** as a config option for OpenRouter/Nous users
5. **Clarify hermes-agent integration** — Is `batch_runner.py` available as a package? Should Phase 2 wait for it?
6. **Update `max_skill_size`** default or make it dynamic

---

We're excited about this project's potential and happy to contribute PRs for any of the above. Just let us know which direction the maintainers prefer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1 SkillModule architecture prevents GEPA from mutating actual skill content + Nous API integration patches #38

Summary

🔬 Environment

🐛 Bug 1: `SkillModule` Architecture Prevents Skill Text Mutation

The Problem

Why This Breaks GEPA

Evidence

Our Workaround (Not Ideal)

🔌 Enhancement: Nous Research API Integration

📏 Config Issue: `max_skill_size` Too Small for Real Skills

🧪 Test Gap: No Test for "Did GEPA Actually Mutate the Skill?"

🎯 Proposed Fix: Redesign `SkillModule` for True Skill Evolution

Option A: Inline Skill Body as Signature Instructions

Option B: Two-Stage Evolution

Option C: MIPROv2 for Skill Body

🚀 Path to Phase 2: Tool Description Evolution

What's Needed

Critical Dependency

✅ What We Verified Works

📋 Recommended Action Items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version
`hermes-agent-self-evolution`	`4693c8f` (latest main)
`dspy`	`3.1.3`
`gepa`	`0.2.1`
Python	`3.12`
OS	Linux Mint 22.3
API Provider	Nous Research (OpenRouter gateway)

Component	Status	Notes
Synthetic dataset generation	✅	Produces valid `EvalDataset`
SessionDB mining	✅	50 real sessions extracted
GEPA execution	✅	Completes N iterations without crash
Constraint validation	✅	Size, structure, growth checks pass
Skill reassembly	✅	Frontmatter preserved, body replaced
Nous Research API	✅	With patches, runs end-to-end

Phase 1 SkillModule architecture prevents GEPA from mutating actual skill content + Nous API integration patches #38

Description

Summary

🔬 Environment

🐛 Bug 1: SkillModule Architecture Prevents Skill Text Mutation

The Problem

Why This Breaks GEPA

Evidence

Our Workaround (Not Ideal)

🔌 Enhancement: Nous Research API Integration

📏 Config Issue: max_skill_size Too Small for Real Skills

🧪 Test Gap: No Test for "Did GEPA Actually Mutate the Skill?"

🎯 Proposed Fix: Redesign SkillModule for True Skill Evolution

Option A: Inline Skill Body as Signature Instructions

Option B: Two-Stage Evolution

Option C: MIPROv2 for Skill Body

🚀 Path to Phase 2: Tool Description Evolution

What's Needed

Critical Dependency

✅ What We Verified Works

📋 Recommended Action Items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

🐛 Bug 1: `SkillModule` Architecture Prevents Skill Text Mutation

📏 Config Issue: `max_skill_size` Too Small for Real Skills

🎯 Proposed Fix: Redesign `SkillModule` for True Skill Evolution