Real session data evolution: fix GEPA + expand sessiondb filter by rc-int · Pull Request #26 · NousResearch/hermes-agent-self-evolution

rc-int · 2026-04-16T17:09:05Z

Summary

Two fixes that unblocked evolution on real session history:

GEPA wasn't mutating skill text — SkillModule stored the skill as a plain attribute and passed it as an InputField. GEPA only optimizes signature instructions, so every mutation was a no-op (prior runs showed 0% improvement). Fixed: skill text is now the signature's instructions field.
Sessiondb filter missed 96% of relevant messages — the heuristic extracted keywords from only the first 500 chars of the skill (usually frontmatter), required 2 exact-word matches, and capped candidates at max_examples * 3. For github-code-review that filter found 17 relevant examples from 2482 messages. New three-stage pipeline: one LLM call expands 30-50 relevance keywords (synonyms, abbreviations, domain verbs), then substring pre-filter scans the entire corpus before LLM relevance scoring.

Also adds CodexLM — a DSPy LM provider that delegates to codex exec --json, using ChatGPT Plus OAuth. Lets GEPA use GPT-5.4 as the reflection model without an API key.

Expected outcomes

Metric	Before	After
Sessiondb relevant examples (github-code-review)	17	44 (2.6x)
Pre-filter corpus coverage	capped at 150	2590 (full corpus)
Expanded relevance keywords	0	51 (e.g. "code review", "pull request", "git diff", "code audit")
GEPA holdout score — synthetic data	0.414 → 0.414 (no-op)	0.414 → 0.564 (+36.2%)
GEPA holdout score — real session data	N/A (couldn't get enough data)	0.481 → 0.536 (+11.5%)
Train/val/holdout split on real data	10/5/5	22/11/11

Test plan

python -m evolution.core.external_importers --skill github-code-review --source all --model cerebras/zai-glm-4.7 --max-examples 100 → 44 examples, 22/11/11 split
python -m evolution.skills.evolve_skill --skill github-code-review --iterations 20 --eval-source golden --dataset-path datasets/skills/github-code-review --optimizer-model cerebras/zai-glm-4.7 --eval-model cerebras/zai-glm-4.7 → +11.5% holdout improvement
python -m evolution.skills.evolve_skill --skill github-code-review --iterations 15 --eval-source synthetic ... → +36.2% holdout improvement on synthetic (sanity check)
Repeat on a second skill to confirm generality

Merge strategy

Please use merge-commit (not squash). Each commit is a logical unit.

WHY: GEPA runs produced zero improvement because skill text was stored as a plain attribute and passed as an InputField, not an optimizable parameter. GEPA mutates signature instructions, not module state, so every mutation was a no-op. WHAT: - SkillModule: skill text is now the signature.instructions field (the actual GEPA-optimizable parameter), not a separate InputField. - evolve_skill: fixed GEPA constructor (max_metric_calls not max_steps, added reflection_lm), split eval/optimizer LMs, extract evolved body from predictor.predict.signature.instructions. - fitness: metric signature extended to (example, prediction, trace, pred_name, pred_trace) — GEPA calls metrics with 5 args. - constraints: passed full skill text (frontmatter + body) to validator. ADD: CodexLM — DSPy LM provider that delegates to codex exec --json, using ChatGPT Plus OAuth (no API key). Lets GEPA use GPT-5.4 as the reflection model via the local codex CLI. AFFECT: evolution/core/codex_lm.py, evolution/core/fitness.py, evolution/skills/evolve_skill.py, evolution/skills/skill_module.py Verified: GEPA on real session data (22/11/11 split) gave +11.5% holdout gain (0.481 to 0.536) on github-code-review skill.

WHY: The sessiondb relevance filter found only 17 relevant examples from 2482 messages of real session history. Heuristic pre-filter extracted keywords from the first 500 chars of the skill (often frontmatter), used strict 2-match-minimum exact-word overlap, and capped candidates at max_examples * 3. Synonyms and abbreviations ("PR" for "pull request") were missed entirely. WHAT: RelevanceFilter now has a three-stage pipeline. Stage 1: one LLM call generates 30-50 relevance keywords/phrases for the target skill (synonyms, verbs, artifacts). Stage 2: substring pre-filter against the ENTIRE corpus using expanded keywords. Stage 3: LLM scores each pre-filtered candidate as before. Budget raised to max_examples * 8 to let the LLM reject noisy borderline cases rather than caps losing them. AFFECT: evolution/core/external_importers.py — RelevanceFilter class, new ExpandKeywords signature, _expand_keywords method. Verified (github-code-review skill, Cerebras zai-glm-4.7): Before: 17 examples from 150 candidates (2590 total corpus) After: 44 examples from 384 candidates (same 2590 corpus) Expanded to 51 keywords: "code review", "pull request", "git diff", "code audit", "review changes", ... Downstream: real-data GEPA run produced +11.5% holdout improvement on evolved skill with train/val/holdout = 22/11/11 (vs 10/5/5 synthetic). ALSO: log pre-existing _read_copilot_workspace exception (S110 lint fix).

…nding upstream)

…sResearch#24, NousResearch#26, NousResearch#35) - PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions - _load_skill_body() splits frontmatter from body, body becomes instruction - _extract_evolved_instructions() extracts from signature.instructions (not wrapper) - constraint_validator.py: body/frontmatter separation — validate body has substance - dataset_builder.py: robust JSON parsing with 6 fallback strategies - PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA - PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls conflicts with auto='light'. Use max_metric_calls alone (fixed).

…traint validator, JSON parsing robustness Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35: - skill_module.py: embed skill body in signature instructions via HTML sentinel - evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging - constraints.py: validate YAML frontmatter + substantive body content separately - dataset_builder.py: 6-strategy JSON parser for LLM output resilience - sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with

rc-int added 2 commits April 16, 2026 17:07

rc-int added a commit to rc-international/hermes-agent-self-evolution that referenced this pull request Apr 16, 2026

Merge feat/real-data-evolution into local main (PR NousResearch#26 pe…

5350b81

…nding upstream)

steezkelly mentioned this pull request Apr 25, 2026

fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness #39

Closed

seilk mentioned this pull request Apr 26, 2026

feat(importers): read full Claude Code transcripts from ~/.claude/projects #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real session data evolution: fix GEPA + expand sessiondb filter#26

Real session data evolution: fix GEPA + expand sessiondb filter#26
rc-int wants to merge 2 commits into
NousResearch:mainfrom
rc-international:feat/real-data-evolution

rc-int commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rc-int commented Apr 16, 2026

Summary

Expected outcomes

Test plan

Merge strategy

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant