Real session data evolution: fix GEPA + expand sessiondb filter#26
Open
rc-int wants to merge 2 commits into
Open
Real session data evolution: fix GEPA + expand sessiondb filter#26rc-int wants to merge 2 commits into
rc-int wants to merge 2 commits into
Conversation
WHY: GEPA runs produced zero improvement because skill text was stored as a plain attribute and passed as an InputField, not an optimizable parameter. GEPA mutates signature instructions, not module state, so every mutation was a no-op. WHAT: - SkillModule: skill text is now the signature.instructions field (the actual GEPA-optimizable parameter), not a separate InputField. - evolve_skill: fixed GEPA constructor (max_metric_calls not max_steps, added reflection_lm), split eval/optimizer LMs, extract evolved body from predictor.predict.signature.instructions. - fitness: metric signature extended to (example, prediction, trace, pred_name, pred_trace) — GEPA calls metrics with 5 args. - constraints: passed full skill text (frontmatter + body) to validator. ADD: CodexLM — DSPy LM provider that delegates to codex exec --json, using ChatGPT Plus OAuth (no API key). Lets GEPA use GPT-5.4 as the reflection model via the local codex CLI. AFFECT: evolution/core/codex_lm.py, evolution/core/fitness.py, evolution/skills/evolve_skill.py, evolution/skills/skill_module.py Verified: GEPA on real session data (22/11/11 split) gave +11.5% holdout gain (0.481 to 0.536) on github-code-review skill.
WHY: The sessiondb relevance filter found only 17 relevant examples from
2482 messages of real session history. Heuristic pre-filter extracted
keywords from the first 500 chars of the skill (often frontmatter), used
strict 2-match-minimum exact-word overlap, and capped candidates at
max_examples * 3. Synonyms and abbreviations ("PR" for "pull request")
were missed entirely.
WHAT: RelevanceFilter now has a three-stage pipeline. Stage 1: one LLM
call generates 30-50 relevance keywords/phrases for the target skill
(synonyms, verbs, artifacts). Stage 2: substring pre-filter against the
ENTIRE corpus using expanded keywords. Stage 3: LLM scores each
pre-filtered candidate as before. Budget raised to max_examples * 8 to
let the LLM reject noisy borderline cases rather than caps losing them.
AFFECT: evolution/core/external_importers.py — RelevanceFilter class,
new ExpandKeywords signature, _expand_keywords method.
Verified (github-code-review skill, Cerebras zai-glm-4.7):
Before: 17 examples from 150 candidates (2590 total corpus)
After: 44 examples from 384 candidates (same 2590 corpus)
Expanded to 51 keywords: "code review", "pull request", "git diff",
"code audit", "review changes", ...
Downstream: real-data GEPA run produced +11.5% holdout improvement on
evolved skill with train/val/holdout = 22/11/11 (vs 10/5/5 synthetic).
ALSO: log pre-existing _read_copilot_workspace exception (S110 lint fix).
rc-int
added a commit
to rc-international/hermes-agent-self-evolution
that referenced
this pull request
Apr 16, 2026
…nding upstream)
steezkelly
added a commit
to steezkelly/hermes-agent-self-evolution
that referenced
this pull request
Apr 25, 2026
…sResearch#24, NousResearch#26, NousResearch#35) - PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions - _load_skill_body() splits frontmatter from body, body becomes instruction - _extract_evolved_instructions() extracts from signature.instructions (not wrapper) - constraint_validator.py: body/frontmatter separation — validate body has substance - dataset_builder.py: robust JSON parsing with 6 fallback strategies - PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA - PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls conflicts with auto='light'. Use max_metric_calls alone (fixed).
steezkelly
added a commit
to steezkelly/hermes-agent-self-evolution
that referenced
this pull request
Apr 25, 2026
…traint validator, JSON parsing robustness Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35: - skill_module.py: embed skill body in signature instructions via HTML sentinel - evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging - constraints.py: validate YAML frontmatter + substantive body content separately - dataset_builder.py: 6-strategy JSON parser for LLM output resilience - sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with <!-- ___SKILL_EVOLUTION_SENTINEL___ -->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two fixes that unblocked evolution on real session history:
GEPA wasn't mutating skill text —
SkillModulestored the skill as a plain attribute and passed it as anInputField. GEPA only optimizes signature instructions, so every mutation was a no-op (prior runs showed 0% improvement). Fixed: skill text is now the signature'sinstructionsfield.Sessiondb filter missed 96% of relevant messages — the heuristic extracted keywords from only the first 500 chars of the skill (usually frontmatter), required 2 exact-word matches, and capped candidates at
max_examples * 3. For github-code-review that filter found 17 relevant examples from 2482 messages. New three-stage pipeline: one LLM call expands 30-50 relevance keywords (synonyms, abbreviations, domain verbs), then substring pre-filter scans the entire corpus before LLM relevance scoring.Also adds
CodexLM— a DSPy LM provider that delegates tocodex exec --json, using ChatGPT Plus OAuth. Lets GEPA use GPT-5.4 as the reflection model without an API key.Expected outcomes
Test plan
python -m evolution.core.external_importers --skill github-code-review --source all --model cerebras/zai-glm-4.7 --max-examples 100→ 44 examples, 22/11/11 splitpython -m evolution.skills.evolve_skill --skill github-code-review --iterations 20 --eval-source golden --dataset-path datasets/skills/github-code-review --optimizer-model cerebras/zai-glm-4.7 --eval-model cerebras/zai-glm-4.7→ +11.5% holdout improvementpython -m evolution.skills.evolve_skill --skill github-code-review --iterations 15 --eval-source synthetic ...→ +36.2% holdout improvement on synthetic (sanity check)Merge strategy
Please use merge-commit (not squash). Each commit is a logical unit.