Skip to content

Real session data evolution: fix GEPA + expand sessiondb filter#26

Open
rc-int wants to merge 2 commits into
NousResearch:mainfrom
rc-international:feat/real-data-evolution
Open

Real session data evolution: fix GEPA + expand sessiondb filter#26
rc-int wants to merge 2 commits into
NousResearch:mainfrom
rc-international:feat/real-data-evolution

Conversation

@rc-int
Copy link
Copy Markdown

@rc-int rc-int commented Apr 16, 2026

Summary

Two fixes that unblocked evolution on real session history:

  1. GEPA wasn't mutating skill textSkillModule stored the skill as a plain attribute and passed it as an InputField. GEPA only optimizes signature instructions, so every mutation was a no-op (prior runs showed 0% improvement). Fixed: skill text is now the signature's instructions field.

  2. Sessiondb filter missed 96% of relevant messages — the heuristic extracted keywords from only the first 500 chars of the skill (usually frontmatter), required 2 exact-word matches, and capped candidates at max_examples * 3. For github-code-review that filter found 17 relevant examples from 2482 messages. New three-stage pipeline: one LLM call expands 30-50 relevance keywords (synonyms, abbreviations, domain verbs), then substring pre-filter scans the entire corpus before LLM relevance scoring.

Also adds CodexLM — a DSPy LM provider that delegates to codex exec --json, using ChatGPT Plus OAuth. Lets GEPA use GPT-5.4 as the reflection model without an API key.

Expected outcomes

Metric Before After
Sessiondb relevant examples (github-code-review) 17 44 (2.6x)
Pre-filter corpus coverage capped at 150 2590 (full corpus)
Expanded relevance keywords 0 51 (e.g. "code review", "pull request", "git diff", "code audit")
GEPA holdout score — synthetic data 0.414 → 0.414 (no-op) 0.414 → 0.564 (+36.2%)
GEPA holdout score — real session data N/A (couldn't get enough data) 0.481 → 0.536 (+11.5%)
Train/val/holdout split on real data 10/5/5 22/11/11

Test plan

  • python -m evolution.core.external_importers --skill github-code-review --source all --model cerebras/zai-glm-4.7 --max-examples 100 → 44 examples, 22/11/11 split
  • python -m evolution.skills.evolve_skill --skill github-code-review --iterations 20 --eval-source golden --dataset-path datasets/skills/github-code-review --optimizer-model cerebras/zai-glm-4.7 --eval-model cerebras/zai-glm-4.7 → +11.5% holdout improvement
  • python -m evolution.skills.evolve_skill --skill github-code-review --iterations 15 --eval-source synthetic ... → +36.2% holdout improvement on synthetic (sanity check)
  • Repeat on a second skill to confirm generality

Merge strategy

Please use merge-commit (not squash). Each commit is a logical unit.

rc-int added 2 commits April 16, 2026 17:07
WHY: GEPA runs produced zero improvement because skill text was stored as
a plain attribute and passed as an InputField, not an optimizable parameter.
GEPA mutates signature instructions, not module state, so every mutation
was a no-op.

WHAT:
- SkillModule: skill text is now the signature.instructions field (the
  actual GEPA-optimizable parameter), not a separate InputField.
- evolve_skill: fixed GEPA constructor (max_metric_calls not max_steps,
  added reflection_lm), split eval/optimizer LMs, extract evolved body
  from predictor.predict.signature.instructions.
- fitness: metric signature extended to (example, prediction, trace,
  pred_name, pred_trace) — GEPA calls metrics with 5 args.
- constraints: passed full skill text (frontmatter + body) to validator.

ADD: CodexLM — DSPy LM provider that delegates to codex exec --json,
using ChatGPT Plus OAuth (no API key). Lets GEPA use GPT-5.4 as the
reflection model via the local codex CLI.

AFFECT: evolution/core/codex_lm.py, evolution/core/fitness.py,
evolution/skills/evolve_skill.py, evolution/skills/skill_module.py

Verified: GEPA on real session data (22/11/11 split) gave +11.5%
holdout gain (0.481 to 0.536) on github-code-review skill.
WHY: The sessiondb relevance filter found only 17 relevant examples from
2482 messages of real session history. Heuristic pre-filter extracted
keywords from the first 500 chars of the skill (often frontmatter), used
strict 2-match-minimum exact-word overlap, and capped candidates at
max_examples * 3. Synonyms and abbreviations ("PR" for "pull request")
were missed entirely.

WHAT: RelevanceFilter now has a three-stage pipeline. Stage 1: one LLM
call generates 30-50 relevance keywords/phrases for the target skill
(synonyms, verbs, artifacts). Stage 2: substring pre-filter against the
ENTIRE corpus using expanded keywords. Stage 3: LLM scores each
pre-filtered candidate as before. Budget raised to max_examples * 8 to
let the LLM reject noisy borderline cases rather than caps losing them.

AFFECT: evolution/core/external_importers.py — RelevanceFilter class,
new ExpandKeywords signature, _expand_keywords method.

Verified (github-code-review skill, Cerebras zai-glm-4.7):
  Before: 17 examples from 150 candidates (2590 total corpus)
  After:  44 examples from 384 candidates (same 2590 corpus)
  Expanded to 51 keywords: "code review", "pull request", "git diff",
  "code audit", "review changes", ...
Downstream: real-data GEPA run produced +11.5% holdout improvement on
evolved skill with train/val/holdout = 22/11/11 (vs 10/5/5 synthetic).

ALSO: log pre-existing _read_copilot_workspace exception (S110 lint fix).
rc-int added a commit to rc-international/hermes-agent-self-evolution that referenced this pull request Apr 16, 2026
steezkelly added a commit to steezkelly/hermes-agent-self-evolution that referenced this pull request Apr 25, 2026
…sResearch#24, NousResearch#26, NousResearch#35)

- PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions
  - _load_skill_body() splits frontmatter from body, body becomes instruction
  - _extract_evolved_instructions() extracts from signature.instructions (not wrapper)
  - constraint_validator.py: body/frontmatter separation — validate body has substance
  - dataset_builder.py: robust JSON parsing with 6 fallback strategies

- PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA

- PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto

Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls
conflicts with auto='light'. Use max_metric_calls alone (fixed).
steezkelly added a commit to steezkelly/hermes-agent-self-evolution that referenced this pull request Apr 25, 2026
…traint validator, JSON parsing robustness

Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35:
- skill_module.py: embed skill body in signature instructions via HTML sentinel
- evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging
- constraints.py: validate YAML frontmatter + substantive body content separately
- dataset_builder.py: 6-strategy JSON parser for LLM output resilience
- sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with <!-- ___SKILL_EVOLUTION_SENTINEL___ -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant