fix(gepa): wire ContentSemanticScorer into v2 dispatch to catch purpose drift#69
Open
steezkelly wants to merge 50 commits into
Open
fix(gepa): wire ContentSemanticScorer into v2 dispatch to catch purpose drift#69steezkelly wants to merge 50 commits into
steezkelly wants to merge 50 commits into
Conversation
…sResearch#24, NousResearch#26, NousResearch#35) - PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions - _load_skill_body() splits frontmatter from body, body becomes instruction - _extract_evolved_instructions() extracts from signature.instructions (not wrapper) - constraint_validator.py: body/frontmatter separation — validate body has substance - dataset_builder.py: robust JSON parsing with 6 fallback strategies - PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA - PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls conflicts with auto='light'. Use max_metric_calls alone (fixed).
…traint validator, JSON parsing robustness Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35: - skill_module.py: embed skill body in signature instructions via HTML sentinel - evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging - constraints.py: validate YAML frontmatter + substantive body content separately - dataset_builder.py: 6-strategy JSON parser for LLM output resilience - sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with <!-- ___SKILL_EVOLUTION_SENTINEL___ -->
…<20KB) +50%, large +20%. Fixes companion-interview-workflow rejection (+28.5% bloat was genuine operational detail, not bloat). Also cap pre-filter to 20 candidates in RelevanceFilter to prevent 30+ minute timeouts.
Completes v2.1 build phase: 1. GEPA/MIPROv2 logger (Cassian's #1 production risk) - Logs optimizer type (GEPA vs MIPROv2) after compile in evolve_skill.py - Added optimizer_type field to stats CSV schema 2. Router (evolution/core/router.py) - 3-action classification: fix / extend / abstain - Heuristic-based (no LLM calls): failure pattern detection by reason keyword, structural change detection via conditional counts, confidence scaling - All thresholds labeled as unvalidated novel design per Aris Thorne review 3. Backtrack Controller (evolution/core/backtrack.py) - 3-iteration sliding window plateau detection - Float-epsilon threshold comparison (fixes IEEE 754 precision edge case) - Walk-back: finds last adjacent improvement > 1%, returns checkpoint before it - Force-archive after N consecutive backtracks - Resets backtrack count after any improvement 4. Robustness Checkers (evolution/core/constraints_v2.py) - ConfigDriftChecker: frontmatter name/description stability - SkillRegressionChecker: holdout score retains 90%+ of baseline - ScopeCreepChecker: length-normalized term frequency drift detection - Small-baseline (<3 meaningful words) gracefully skipped 5. Pareto Selector (evolution/core/pareto_selector.py) - Multi-objective: holdout score (primary) + skill size delta (secondary) - min_improvement_delta=0.03 noise floor (evaluation noise guard) - growth_threshold cap prevents 400%+ bloat with small gains - Robustness gate: failed check = baseline retained regardless 6. Shared Types (evolution/core/types.py) - 5 dataclasses: EvolutionSnapshot, RouterDecision, BacktrackDecision, ComputeBudget, EvolutionReport 7. Tests: 30 new tests, all passing - Router: 6 tests (empty extend, edge case fix, low budget, structural, confidence scaling, all-pass) - Backtrack: 6 tests (insufficient data, plateau, improving, force archive, reset, walk-back) - Pareto: 5 tests (better, noise floor, robustness fail, growth penalty, zero growth) - Constraints: 9 tests (5 config drift, 4 scope creep) - Integration: 4 tests (dry run, no-gepa skeleton, backtrack integration, Pareto integration)
Adds the top-level integration layer that connects v2.1 modules
to the live evolution pipeline:
1. gepa_v2_dispatch.py (437 lines)
- Wraps v1's GEPA loop with v2.1 decision gates
- Top-level backtrack: re-runs GEPA if ParetoSelector rejects,
up to N attempts (3 or iterations//5)
- Runs ConfigDriftChecker, SkillRegressionChecker, ScopeCreepChecker
on each evolved candidate
- Captures per-scenario holdout results for Router classification
- Returns EvolutionReport with deploy/review/reject recommendation
- Saves output to output/<skill>/v2_<timestamp>/ with report.json
2. --v2 CLI flag
- python -m evolution.skills.evolve_skill --skill X --v2
- Dispatches through v2_dispatch() instead of v1 evolve()
- v1 path unchanged when --v2 is absent
3. EvolutionReport simplified
- Replaced 10 fields (baseline_score, evolved_score, budget, etc.)
with 8 focused fields: skill_name, n_iterations_executed,
improvement, recommendation, details, router_decision,
backtrack_decision, elapsed_seconds
- All dependent modules (evolve_skill_v2.py, types.py, tests)
updated to match
4. Backtrack checkpoint_for_score convenience method
- Records EvolutionSnapshot from raw score/body/iteration values
5. Tests: 162 passing (3 pre-existing failures)
- 3 new dispatch tests: dry_run, no_skill, report_type
- 30 v2.1 unit tests still passing
- Integration tests updated for new EvolutionReport shape
1. EvolutionRouter fixes (threshold validation):
- Fixed priority order: structural checked BEFORE coverage (not after)
- Fixed structural check: uses evolution_history[-1] (needs >=1, not >=2)
- coverage check now actually uses coverage_cluster_ratio (was defined but
never implemented — classified ANY multi-reason failure set as coverage)
- Added coverage_min_failures=3 guard (1 failure isn't a coverage pattern)
- Added _dominant_category helper for logging which category dominates
2. PostHocAnalyzer (evolution/core/posthoc_analyzer.py)
- Fits power-law curve: score = a * iteration^c + b via scipy.optimize.curve_fit
- Scipy fallback: log-log linear regression with R² estimate
- Phase classification: early_discovery (c>0.2), diminishing_returns (0.05<c<0.2),
plateau (c<0.05)
- Crossover detection: finds iteration where marginal gain < min_improvement_delta
- Predicted score at 2x iterations
- Pure analytical — no API calls
- scipy added to project dependencies (needed for curve_fit)
3. Router Benchmark (tests/core/router_benchmark.py)
- 11 synthetic test cases: edge_case, coverage, structural, noise,
all_pass, low_budget, edge_case ratio sweep, empty, no_history,
zero_hard, mixed_priority
- All 11 passing
- Run standalone: python tests/core/router_benchmark.py
4. Tests: 175 passing (3 pre-existing failures)
…n test
Bugs surfaced by running the full gate pipeline against real skill data:
1. SkillRegressionChecker.check() interface mismatch
- Filesystem-based check() takes (skill_name, threshold) — not inline scores
- Added check_score(evolved_score, baseline_score) for the direct score
comparison that v2_dispatch and tests use
- Fixed v2_dispatch.py call → check_score()
2. SelectionResult missing 'reason' field
- ParetoSelector selected evolved vs baseline but didn't explain WHY
- Added reason: str field with human-readable explanations at every branch
- All selection paths now log: robustness failure, noise floor, weighted win,
growth penalty, and improvement
- Growth penalty now appears in reason string for size-override decisions
3. ParetoSelector reason edge case: growth info missing
- When size penalty was the deciding factor (400% growth → penalty=1.0),
the reason only said 'baseline wins on weighted score' without mentioning
growth or the penalty value
- Fixed: all weighted-score reasons now include growth ratio and penalty
4. Test fixes:
- ConfigDrift: used different descriptions (which correctly triggers drift)
changed to different tags (which correctly does not)
- Regression: asserted r2[0] == 'pass' but check_score returns (bool, str)
- Pipeline tests: 4/4 passing against real companion-workflows skill data
- PostHocAnalyzer runs after GEPA loop completes, before Router classification, using the per-attempt score trajectory - Shows power-law phase classification and recommended action in console - Appends posthoc analysis to report.json output - Adds Phase and Power-Law c rows to summary table - Import PostHocReport type and PostHocAnalyzer class in dispatch - All 17 posthoc + pipeline integration tests passing
1. test_skill_over_limit: 20KB input didn't exceed 50KB max_skill_size → increased to 60KB to actually trigger the limit 2. test_excessive_growth: 30% growth on 1KB baseline was within the 100% dynamic allowance for small skills (<5KB) → changed to 30% growth on 25KB baseline (max 20% for >20KB skills) 3. test_valid_skill: minimal body lacked 2-of-3 structural checks → added substantive body with steps, headings, and >100 chars 4. PostHoc integration: fixed spurious PostHocReport import from types.py (doesn't exist there; PostHocAnalyzer resolves its own dependencies) Full suite: 189/189 passing — all tests clean
Bug: score_trajectory extracted avg_baseline (always the same value) instead of tracking best score over time. This meant PostHoc never had enough variance for power-law fitting and always returned None. Fix: starts with the initial baseline score, then for each attempt takes max(avg_baseline, avg_evolved) compared against the running best. This produces a non-decreasing trajectory that the power-law fitter can actually analyze. 189/189 tests passing.
Part A — captured-skill plugin (Hermes Agent gateway):
plugin.yaml — registers on_session_end hook
__init__.py — loads session data, builds candidates, slash commands
capture.py — core logic: is_capturable heuristics, tool sequence
extraction, domain tagging, skill body generation, overlap
detection (Jaccard word similarity, no embeddings needed)
Hooks: on_session_end — runs after every completed session with
3+ tool calls, extracts task description, tool sequence, domain
tags, and success pattern, saves to ~/.hermes/captured/<name>.json
Slash command: /captured — list, show, inspect, validate, stats
Part B — ingest-captured CLI (self-evolution repo):
python -m evolution.tools.ingest_captured
list [status] List captured candidates
validate <file> Validate candidate structure and overlap
deploy <file> Deploy a validated candidate as ~/.hermes/skills/<name>/SKILL.md
auto Bulk validate + deploy all pending candidates
evolve <file> Run v2 evolution pipeline then deploy if improved
stats Capture statistics
Validation: body length > 50 chars, frontmatter or heading structure,
overlap with existing skills (blocked at J > 0.5)
196/196 tests passing (7 new + 189 existing)
…n (rejected), nous_auth module, v2 entry point, evolution stats/tools catalog
New modules: - evolution/tools/tool_module.py: ToolDescriptionStore (loads from Hermes registry) + ToolDescriptionModule (DSPy Predict wrapper) - evolution/tools/tool_dataset_builder.py: (task, tool_name) eval dataset from synthetic templates + SessionDB mining - evolution/tools/tool_description_v2.py: GEPA v2 pipeline with BacktrackController, PostHocAnalyzer, ParetoSelector, EvolutionRouter - evolution/tools/evolve_tool_description.py: CLI entry point (--tool, --iterations, --eval-source, --dry-run) Architecture: - Tool selection as classification: given task description → predict correct tool - GEPA optimizes tool descriptions (≤500 chars) to maximize classification accuracy - v2 pipeline wraps v1 GEPA with decision gates (reject/accept/review) - Constraint validator enforces 500-char description limit - Output: output/tool_descriptions/<tool>_<timestamp>/report.json CLI: python -m evolution.tools.evolve_tool_description --tool search_files --iterations 10
…itness, parallel scoring - MultiComponentSkillModule: section-level mutation via split_into_sections/reconstruct - PurposePreservationChecker (4th hard gate): blocks type-changing mutations via keyword survival + TF-IDF cosine similarity + consultant-prompt structure detection - ContentSemanticScorer: sklearn TfidfVectorizer (unigrams+bigrams, sublinear TF) - ParallelRelevanceFilter: ThreadPoolExecutor for LLM relevance calls (13s→30s for 20) - Ollama Cloud thinking wrapper fix: promotes reasoning_content→content when empty - evolve_skill.py: multi_component_extract() replaces _extract_evolved_skill_body() - gepa_v2_dispatch: uses MultiComponentSkillModule, fixed total_improvement calc - Test coverage: test_constraints_v2.py (7 test cases) - CLI entry points: run_batch_evolution.py, run_deep_evolution.py, run_v2_validate*.py
…eration - Fix syntax error in seed_to_skill.py: Python 3.12 doesn't allow backslash escapes inside f-string expressions. Extracted coherence_issues_escaped and timestamp to variables before the multi-line return statement. - Add run_batch_seed_generation.py: generates skills from seeds for Phase 3 of skill-generation-from-seed plan. Seeds: personal-osint-audit, exploratory-data-analysis, research-planning - Kanban: move 3 regression skills to STALE: companion-personas (plateau, best=+0.1988, latest=-0.1247), companion-system-orchestration (plateau, best=+0.0418), github-code-review (noise-level changes, best=+0.0000) Root cause: evolving existing 600-line skills hits plateaus. These should use seed-based generation instead. - Batch seed generation running in background (proc_ce421498c3d0)
- seed_to_skill.py: add --timestamp CLI arg to prevent cross-run arXiv pollution - run_batch_seed_fast.py: switch eval model from broken minimax/minimax-m2.7 → deepseek/deepseek-v4-flash - gepa_kanban: add 3 new skills (hermes-agent-author, design-a-multi-agent-companion-coordinat, github-pr-review) in VALIDATING; mark old skills STALE New skills installed to ~/.hermes/skills/: companion-system/hermes-agent-author/ (replaces companion-personas) companion-system/design-a-multi-agent-companion-coordinat/ (replaces companion-system-orchestration) github/github-pr-review/ (replaces github-code-review) All 3 skills: 0 arXiv refs, all 5 sections exit=0, coherence PASS. 228 tests still passing.
Results: - hermes-agent-author: 0.500 (INCOMPARABLE - generator vs old persona catalog) - design-a-multi-agent-companion-coordinat: 0.621 vs old 0.731 (-0.11 regression) - github-pr-review: 0.578 vs old 0.650 (-0.07 regression) Key insight: hermes-agent-author is a fundamentally different task from the old companion-personas (generates personas vs. provides a fixed catalog). The other two are modest regressions - the seed skills are more focused/narrow than the original 11-section skills they replaced. Scripts added: score_new_skills.py, score_compare.py, score_baseline.py Cleanup: removed bad arXiv-polluted seed dirs from previous failed run.
GEPA ran 5 iterations on: - design-a-multi-agent-companion-coordinat: 0.5->0.5, all mutations rejected - github-pr-review: 0.5->0.5, all mutations rejected - hermes-agent-author: failed to load skill Root cause: seed skills are ~50% smaller (5 sections) vs old archived skills (9-15 sections). The seed generates a GEPA-friendly skeleton but lacks the depth/complexity that made originals effective. GEPA can't hallucinate in missing sections in 5 iterations. Key finding: seed-to-skill creates useful starting skeletons, not direct replacements for highly-refined multi-section skills. The pipeline is working correctly — the gap is in seed density.
NEW SKILLS (Phase 3 - no regression baseline): - research-synthesis: 0.560 — web/arxiv/wiki research report synthesis - linear-issue-creator: 0.341 — natural language to Linear issue creation - codebase-metrics: 0.527 — codebase metrics via pygount PHASE 2 REPLACEMENTS (with old baselines): - hermes-agent-author: 0.528 vs 0.627 old (-0.099, INCOMPARABLE - different task) - design-a-multi-agent-companion-coordinat: 0.622 vs 0.731 old (-0.109) - github-pr-review: 0.585 vs 0.650 old (-0.065) GEPA PHASE 4 EVOLUTION: - All 3 seed skills: 0 improvement (5 iterations, all mutations rejected) - Root cause: seed skills are 5-section skeletons vs old 9-15 section skills - Pipeline is working correctly; seed density is the bottleneck Phase 3 scripts: run_batch_seed_phase3.py, updated score_new_skills.py Datasets: research-synthesis, linear-issue-creator, codebase-metrics
…0% improvement - linear-issue-creator: GEPA 5 iter, valset 72%, holdout 50%, no improvement - codebase-metrics: GEPA 5 iter, valset 55.8%, holdout 50%, no improvement - Both seeds locally optimal on synthetic eval — seed density ceiling confirmed - research-synthesis killed (22% valset, poor synthetic fit) - Updated card-registry.json + baseline_scores_20260501.json
…er, same 0.5, same 10544 chars)
…d hermes CLI commands)
Added: overview, when-to-use, troubleshooting, variants, related-skills Evidence: Phase 3 seeds averaged 5 sections (~9KB) vs comparable mature skills averaging 10-15 sections (~10-20KB). The seed density ceiling is structural — GEPA can't discover missing sections. Mature skill section counts: systematic-debugging=15, claude-code=23, hermes-quickref=19, obliterator=19. Seed density fix: double the template sections.
- EvalExample: add tool_sequence, complexity_score, session_id, success_pattern - EvalDataset: add merge() dedup and save_atomic() write-temp-then-rename - CapturedExampleEnricher: rule-based rubric extraction from first section - assign_split(): deterministic MD5 hash -> exactly one split (train/val/holdout) - enrich_and_merge(): single-split append with dedup, replaces save_as_sessiondb_example - Capture plugin: on_session_end hook + /captured slash command (plugin.yaml + __init__.py) - Verification script: docs/phase5_verification.py - Tests: 34 passing across dataset_builder, ingest_captured, capture_plugin Design decisions: - D1 (3+ tools) enforced in both plugin _is_capturable and _save_candidate - D2 (rule-based rubric) — no LLM call, first post-frontmatter section + fallback - D3 (silent failure) — errors logged to ~/.hermes/capture_errors/<date>.jsonl Fixes 6 gaps from Phase 5 gap audit: - Gap B (data leakage): single split assignment - Gap C (rubric mismatch): structured expected_behavior with Task/Expected tools/Procedure - Gap F (field loss): metadata preserved to dataset
Local verification before merge: targeted stack tests 41 passed; full suite 291 passed; GitHub checks absent. Rebased conflict resolution preserved active lab main improvements plus #8 ingestion/gate fixes.
Adds the Foundry-side action-router trace-to-eval fixture and local artifact contract for the thin-bootstrap pipeline.
- session_import_demo.py: baseline raw-extractor fails, candidate structured- extractor passes (task_input, expected_behavior, difficulty, category) - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, eval_examples.json, promotion_dossier.md, artifact_manifest.json - 6 tests in test_session_import_demo.py, all green - failure_class: raw_session_trace_without_structured_eval_example
* feat: add deterministic session-import fixture demo - session_import_demo.py: baseline raw-extractor fails, candidate structured- extractor passes (task_input, expected_behavior, difficulty, category) - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, eval_examples.json, promotion_dossier.md, artifact_manifest.json - 6 tests in test_session_import_demo.py, all green - failure_class: raw_session_trace_without_structured_eval_example * feat: add deterministic tool-underuse fixture demo - tool_underuse_demo.py: baseline describes actions verbally (zero tool calls), candidate makes actual terminal calls with output - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, tool_usage_snapshot.json, promotion_dossier.md, artifact_manifest.json - 8 tests in test_tool_underuse_demo.py, all green - failure_class: agent_describes_instead_of_calls_tools
* feat: add deterministic session-import fixture demo - session_import_demo.py: baseline raw-extractor fails, candidate structured- extractor passes (task_input, expected_behavior, difficulty, category) - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, eval_examples.json, promotion_dossier.md, artifact_manifest.json - 6 tests in test_session_import_demo.py, all green - failure_class: raw_session_trace_without_structured_eval_example * feat: add deterministic tool-underuse fixture demo - tool_underuse_demo.py: baseline describes actions verbally (zero tool calls), candidate makes actual terminal calls with output - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, tool_usage_snapshot.json, promotion_dossier.md, artifact_manifest.json - 8 tests in test_tool_underuse_demo.py, all green - failure_class: agent_describes_instead_of_calls_tools * feat: add deterministic skill-drift fixture demo - skill_drift_demo.py: baseline never checks for staleness, candidate diffs skill body against last-reviewed reference - emits: run_report.json, skill_diff.txt, promotion_dossier.md, artifact_manifest.json - 6 tests in test_skill_drift_demo.py, all green - failure_class: stale_skill_body_without_drift_detection
…se drift Root cause: PurposePreservationChecker compared evolved content against run baseline, not canonical source. After first purpose drift (mnemosyne tools → Background Process Analyzer), all v2 runs anchored to drifted baseline and checker passed drift-to-drift trivially. Fix: - Import and fit ContentSemanticScorer on canonical baseline_body - Inject scorer into PurposePreservationChecker before backtrack loop - Add regression test using canonical mnemosyne baseline vs drifted output Card-registry also corrected: companion-roundtable→STALE (plateaued, never deployed), github-code-review→ARCHIVED (deployed, stable), mnemosyne→REGRESSION with full cascade diagnostic. 22/22 constraints tests + 3/3 dispatch tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
PurposePreservationChecker in gepa_v2_dispatch.py compared evolved content against the run's own baseline — not against the canonical source. After a single purpose drift (e.g. mnemosyne tools → Background Process Analyzer), all subsequent v2 runs anchored to the drifted content, and the checker compared drift-to-drift, always passing.
Fix
baseline_bodypurpose_check.set_content_scorer(content_scorer)Regression test
Added
test_mnemosyne_self_evolution_tools_purpose_drift_is_caught— loads canonical mnemosyne tools baseline, tests that the drifted 'Background Process Completion Analyzer' content fails the checker. (Previously it passed with recommendation 'deploy'.)Verification