Skip to content

fix(gepa): wire ContentSemanticScorer into v2 dispatch to catch purpose drift#69

Open
steezkelly wants to merge 50 commits into
NousResearch:mainfrom
steezkelly:feat/skill-drift-fixture
Open

fix(gepa): wire ContentSemanticScorer into v2 dispatch to catch purpose drift#69
steezkelly wants to merge 50 commits into
NousResearch:mainfrom
steezkelly:feat/skill-drift-fixture

Conversation

@steezkelly
Copy link
Copy Markdown

Root cause

PurposePreservationChecker in gepa_v2_dispatch.py compared evolved content against the run's own baseline — not against the canonical source. After a single purpose drift (e.g. mnemosyne tools → Background Process Analyzer), all subsequent v2 runs anchored to the drifted content, and the checker compared drift-to-drift, always passing.

Fix

  • Import and fit ContentSemanticScorer on the canonical baseline_body
  • Inject scorer into PurposePreservationChecker before the backtrack loop via purpose_check.set_content_scorer(content_scorer)
  • TF-IDF cosine similarity now catches text-distribution divergence even when keyword survival passes

Regression test

Added test_mnemosyne_self_evolution_tools_purpose_drift_is_caught — loads canonical mnemosyne tools baseline, tests that the drifted 'Background Process Completion Analyzer' content fails the checker. (Previously it passed with recommendation 'deploy'.)

Verification

  • 22/22 constraints tests pass
  • 3/3 dispatch tests pass
  • 263/265 core tests pass (2 pre-existing failures in test_issue54_promotion — python PATH in CliRunner, unrelated)

…sResearch#24, NousResearch#26, NousResearch#35)

- PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions
  - _load_skill_body() splits frontmatter from body, body becomes instruction
  - _extract_evolved_instructions() extracts from signature.instructions (not wrapper)
  - constraint_validator.py: body/frontmatter separation — validate body has substance
  - dataset_builder.py: robust JSON parsing with 6 fallback strategies

- PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA

- PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto

Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls
conflicts with auto='light'. Use max_metric_calls alone (fixed).
…traint validator, JSON parsing robustness

Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35:
- skill_module.py: embed skill body in signature instructions via HTML sentinel
- evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging
- constraints.py: validate YAML frontmatter + substantive body content separately
- dataset_builder.py: 6-strategy JSON parser for LLM output resilience
- sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with <!-- ___SKILL_EVOLUTION_SENTINEL___ -->
…<20KB) +50%, large +20%. Fixes companion-interview-workflow rejection (+28.5% bloat was genuine operational detail, not bloat). Also cap pre-filter to 20 candidates in RelevanceFilter to prevent 30+ minute timeouts.
Completes v2.1 build phase:

1. GEPA/MIPROv2 logger (Cassian's #1 production risk)
   - Logs optimizer type (GEPA vs MIPROv2) after compile in evolve_skill.py
   - Added optimizer_type field to stats CSV schema

2. Router (evolution/core/router.py)
   - 3-action classification: fix / extend / abstain
   - Heuristic-based (no LLM calls): failure pattern detection by reason keyword,
     structural change detection via conditional counts, confidence scaling
   - All thresholds labeled as unvalidated novel design per Aris Thorne review

3. Backtrack Controller (evolution/core/backtrack.py)
   - 3-iteration sliding window plateau detection
   - Float-epsilon threshold comparison (fixes IEEE 754 precision edge case)
   - Walk-back: finds last adjacent improvement > 1%, returns checkpoint before it
   - Force-archive after N consecutive backtracks
   - Resets backtrack count after any improvement

4. Robustness Checkers (evolution/core/constraints_v2.py)
   - ConfigDriftChecker: frontmatter name/description stability
   - SkillRegressionChecker: holdout score retains 90%+ of baseline
   - ScopeCreepChecker: length-normalized term frequency drift detection
   - Small-baseline (<3 meaningful words) gracefully skipped

5. Pareto Selector (evolution/core/pareto_selector.py)
   - Multi-objective: holdout score (primary) + skill size delta (secondary)
   - min_improvement_delta=0.03 noise floor (evaluation noise guard)
   - growth_threshold cap prevents 400%+ bloat with small gains
   - Robustness gate: failed check = baseline retained regardless

6. Shared Types (evolution/core/types.py)
   - 5 dataclasses: EvolutionSnapshot, RouterDecision, BacktrackDecision,
     ComputeBudget, EvolutionReport

7. Tests: 30 new tests, all passing
   - Router: 6 tests (empty extend, edge case fix, low budget, structural,
     confidence scaling, all-pass)
   - Backtrack: 6 tests (insufficient data, plateau, improving, force archive,
     reset, walk-back)
   - Pareto: 5 tests (better, noise floor, robustness fail, growth penalty,
     zero growth)
   - Constraints: 9 tests (5 config drift, 4 scope creep)
   - Integration: 4 tests (dry run, no-gepa skeleton, backtrack integration,
     Pareto integration)
Adds the top-level integration layer that connects v2.1 modules
to the live evolution pipeline:

1. gepa_v2_dispatch.py (437 lines)
   - Wraps v1's GEPA loop with v2.1 decision gates
   - Top-level backtrack: re-runs GEPA if ParetoSelector rejects,
     up to N attempts (3 or iterations//5)
   - Runs ConfigDriftChecker, SkillRegressionChecker, ScopeCreepChecker
     on each evolved candidate
   - Captures per-scenario holdout results for Router classification
   - Returns EvolutionReport with deploy/review/reject recommendation
   - Saves output to output/<skill>/v2_<timestamp>/ with report.json

2. --v2 CLI flag
   - python -m evolution.skills.evolve_skill --skill X --v2
   - Dispatches through v2_dispatch() instead of v1 evolve()
   - v1 path unchanged when --v2 is absent

3. EvolutionReport simplified
   - Replaced 10 fields (baseline_score, evolved_score, budget, etc.)
     with 8 focused fields: skill_name, n_iterations_executed,
     improvement, recommendation, details, router_decision,
     backtrack_decision, elapsed_seconds
   - All dependent modules (evolve_skill_v2.py, types.py, tests)
     updated to match

4. Backtrack checkpoint_for_score convenience method
   - Records EvolutionSnapshot from raw score/body/iteration values

5. Tests: 162 passing (3 pre-existing failures)
   - 3 new dispatch tests: dry_run, no_skill, report_type
   - 30 v2.1 unit tests still passing
   - Integration tests updated for new EvolutionReport shape
1. EvolutionRouter fixes (threshold validation):
   - Fixed priority order: structural checked BEFORE coverage (not after)
   - Fixed structural check: uses evolution_history[-1] (needs >=1, not >=2)
   - coverage check now actually uses coverage_cluster_ratio (was defined but
     never implemented — classified ANY multi-reason failure set as coverage)
   - Added coverage_min_failures=3 guard (1 failure isn't a coverage pattern)
   - Added _dominant_category helper for logging which category dominates

2. PostHocAnalyzer (evolution/core/posthoc_analyzer.py)
   - Fits power-law curve: score = a * iteration^c + b via scipy.optimize.curve_fit
   - Scipy fallback: log-log linear regression with R² estimate
   - Phase classification: early_discovery (c>0.2), diminishing_returns (0.05<c<0.2),
     plateau (c<0.05)
   - Crossover detection: finds iteration where marginal gain < min_improvement_delta
   - Predicted score at 2x iterations
   - Pure analytical — no API calls
   - scipy added to project dependencies (needed for curve_fit)

3. Router Benchmark (tests/core/router_benchmark.py)
   - 11 synthetic test cases: edge_case, coverage, structural, noise,
     all_pass, low_budget, edge_case ratio sweep, empty, no_history,
     zero_hard, mixed_priority
   - All 11 passing
   - Run standalone: python tests/core/router_benchmark.py

4. Tests: 175 passing (3 pre-existing failures)
…n test

Bugs surfaced by running the full gate pipeline against real skill data:

1. SkillRegressionChecker.check() interface mismatch
   - Filesystem-based check() takes (skill_name, threshold) — not inline scores
   - Added check_score(evolved_score, baseline_score) for the direct score
     comparison that v2_dispatch and tests use
   - Fixed v2_dispatch.py call → check_score()

2. SelectionResult missing 'reason' field
   - ParetoSelector selected evolved vs baseline but didn't explain WHY
   - Added reason: str field with human-readable explanations at every branch
   - All selection paths now log: robustness failure, noise floor, weighted win,
     growth penalty, and improvement
   - Growth penalty now appears in reason string for size-override decisions

3. ParetoSelector reason edge case: growth info missing
   - When size penalty was the deciding factor (400% growth → penalty=1.0),
     the reason only said 'baseline wins on weighted score' without mentioning
     growth or the penalty value
   - Fixed: all weighted-score reasons now include growth ratio and penalty

4. Test fixes:
   - ConfigDrift: used different descriptions (which correctly triggers drift)
     changed to different tags (which correctly does not)
   - Regression: asserted r2[0] == 'pass' but check_score returns (bool, str)
   - Pipeline tests: 4/4 passing against real companion-workflows skill data
- PostHocAnalyzer runs after GEPA loop completes, before Router
  classification, using the per-attempt score trajectory
- Shows power-law phase classification and recommended action in console
- Appends posthoc analysis to report.json output
- Adds Phase and Power-Law c rows to summary table
- Import PostHocReport type and PostHocAnalyzer class in dispatch
- All 17 posthoc + pipeline integration tests passing
1. test_skill_over_limit: 20KB input didn't exceed 50KB max_skill_size
   → increased to 60KB to actually trigger the limit

2. test_excessive_growth: 30% growth on 1KB baseline was within the
   100% dynamic allowance for small skills (<5KB)
   → changed to 30% growth on 25KB baseline (max 20% for >20KB skills)

3. test_valid_skill: minimal body lacked 2-of-3 structural checks
   → added substantive body with steps, headings, and >100 chars

4. PostHoc integration: fixed spurious PostHocReport import from types.py
   (doesn't exist there; PostHocAnalyzer resolves its own dependencies)

Full suite: 189/189 passing — all tests clean
Bug: score_trajectory extracted avg_baseline (always the same value)
instead of tracking best score over time. This meant PostHoc never had
enough variance for power-law fitting and always returned None.

Fix: starts with the initial baseline score, then for each attempt
takes max(avg_baseline, avg_evolved) compared against the running best.
This produces a non-decreasing trajectory that the power-law fitter
can actually analyze.

189/189 tests passing.
Part A — captured-skill plugin (Hermes Agent gateway):
  plugin.yaml — registers on_session_end hook
  __init__.py — loads session data, builds candidates, slash commands
  capture.py — core logic: is_capturable heuristics, tool sequence
    extraction, domain tagging, skill body generation, overlap
    detection (Jaccard word similarity, no embeddings needed)

  Hooks: on_session_end — runs after every completed session with
    3+ tool calls, extracts task description, tool sequence, domain
    tags, and success pattern, saves to ~/.hermes/captured/<name>.json

  Slash command: /captured — list, show, inspect, validate, stats

Part B — ingest-captured CLI (self-evolution repo):
  python -m evolution.tools.ingest_captured
    list [status]     List captured candidates
    validate <file>   Validate candidate structure and overlap
    deploy <file>     Deploy a validated candidate as ~/.hermes/skills/<name>/SKILL.md
    auto              Bulk validate + deploy all pending candidates
    evolve <file>     Run v2 evolution pipeline then deploy if improved
    stats             Capture statistics

  Validation: body length > 50 chars, frontmatter or heading structure,
    overlap with existing skills (blocked at J > 0.5)

196/196 tests passing (7 new + 189 existing)
…n (rejected), nous_auth module, v2 entry point, evolution stats/tools catalog
New modules:
- evolution/tools/tool_module.py: ToolDescriptionStore (loads from Hermes registry) + ToolDescriptionModule (DSPy Predict wrapper)
- evolution/tools/tool_dataset_builder.py: (task, tool_name) eval dataset from synthetic templates + SessionDB mining
- evolution/tools/tool_description_v2.py: GEPA v2 pipeline with BacktrackController, PostHocAnalyzer, ParetoSelector, EvolutionRouter
- evolution/tools/evolve_tool_description.py: CLI entry point (--tool, --iterations, --eval-source, --dry-run)

Architecture:
- Tool selection as classification: given task description → predict correct tool
- GEPA optimizes tool descriptions (≤500 chars) to maximize classification accuracy
- v2 pipeline wraps v1 GEPA with decision gates (reject/accept/review)
- Constraint validator enforces 500-char description limit
- Output: output/tool_descriptions/<tool>_<timestamp>/report.json

CLI: python -m evolution.tools.evolve_tool_description --tool search_files --iterations 10
…itness, parallel scoring

- MultiComponentSkillModule: section-level mutation via split_into_sections/reconstruct
- PurposePreservationChecker (4th hard gate): blocks type-changing mutations via keyword
  survival + TF-IDF cosine similarity + consultant-prompt structure detection
- ContentSemanticScorer: sklearn TfidfVectorizer (unigrams+bigrams, sublinear TF)
- ParallelRelevanceFilter: ThreadPoolExecutor for LLM relevance calls (13s→30s for 20)
- Ollama Cloud thinking wrapper fix: promotes reasoning_content→content when empty
- evolve_skill.py: multi_component_extract() replaces _extract_evolved_skill_body()
- gepa_v2_dispatch: uses MultiComponentSkillModule, fixed total_improvement calc
- Test coverage: test_constraints_v2.py (7 test cases)
- CLI entry points: run_batch_evolution.py, run_deep_evolution.py, run_v2_validate*.py
…eration

- Fix syntax error in seed_to_skill.py: Python 3.12 doesn't allow backslash
  escapes inside f-string expressions. Extracted coherence_issues_escaped
  and timestamp to variables before the multi-line return statement.

- Add run_batch_seed_generation.py: generates skills from seeds for
  Phase 3 of skill-generation-from-seed plan.
  Seeds: personal-osint-audit, exploratory-data-analysis, research-planning

- Kanban: move 3 regression skills to STALE:
  companion-personas (plateau, best=+0.1988, latest=-0.1247),
  companion-system-orchestration (plateau, best=+0.0418),
  github-code-review (noise-level changes, best=+0.0000)
  Root cause: evolving existing 600-line skills hits plateaus.
  These should use seed-based generation instead.

- Batch seed generation running in background (proc_ce421498c3d0)
- seed_to_skill.py: add --timestamp CLI arg to prevent cross-run arXiv pollution
- run_batch_seed_fast.py: switch eval model from broken minimax/minimax-m2.7 → deepseek/deepseek-v4-flash
- gepa_kanban: add 3 new skills (hermes-agent-author, design-a-multi-agent-companion-coordinat, github-pr-review) in VALIDATING; mark old skills STALE

New skills installed to ~/.hermes/skills/:
  companion-system/hermes-agent-author/ (replaces companion-personas)
  companion-system/design-a-multi-agent-companion-coordinat/ (replaces companion-system-orchestration)
  github/github-pr-review/ (replaces github-code-review)

All 3 skills: 0 arXiv refs, all 5 sections exit=0, coherence PASS.
228 tests still passing.
Results:
- hermes-agent-author: 0.500 (INCOMPARABLE - generator vs old persona catalog)
- design-a-multi-agent-companion-coordinat: 0.621 vs old 0.731 (-0.11 regression)
- github-pr-review: 0.578 vs old 0.650 (-0.07 regression)

Key insight: hermes-agent-author is a fundamentally different task from
the old companion-personas (generates personas vs. provides a fixed catalog).
The other two are modest regressions - the seed skills are more focused/narrow
than the original 11-section skills they replaced.

Scripts added: score_new_skills.py, score_compare.py, score_baseline.py
Cleanup: removed bad arXiv-polluted seed dirs from previous failed run.
GEPA ran 5 iterations on:
- design-a-multi-agent-companion-coordinat: 0.5->0.5, all mutations rejected
- github-pr-review: 0.5->0.5, all mutations rejected
- hermes-agent-author: failed to load skill

Root cause: seed skills are ~50% smaller (5 sections) vs old archived
skills (9-15 sections). The seed generates a GEPA-friendly skeleton
but lacks the depth/complexity that made originals effective. GEPA
can't hallucinate in missing sections in 5 iterations.

Key finding: seed-to-skill creates useful starting skeletons, not
direct replacements for highly-refined multi-section skills. The
pipeline is working correctly — the gap is in seed density.
NEW SKILLS (Phase 3 - no regression baseline):
- research-synthesis: 0.560 — web/arxiv/wiki research report synthesis
- linear-issue-creator: 0.341 — natural language to Linear issue creation
- codebase-metrics: 0.527 — codebase metrics via pygount

PHASE 2 REPLACEMENTS (with old baselines):
- hermes-agent-author: 0.528 vs 0.627 old (-0.099, INCOMPARABLE - different task)
- design-a-multi-agent-companion-coordinat: 0.622 vs 0.731 old (-0.109)
- github-pr-review: 0.585 vs 0.650 old (-0.065)

GEPA PHASE 4 EVOLUTION:
- All 3 seed skills: 0 improvement (5 iterations, all mutations rejected)
- Root cause: seed skills are 5-section skeletons vs old 9-15 section skills
- Pipeline is working correctly; seed density is the bottleneck

Phase 3 scripts: run_batch_seed_phase3.py, updated score_new_skills.py
Datasets: research-synthesis, linear-issue-creator, codebase-metrics
…0% improvement

- linear-issue-creator: GEPA 5 iter, valset 72%, holdout 50%, no improvement
- codebase-metrics: GEPA 5 iter, valset 55.8%, holdout 50%, no improvement
- Both seeds locally optimal on synthetic eval — seed density ceiling confirmed
- research-synthesis killed (22% valset, poor synthetic fit)
- Updated card-registry.json + baseline_scores_20260501.json
steezkelly and others added 20 commits May 1, 2026 04:13
Added: overview, when-to-use, troubleshooting, variants, related-skills

Evidence: Phase 3 seeds averaged 5 sections (~9KB) vs comparable
mature skills averaging 10-15 sections (~10-20KB). The seed density
ceiling is structural — GEPA can't discover missing sections.

Mature skill section counts: systematic-debugging=15, claude-code=23,
hermes-quickref=19, obliterator=19. Seed density fix: double the
template sections.
- EvalExample: add tool_sequence, complexity_score, session_id, success_pattern
- EvalDataset: add merge() dedup and save_atomic() write-temp-then-rename
- CapturedExampleEnricher: rule-based rubric extraction from first section
- assign_split(): deterministic MD5 hash -> exactly one split (train/val/holdout)
- enrich_and_merge(): single-split append with dedup, replaces save_as_sessiondb_example
- Capture plugin: on_session_end hook + /captured slash command (plugin.yaml + __init__.py)
- Verification script: docs/phase5_verification.py
- Tests: 34 passing across dataset_builder, ingest_captured, capture_plugin

Design decisions:
- D1 (3+ tools) enforced in both plugin _is_capturable and _save_candidate
- D2 (rule-based rubric) — no LLM call, first post-frontmatter section + fallback
- D3 (silent failure) — errors logged to ~/.hermes/capture_errors/<date>.jsonl

Fixes 6 gaps from Phase 5 gap audit:
- Gap B (data leakage): single split assignment
- Gap C (rubric mismatch): structured expected_behavior with Task/Expected tools/Procedure
- Gap F (field loss): metadata preserved to dataset
Local verification before merge: targeted stack tests 41 passed; full suite 291 passed; GitHub checks absent. Rebased conflict resolution preserved active lab main improvements plus #8 ingestion/gate fixes.
Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.
Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.
Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.
Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.
Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.
Adds the Foundry-side action-router trace-to-eval fixture and local artifact contract for the thin-bootstrap pipeline.
- session_import_demo.py: baseline raw-extractor fails, candidate structured-
  extractor passes (task_input, expected_behavior, difficulty, category)
- deterministic CLI: --mode fixture --no-network --no-external-writes
- emits: run_report.json, eval_examples.json, promotion_dossier.md,
  artifact_manifest.json
- 6 tests in test_session_import_demo.py, all green
- failure_class: raw_session_trace_without_structured_eval_example
* feat: add deterministic session-import fixture demo

- session_import_demo.py: baseline raw-extractor fails, candidate structured-
  extractor passes (task_input, expected_behavior, difficulty, category)
- deterministic CLI: --mode fixture --no-network --no-external-writes
- emits: run_report.json, eval_examples.json, promotion_dossier.md,
  artifact_manifest.json
- 6 tests in test_session_import_demo.py, all green
- failure_class: raw_session_trace_without_structured_eval_example

* feat: add deterministic tool-underuse fixture demo

- tool_underuse_demo.py: baseline describes actions verbally (zero tool calls),
  candidate makes actual terminal calls with output
- deterministic CLI: --mode fixture --no-network --no-external-writes
- emits: run_report.json, tool_usage_snapshot.json, promotion_dossier.md,
  artifact_manifest.json
- 8 tests in test_tool_underuse_demo.py, all green
- failure_class: agent_describes_instead_of_calls_tools
* feat: add deterministic session-import fixture demo

- session_import_demo.py: baseline raw-extractor fails, candidate structured-
  extractor passes (task_input, expected_behavior, difficulty, category)
- deterministic CLI: --mode fixture --no-network --no-external-writes
- emits: run_report.json, eval_examples.json, promotion_dossier.md,
  artifact_manifest.json
- 6 tests in test_session_import_demo.py, all green
- failure_class: raw_session_trace_without_structured_eval_example

* feat: add deterministic tool-underuse fixture demo

- tool_underuse_demo.py: baseline describes actions verbally (zero tool calls),
  candidate makes actual terminal calls with output
- deterministic CLI: --mode fixture --no-network --no-external-writes
- emits: run_report.json, tool_usage_snapshot.json, promotion_dossier.md,
  artifact_manifest.json
- 8 tests in test_tool_underuse_demo.py, all green
- failure_class: agent_describes_instead_of_calls_tools

* feat: add deterministic skill-drift fixture demo

- skill_drift_demo.py: baseline never checks for staleness, candidate diffs
  skill body against last-reviewed reference
- emits: run_report.json, skill_diff.txt, promotion_dossier.md, artifact_manifest.json
- 6 tests in test_skill_drift_demo.py, all green
- failure_class: stale_skill_body_without_drift_detection
…se drift

Root cause: PurposePreservationChecker compared evolved content against
run baseline, not canonical source. After first purpose drift (mnemosyne
tools → Background Process Analyzer), all v2 runs anchored to drifted
baseline and checker passed drift-to-drift trivially.

Fix:
- Import and fit ContentSemanticScorer on canonical baseline_body
- Inject scorer into PurposePreservationChecker before backtrack loop
- Add regression test using canonical mnemosyne baseline vs drifted output

Card-registry also corrected: companion-roundtable→STALE (plateaued,
never deployed), github-code-review→ARCHIVED (deployed, stable),
mnemosyne→REGRESSION with full cascade diagnostic.

22/22 constraints tests + 3/3 dispatch tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant