fix(gepa): wire ContentSemanticScorer into v2 dispatch to catch purpose drift by steezkelly · Pull Request #69 · NousResearch/hermes-agent-self-evolution

steezkelly · 2026-05-09T07:17:50Z

Root cause

PurposePreservationChecker in gepa_v2_dispatch.py compared evolved content against the run's own baseline — not against the canonical source. After a single purpose drift (e.g. mnemosyne tools → Background Process Analyzer), all subsequent v2 runs anchored to the drifted content, and the checker compared drift-to-drift, always passing.

Fix

Import and fit ContentSemanticScorer on the canonical baseline_body
Inject scorer into PurposePreservationChecker before the backtrack loop via purpose_check.set_content_scorer(content_scorer)
TF-IDF cosine similarity now catches text-distribution divergence even when keyword survival passes

Regression test

Added test_mnemosyne_self_evolution_tools_purpose_drift_is_caught — loads canonical mnemosyne tools baseline, tests that the drifted 'Background Process Completion Analyzer' content fails the checker. (Previously it passed with recommendation 'deploy'.)

Verification

22/22 constraints tests pass
3/3 dispatch tests pass
263/265 core tests pass (2 pre-existing failures in test_issue54_promotion — python PATH in CliRunner, unrelated)

…sResearch#24, NousResearch#26, NousResearch#35) - PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions - _load_skill_body() splits frontmatter from body, body becomes instruction - _extract_evolved_instructions() extracts from signature.instructions (not wrapper) - constraint_validator.py: body/frontmatter separation — validate body has substance - dataset_builder.py: robust JSON parsing with 6 fallback strategies - PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA - PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls conflicts with auto='light'. Use max_metric_calls alone (fixed).

…traint validator, JSON parsing robustness Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35: - skill_module.py: embed skill body in signature instructions via HTML sentinel - evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging - constraints.py: validate YAML frontmatter + substantive body content separately - dataset_builder.py: 6-strategy JSON parser for LLM output resilience - sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with

…<20KB) +50%, large +20%. Fixes companion-interview-workflow rejection (+28.5% bloat was genuine operational detail, not bloat). Also cap pre-filter to 20 candidates in RelevanceFilter to prevent 30+ minute timeouts.

Completes v2.1 build phase: 1. GEPA/MIPROv2 logger (Cassian's #1 production risk) - Logs optimizer type (GEPA vs MIPROv2) after compile in evolve_skill.py - Added optimizer_type field to stats CSV schema 2. Router (evolution/core/router.py) - 3-action classification: fix / extend / abstain - Heuristic-based (no LLM calls): failure pattern detection by reason keyword, structural change detection via conditional counts, confidence scaling - All thresholds labeled as unvalidated novel design per Aris Thorne review 3. Backtrack Controller (evolution/core/backtrack.py) - 3-iteration sliding window plateau detection - Float-epsilon threshold comparison (fixes IEEE 754 precision edge case) - Walk-back: finds last adjacent improvement > 1%, returns checkpoint before it - Force-archive after N consecutive backtracks - Resets backtrack count after any improvement 4. Robustness Checkers (evolution/core/constraints_v2.py) - ConfigDriftChecker: frontmatter name/description stability - SkillRegressionChecker: holdout score retains 90%+ of baseline - ScopeCreepChecker: length-normalized term frequency drift detection - Small-baseline (<3 meaningful words) gracefully skipped 5. Pareto Selector (evolution/core/pareto_selector.py) - Multi-objective: holdout score (primary) + skill size delta (secondary) - min_improvement_delta=0.03 noise floor (evaluation noise guard) - growth_threshold cap prevents 400%+ bloat with small gains - Robustness gate: failed check = baseline retained regardless 6. Shared Types (evolution/core/types.py) - 5 dataclasses: EvolutionSnapshot, RouterDecision, BacktrackDecision, ComputeBudget, EvolutionReport 7. Tests: 30 new tests, all passing - Router: 6 tests (empty extend, edge case fix, low budget, structural, confidence scaling, all-pass) - Backtrack: 6 tests (insufficient data, plateau, improving, force archive, reset, walk-back) - Pareto: 5 tests (better, noise floor, robustness fail, growth penalty, zero growth) - Constraints: 9 tests (5 config drift, 4 scope creep) - Integration: 4 tests (dry run, no-gepa skeleton, backtrack integration, Pareto integration)

Adds the top-level integration layer that connects v2.1 modules to the live evolution pipeline: 1. gepa_v2_dispatch.py (437 lines) - Wraps v1's GEPA loop with v2.1 decision gates - Top-level backtrack: re-runs GEPA if ParetoSelector rejects, up to N attempts (3 or iterations//5) - Runs ConfigDriftChecker, SkillRegressionChecker, ScopeCreepChecker on each evolved candidate - Captures per-scenario holdout results for Router classification - Returns EvolutionReport with deploy/review/reject recommendation - Saves output to output/<skill>/v2_<timestamp>/ with report.json 2. --v2 CLI flag - python -m evolution.skills.evolve_skill --skill X --v2 - Dispatches through v2_dispatch() instead of v1 evolve() - v1 path unchanged when --v2 is absent 3. EvolutionReport simplified - Replaced 10 fields (baseline_score, evolved_score, budget, etc.) with 8 focused fields: skill_name, n_iterations_executed, improvement, recommendation, details, router_decision, backtrack_decision, elapsed_seconds - All dependent modules (evolve_skill_v2.py, types.py, tests) updated to match 4. Backtrack checkpoint_for_score convenience method - Records EvolutionSnapshot from raw score/body/iteration values 5. Tests: 162 passing (3 pre-existing failures) - 3 new dispatch tests: dry_run, no_skill, report_type - 30 v2.1 unit tests still passing - Integration tests updated for new EvolutionReport shape

1. EvolutionRouter fixes (threshold validation): - Fixed priority order: structural checked BEFORE coverage (not after) - Fixed structural check: uses evolution_history[-1] (needs >=1, not >=2) - coverage check now actually uses coverage_cluster_ratio (was defined but never implemented — classified ANY multi-reason failure set as coverage) - Added coverage_min_failures=3 guard (1 failure isn't a coverage pattern) - Added _dominant_category helper for logging which category dominates 2. PostHocAnalyzer (evolution/core/posthoc_analyzer.py) - Fits power-law curve: score = a * iteration^c + b via scipy.optimize.curve_fit - Scipy fallback: log-log linear regression with R² estimate - Phase classification: early_discovery (c>0.2), diminishing_returns (0.05<c<0.2), plateau (c<0.05) - Crossover detection: finds iteration where marginal gain < min_improvement_delta - Predicted score at 2x iterations - Pure analytical — no API calls - scipy added to project dependencies (needed for curve_fit) 3. Router Benchmark (tests/core/router_benchmark.py) - 11 synthetic test cases: edge_case, coverage, structural, noise, all_pass, low_budget, edge_case ratio sweep, empty, no_history, zero_hard, mixed_priority - All 11 passing - Run standalone: python tests/core/router_benchmark.py 4. Tests: 175 passing (3 pre-existing failures)

…n test Bugs surfaced by running the full gate pipeline against real skill data: 1. SkillRegressionChecker.check() interface mismatch - Filesystem-based check() takes (skill_name, threshold) — not inline scores - Added check_score(evolved_score, baseline_score) for the direct score comparison that v2_dispatch and tests use - Fixed v2_dispatch.py call → check_score() 2. SelectionResult missing 'reason' field - ParetoSelector selected evolved vs baseline but didn't explain WHY - Added reason: str field with human-readable explanations at every branch - All selection paths now log: robustness failure, noise floor, weighted win, growth penalty, and improvement - Growth penalty now appears in reason string for size-override decisions 3. ParetoSelector reason edge case: growth info missing - When size penalty was the deciding factor (400% growth → penalty=1.0), the reason only said 'baseline wins on weighted score' without mentioning growth or the penalty value - Fixed: all weighted-score reasons now include growth ratio and penalty 4. Test fixes: - ConfigDrift: used different descriptions (which correctly triggers drift) changed to different tags (which correctly does not) - Regression: asserted r2[0] == 'pass' but check_score returns (bool, str) - Pipeline tests: 4/4 passing against real companion-workflows skill data

- PostHocAnalyzer runs after GEPA loop completes, before Router classification, using the per-attempt score trajectory - Shows power-law phase classification and recommended action in console - Appends posthoc analysis to report.json output - Adds Phase and Power-Law c rows to summary table - Import PostHocReport type and PostHocAnalyzer class in dispatch - All 17 posthoc + pipeline integration tests passing

1. test_skill_over_limit: 20KB input didn't exceed 50KB max_skill_size → increased to 60KB to actually trigger the limit 2. test_excessive_growth: 30% growth on 1KB baseline was within the 100% dynamic allowance for small skills (<5KB) → changed to 30% growth on 25KB baseline (max 20% for >20KB skills) 3. test_valid_skill: minimal body lacked 2-of-3 structural checks → added substantive body with steps, headings, and >100 chars 4. PostHoc integration: fixed spurious PostHocReport import from types.py (doesn't exist there; PostHocAnalyzer resolves its own dependencies) Full suite: 189/189 passing — all tests clean

Bug: score_trajectory extracted avg_baseline (always the same value) instead of tracking best score over time. This meant PostHoc never had enough variance for power-law fitting and always returned None. Fix: starts with the initial baseline score, then for each attempt takes max(avg_baseline, avg_evolved) compared against the running best. This produces a non-decreasing trajectory that the power-law fitter can actually analyze. 189/189 tests passing.

Part A — captured-skill plugin (Hermes Agent gateway): plugin.yaml — registers on_session_end hook __init__.py — loads session data, builds candidates, slash commands capture.py — core logic: is_capturable heuristics, tool sequence extraction, domain tagging, skill body generation, overlap detection (Jaccard word similarity, no embeddings needed) Hooks: on_session_end — runs after every completed session with 3+ tool calls, extracts task description, tool sequence, domain tags, and success pattern, saves to ~/.hermes/captured/<name>.json Slash command: /captured — list, show, inspect, validate, stats Part B — ingest-captured CLI (self-evolution repo): python -m evolution.tools.ingest_captured list [status] List captured candidates validate <file> Validate candidate structure and overlap deploy <file> Deploy a validated candidate as ~/.hermes/skills/<name>/SKILL.md auto Bulk validate + deploy all pending candidates evolve <file> Run v2 evolution pipeline then deploy if improved stats Capture statistics Validation: body length > 50 chars, frontmatter or heading structure, overlap with existing skills (blocked at J > 0.5) 196/196 tests passing (7 new + 189 existing)

…n (rejected), nous_auth module, v2 entry point, evolution stats/tools catalog

New modules: - evolution/tools/tool_module.py: ToolDescriptionStore (loads from Hermes registry) + ToolDescriptionModule (DSPy Predict wrapper) - evolution/tools/tool_dataset_builder.py: (task, tool_name) eval dataset from synthetic templates + SessionDB mining - evolution/tools/tool_description_v2.py: GEPA v2 pipeline with BacktrackController, PostHocAnalyzer, ParetoSelector, EvolutionRouter - evolution/tools/evolve_tool_description.py: CLI entry point (--tool, --iterations, --eval-source, --dry-run) Architecture: - Tool selection as classification: given task description → predict correct tool - GEPA optimizes tool descriptions (≤500 chars) to maximize classification accuracy - v2 pipeline wraps v1 GEPA with decision gates (reject/accept/review) - Constraint validator enforces 500-char description limit - Output: output/tool_descriptions/<tool>_<timestamp>/report.json CLI: python -m evolution.tools.evolve_tool_description --tool search_files --iterations 10

…itness, parallel scoring - MultiComponentSkillModule: section-level mutation via split_into_sections/reconstruct - PurposePreservationChecker (4th hard gate): blocks type-changing mutations via keyword survival + TF-IDF cosine similarity + consultant-prompt structure detection - ContentSemanticScorer: sklearn TfidfVectorizer (unigrams+bigrams, sublinear TF) - ParallelRelevanceFilter: ThreadPoolExecutor for LLM relevance calls (13s→30s for 20) - Ollama Cloud thinking wrapper fix: promotes reasoning_content→content when empty - evolve_skill.py: multi_component_extract() replaces _extract_evolved_skill_body() - gepa_v2_dispatch: uses MultiComponentSkillModule, fixed total_improvement calc - Test coverage: test_constraints_v2.py (7 test cases) - CLI entry points: run_batch_evolution.py, run_deep_evolution.py, run_v2_validate*.py

…04-30

…on tests

…eration - Fix syntax error in seed_to_skill.py: Python 3.12 doesn't allow backslash escapes inside f-string expressions. Extracted coherence_issues_escaped and timestamp to variables before the multi-line return statement. - Add run_batch_seed_generation.py: generates skills from seeds for Phase 3 of skill-generation-from-seed plan. Seeds: personal-osint-audit, exploratory-data-analysis, research-planning - Kanban: move 3 regression skills to STALE: companion-personas (plateau, best=+0.1988, latest=-0.1247), companion-system-orchestration (plateau, best=+0.0418), github-code-review (noise-level changes, best=+0.0000) Root cause: evolving existing 600-line skills hits plateaus. These should use seed-based generation instead. - Batch seed generation running in background (proc_ce421498c3d0)

- seed_to_skill.py: add --timestamp CLI arg to prevent cross-run arXiv pollution - run_batch_seed_fast.py: switch eval model from broken minimax/minimax-m2.7 → deepseek/deepseek-v4-flash - gepa_kanban: add 3 new skills (hermes-agent-author, design-a-multi-agent-companion-coordinat, github-pr-review) in VALIDATING; mark old skills STALE New skills installed to ~/.hermes/skills/: companion-system/hermes-agent-author/ (replaces companion-personas) companion-system/design-a-multi-agent-companion-coordinat/ (replaces companion-system-orchestration) github/github-pr-review/ (replaces github-code-review) All 3 skills: 0 arXiv refs, all 5 sections exit=0, coherence PASS. 228 tests still passing.

Results: - hermes-agent-author: 0.500 (INCOMPARABLE - generator vs old persona catalog) - design-a-multi-agent-companion-coordinat: 0.621 vs old 0.731 (-0.11 regression) - github-pr-review: 0.578 vs old 0.650 (-0.07 regression) Key insight: hermes-agent-author is a fundamentally different task from the old companion-personas (generates personas vs. provides a fixed catalog). The other two are modest regressions - the seed skills are more focused/narrow than the original 11-section skills they replaced. Scripts added: score_new_skills.py, score_compare.py, score_baseline.py Cleanup: removed bad arXiv-polluted seed dirs from previous failed run.

GEPA ran 5 iterations on: - design-a-multi-agent-companion-coordinat: 0.5->0.5, all mutations rejected - github-pr-review: 0.5->0.5, all mutations rejected - hermes-agent-author: failed to load skill Root cause: seed skills are ~50% smaller (5 sections) vs old archived skills (9-15 sections). The seed generates a GEPA-friendly skeleton but lacks the depth/complexity that made originals effective. GEPA can't hallucinate in missing sections in 5 iterations. Key finding: seed-to-skill creates useful starting skeletons, not direct replacements for highly-refined multi-section skills. The pipeline is working correctly — the gap is in seed density.

NEW SKILLS (Phase 3 - no regression baseline): - research-synthesis: 0.560 — web/arxiv/wiki research report synthesis - linear-issue-creator: 0.341 — natural language to Linear issue creation - codebase-metrics: 0.527 — codebase metrics via pygount PHASE 2 REPLACEMENTS (with old baselines): - hermes-agent-author: 0.528 vs 0.627 old (-0.099, INCOMPARABLE - different task) - design-a-multi-agent-companion-coordinat: 0.622 vs 0.731 old (-0.109) - github-pr-review: 0.585 vs 0.650 old (-0.065) GEPA PHASE 4 EVOLUTION: - All 3 seed skills: 0 improvement (5 iterations, all mutations rejected) - Root cause: seed skills are 5-section skeletons vs old 9-15 section skills - Pipeline is working correctly; seed density is the bottleneck Phase 3 scripts: run_batch_seed_phase3.py, updated score_new_skills.py Datasets: research-synthesis, linear-issue-creator, codebase-metrics

…0% improvement - linear-issue-creator: GEPA 5 iter, valset 72%, holdout 50%, no improvement - codebase-metrics: GEPA 5 iter, valset 55.8%, holdout 50%, no improvement - Both seeds locally optimal on synthetic eval — seed density ceiling confirmed - research-synthesis killed (22% valset, poor synthetic fit) - Updated card-registry.json + baseline_scores_20260501.json

…er, same 0.5, same 10544 chars)

…d hermes CLI commands)

…al skill)

Added: overview, when-to-use, troubleshooting, variants, related-skills Evidence: Phase 3 seeds averaged 5 sections (~9KB) vs comparable mature skills averaging 10-15 sections (~10-20KB). The seed density ceiling is structural — GEPA can't discover missing sections. Mature skill section counts: systematic-debugging=15, claude-code=23, hermes-quickref=19, obliterator=19. Seed density fix: double the template sections.

- EvalExample: add tool_sequence, complexity_score, session_id, success_pattern - EvalDataset: add merge() dedup and save_atomic() write-temp-then-rename - CapturedExampleEnricher: rule-based rubric extraction from first section - assign_split(): deterministic MD5 hash -> exactly one split (train/val/holdout) - enrich_and_merge(): single-split append with dedup, replaces save_as_sessiondb_example - Capture plugin: on_session_end hook + /captured slash command (plugin.yaml + __init__.py) - Verification script: docs/phase5_verification.py - Tests: 34 passing across dataset_builder, ingest_captured, capture_plugin Design decisions: - D1 (3+ tools) enforced in both plugin _is_capturable and _save_candidate - D2 (rule-based rubric) — no LLM call, first post-frontmatter section + fallback - D3 (silent failure) — errors logged to ~/.hermes/capture_errors/<date>.jsonl Fixes 6 gaps from Phase 5 gap audit: - Gap B (data leakage): single split assignment - Gap C (rubric mismatch): structured expected_behavior with Task/Expected tools/Procedure - Gap F (field loss): metadata preserved to dataset

Local verification before merge: targeted stack tests 41 passed; full suite 291 passed; GitHub checks absent. Rebased conflict resolution preserved active lab main improvements plus #8 ingestion/gate fixes.

Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.

Adds the Foundry-side action-router trace-to-eval fixture and local artifact contract for the thin-bootstrap pipeline.

- session_import_demo.py: baseline raw-extractor fails, candidate structured- extractor passes (task_input, expected_behavior, difficulty, category) - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, eval_examples.json, promotion_dossier.md, artifact_manifest.json - 6 tests in test_session_import_demo.py, all green - failure_class: raw_session_trace_without_structured_eval_example

* feat: add deterministic session-import fixture demo - session_import_demo.py: baseline raw-extractor fails, candidate structured- extractor passes (task_input, expected_behavior, difficulty, category) - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, eval_examples.json, promotion_dossier.md, artifact_manifest.json - 6 tests in test_session_import_demo.py, all green - failure_class: raw_session_trace_without_structured_eval_example * feat: add deterministic tool-underuse fixture demo - tool_underuse_demo.py: baseline describes actions verbally (zero tool calls), candidate makes actual terminal calls with output - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, tool_usage_snapshot.json, promotion_dossier.md, artifact_manifest.json - 8 tests in test_tool_underuse_demo.py, all green - failure_class: agent_describes_instead_of_calls_tools

* feat: add deterministic session-import fixture demo - session_import_demo.py: baseline raw-extractor fails, candidate structured- extractor passes (task_input, expected_behavior, difficulty, category) - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, eval_examples.json, promotion_dossier.md, artifact_manifest.json - 6 tests in test_session_import_demo.py, all green - failure_class: raw_session_trace_without_structured_eval_example * feat: add deterministic tool-underuse fixture demo - tool_underuse_demo.py: baseline describes actions verbally (zero tool calls), candidate makes actual terminal calls with output - deterministic CLI: --mode fixture --no-network --no-external-writes - emits: run_report.json, tool_usage_snapshot.json, promotion_dossier.md, artifact_manifest.json - 8 tests in test_tool_underuse_demo.py, all green - failure_class: agent_describes_instead_of_calls_tools * feat: add deterministic skill-drift fixture demo - skill_drift_demo.py: baseline never checks for staleness, candidate diffs skill body against last-reviewed reference - emits: run_report.json, skill_diff.txt, promotion_dossier.md, artifact_manifest.json - 6 tests in test_skill_drift_demo.py, all green - failure_class: stale_skill_body_without_drift_detection

…se drift Root cause: PurposePreservationChecker compared evolved content against run baseline, not canonical source. After first purpose drift (mnemosyne tools → Background Process Analyzer), all v2 runs anchored to drifted baseline and checker passed drift-to-drift trivially. Fix: - Import and fit ContentSemanticScorer on canonical baseline_body - Inject scorer into PurposePreservationChecker before backtrack loop - Add regression test using canonical mnemosyne baseline vs drifted output Card-registry also corrected: companion-roundtable→STALE (plateaued, never deployed), github-code-review→ARCHIVED (deployed, stable), mnemosyne→REGRESSION with full cascade diagnostic. 22/22 constraints tests + 3/3 dispatch tests pass.

…r tests

steezkelly added 30 commits April 24, 2026 22:44

Phase 5 session wrap: v2.1 fixes, companion-workflows v2 evolution ru…

519cc69

…n (rejected), nous_auth module, v2 entry point, evolution stats/tools catalog

GEPA kanban: card-registry.json + board-state.md snapshot as of 2026-…

d0d5346

…04-30

Apr 29 evening reports + batch evolution results

1370669

README: update Phase status markers to 2026-04-29

1b2a5d9

Update metrics.json timestamps for v1 evolution run outputs

e931aa6

V2 pipeline validation: new run outputs, safe wrapper, tool descripti…

2a3b050

…on tests

Final: kanban state + hermes-agent Apr29 085253 run output

6f05e28

GEPA: design-a-multi-agent-companion-coordinat — 0% improvement (5 it…

2abdba0

…er, same 0.5, same 10544 chars)

GEPA: add design-a-multi-agent-companion-coordinat baseline scores

40a9296

GEPA: github-pr-review — 0% improvement (5 iter, same 0.5, 8374 chars)

fb8f815

GEPA: add github-pr-review baseline scores

06e5e64

steezkelly and others added 20 commits May 1, 2026 04:13

GEPA: hermes-agent-author — FAILED cli_syntax constraint (hallucinate…

776bc5b

…d hermes CLI commands)

Cleanup: remove seed-generated artifact entry from registry (not a re…

f4ae45b

…al skill)

Phase 5 P1: --enrich flag, integration tests, full 266/266 green

add2d5b

fix: declare scikit-learn dependency

a7117d2

docs: clarify fork project direction

5be2631

docs: establish agent evolution lab identity

7a40906

feat: promote consolidated evolution gates as active trunk (#8)

95b4ad0

Local verification before merge: targeted stack tests 41 passed; full suite 291 passed; GitHub checks absent. Rebased conflict resolution preserved active lab main improvements plus #8 ingestion/gate fixes.

chore: move one-off runners into scripts and archive (#2)

19ef648

Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.

feat: add GEPA kanban sync bridge (#3)

cd2e57e

Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.

docs: add self-evolution technical documentation (#4)

b8e1a10

Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.

tools: add metric discrimination diagnostics (#5)

95633fa

Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.

tools: add batch GEPA dataset operations scripts (#6)

77cf232

Locally trial-merged PRs #2-#6 after #8. Verification before merge: py_compile scripts/archive/diagnostics/kanban bridge passed; bash -n scripts/run-evolution.sh passed; full pytest 291 passed, 11 warnings.

feat: add deterministic action-router fixture demo

ab17b95

Adds the Foundry-side action-router trace-to-eval fixture and local artifact contract for the thin-bootstrap pipeline.

fix(tests): use python3 instead of python for benchmark gate CliRunne…

e471861

…r tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gepa): wire ContentSemanticScorer into v2 dispatch to catch purpose drift#69

fix(gepa): wire ContentSemanticScorer into v2 dispatch to catch purpose drift#69
steezkelly wants to merge 50 commits into
NousResearch:mainfrom
steezkelly:feat/skill-drift-fixture

steezkelly commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steezkelly commented May 9, 2026

Root cause

Fix

Regression test

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant