Code Audit Report: hermes-agent-self-evolution
Audited by: Hermes Agent (NousResearch)
Scope: Phase 1 implementation — evolution/skills/, evolution/core/, tests/
Files reviewed: 21 Python files, PLAN.md, README.md, pyproject.toml
Critical Issues (Must Fix Before Phase 1 Production Use)
C1. skill_fitness_metric — Keyword Overlap Used Instead of LLM Judge
Severity: Critical | Confirmed by: SE Review + Tech Lead Review
The evolve() function passes skill_fitness_metric (keyword-overlap heuristic) to dspy.GEPA.compile(). The LLMJudge class is imported but never instantiated — it is dead code.
Impact: GEPA/MIPROv2 optimize for word overlap, not semantic quality. A skill that spam-adds expected keywords would score 0.9+ while being functionally broken. The evolved skill is never evaluated for correctness, procedure_following, or conciseness.
# fitness.py:107 — the actual optimization objective:
score = 0.3 + (0.7 * overlap) # Keyword overlap. That's it.
The LLMJudge class exists and is well-designed — it should be used for val-set evaluation during optimization.
Tech Lead note: This will produce fake improvements that appear rigorous (metrics, constraints, holdout sets) while actually converging on keyword density, not task competence.
C2. run_pytest Config is Dead — --run-tests is a No-Op
Severity: Critical | Found by: Tech Lead Review
In evolve_skill.py:54, run_pytest=run_tests is stored in EvolutionConfig, but validator.run_test_suite() is never called anywhere in the codebase.
Impact: The --run-tests CLI flag does nothing. Constraint validation runs but pytest execution is skipped. A skill with test failures would pass through the pipeline as if tests passed.
C3. No Benchmark Gate — TBLite Regression Not Implemented
Severity: Critical | Confirmed by: Tech Lead Review
benchmark_gate.py is listed in PLAN.md as a Phase 1 core infrastructure component but does not exist. Config fields run_tblite and tblite_regression_threshold in EvolutionConfig are defined but never consumed.
Impact: Phase 1 could deploy a skill that regresses TBLite by any amount. The pipeline would appear to work (scores improve, constraints pass) while producing a skill that performs worse on real benchmarks.
High Severity Issues
H1. create_pr: bool = True — Config Never Consumed
Severity: High | Confirmed by: Tech Lead Review + Audit
create_pr is defined in EvolutionConfig (config.py:44) but referenced exactly once — only its definition. PLAN.md shows PR output as the end of the optimization loop, and Phase 1's "Validate" stage explicitly says "Create PRs for improvements that pass all gates." The output is saved locally but no PR is ever created.
Impact: Phase 1 cannot actually deploy improvements. The end-to-end pipeline is broken at the final step.
H2. GEPA Fallback to MIPROv2 Triggers Silently
Severity: High | Found by: Tech Lead Review
If dspy.GEPA is unavailable in the installed DSPy version, the fallback to MIPROv2 triggers without any log message. The user gets MIPROv2 results unaware that GEPA failed.
H3. Evolution Proceeds Despite Baseline Constraint Violations
Severity: Medium | Found by: Audit
evolve_skill.py:131-132 logs a warning but proceeds with optimization even when the baseline skill fails constraint validation. Makes improvement metrics unreliable — you are comparing evolved output against a malformed baseline.
Medium / Low Severity Issues
M1. SECRET_PATTERNS Regex False Negatives
Short API keys (sk-abc123), Bearer tokens <20 chars, and password == "..." style assignments bypass detection.
M2. Empty Holdout Set Produces 0.000 Scores Silently
If the data split produces zero holdout examples, both score lists are empty but no warning is emitted.
M3. DSPy 3.x API Instability
No version pins — dspy>=3.0.0 could auto-upgrade to a breaking version. GEPA and MIPROv2 are experimental in DSPy 3.x.
M4. reportlab Not in Dependencies
generate_report.py imports reportlab but it is absent from pyproject.toml. PDF generation fails at runtime.
M5. No End-to-End Integration Test
Test suite covers individual components (constraints, skill_module, external_importers) but no test runs the full evolve() pipeline against a real skill file.
M6. Dataset Splitting Logic Duplicated 3 Times
max(1, int(len * fraction)) 50/25/25 split pattern appears in SyntheticDatasetBuilder.generate(), GoldenDatasetLoader.load(), and build_dataset_from_external().
Summary
| Severity |
Count |
Key Issues |
| Critical |
3 |
Keyword fitness, dead pytest flag, no benchmark gate |
| High |
3 |
Dead PR creation, silent GEPA fallback, baseline violations |
| Medium |
4 |
Regex gaps, empty holdout, DSPy version, missing deps |
SE Review verdict: Critical blockers for Phase 1 dogfooding — the system optimizes for keyword overlap, not task competence.
Tech Lead verdict: Phase 1 has the right architecture but 3 broken critical-path links that would allow fake improvements to ship to production.
Audit performed by Hermes Agent. SE review by Staff Engineer. Tech Lead review by Tech Lead.
Code Audit Report: hermes-agent-self-evolution
Audited by: Hermes Agent (NousResearch)
Scope: Phase 1 implementation — evolution/skills/, evolution/core/, tests/
Files reviewed: 21 Python files, PLAN.md, README.md, pyproject.toml
Critical Issues (Must Fix Before Phase 1 Production Use)
C1. skill_fitness_metric — Keyword Overlap Used Instead of LLM Judge
Severity: Critical | Confirmed by: SE Review + Tech Lead Review
The
evolve()function passesskill_fitness_metric(keyword-overlap heuristic) todspy.GEPA.compile(). TheLLMJudgeclass is imported but never instantiated — it is dead code.Impact: GEPA/MIPROv2 optimize for word overlap, not semantic quality. A skill that spam-adds expected keywords would score 0.9+ while being functionally broken. The evolved skill is never evaluated for correctness, procedure_following, or conciseness.
The
LLMJudgeclass exists and is well-designed — it should be used for val-set evaluation during optimization.Tech Lead note: This will produce fake improvements that appear rigorous (metrics, constraints, holdout sets) while actually converging on keyword density, not task competence.
C2. run_pytest Config is Dead — --run-tests is a No-Op
Severity: Critical | Found by: Tech Lead Review
In
evolve_skill.py:54,run_pytest=run_testsis stored inEvolutionConfig, butvalidator.run_test_suite()is never called anywhere in the codebase.Impact: The
--run-testsCLI flag does nothing. Constraint validation runs but pytest execution is skipped. A skill with test failures would pass through the pipeline as if tests passed.C3. No Benchmark Gate — TBLite Regression Not Implemented
Severity: Critical | Confirmed by: Tech Lead Review
benchmark_gate.pyis listed in PLAN.md as a Phase 1 core infrastructure component but does not exist. Config fieldsrun_tbliteandtblite_regression_thresholdinEvolutionConfigare defined but never consumed.Impact: Phase 1 could deploy a skill that regresses TBLite by any amount. The pipeline would appear to work (scores improve, constraints pass) while producing a skill that performs worse on real benchmarks.
High Severity Issues
H1. create_pr: bool = True — Config Never Consumed
Severity: High | Confirmed by: Tech Lead Review + Audit
create_pris defined inEvolutionConfig(config.py:44) but referenced exactly once — only its definition. PLAN.md shows PR output as the end of the optimization loop, and Phase 1's "Validate" stage explicitly says "Create PRs for improvements that pass all gates." The output is saved locally but no PR is ever created.Impact: Phase 1 cannot actually deploy improvements. The end-to-end pipeline is broken at the final step.
H2. GEPA Fallback to MIPROv2 Triggers Silently
Severity: High | Found by: Tech Lead Review
If
dspy.GEPAis unavailable in the installed DSPy version, the fallback to MIPROv2 triggers without any log message. The user gets MIPROv2 results unaware that GEPA failed.H3. Evolution Proceeds Despite Baseline Constraint Violations
Severity: Medium | Found by: Audit
evolve_skill.py:131-132logs a warning but proceeds with optimization even when the baseline skill fails constraint validation. Makes improvement metrics unreliable — you are comparing evolved output against a malformed baseline.Medium / Low Severity Issues
M1. SECRET_PATTERNS Regex False Negatives
Short API keys (
sk-abc123), Bearer tokens <20 chars, andpassword == "..."style assignments bypass detection.M2. Empty Holdout Set Produces 0.000 Scores Silently
If the data split produces zero holdout examples, both score lists are empty but no warning is emitted.
M3. DSPy 3.x API Instability
No version pins —
dspy>=3.0.0could auto-upgrade to a breaking version. GEPA and MIPROv2 are experimental in DSPy 3.x.M4. reportlab Not in Dependencies
generate_report.pyimportsreportlabbut it is absent frompyproject.toml. PDF generation fails at runtime.M5. No End-to-End Integration Test
Test suite covers individual components (constraints, skill_module, external_importers) but no test runs the full
evolve()pipeline against a real skill file.M6. Dataset Splitting Logic Duplicated 3 Times
max(1, int(len * fraction))50/25/25 split pattern appears inSyntheticDatasetBuilder.generate(),GoldenDatasetLoader.load(), andbuild_dataset_from_external().Summary
SE Review verdict: Critical blockers for Phase 1 dogfooding — the system optimizes for keyword overlap, not task competence.
Tech Lead verdict: Phase 1 has the right architecture but 3 broken critical-path links that would allow fake improvements to ship to production.
Audit performed by Hermes Agent. SE review by Staff Engineer. Tech Lead review by Tech Lead.