Skip to content

Code Audit: Phase 1 Critical Issues — Fitness Signal, Dead Config, Missing Benchmark Gate #33

@ether-btc

Description

@ether-btc

Code Audit Report: hermes-agent-self-evolution

Audited by: Hermes Agent (NousResearch)
Scope: Phase 1 implementation — evolution/skills/, evolution/core/, tests/
Files reviewed: 21 Python files, PLAN.md, README.md, pyproject.toml


Critical Issues (Must Fix Before Phase 1 Production Use)

C1. skill_fitness_metric — Keyword Overlap Used Instead of LLM Judge

Severity: Critical | Confirmed by: SE Review + Tech Lead Review

The evolve() function passes skill_fitness_metric (keyword-overlap heuristic) to dspy.GEPA.compile(). The LLMJudge class is imported but never instantiated — it is dead code.

Impact: GEPA/MIPROv2 optimize for word overlap, not semantic quality. A skill that spam-adds expected keywords would score 0.9+ while being functionally broken. The evolved skill is never evaluated for correctness, procedure_following, or conciseness.

# fitness.py:107 — the actual optimization objective:
score = 0.3 + (0.7 * overlap)  # Keyword overlap. That's it.

The LLMJudge class exists and is well-designed — it should be used for val-set evaluation during optimization.

Tech Lead note: This will produce fake improvements that appear rigorous (metrics, constraints, holdout sets) while actually converging on keyword density, not task competence.


C2. run_pytest Config is Dead — --run-tests is a No-Op

Severity: Critical | Found by: Tech Lead Review

In evolve_skill.py:54, run_pytest=run_tests is stored in EvolutionConfig, but validator.run_test_suite() is never called anywhere in the codebase.

Impact: The --run-tests CLI flag does nothing. Constraint validation runs but pytest execution is skipped. A skill with test failures would pass through the pipeline as if tests passed.


C3. No Benchmark Gate — TBLite Regression Not Implemented

Severity: Critical | Confirmed by: Tech Lead Review

benchmark_gate.py is listed in PLAN.md as a Phase 1 core infrastructure component but does not exist. Config fields run_tblite and tblite_regression_threshold in EvolutionConfig are defined but never consumed.

Impact: Phase 1 could deploy a skill that regresses TBLite by any amount. The pipeline would appear to work (scores improve, constraints pass) while producing a skill that performs worse on real benchmarks.


High Severity Issues

H1. create_pr: bool = True — Config Never Consumed

Severity: High | Confirmed by: Tech Lead Review + Audit

create_pr is defined in EvolutionConfig (config.py:44) but referenced exactly once — only its definition. PLAN.md shows PR output as the end of the optimization loop, and Phase 1's "Validate" stage explicitly says "Create PRs for improvements that pass all gates." The output is saved locally but no PR is ever created.

Impact: Phase 1 cannot actually deploy improvements. The end-to-end pipeline is broken at the final step.


H2. GEPA Fallback to MIPROv2 Triggers Silently

Severity: High | Found by: Tech Lead Review

If dspy.GEPA is unavailable in the installed DSPy version, the fallback to MIPROv2 triggers without any log message. The user gets MIPROv2 results unaware that GEPA failed.


H3. Evolution Proceeds Despite Baseline Constraint Violations

Severity: Medium | Found by: Audit

evolve_skill.py:131-132 logs a warning but proceeds with optimization even when the baseline skill fails constraint validation. Makes improvement metrics unreliable — you are comparing evolved output against a malformed baseline.


Medium / Low Severity Issues

M1. SECRET_PATTERNS Regex False Negatives

Short API keys (sk-abc123), Bearer tokens <20 chars, and password == "..." style assignments bypass detection.

M2. Empty Holdout Set Produces 0.000 Scores Silently

If the data split produces zero holdout examples, both score lists are empty but no warning is emitted.

M3. DSPy 3.x API Instability

No version pins — dspy>=3.0.0 could auto-upgrade to a breaking version. GEPA and MIPROv2 are experimental in DSPy 3.x.

M4. reportlab Not in Dependencies

generate_report.py imports reportlab but it is absent from pyproject.toml. PDF generation fails at runtime.

M5. No End-to-End Integration Test

Test suite covers individual components (constraints, skill_module, external_importers) but no test runs the full evolve() pipeline against a real skill file.

M6. Dataset Splitting Logic Duplicated 3 Times

max(1, int(len * fraction)) 50/25/25 split pattern appears in SyntheticDatasetBuilder.generate(), GoldenDatasetLoader.load(), and build_dataset_from_external().


Summary

Severity Count Key Issues
Critical 3 Keyword fitness, dead pytest flag, no benchmark gate
High 3 Dead PR creation, silent GEPA fallback, baseline violations
Medium 4 Regex gaps, empty holdout, DSPy version, missing deps

SE Review verdict: Critical blockers for Phase 1 dogfooding — the system optimizes for keyword overlap, not task competence.

Tech Lead verdict: Phase 1 has the right architecture but 3 broken critical-path links that would allow fake improvements to ship to production.


Audit performed by Hermes Agent. SE review by Staff Engineer. Tech Lead review by Tech Lead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions