Code Audit: Phase 1 Critical Issues — Fitness Signal, Dead Config, Missing Benchmark Gate

## Code Audit Report: hermes-agent-self-evolution

**Audited by:** Hermes Agent (NousResearch)
**Scope:** Phase 1 implementation — evolution/skills/, evolution/core/, tests/
**Files reviewed:** 21 Python files, PLAN.md, README.md, pyproject.toml

---

## Critical Issues (Must Fix Before Phase 1 Production Use)

### C1. skill_fitness_metric — Keyword Overlap Used Instead of LLM Judge
**Severity:** Critical | **Confirmed by:** SE Review + Tech Lead Review

The `evolve()` function passes `skill_fitness_metric` (keyword-overlap heuristic) to `dspy.GEPA.compile()`. The `LLMJudge` class is imported but never instantiated — it is dead code.

**Impact:** GEPA/MIPROv2 optimize for word overlap, not semantic quality. A skill that spam-adds expected keywords would score 0.9+ while being functionally broken. The evolved skill is never evaluated for correctness, procedure_following, or conciseness.

```python
# fitness.py:107 — the actual optimization objective:
score = 0.3 + (0.7 * overlap)  # Keyword overlap. That's it.
```

The `LLMJudge` class exists and is well-designed — it should be used for val-set evaluation during optimization.

**Tech Lead note:** This will produce fake improvements that appear rigorous (metrics, constraints, holdout sets) while actually converging on keyword density, not task competence.

---

### C2. run_pytest Config is Dead — --run-tests is a No-Op
**Severity:** Critical | **Found by:** Tech Lead Review

In `evolve_skill.py:54`, `run_pytest=run_tests` is stored in `EvolutionConfig`, but `validator.run_test_suite()` is never called anywhere in the codebase.

**Impact:** The `--run-tests` CLI flag does nothing. Constraint validation runs but pytest execution is skipped. A skill with test failures would pass through the pipeline as if tests passed.

---

### C3. No Benchmark Gate — TBLite Regression Not Implemented
**Severity:** Critical | **Confirmed by:** Tech Lead Review

`benchmark_gate.py` is listed in PLAN.md as a Phase 1 core infrastructure component but does not exist. Config fields `run_tblite` and `tblite_regression_threshold` in `EvolutionConfig` are defined but never consumed.

**Impact:** Phase 1 could deploy a skill that regresses TBLite by any amount. The pipeline would appear to work (scores improve, constraints pass) while producing a skill that performs worse on real benchmarks.

---

## High Severity Issues

### H1. create_pr: bool = True — Config Never Consumed
**Severity:** High | **Confirmed by:** Tech Lead Review + Audit

`create_pr` is defined in `EvolutionConfig` (config.py:44) but referenced exactly once — only its definition. PLAN.md shows PR output as the end of the optimization loop, and Phase 1's "Validate" stage explicitly says "Create PRs for improvements that pass all gates." The output is saved locally but no PR is ever created.

**Impact:** Phase 1 cannot actually deploy improvements. The end-to-end pipeline is broken at the final step.

---

### H2. GEPA Fallback to MIPROv2 Triggers Silently
**Severity:** High | **Found by:** Tech Lead Review

If `dspy.GEPA` is unavailable in the installed DSPy version, the fallback to MIPROv2 triggers without any log message. The user gets MIPROv2 results unaware that GEPA failed.

---

### H3. Evolution Proceeds Despite Baseline Constraint Violations
**Severity:** Medium | **Found by:** Audit

`evolve_skill.py:131-132` logs a warning but proceeds with optimization even when the baseline skill fails constraint validation. Makes improvement metrics unreliable — you are comparing evolved output against a malformed baseline.

---

## Medium / Low Severity Issues

### M1. SECRET_PATTERNS Regex False Negatives
Short API keys (`sk-abc123`), Bearer tokens <20 chars, and `password == "..."` style assignments bypass detection.

### M2. Empty Holdout Set Produces 0.000 Scores Silently
If the data split produces zero holdout examples, both score lists are empty but no warning is emitted.

### M3. DSPy 3.x API Instability
No version pins — `dspy>=3.0.0` could auto-upgrade to a breaking version. GEPA and MIPROv2 are experimental in DSPy 3.x.

### M4. reportlab Not in Dependencies
`generate_report.py` imports `reportlab` but it is absent from `pyproject.toml`. PDF generation fails at runtime.

### M5. No End-to-End Integration Test
Test suite covers individual components (constraints, skill_module, external_importers) but no test runs the full `evolve()` pipeline against a real skill file.

### M6. Dataset Splitting Logic Duplicated 3 Times
`max(1, int(len * fraction))` 50/25/25 split pattern appears in `SyntheticDatasetBuilder.generate()`, `GoldenDatasetLoader.load()`, and `build_dataset_from_external()`.

---

## Summary

| Severity | Count | Key Issues |
|----------|-------|------------|
| Critical | 3 | Keyword fitness, dead pytest flag, no benchmark gate |
| High | 3 | Dead PR creation, silent GEPA fallback, baseline violations |
| Medium | 4 | Regex gaps, empty holdout, DSPy version, missing deps |

**SE Review verdict:** Critical blockers for Phase 1 dogfooding — the system optimizes for keyword overlap, not task competence.

**Tech Lead verdict:** Phase 1 has the right architecture but 3 broken critical-path links that would allow fake improvements to ship to production.

---

*Audit performed by Hermes Agent. SE review by Staff Engineer. Tech Lead review by Tech Lead.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Audit: Phase 1 Critical Issues — Fitness Signal, Dead Config, Missing Benchmark Gate #33

Code Audit Report: hermes-agent-self-evolution

Critical Issues (Must Fix Before Phase 1 Production Use)

C1. skill_fitness_metric — Keyword Overlap Used Instead of LLM Judge

C2. run_pytest Config is Dead — --run-tests is a No-Op

C3. No Benchmark Gate — TBLite Regression Not Implemented

High Severity Issues

H1. create_pr: bool = True — Config Never Consumed

H2. GEPA Fallback to MIPROv2 Triggers Silently

H3. Evolution Proceeds Despite Baseline Constraint Violations

Medium / Low Severity Issues

M1. SECRET_PATTERNS Regex False Negatives

M2. Empty Holdout Set Produces 0.000 Scores Silently

M3. DSPy 3.x API Instability

M4. reportlab Not in Dependencies

M5. No End-to-End Integration Test

M6. Dataset Splitting Logic Duplicated 3 Times

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Severity	Count	Key Issues
Critical	3	Keyword fitness, dead pytest flag, no benchmark gate
High	3	Dead PR creation, silent GEPA fallback, baseline violations
Medium	4	Regex gaps, empty holdout, DSPy version, missing deps

Code Audit: Phase 1 Critical Issues — Fitness Signal, Dead Config, Missing Benchmark Gate #33

Description

Code Audit Report: hermes-agent-self-evolution

Critical Issues (Must Fix Before Phase 1 Production Use)

C1. skill_fitness_metric — Keyword Overlap Used Instead of LLM Judge

C2. run_pytest Config is Dead — --run-tests is a No-Op

C3. No Benchmark Gate — TBLite Regression Not Implemented

High Severity Issues

H1. create_pr: bool = True — Config Never Consumed

H2. GEPA Fallback to MIPROv2 Triggers Silently

H3. Evolution Proceeds Despite Baseline Constraint Violations

Medium / Low Severity Issues

M1. SECRET_PATTERNS Regex False Negatives

M2. Empty Holdout Set Produces 0.000 Scores Silently

M3. DSPy 3.x API Instability

M4. reportlab Not in Dependencies

M5. No End-to-End Integration Test

M6. Dataset Splitting Logic Duplicated 3 Times

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions