Objective
Implement the next usable slice of hermes-agent-self-evolution so it can mine evaluation examples from all supported local agent/session sources, run a bounded evolution pass, and produce reviewable outputs that are safe to promote into hermes-agent.
This is intended as a concrete implementation track, not a redesign of the existing PLAN.md.
Current repo evidence
From main at the time of filing:
PLAN.md describes evolution/core/benchmark_gate.py and evolution/core/pr_builder.py, but those files are not present in the current tree.
- Phase 1 skill evolution exists via
evolution/skills/evolve_skill.py and evolution/skills/skill_module.py.
- Session/history ingestion exists in
evolution/core/external_importers.py for Claude Code, GitHub Copilot, and Hermes Agent.
- Dataset abstractions exist in
evolution/core/dataset_builder.py with train / val / holdout splits.
- Constraints exist in
evolution/core/constraints.py, but promotion gating is still mostly local and not yet tied to benchmark/PR artifacts.
Proposed implementation plan
1. Normalize all-agent/session ingestion
Add a stable ingestion boundary around the current importers:
- Keep current sources: Claude Code, GitHub Copilot, Hermes Agent.
- Define one canonical event/message schema before converting to
EvalExample.
- Preserve source metadata needed for audit/debugging: source, project/repo, session id, timestamp, message role, and extraction reason.
- Keep secret filtering mandatory before persistence.
- Make source availability explicit in dry-run output rather than silently returning empty datasets.
Acceptance criteria:
python -m evolution.core.external_importers --source all --skill <skill> --dry-run reports per-source availability and candidate counts.
- Generated JSONL examples contain no raw secrets and include source metadata.
- Unit tests cover unavailable source paths, malformed JSONL, secret filtering, short/irrelevant prompts, and mixed-source deduplication.
2. Add promotion artifacts before any PR automation
Implement a report artifact for each evolution run before opening/creating PRs:
- baseline artifact hash and size
- optimized artifact hash and size
- dataset source and split counts
- optimizer/eval model names
- constraint results
- holdout score delta
- cost/latency estimate if available
- full diff or path to diff
Acceptance criteria:
- Every non-dry-run writes a machine-readable run report, e.g.
reports/runs/<timestamp>-<target>.json.
- The CLI prints the report path at the end of the run.
- Report generation is tested without requiring live LLM calls.
3. Implement benchmark_gate.py
Add the missing benchmark gate described by PLAN.md as a conservative first version:
- Accept a run report plus optional benchmark command(s).
- Fail closed when required benchmark data is absent.
- Support thresholds such as minimum holdout improvement, maximum cost increase, maximum artifact growth, and mandatory constraint pass.
- Return a structured pass/fail result for PR body generation.
Acceptance criteria:
python -m evolution.core.benchmark_gate --report <report.json> exits non-zero when gates fail.
- Tests cover pass, fail, missing report fields, and threshold override behavior.
4. Implement pr_builder.py as local-first PR preparation
Start with local branch/commit/PR-body preparation, not automatic upstream mutation by default:
- Generate a PR body from the run report.
- Include summary, before/after metrics, constraints, benchmark gate result, risk notes, rollback command, and test plan.
- Provide
--dry-run and --no-push defaults.
- Only push/open a PR behind an explicit flag.
Acceptance criteria:
python -m evolution.core.pr_builder --report <report.json> --dry-run prints a deterministic PR title/body.
- Tests verify that no git remote mutation occurs unless an explicit push/open flag is provided.
5. Wire the flow into evolve_skill.py
After optimization and constraint validation:
- evaluate baseline vs optimized on holdout
- write run report
- run benchmark gate if configured
- optionally prepare PR body/branch
Acceptance criteria:
--dry-run remains non-mutative.
- Default non-dry-run writes local reports but does not push/open PRs unless explicitly requested.
- A minimal golden dataset path can run in CI without real local Claude/Copilot/Hermes histories.
Suggested command UX
python -m evolution.skills.evolve_skill \
--skill github-code-review \
--eval-source sessiondb \
--iterations 10 \
--write-report
python -m evolution.core.benchmark_gate \
--report reports/runs/<run>.json
python -m evolution.core.pr_builder \
--report reports/runs/<run>.json \
--dry-run
Safety constraints
- No direct mutation of a user’s real
~/.hermes files.
- No direct mutation of upstream
hermes-agent without an explicit flag.
- No persistence of secrets, tokens, API keys, raw private session dumps, or full tool outputs into datasets/reports.
- Treat external session sources as local/private raw material; generated eval datasets should be sanitized and reviewable.
Why this slice first
It converts the repo from “Phase 1 prototype can run” into a repeatable, auditable improvement loop:
source sessions → sanitized eval examples → evolution run → constraints → benchmark gate → PR-ready artifact.
That is the smallest path to making self-evolution useful across agents and sessions while keeping review and promotion human-controlled.
Objective
Implement the next usable slice of
hermes-agent-self-evolutionso it can mine evaluation examples from all supported local agent/session sources, run a bounded evolution pass, and produce reviewable outputs that are safe to promote intohermes-agent.This is intended as a concrete implementation track, not a redesign of the existing
PLAN.md.Current repo evidence
From
mainat the time of filing:PLAN.mddescribesevolution/core/benchmark_gate.pyandevolution/core/pr_builder.py, but those files are not present in the current tree.evolution/skills/evolve_skill.pyandevolution/skills/skill_module.py.evolution/core/external_importers.pyfor Claude Code, GitHub Copilot, and Hermes Agent.evolution/core/dataset_builder.pywithtrain/val/holdoutsplits.evolution/core/constraints.py, but promotion gating is still mostly local and not yet tied to benchmark/PR artifacts.Proposed implementation plan
1. Normalize all-agent/session ingestion
Add a stable ingestion boundary around the current importers:
EvalExample.Acceptance criteria:
python -m evolution.core.external_importers --source all --skill <skill> --dry-runreports per-source availability and candidate counts.2. Add promotion artifacts before any PR automation
Implement a report artifact for each evolution run before opening/creating PRs:
Acceptance criteria:
reports/runs/<timestamp>-<target>.json.3. Implement
benchmark_gate.pyAdd the missing benchmark gate described by
PLAN.mdas a conservative first version:Acceptance criteria:
python -m evolution.core.benchmark_gate --report <report.json>exits non-zero when gates fail.4. Implement
pr_builder.pyas local-first PR preparationStart with local branch/commit/PR-body preparation, not automatic upstream mutation by default:
--dry-runand--no-pushdefaults.Acceptance criteria:
python -m evolution.core.pr_builder --report <report.json> --dry-runprints a deterministic PR title/body.5. Wire the flow into
evolve_skill.pyAfter optimization and constraint validation:
Acceptance criteria:
--dry-runremains non-mutative.Suggested command UX
Safety constraints
~/.hermesfiles.hermes-agentwithout an explicit flag.Why this slice first
It converts the repo from “Phase 1 prototype can run” into a repeatable, auditable improvement loop:
source sessions → sanitized eval examples → evolution run → constraints → benchmark gate → PR-ready artifact.
That is the smallest path to making self-evolution useful across agents and sessions while keeping review and promotion human-controlled.