Skip to content

Implement all-agent session ingestion and promotion gates #54

@rolldav

Description

@rolldav

Objective

Implement the next usable slice of hermes-agent-self-evolution so it can mine evaluation examples from all supported local agent/session sources, run a bounded evolution pass, and produce reviewable outputs that are safe to promote into hermes-agent.

This is intended as a concrete implementation track, not a redesign of the existing PLAN.md.

Current repo evidence

From main at the time of filing:

  • PLAN.md describes evolution/core/benchmark_gate.py and evolution/core/pr_builder.py, but those files are not present in the current tree.
  • Phase 1 skill evolution exists via evolution/skills/evolve_skill.py and evolution/skills/skill_module.py.
  • Session/history ingestion exists in evolution/core/external_importers.py for Claude Code, GitHub Copilot, and Hermes Agent.
  • Dataset abstractions exist in evolution/core/dataset_builder.py with train / val / holdout splits.
  • Constraints exist in evolution/core/constraints.py, but promotion gating is still mostly local and not yet tied to benchmark/PR artifacts.

Proposed implementation plan

1. Normalize all-agent/session ingestion

Add a stable ingestion boundary around the current importers:

  • Keep current sources: Claude Code, GitHub Copilot, Hermes Agent.
  • Define one canonical event/message schema before converting to EvalExample.
  • Preserve source metadata needed for audit/debugging: source, project/repo, session id, timestamp, message role, and extraction reason.
  • Keep secret filtering mandatory before persistence.
  • Make source availability explicit in dry-run output rather than silently returning empty datasets.

Acceptance criteria:

  • python -m evolution.core.external_importers --source all --skill <skill> --dry-run reports per-source availability and candidate counts.
  • Generated JSONL examples contain no raw secrets and include source metadata.
  • Unit tests cover unavailable source paths, malformed JSONL, secret filtering, short/irrelevant prompts, and mixed-source deduplication.

2. Add promotion artifacts before any PR automation

Implement a report artifact for each evolution run before opening/creating PRs:

  • baseline artifact hash and size
  • optimized artifact hash and size
  • dataset source and split counts
  • optimizer/eval model names
  • constraint results
  • holdout score delta
  • cost/latency estimate if available
  • full diff or path to diff

Acceptance criteria:

  • Every non-dry-run writes a machine-readable run report, e.g. reports/runs/<timestamp>-<target>.json.
  • The CLI prints the report path at the end of the run.
  • Report generation is tested without requiring live LLM calls.

3. Implement benchmark_gate.py

Add the missing benchmark gate described by PLAN.md as a conservative first version:

  • Accept a run report plus optional benchmark command(s).
  • Fail closed when required benchmark data is absent.
  • Support thresholds such as minimum holdout improvement, maximum cost increase, maximum artifact growth, and mandatory constraint pass.
  • Return a structured pass/fail result for PR body generation.

Acceptance criteria:

  • python -m evolution.core.benchmark_gate --report <report.json> exits non-zero when gates fail.
  • Tests cover pass, fail, missing report fields, and threshold override behavior.

4. Implement pr_builder.py as local-first PR preparation

Start with local branch/commit/PR-body preparation, not automatic upstream mutation by default:

  • Generate a PR body from the run report.
  • Include summary, before/after metrics, constraints, benchmark gate result, risk notes, rollback command, and test plan.
  • Provide --dry-run and --no-push defaults.
  • Only push/open a PR behind an explicit flag.

Acceptance criteria:

  • python -m evolution.core.pr_builder --report <report.json> --dry-run prints a deterministic PR title/body.
  • Tests verify that no git remote mutation occurs unless an explicit push/open flag is provided.

5. Wire the flow into evolve_skill.py

After optimization and constraint validation:

  1. evaluate baseline vs optimized on holdout
  2. write run report
  3. run benchmark gate if configured
  4. optionally prepare PR body/branch

Acceptance criteria:

  • --dry-run remains non-mutative.
  • Default non-dry-run writes local reports but does not push/open PRs unless explicitly requested.
  • A minimal golden dataset path can run in CI without real local Claude/Copilot/Hermes histories.

Suggested command UX

python -m evolution.skills.evolve_skill \
  --skill github-code-review \
  --eval-source sessiondb \
  --iterations 10 \
  --write-report

python -m evolution.core.benchmark_gate \
  --report reports/runs/<run>.json

python -m evolution.core.pr_builder \
  --report reports/runs/<run>.json \
  --dry-run

Safety constraints

  • No direct mutation of a user’s real ~/.hermes files.
  • No direct mutation of upstream hermes-agent without an explicit flag.
  • No persistence of secrets, tokens, API keys, raw private session dumps, or full tool outputs into datasets/reports.
  • Treat external session sources as local/private raw material; generated eval datasets should be sanitized and reviewable.

Why this slice first

It converts the repo from “Phase 1 prototype can run” into a repeatable, auditable improvement loop:

source sessions → sanitized eval examples → evolution run → constraints → benchmark gate → PR-ready artifact.

That is the smallest path to making self-evolution useful across agents and sessions while keeping review and promotion human-controlled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions