Implement all-agent session ingestion and promotion gates

## Objective

Implement the next usable slice of `hermes-agent-self-evolution` so it can mine evaluation examples from all supported local agent/session sources, run a bounded evolution pass, and produce reviewable outputs that are safe to promote into `hermes-agent`.

This is intended as a concrete implementation track, not a redesign of the existing `PLAN.md`.

## Current repo evidence

From `main` at the time of filing:

- `PLAN.md` describes `evolution/core/benchmark_gate.py` and `evolution/core/pr_builder.py`, but those files are not present in the current tree.
- Phase 1 skill evolution exists via `evolution/skills/evolve_skill.py` and `evolution/skills/skill_module.py`.
- Session/history ingestion exists in `evolution/core/external_importers.py` for Claude Code, GitHub Copilot, and Hermes Agent.
- Dataset abstractions exist in `evolution/core/dataset_builder.py` with `train` / `val` / `holdout` splits.
- Constraints exist in `evolution/core/constraints.py`, but promotion gating is still mostly local and not yet tied to benchmark/PR artifacts.

## Proposed implementation plan

### 1. Normalize all-agent/session ingestion

Add a stable ingestion boundary around the current importers:

- Keep current sources: Claude Code, GitHub Copilot, Hermes Agent.
- Define one canonical event/message schema before converting to `EvalExample`.
- Preserve source metadata needed for audit/debugging: source, project/repo, session id, timestamp, message role, and extraction reason.
- Keep secret filtering mandatory before persistence.
- Make source availability explicit in dry-run output rather than silently returning empty datasets.

Acceptance criteria:

- `python -m evolution.core.external_importers --source all --skill <skill> --dry-run` reports per-source availability and candidate counts.
- Generated JSONL examples contain no raw secrets and include source metadata.
- Unit tests cover unavailable source paths, malformed JSONL, secret filtering, short/irrelevant prompts, and mixed-source deduplication.

### 2. Add promotion artifacts before any PR automation

Implement a report artifact for each evolution run before opening/creating PRs:

- baseline artifact hash and size
- optimized artifact hash and size
- dataset source and split counts
- optimizer/eval model names
- constraint results
- holdout score delta
- cost/latency estimate if available
- full diff or path to diff

Acceptance criteria:

- Every non-dry-run writes a machine-readable run report, e.g. `reports/runs/<timestamp>-<target>.json`.
- The CLI prints the report path at the end of the run.
- Report generation is tested without requiring live LLM calls.

### 3. Implement `benchmark_gate.py`

Add the missing benchmark gate described by `PLAN.md` as a conservative first version:

- Accept a run report plus optional benchmark command(s).
- Fail closed when required benchmark data is absent.
- Support thresholds such as minimum holdout improvement, maximum cost increase, maximum artifact growth, and mandatory constraint pass.
- Return a structured pass/fail result for PR body generation.

Acceptance criteria:

- `python -m evolution.core.benchmark_gate --report <report.json>` exits non-zero when gates fail.
- Tests cover pass, fail, missing report fields, and threshold override behavior.

### 4. Implement `pr_builder.py` as local-first PR preparation

Start with local branch/commit/PR-body preparation, not automatic upstream mutation by default:

- Generate a PR body from the run report.
- Include summary, before/after metrics, constraints, benchmark gate result, risk notes, rollback command, and test plan.
- Provide `--dry-run` and `--no-push` defaults.
- Only push/open a PR behind an explicit flag.

Acceptance criteria:

- `python -m evolution.core.pr_builder --report <report.json> --dry-run` prints a deterministic PR title/body.
- Tests verify that no git remote mutation occurs unless an explicit push/open flag is provided.

### 5. Wire the flow into `evolve_skill.py`

After optimization and constraint validation:

1. evaluate baseline vs optimized on holdout
2. write run report
3. run benchmark gate if configured
4. optionally prepare PR body/branch

Acceptance criteria:

- `--dry-run` remains non-mutative.
- Default non-dry-run writes local reports but does not push/open PRs unless explicitly requested.
- A minimal golden dataset path can run in CI without real local Claude/Copilot/Hermes histories.

## Suggested command UX

```bash
python -m evolution.skills.evolve_skill \
  --skill github-code-review \
  --eval-source sessiondb \
  --iterations 10 \
  --write-report

python -m evolution.core.benchmark_gate \
  --report reports/runs/<run>.json

python -m evolution.core.pr_builder \
  --report reports/runs/<run>.json \
  --dry-run
```

## Safety constraints

- No direct mutation of a user’s real `~/.hermes` files.
- No direct mutation of upstream `hermes-agent` without an explicit flag.
- No persistence of secrets, tokens, API keys, raw private session dumps, or full tool outputs into datasets/reports.
- Treat external session sources as local/private raw material; generated eval datasets should be sanitized and reviewable.

## Why this slice first

It converts the repo from “Phase 1 prototype can run” into a repeatable, auditable improvement loop:

source sessions → sanitized eval examples → evolution run → constraints → benchmark gate → PR-ready artifact.

That is the smallest path to making self-evolution useful across agents and sessions while keeping review and promotion human-controlled.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement all-agent session ingestion and promotion gates #54

Objective

Current repo evidence

Proposed implementation plan

1. Normalize all-agent/session ingestion

2. Add promotion artifacts before any PR automation

3. Implement `benchmark_gate.py`

4. Implement `pr_builder.py` as local-first PR preparation

5. Wire the flow into `evolve_skill.py`

Suggested command UX

Safety constraints

Why this slice first

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement all-agent session ingestion and promotion gates #54

Description

Objective

Current repo evidence

Proposed implementation plan

1. Normalize all-agent/session ingestion

2. Add promotion artifacts before any PR automation

3. Implement benchmark_gate.py

4. Implement pr_builder.py as local-first PR preparation

5. Wire the flow into evolve_skill.py

Suggested command UX

Safety constraints

Why this slice first

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. Implement `benchmark_gate.py`

4. Implement `pr_builder.py` as local-first PR preparation

5. Wire the flow into `evolve_skill.py`