NousResearch · breakneo · May 7, 2026
diff --git a/.gitignore b/.gitignore
@@ -9,11 +9,19 @@ build/
 .venv/
 venv/
 .env
+.pytest_cache/
 
 # Generated eval datasets (local, not shared)
 datasets/**/*.jsonl
 datasets/**/*.json
 !datasets/.gitkeep
+!datasets/skills/dogfood/baidu-homepage/train.jsonl
+!datasets/skills/dogfood/baidu-homepage/val.jsonl
+!datasets/skills/dogfood/baidu-homepage/holdout.jsonl
+
+# Generated run artifacts
+output/
+dogfood-output/
 
 # Evolution snapshots
 snapshots/

diff --git a/README.md b/README.md
@@ -31,7 +31,9 @@ GEPA reads execution traces to understand *why* things fail (not just that they
 # Install
 git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
 cd hermes-agent-self-evolution
-pip install -e ".[dev]"
+uv venv --python 3.11 .venv
+source .venv/bin/activate
+uv pip install -e '.[dev]'
 
 # Point at your hermes-agent repo
 export HERMES_AGENT_REPO=~/.hermes/hermes-agent
@@ -49,6 +51,57 @@ python -m evolution.skills.evolve_skill \
     --eval-source sessiondb
 ```
 
+## Advanced Usage
+
+### Evaluate through a real Hermes runtime
+
+```bash
+python -m evolution.skills.evolve_skill \
+    --skill dogfood \
+    --eval-source golden \
+    --dataset-path datasets/skills/dogfood/baidu-homepage \
+    --eval-backend hermes
+```
+
+### Add an optional TBLite regression gate
+
+```bash
+python -m evolution.skills.evolve_skill \
+    --skill github-code-review \
+    --eval-source synthetic \
+    --run-tblite \
+    --tblite-mode fast
+```
+
+### Generate git / PR handoff artifacts or execute them directly
+
+```bash
+python -m evolution.skills.evolve_skill \
+    --skill github-code-review \
+    --eval-source synthetic \
+    --execute-git-apply
+
+python -m evolution.skills.evolve_skill \
+    --skill github-code-review \
+    --eval-source synthetic \
+    --execute-git-apply \
+    --execute-push \
+    --execute-pr
+```
+
+Notes:
+- By default the tool loads credentials from `~/.hermes/.env` and local `.env` when present, without overwriting already exported values.
+- When the local Hermes runtime uses `model.provider: custom`, default self-evolution model settings are aligned to the active Hermes config automatically.
+- `--execute-pr` requires `--execute-push`.
+
+## Golden Dataset Sample
+
+A browser-heavy golden sample for the `dogfood` skill is included at:
+
+- `datasets/skills/dogfood/baidu-homepage/`
+
+It captures both positive paths and blockers from a real Baidu homepage dogfood run, which makes it useful for evaluating browser QA skills more realistically than purely synthetic examples.
+
 ## What It Optimizes
 
 | Phase | Target | Engine | Status |

diff --git a/datasets/skills/dogfood/baidu-homepage/README.md b/datasets/skills/dogfood/baidu-homepage/README.md
@@ -0,0 +1,34 @@
+# Dogfood Golden Sample: Baidu Homepage
+
+This dataset incorporates the real dogfood run against <https://www.baidu.com/> on 2026-04-15 into the self-evolution sample set for the `dogfood` skill.
+
+## Source Artifacts
+
+- Source report: `datasets/skills/dogfood/baidu-homepage/source_report.md`
+- Dataset directory: `datasets/skills/dogfood/baidu-homepage`
+
+## What This Sample Covers
+
+- Homepage load health and console cleanliness
+- Search submission flow
+- Search suggestion relevance
+- Wenxin assistant entry and back-navigation chain
+- Top-nav News entry health
+
+## Why It Matters
+
+This sample gives `dogfood` a real browser-heavy golden set with both:
+
+- **positive paths**: homepage load, Wenxin single-turn QA, News page load
+- **negative/blocking paths**: search flow interrupted by Baidu security verification, unrelated suggestions, unstable back-navigation chain
+
+That makes it more useful than a purely synthetic sample when evaluating whether the evolved skill:
+
+1. tests the intended user flows,
+2. distinguishes blockers from non-blockers,
+3. captures evidence correctly, and
+4. writes a balanced QA report with both working and broken paths.
+
+## Notes
+
+- The source report is stored in-repo as text only; screenshot binaries from the original run are intentionally not committed.
diff --git a/datasets/skills/dogfood/baidu-homepage/holdout.jsonl b/datasets/skills/dogfood/baidu-homepage/holdout.jsonl
@@ -0,0 +1 @@
+{"task_input": "用 dogfood 测这个网站：https://www.baidu.com/，重点看顶部导航中的“新闻”入口，并总结哪些路径正常、哪些路径被阻断。", "expected_behavior": "应验证“新闻”入口能否正常打开百度新闻页，并区分成功路径与失败路径：例如新闻页应被记录为正常打开、无明显布局异常或 console 错误；同时如果搜索主流程被安全验证打断，也应在总结中列为 blocker，而不是把所有路径都误判为失败。", "difficulty": "hard", "category": "navigation-health", "source": "golden"}
diff --git a/datasets/skills/dogfood/baidu-homepage/source_report.md b/datasets/skills/dogfood/baidu-homepage/source_report.md
@@ -0,0 +1,165 @@
+# Dogfood QA Report
+
+**Target:** https://www.baidu.com/
+**Date:** 2026-04-15
+**Scope:** 百度首页桌面站小样本探索式测试：首页加载、顶部导航、搜索输入与提交主流程、文心助手入口、百度新闻入口。
+**Tester:** Hermes Agent (automated exploratory QA)
+
+---
+
+## Executive Summary
+
+| Severity | Count |
+|----------|-------|
+| 🔴 Critical | 0 |
+| 🟠 High | 1 |
+| 🟡 Medium | 2 |
+| 🔵 Low | 0 |
+| **Total** | **3** |
+
+**Overall Assessment:** 百度首页与主要入口整体可用，但搜索主流程触发安全验证拦截，且搜索联想与返回链路存在可用性问题，影响无登录/自动化场景下的连续使用体验。
+
+---
+
+## Issues
+
+### Issue #1: 首页搜索主流程被安全验证拦截，无法直接进入结果页
+
+| Field | Value |
+|-------|-------|
+| **Severity** | High |
+| **Category** | Functional |
+| **URL** | https://www.baidu.com/ |
+
+**Description:**
+在首页输入测试关键词并提交后，未直接进入正常搜索结果页，而是被“百度安全验证”拦截。页面要求用户完成“拖动左侧滑块使图片为正”的图片旋转验证，导致标准搜索主流程中断。对于自动化代理、辅助技术用户或希望快速搜索的用户来说，这属于明显阻断。
+
+**Steps to Reproduce:**
+1. 打开 https://www.baidu.com/
+2. 在首页搜索框输入“Hermes Agent dogfood 测试”
+3. 按 Enter 提交搜索
+
+**Expected Behavior:**
+直接进入对应关键词的搜索结果页，用户可以继续浏览结果。
+
+**Actual Behavior:**
+页面跳转到“百度安全验证”，要求完成滑块旋转图片验证后才能继续，正常搜索结果未展示。
+
+**Screenshot:**
+Original screenshot captured during the source run (binary not committed in this repo).
+
+**Console Errors** (if applicable):
+```text
+None observed.
+```
+
+---
+
+### Issue #2: 搜索联想词与已输入查询明显不相关
+
+| Field | Value |
+|-------|-------|
+| **Severity** | Medium |
+| **Category** | UX |
+| **URL** | https://www.baidu.com/ |
+
+**Description:**
+在首页搜索框输入“Hermes Agent dogfood 测试”后，下拉联想建议并未围绕完整查询或“dogfood 测试”意图展开，而是出现大量泛化的英文品牌/词条，如 “hermes tracking”、“hermes track”、“hermes trismegistus”等。这种联想结果与当前查询意图偏差较大，容易误导用户点击到无关搜索方向。
+
+**Steps to Reproduce:**
+1. 打开 https://www.baidu.com/
+2. 在首页搜索框输入“Hermes Agent dogfood 测试”
+3. 观察联想词下拉列表
+
+**Expected Behavior:**
+联想词应尽量贴近当前完整查询，或至少与“Agent / dogfood / 测试”意图相关。
+
+**Actual Behavior:**
+联想词主要围绕泛化的“Hermes”品牌/英文词条展开，与完整查询相关性较弱。
+
+**Screenshot:**
+Original screenshot captured during the source run (binary not committed in this repo).
+
+**Console Errors** (if applicable):
+```text
+None observed.
+```
+
+---
+
+### Issue #3: 从文心助手页使用返回操作未能回到百度首页，历史链路表现不稳定
+
+| Field | Value |
+|-------|-------|
+| **Severity** | Medium |
+| **Category** | Functional |
+| **URL** | https://chat.baidu.com/?enter_type=home_operate |
+
+**Description:**
+从百度首页点击“复杂问题就找文心助手”进入文心助手页后，使用浏览器后退操作时，并未顺利回到百度首页，而是停留在文心相关页面。对用户来说，这会造成页面链路理解困难；对自动化工作流来说，也会增加 flow 恢复成本。
+
+**Steps to Reproduce:**
+1. 打开 https://www.baidu.com/
+2. 点击“复杂问题就找文心助手，深入思考回答更优”入口
+3. 在文心助手页面执行浏览器后退
+
+**Expected Behavior:**
+后退应返回原始百度首页。
+
+**Actual Behavior:**
+后退后仍停留在文心相关页面，未恢复到首页，需要重新导航到百度首页。
+
+**Screenshot:**
+Original screenshot captured during the source run (binary not committed in this repo).
+
+**Console Errors** (if applicable):
+```text
+None observed.
+```
+
+---
+
+## Issues Summary Table
+
+| # | Title | Severity | Category | URL |
+|---|-------|----------|----------|-----|
+| 1 | 首页搜索主流程被安全验证拦截，无法直接进入结果页 | High | Functional | https://www.baidu.com/ |
+| 2 | 搜索联想词与已输入查询明显不相关 | Medium | UX | https://www.baidu.com/ |
+| 3 | 从文心助手页使用返回操作未能回到百度首页，历史链路表现不稳定 | Medium | Functional | https://chat.baidu.com/?enter_type=home_operate |
+
+## Testing Coverage
+
+### Pages Tested
+- 百度首页（https://www.baidu.com/）
+- 百度安全验证页（搜索后触发）
+- 文心助手入口页 / 对话页（https://chat.baidu.com/）
+- 百度新闻页（http://news.baidu.com）
+
+### Features Tested
+- 首页加载与视觉检查
+- 浏览器 console 基础检查
+- 首页搜索输入与提交
+- 搜索联想词观察
+- 文心助手入口跳转
+- 文心助手单轮提问与回答返回
+- 顶部“新闻”导航入口跳转
+
+### Not Tested / Out of Scope
+- 登录流程
+- 图片、视频、地图、贴吧、网盘、文库等其余顶部入口的深入测试
+- 首页“设置”菜单展开行为
+- 热搜条目逐条点击验证
+- 移动端布局与响应式行为
+- 安全验证滑块的人工完成与验证后结果页质量
+
+### Blockers
+- 搜索主流程被百度安全验证拦截，无法在当前会话中继续检查正常搜索结果页的相关性、结果布局与分页链路。
+
+---
+
+## Notes
+
+1. 百度首页本身在首屏加载、布局和视觉呈现上表现稳定，未见明显白屏、JS 报错或布局错位。
+2. 文心助手入口可正常打开，且无需登录即可完成单轮问答，这一入口的可用性较好。
+3. 百度新闻页正常打开，说明顶部导航至少部分入口工作正常。
+4. 本次最主要的问题集中在“主搜索流程被风控打断”和“返回链路不稳定”，这两点对真实 end-to-end 体验影响最大。
diff --git a/datasets/skills/dogfood/baidu-homepage/train.jsonl b/datasets/skills/dogfood/baidu-homepage/train.jsonl
@@ -0,0 +1,2 @@
+{"task_input": "用 dogfood 测这个网站：https://www.baidu.com/，重点看首页搜索输入和提交主流程。", "expected_behavior": "应导航到百度首页，检查 console 与首屏状态，在搜索框输入明确测试词并提交；若出现百度安全验证，应将其识别为高严重级别的 Functional blocker，记录验证文案、触发步骤、结果页未展示这一事实，并附截图证据。", "difficulty": "medium", "category": "search-flow", "source": "golden"}
+{"task_input": "用 dogfood 测这个网站：https://www.baidu.com/，重点看首页搜索联想词是否贴合查询意图。", "expected_behavior": "应在首页搜索框输入具有明确意图的测试查询，观察下拉联想词，并判断其是否与完整查询相关；若联想词大量偏向泛化品牌词而非当前测试意图，应记录为中等级别 UX 问题，说明误导风险并保存截图。", "difficulty": "medium", "category": "search-suggestions", "source": "golden"}
diff --git a/datasets/skills/dogfood/baidu-homepage/val.jsonl b/datasets/skills/dogfood/baidu-homepage/val.jsonl
@@ -0,0 +1 @@
+{"task_input": "用 dogfood 测这个网站：https://www.baidu.com/，重点看“复杂问题就找文心助手”入口，以及从该页返回首页的链路。", "expected_behavior": "应点击文心助手入口，验证页面是否正常打开、是否能在未登录状态下完成至少一轮问答，再执行返回操作；若返回未恢复到原始百度首页，应记录为中等级别 Functional 问题，同时注明文心页面本身可用、无明显 console 错误或登录阻断。", "difficulty": "hard", "category": "subpage-flow", "source": "golden"}
diff --git a/evolution/core/__init__.py b/evolution/core/__init__.py
@@ -1,3 +1,25 @@
 """Core infrastructure shared across all evolution phases."""
 
 from evolution.core.config import EvolutionConfig, get_hermes_agent_path
+from evolution.core.benchmark_gate import TBLiteGateResult, run_tblite_benchmark_gate
+from evolution.core.git_pr_automation import (
+    build_evolution_branch_name,
+    build_git_apply_plan,
+    build_target_skill_path,
+    write_git_apply_plan_artifacts,
+    write_git_pr_automation_artifacts,
+    write_skill_patch_artifacts,
+)
+from evolution.core.report_artifact import (
+    build_diff_summary,
+    build_evolution_report,
+    build_github_pr_body,
+    build_github_pr_title,
+    build_gh_pr_create_command,
+    build_pr_draft,
+    build_review_checklist,
+    summarize_recommendation,
+    write_github_pr_artifacts,
+    write_pr_ready_artifacts,
+    write_report_artifacts,
+)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"task_input": "用 dogfood 测这个网站：https://www.baidu.com/，重点看顶部导航中的“新闻”入口，并总结哪些路径正常、哪些路径被阻断。", "expected_behavior": "应验证“新闻”入口能否正常打开百度新闻页，并区分成功路径与失败路径：例如新闻页应被记录为正常打开、无明显布局异常或 console 错误；同时如果搜索主流程被安全验证打断，也应在总结中列为 blocker，而不是把所有路径都误判为失败。", "difficulty": "hard", "category": "navigation-health", "source": "golden"}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		{"task_input": "用 dogfood 测这个网站：https://www.baidu.com/，重点看首页搜索输入和提交主流程。", "expected_behavior": "应导航到百度首页，检查 console 与首屏状态，在搜索框输入明确测试词并提交；若出现百度安全验证，应将其识别为高严重级别的 Functional blocker，记录验证文案、触发步骤、结果页未展示这一事实，并附截图证据。", "difficulty": "medium", "category": "search-flow", "source": "golden"}
		{"task_input": "用 dogfood 测这个网站：https://www.baidu.com/，重点看首页搜索联想词是否贴合查询意图。", "expected_behavior": "应在首页搜索框输入具有明确意图的测试查询，观察下拉联想词，并判断其是否与完整查询相关；若联想词大量偏向泛化品牌词而非当前测试意图，应记录为中等级别 UX 问题，说明误导风险并保存截图。", "difficulty": "medium", "category": "search-suggestions", "source": "golden"}