Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,19 @@ build/
.venv/
venv/
.env
.pytest_cache/

# Generated eval datasets (local, not shared)
datasets/**/*.jsonl
datasets/**/*.json
!datasets/.gitkeep
!datasets/skills/dogfood/baidu-homepage/train.jsonl
!datasets/skills/dogfood/baidu-homepage/val.jsonl
!datasets/skills/dogfood/baidu-homepage/holdout.jsonl

# Generated run artifacts
output/
dogfood-output/

# Evolution snapshots
snapshots/
Expand Down
55 changes: 54 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,9 @@ GEPA reads execution traces to understand *why* things fail (not just that they
# Install
git clone https://github.com/NousResearch/hermes-agent-self-evolution.git
cd hermes-agent-self-evolution
pip install -e ".[dev]"
uv venv --python 3.11 .venv
source .venv/bin/activate
uv pip install -e '.[dev]'

# Point at your hermes-agent repo
export HERMES_AGENT_REPO=~/.hermes/hermes-agent
Expand All @@ -49,6 +51,57 @@ python -m evolution.skills.evolve_skill \
--eval-source sessiondb
```

## Advanced Usage

### Evaluate through a real Hermes runtime

```bash
python -m evolution.skills.evolve_skill \
--skill dogfood \
--eval-source golden \
--dataset-path datasets/skills/dogfood/baidu-homepage \
--eval-backend hermes
```

### Add an optional TBLite regression gate

```bash
python -m evolution.skills.evolve_skill \
--skill github-code-review \
--eval-source synthetic \
--run-tblite \
--tblite-mode fast
```

### Generate git / PR handoff artifacts or execute them directly

```bash
python -m evolution.skills.evolve_skill \
--skill github-code-review \
--eval-source synthetic \
--execute-git-apply

python -m evolution.skills.evolve_skill \
--skill github-code-review \
--eval-source synthetic \
--execute-git-apply \
--execute-push \
--execute-pr
```

Notes:
- By default the tool loads credentials from `~/.hermes/.env` and local `.env` when present, without overwriting already exported values.
- When the local Hermes runtime uses `model.provider: custom`, default self-evolution model settings are aligned to the active Hermes config automatically.
- `--execute-pr` requires `--execute-push`.

## Golden Dataset Sample

A browser-heavy golden sample for the `dogfood` skill is included at:

- `datasets/skills/dogfood/baidu-homepage/`

It captures both positive paths and blockers from a real Baidu homepage dogfood run, which makes it useful for evaluating browser QA skills more realistically than purely synthetic examples.

## What It Optimizes

| Phase | Target | Engine | Status |
Expand Down
34 changes: 34 additions & 0 deletions datasets/skills/dogfood/baidu-homepage/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Dogfood Golden Sample: Baidu Homepage

This dataset incorporates the real dogfood run against <https://www.baidu.com/> on 2026-04-15 into the self-evolution sample set for the `dogfood` skill.

## Source Artifacts

- Source report: `datasets/skills/dogfood/baidu-homepage/source_report.md`
- Dataset directory: `datasets/skills/dogfood/baidu-homepage`

## What This Sample Covers

- Homepage load health and console cleanliness
- Search submission flow
- Search suggestion relevance
- Wenxin assistant entry and back-navigation chain
- Top-nav News entry health

## Why It Matters

This sample gives `dogfood` a real browser-heavy golden set with both:

- **positive paths**: homepage load, Wenxin single-turn QA, News page load
- **negative/blocking paths**: search flow interrupted by Baidu security verification, unrelated suggestions, unstable back-navigation chain

That makes it more useful than a purely synthetic sample when evaluating whether the evolved skill:

1. tests the intended user flows,
2. distinguishes blockers from non-blockers,
3. captures evidence correctly, and
4. writes a balanced QA report with both working and broken paths.

## Notes

- The source report is stored in-repo as text only; screenshot binaries from the original run are intentionally not committed.
1 change: 1 addition & 0 deletions datasets/skills/dogfood/baidu-homepage/holdout.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"task_input": "用 dogfood 测这个网站:https://www.baidu.com/,重点看顶部导航中的“新闻”入口,并总结哪些路径正常、哪些路径被阻断。", "expected_behavior": "应验证“新闻”入口能否正常打开百度新闻页,并区分成功路径与失败路径:例如新闻页应被记录为正常打开、无明显布局异常或 console 错误;同时如果搜索主流程被安全验证打断,也应在总结中列为 blocker,而不是把所有路径都误判为失败。", "difficulty": "hard", "category": "navigation-health", "source": "golden"}
165 changes: 165 additions & 0 deletions datasets/skills/dogfood/baidu-homepage/source_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Dogfood QA Report

**Target:** https://www.baidu.com/
**Date:** 2026-04-15
**Scope:** 百度首页桌面站小样本探索式测试:首页加载、顶部导航、搜索输入与提交主流程、文心助手入口、百度新闻入口。
**Tester:** Hermes Agent (automated exploratory QA)

---

## Executive Summary

| Severity | Count |
|----------|-------|
| 🔴 Critical | 0 |
| 🟠 High | 1 |
| 🟡 Medium | 2 |
| 🔵 Low | 0 |
| **Total** | **3** |

**Overall Assessment:** 百度首页与主要入口整体可用,但搜索主流程触发安全验证拦截,且搜索联想与返回链路存在可用性问题,影响无登录/自动化场景下的连续使用体验。

---

## Issues

### Issue #1: 首页搜索主流程被安全验证拦截,无法直接进入结果页

| Field | Value |
|-------|-------|
| **Severity** | High |
| **Category** | Functional |
| **URL** | https://www.baidu.com/ |

**Description:**
在首页输入测试关键词并提交后,未直接进入正常搜索结果页,而是被“百度安全验证”拦截。页面要求用户完成“拖动左侧滑块使图片为正”的图片旋转验证,导致标准搜索主流程中断。对于自动化代理、辅助技术用户或希望快速搜索的用户来说,这属于明显阻断。

**Steps to Reproduce:**
1. 打开 https://www.baidu.com/
2. 在首页搜索框输入“Hermes Agent dogfood 测试”
3. 按 Enter 提交搜索

**Expected Behavior:**
直接进入对应关键词的搜索结果页,用户可以继续浏览结果。

**Actual Behavior:**
页面跳转到“百度安全验证”,要求完成滑块旋转图片验证后才能继续,正常搜索结果未展示。

**Screenshot:**
Original screenshot captured during the source run (binary not committed in this repo).

**Console Errors** (if applicable):
```text
None observed.
```

---

### Issue #2: 搜索联想词与已输入查询明显不相关

| Field | Value |
|-------|-------|
| **Severity** | Medium |
| **Category** | UX |
| **URL** | https://www.baidu.com/ |

**Description:**
在首页搜索框输入“Hermes Agent dogfood 测试”后,下拉联想建议并未围绕完整查询或“dogfood 测试”意图展开,而是出现大量泛化的英文品牌/词条,如 “hermes tracking”、“hermes track”、“hermes trismegistus”等。这种联想结果与当前查询意图偏差较大,容易误导用户点击到无关搜索方向。

**Steps to Reproduce:**
1. 打开 https://www.baidu.com/
2. 在首页搜索框输入“Hermes Agent dogfood 测试”
3. 观察联想词下拉列表

**Expected Behavior:**
联想词应尽量贴近当前完整查询,或至少与“Agent / dogfood / 测试”意图相关。

**Actual Behavior:**
联想词主要围绕泛化的“Hermes”品牌/英文词条展开,与完整查询相关性较弱。

**Screenshot:**
Original screenshot captured during the source run (binary not committed in this repo).

**Console Errors** (if applicable):
```text
None observed.
```

---

### Issue #3: 从文心助手页使用返回操作未能回到百度首页,历史链路表现不稳定

| Field | Value |
|-------|-------|
| **Severity** | Medium |
| **Category** | Functional |
| **URL** | https://chat.baidu.com/?enter_type=home_operate |

**Description:**
从百度首页点击“复杂问题就找文心助手”进入文心助手页后,使用浏览器后退操作时,并未顺利回到百度首页,而是停留在文心相关页面。对用户来说,这会造成页面链路理解困难;对自动化工作流来说,也会增加 flow 恢复成本。

**Steps to Reproduce:**
1. 打开 https://www.baidu.com/
2. 点击“复杂问题就找文心助手,深入思考回答更优”入口
3. 在文心助手页面执行浏览器后退

**Expected Behavior:**
后退应返回原始百度首页。

**Actual Behavior:**
后退后仍停留在文心相关页面,未恢复到首页,需要重新导航到百度首页。

**Screenshot:**
Original screenshot captured during the source run (binary not committed in this repo).

**Console Errors** (if applicable):
```text
None observed.
```

---

## Issues Summary Table

| # | Title | Severity | Category | URL |
|---|-------|----------|----------|-----|
| 1 | 首页搜索主流程被安全验证拦截,无法直接进入结果页 | High | Functional | https://www.baidu.com/ |
| 2 | 搜索联想词与已输入查询明显不相关 | Medium | UX | https://www.baidu.com/ |
| 3 | 从文心助手页使用返回操作未能回到百度首页,历史链路表现不稳定 | Medium | Functional | https://chat.baidu.com/?enter_type=home_operate |

## Testing Coverage

### Pages Tested
- 百度首页(https://www.baidu.com/)
- 百度安全验证页(搜索后触发)
- 文心助手入口页 / 对话页(https://chat.baidu.com/)
- 百度新闻页(http://news.baidu.com)

### Features Tested
- 首页加载与视觉检查
- 浏览器 console 基础检查
- 首页搜索输入与提交
- 搜索联想词观察
- 文心助手入口跳转
- 文心助手单轮提问与回答返回
- 顶部“新闻”导航入口跳转

### Not Tested / Out of Scope
- 登录流程
- 图片、视频、地图、贴吧、网盘、文库等其余顶部入口的深入测试
- 首页“设置”菜单展开行为
- 热搜条目逐条点击验证
- 移动端布局与响应式行为
- 安全验证滑块的人工完成与验证后结果页质量

### Blockers
- 搜索主流程被百度安全验证拦截,无法在当前会话中继续检查正常搜索结果页的相关性、结果布局与分页链路。

---

## Notes

1. 百度首页本身在首屏加载、布局和视觉呈现上表现稳定,未见明显白屏、JS 报错或布局错位。
2. 文心助手入口可正常打开,且无需登录即可完成单轮问答,这一入口的可用性较好。
3. 百度新闻页正常打开,说明顶部导航至少部分入口工作正常。
4. 本次最主要的问题集中在“主搜索流程被风控打断”和“返回链路不稳定”,这两点对真实 end-to-end 体验影响最大。
2 changes: 2 additions & 0 deletions datasets/skills/dogfood/baidu-homepage/train.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{"task_input": "用 dogfood 测这个网站:https://www.baidu.com/,重点看首页搜索输入和提交主流程。", "expected_behavior": "应导航到百度首页,检查 console 与首屏状态,在搜索框输入明确测试词并提交;若出现百度安全验证,应将其识别为高严重级别的 Functional blocker,记录验证文案、触发步骤、结果页未展示这一事实,并附截图证据。", "difficulty": "medium", "category": "search-flow", "source": "golden"}
{"task_input": "用 dogfood 测这个网站:https://www.baidu.com/,重点看首页搜索联想词是否贴合查询意图。", "expected_behavior": "应在首页搜索框输入具有明确意图的测试查询,观察下拉联想词,并判断其是否与完整查询相关;若联想词大量偏向泛化品牌词而非当前测试意图,应记录为中等级别 UX 问题,说明误导风险并保存截图。", "difficulty": "medium", "category": "search-suggestions", "source": "golden"}
1 change: 1 addition & 0 deletions datasets/skills/dogfood/baidu-homepage/val.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"task_input": "用 dogfood 测这个网站:https://www.baidu.com/,重点看“复杂问题就找文心助手”入口,以及从该页返回首页的链路。", "expected_behavior": "应点击文心助手入口,验证页面是否正常打开、是否能在未登录状态下完成至少一轮问答,再执行返回操作;若返回未恢复到原始百度首页,应记录为中等级别 Functional 问题,同时注明文心页面本身可用、无明显 console 错误或登录阻断。", "difficulty": "hard", "category": "subpage-flow", "source": "golden"}
22 changes: 22 additions & 0 deletions evolution/core/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,25 @@
"""Core infrastructure shared across all evolution phases."""

from evolution.core.config import EvolutionConfig, get_hermes_agent_path
from evolution.core.benchmark_gate import TBLiteGateResult, run_tblite_benchmark_gate
from evolution.core.git_pr_automation import (
build_evolution_branch_name,
build_git_apply_plan,
build_target_skill_path,
write_git_apply_plan_artifacts,
write_git_pr_automation_artifacts,
write_skill_patch_artifacts,
)
from evolution.core.report_artifact import (
build_diff_summary,
build_evolution_report,
build_github_pr_body,
build_github_pr_title,
build_gh_pr_create_command,
build_pr_draft,
build_review_checklist,
summarize_recommendation,
write_github_pr_artifacts,
write_pr_ready_artifacts,
write_report_artifacts,
)
Loading