From a017afdc6298fac8fa697d666717349917a629ef Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 10 Apr 2026 05:44:20 +0000 Subject: [PATCH 1/2] Sync skills with harbor upstream (2026-04-10) - harbor-cli: add --verifier-env/--ve flag to harbor jobs start and harbor trials start; add copilot-cli and pi to supported agents table - harbor-task-creator: document Harbor Reward Kit as the recommended testing pattern (uvx harbor-rewardkit, @criterion, TOML judge config) --- skills/harbor-cli/SKILL.md | 4 ++ skills/harbor-cli/references/flags.md | 2 + skills/harbor-task-creator/SKILL.md | 73 +++++++++++++++++++++++++-- 3 files changed, 74 insertions(+), 5 deletions(-) diff --git a/skills/harbor-cli/SKILL.md b/skills/harbor-cli/SKILL.md index 5a2ba5b..a388190 100644 --- a/skills/harbor-cli/SKILL.md +++ b/skills/harbor-cli/SKILL.md @@ -54,6 +54,7 @@ The primary command. `harbor run` is an alias for `harbor jobs start`. | `--jobs-dir` | `-o` | Output directory (default: `./jobs`) | | `--agent-env` | `--ae` | Pass env var to agent: `KEY=VALUE`. Repeatable | | `--agent-kwarg` | `--ak` | Agent kwarg: `key=value`. Repeatable | +| `--verifier-env` | `--ve` | Env var for verifier: `KEY=VALUE`. Repeatable | | `--task-name` | `-t` | Include tasks by glob pattern. Repeatable | | `--exclude-task-name` | `-x` | Exclude tasks by glob. Repeatable | | `--n-tasks` | `-l` | Max tasks to run | @@ -144,6 +145,7 @@ Run a single trial. Useful for debugging and task development. | `--trials-dir` | | Output directory (default: `./trials`) | | `--agent-env` | `--ae` | Env var for agent: `KEY=VALUE`. Repeatable | | `--agent-kwarg` | `--ak` | Agent kwarg: `key=value`. Repeatable | +| `--verifier-env` | `--ve` | Env var for verifier: `KEY=VALUE`. Repeatable | | `--no-cleanup` | | Keep environment after trial | | `--no-verify` | | Skip running tests | @@ -411,10 +413,12 @@ Key flags: `--tasks-dir / -t` (default: `tasks`), `--registry / -r` (required), | Aider | `aider` | | | Codex | `codex` | | | Cline CLI | `cline-cli` | Model format: `provider:model-id` | +| Copilot CLI | `copilot-cli` | | | Cursor CLI | `cursor-cli` | | | Gemini CLI | `gemini-cli` | Needs `GOOGLE_API_KEY` or `GEMINI_API_KEY` | | Goose | `goose` | Model format: `provider/model_name` | | Mini SWE-agent | `mini-swe-agent` | | +| Pi | `pi` | | | SWE-agent | `swe-agent` | | | OpenCode | `opencode` | | | OpenHands | `openhands` | Needs `LLM_API_KEY` and `LLM_MODEL` | diff --git a/skills/harbor-cli/references/flags.md b/skills/harbor-cli/references/flags.md index eb906ea..c9b671e 100644 --- a/skills/harbor-cli/references/flags.md +++ b/skills/harbor-cli/references/flags.md @@ -76,6 +76,7 @@ Detailed flags for every Harbor CLI command. This file is the authoritative refe |------|-------|------|---------|-------------| | `--verifier-import-path` | | str | | Python import path for custom verifier | | `--verifier-kwarg` | `--vk` | str list | | Verifier kwarg as `key=value`. Can repeat | +| `--verifier-env` | `--ve` | str list | | Env var for verifier as `KEY=VALUE`. Overrides task.toml `[verifier.env]`. Can repeat | ### Job control flags @@ -183,6 +184,7 @@ Shares many flags with `harbor jobs start` but operates on a single task. Key di | `--trials-dir` | | Path | `./trials` | Output directory | | `--agent-env` | `--ae` | str list | | Env var for agent as `KEY=VALUE`. Can repeat | | `--agent-kwarg` | `--ak` | str list | | Agent kwarg as `key=value`. Can repeat | +| `--verifier-env` | `--ve` | str list | | Env var for verifier as `KEY=VALUE`. Overrides task.toml `[verifier.env]`. Can repeat | | `--agent-import-path` | | str | | Custom agent import path | | `--environment-import-path` | | str | | Custom environment import path | | `--environment-kwarg` | `--ek` | str list | | Environment kwarg as `key=value`. Can repeat | diff --git a/skills/harbor-task-creator/SKILL.md b/skills/harbor-task-creator/SKILL.md index 26d1a69..3d40edc 100644 --- a/skills/harbor-task-creator/SKILL.md +++ b/skills/harbor-task-creator/SKILL.md @@ -231,7 +231,70 @@ The test script runs inside the container AFTER the agent finishes. It determine - `reward.json` supports multiple named rewards (e.g., `{"reward": 1.0, "quality": 0.8, "style": 0.6}`) - MUST write the reward file in ALL code paths — if missing, the trial fails with `RewardFileNotFoundError`; if empty, `RewardFileEmptyError` -**Pattern 1: Pytest + CTRF (recommended for complex tasks)** +**Pattern 1: Harbor Reward Kit (recommended for new tasks)** + +[`harbor-rewardkit`](https://harborframework.com/docs/rewardkit) is a lightweight package that lets you define criteria as Python functions or TOML-based LLM judges. The test.sh becomes a one-liner and rewardkit handles reward file writing. + +```bash +#!/bin/bash +uvx harbor-rewardkit@0.1 /tests +``` + +Write criteria as Python files in `tests/` using the `@criterion` decorator or built-in criteria: + +```python +# tests/correctness/check.py — programmatic criteria +from rewardkit import criteria, criterion +from pathlib import Path + +# Built-in criteria (file checks, HTTP, SQLite, etc.) +criteria.file_exists("/app/output.txt") +criteria.file_contains("/app/output.txt", "expected") + +# Custom criterion using @criterion decorator +@criterion(shared=True) +def output_is_valid(workspace: Path) -> float: + """Check the output meets requirements.""" + p = workspace / "output.txt" + if not p.exists(): + return 0.0 + lines = p.read_text().strip().splitlines() + return 1.0 if len(lines) >= 5 else 0.0 +``` + +For LLM-as-judge evaluation, add a TOML config file in `tests/`: + +```toml +# tests/quality/quality.toml +[judge] +judge = "anthropic/claude-sonnet-4-6" +files = ["/app/main.py"] + +[[criterion]] +name = "code_quality" +description = "Is the code well-structured and readable?" +type = "likert" +points = 5 +weight = 1.0 + +[[criterion]] +name = "edge_case_handling" +description = "Does the code handle edge cases like empty strings?" +type = "binary" +weight = 1.0 +``` + +Rewardkit aggregates all criteria scores into a weighted average and writes `reward.txt` automatically. Set `[verifier.env]` in task.toml to pass LLM API keys: + +```toml +[verifier] +timeout_sec = 120.0 + +[verifier.env] +ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}" +``` + +**Pattern 2: Pytest + CTRF (for tasks with complex test logic)** ```bash #!/bin/bash @@ -258,7 +321,7 @@ Use `set -uo pipefail` (without `-e`) so the script continues to write the rewar CTRF (`ctrf.json`) is displayed in the Harbor Viewer UI for detailed per-test inspection. It is NOT used for reward calculation — the reward always comes from `reward.txt` or `reward.json`. -**Pattern 2: Simple file check (for straightforward tasks)** +**Pattern 3: Simple file check (for straightforward tasks)** ```bash #!/bin/bash @@ -269,7 +332,7 @@ else fi ``` -**Pattern 3: Partial credit with multiple named rewards** +**Pattern 4: Partial credit with multiple named rewards** ```bash #!/bin/bash @@ -291,9 +354,9 @@ echo "scale=2; $SCORE / $TOTAL" | bc > /logs/verifier/reward.txt **Key rules:** - Working directory when test.sh runs is `/tests/` - Install test dependencies in test.sh, NOT in the Dockerfile -- Always write the reward file, even on error +- Always write the reward file, even on error (rewardkit handles this automatically) - Use CTRF output (`--ctrf`) for pytest — Harbor displays it in the Viewer for human inspection -- Pin all dependency versions in test.sh (e.g., `pytest==8.4.1`) +- Pin all dependency versions in test.sh (e.g., `pytest==8.4.1`, `harbor-rewardkit@0.1`) ## Writing solution/solve.sh From 0d69ef5d5dcfa429e9ce7020e0b99496be245289 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 10 Apr 2026 15:51:45 +0000 Subject: [PATCH 2/2] Sync harbor-adapter-creator and harbor-cli with upstream changes - adapter-creator: clarify that task.toml author_name/author_email credit original benchmark authors, not the adapter builder; add gotcha entries - adapter-creator: document parity notes requirement for run-count asymmetry - harbor-cli: document Codex auth.json support (auto-detect ~/.codex/auth.json, CODEX_AUTH_JSON_PATH, CODEX_FORCE_API_KEY env vars) Upstream commits: 1e1455d (task.toml author fields clarification), 1336775 (Codex auth.json support) --- skills/harbor-adapter-creator/SKILL.md | 9 ++++++++- skills/harbor-cli/SKILL.md | 2 +- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/skills/harbor-adapter-creator/SKILL.md b/skills/harbor-adapter-creator/SKILL.md index 6f7379f..461390f 100644 --- a/skills/harbor-adapter-creator/SKILL.md +++ b/skills/harbor-adapter-creator/SKILL.md @@ -334,7 +334,10 @@ version = "1.0" source = "mybenchmark/instance-42" [metadata] -author_name = "Your Name" +# author_name/author_email are optional; when present they credit the ORIGINAL +# benchmark authors, not the adapter builder. Adapter builder info belongs in +# adapter_metadata.json under "adapter_builders". +author_name = "Jane Doe" # original benchmark author difficulty = "medium" # easy, medium, hard category = "qa" tags = ["factual", "reasoning"] @@ -398,6 +401,8 @@ The `adapter_pr`, `dataset_pr`, and `parity_pr` fields in `parity_experiment.jso The `harbor_adapter` entry must include `parity_matching_agents` (array of `"agent@version+model"` strings). In the harbor-datasets `registry.json`, use `"version": "parity"` to enable `harbor jobs start -d mybenchmark@parity`. +**notes field:** Use `notes` to document anything that affects the comparison's validity. In particular, if the original benchmark published a single score while you ran 3 Harbor trials (or vice versa), you **must** explain the asymmetry here — e.g., `"50 tasks; original published 1 score vs. 3 Harbor runs; see notes for variance"`. Reviewers will flag unexplained run-count mismatches. + ## adapter_metadata.json A JSON **array** describing the adapter and its relationship to the original benchmark. Required fields per entry: `adapter_name`, `adapter_builders`, `original_benchmark`, `harbor_adapter`. @@ -491,6 +496,8 @@ See [references/adapter-anatomy.md](references/adapter-anatomy.md#6-post-impleme - **Using /app vs /workspace inconsistently**: Pick one working directory and be consistent between instruction.md, test.sh, and Dockerfile WORKDIR. Some adapters use `/app`, others use `/workspace`. - **Not escaping shell-unsafe characters**: Answers containing single quotes, backticks, or dollar signs can break test.sh or solve.sh. Escape them when embedding in shell scripts. - **Missing `mkdir -p /logs/verifier`**: Some base images may not have this directory. Create it in test.sh before writing reward.txt. +- **`author_name`/`author_email` in task.toml credit original authors, not you**: These optional fields are for the benchmark's original authors. Your contact info as adapter builder goes in `adapter_metadata.json` under `adapter_builders`. +- **Unexplained parity run-count asymmetry**: If the original benchmark published a single score but you ran 3 Harbor trials (or any mismatch), explain this in the `notes` field of `parity_experiment.json`. Reviewers will reject submissions where the number of runs differs without explanation. ## Existing Adapter Patterns diff --git a/skills/harbor-cli/SKILL.md b/skills/harbor-cli/SKILL.md index a388190..fd30ade 100644 --- a/skills/harbor-cli/SKILL.md +++ b/skills/harbor-cli/SKILL.md @@ -411,7 +411,7 @@ Key flags: `--tasks-dir / -t` (default: `tasks`), `--registry / -r` (required), | Nop | `nop` | Does nothing (baseline) | | Claude Code | `claude-code` | Needs `ANTHROPIC_API_KEY` | | Aider | `aider` | | -| Codex | `codex` | | +| Codex | `codex` | Auth via `~/.codex/auth.json` (auto-detected) or `CODEX_AUTH_JSON_PATH`; set `CODEX_FORCE_API_KEY=1` to use `OPENAI_API_KEY` instead | | Cline CLI | `cline-cli` | Model format: `provider:model-id` | | Copilot CLI | `copilot-cli` | | | Cursor CLI | `cursor-cli` | |