harbor-framework · benediktstroebl · Apr 10, 2026 · Apr 10, 2026
diff --git a/skills/harbor-adapter-creator/SKILL.md b/skills/harbor-adapter-creator/SKILL.md
@@ -334,7 +334,10 @@ version = "1.0"
 source = "mybenchmark/instance-42"
 
 [metadata]
-author_name = "Your Name"
+# author_name/author_email are optional; when present they credit the ORIGINAL
+# benchmark authors, not the adapter builder. Adapter builder info belongs in
+# adapter_metadata.json under "adapter_builders".
+author_name = "Jane Doe"        # original benchmark author
 difficulty = "medium"           # easy, medium, hard
 category = "qa"
 tags = ["factual", "reasoning"]
@@ -398,6 +401,8 @@ The `adapter_pr`, `dataset_pr`, and `parity_pr` fields in `parity_experiment.jso
 
 The `harbor_adapter` entry must include `parity_matching_agents` (array of `"agent@version+model"` strings). In the harbor-datasets `registry.json`, use `"version": "parity"` to enable `harbor jobs start -d mybenchmark@parity`.
 
+**notes field:** Use `notes` to document anything that affects the comparison's validity. In particular, if the original benchmark published a single score while you ran 3 Harbor trials (or vice versa), you **must** explain the asymmetry here — e.g., `"50 tasks; original published 1 score vs. 3 Harbor runs; see notes for variance"`. Reviewers will flag unexplained run-count mismatches.
+
 ## adapter_metadata.json
 
 A JSON **array** describing the adapter and its relationship to the original benchmark. Required fields per entry: `adapter_name`, `adapter_builders`, `original_benchmark`, `harbor_adapter`.
@@ -491,6 +496,8 @@ See [references/adapter-anatomy.md](references/adapter-anatomy.md#6-post-impleme
 - **Using /app vs /workspace inconsistently**: Pick one working directory and be consistent between instruction.md, test.sh, and Dockerfile WORKDIR. Some adapters use `/app`, others use `/workspace`.
 - **Not escaping shell-unsafe characters**: Answers containing single quotes, backticks, or dollar signs can break test.sh or solve.sh. Escape them when embedding in shell scripts.
 - **Missing `mkdir -p /logs/verifier`**: Some base images may not have this directory. Create it in test.sh before writing reward.txt.
+- **`author_name`/`author_email` in task.toml credit original authors, not you**: These optional fields are for the benchmark's original authors. Your contact info as adapter builder goes in `adapter_metadata.json` under `adapter_builders`.
+- **Unexplained parity run-count asymmetry**: If the original benchmark published a single score but you ran 3 Harbor trials (or any mismatch), explain this in the `notes` field of `parity_experiment.json`. Reviewers will reject submissions where the number of runs differs without explanation.
 
 ## Existing Adapter Patterns
 

diff --git a/skills/harbor-cli/SKILL.md b/skills/harbor-cli/SKILL.md
@@ -54,6 +54,7 @@ The primary command. `harbor run` is an alias for `harbor jobs start`.
 | `--jobs-dir` | `-o` | Output directory (default: `./jobs`) |
 | `--agent-env` | `--ae` | Pass env var to agent: `KEY=VALUE`. Repeatable |
 | `--agent-kwarg` | `--ak` | Agent kwarg: `key=value`. Repeatable |
+| `--verifier-env` | `--ve` | Env var for verifier: `KEY=VALUE`. Repeatable |
 | `--task-name` | `-t` | Include tasks by glob pattern. Repeatable |
 | `--exclude-task-name` | `-x` | Exclude tasks by glob. Repeatable |
 | `--n-tasks` | `-l` | Max tasks to run |
@@ -144,6 +145,7 @@ Run a single trial. Useful for debugging and task development.
 | `--trials-dir` | | Output directory (default: `./trials`) |
 | `--agent-env` | `--ae` | Env var for agent: `KEY=VALUE`. Repeatable |
 | `--agent-kwarg` | `--ak` | Agent kwarg: `key=value`. Repeatable |
+| `--verifier-env` | `--ve` | Env var for verifier: `KEY=VALUE`. Repeatable |
 | `--no-cleanup` | | Keep environment after trial |
 | `--no-verify` | | Skip running tests |
 
@@ -409,12 +411,14 @@ Key flags: `--tasks-dir / -t` (default: `tasks`), `--registry / -r` (required),
 | Nop | `nop` | Does nothing (baseline) |
 | Claude Code | `claude-code` | Needs `ANTHROPIC_API_KEY` |
 | Aider | `aider` | |
-| Codex | `codex` | |
+| Codex | `codex` | Auth via `~/.codex/auth.json` (auto-detected) or `CODEX_AUTH_JSON_PATH`; set `CODEX_FORCE_API_KEY=1` to use `OPENAI_API_KEY` instead |
 | Cline CLI | `cline-cli` | Model format: `provider:model-id` |
+| Copilot CLI | `copilot-cli` | |
 | Cursor CLI | `cursor-cli` | |
 | Gemini CLI | `gemini-cli` | Needs `GOOGLE_API_KEY` or `GEMINI_API_KEY` |
 | Goose | `goose` | Model format: `provider/model_name` |
 | Mini SWE-agent | `mini-swe-agent` | |
+| Pi | `pi` | |
 | SWE-agent | `swe-agent` | |
 | OpenCode | `opencode` | |
 | OpenHands | `openhands` | Needs `LLM_API_KEY` and `LLM_MODEL` |

diff --git a/skills/harbor-cli/references/flags.md b/skills/harbor-cli/references/flags.md
@@ -76,6 +76,7 @@ Detailed flags for every Harbor CLI command. This file is the authoritative refe
 |------|-------|------|---------|-------------|
 | `--verifier-import-path` | | str | | Python import path for custom verifier |
 | `--verifier-kwarg` | `--vk` | str list | | Verifier kwarg as `key=value`. Can repeat |
+| `--verifier-env` | `--ve` | str list | | Env var for verifier as `KEY=VALUE`. Overrides task.toml `[verifier.env]`. Can repeat |
 
 ### Job control flags
 
@@ -183,6 +184,7 @@ Shares many flags with `harbor jobs start` but operates on a single task. Key di
 | `--trials-dir` | | Path | `./trials` | Output directory |
 | `--agent-env` | `--ae` | str list | | Env var for agent as `KEY=VALUE`. Can repeat |
 | `--agent-kwarg` | `--ak` | str list | | Agent kwarg as `key=value`. Can repeat |
+| `--verifier-env` | `--ve` | str list | | Env var for verifier as `KEY=VALUE`. Overrides task.toml `[verifier.env]`. Can repeat |
 | `--agent-import-path` | | str | | Custom agent import path |
 | `--environment-import-path` | | str | | Custom environment import path |
 | `--environment-kwarg` | `--ek` | str list | | Environment kwarg as `key=value`. Can repeat |

diff --git a/skills/harbor-task-creator/SKILL.md b/skills/harbor-task-creator/SKILL.md
@@ -231,7 +231,70 @@ The test script runs inside the container AFTER the agent finishes. It determine
 - `reward.json` supports multiple named rewards (e.g., `{"reward": 1.0, "quality": 0.8, "style": 0.6}`)
 - MUST write the reward file in ALL code paths — if missing, the trial fails with `RewardFileNotFoundError`; if empty, `RewardFileEmptyError`
 
-**Pattern 1: Pytest + CTRF (recommended for complex tasks)**
+**Pattern 1: Harbor Reward Kit (recommended for new tasks)**
+
+[`harbor-rewardkit`](https://harborframework.com/docs/rewardkit) is a lightweight package that lets you define criteria as Python functions or TOML-based LLM judges. The test.sh becomes a one-liner and rewardkit handles reward file writing.
+
+```bash
+#!/bin/bash
+uvx [email protected] /tests
+```
+
+Write criteria as Python files in `tests/` using the `@criterion` decorator or built-in criteria:
+
+```python
+# tests/correctness/check.py  — programmatic criteria
+from rewardkit import criteria, criterion
+from pathlib import Path
+
+# Built-in criteria (file checks, HTTP, SQLite, etc.)
+criteria.file_exists("/app/output.txt")
+criteria.file_contains("/app/output.txt", "expected")
+
+# Custom criterion using @criterion decorator
+@criterion(shared=True)
+def output_is_valid(workspace: Path) -> float:
+    """Check the output meets requirements."""
+    p = workspace / "output.txt"
+    if not p.exists():
+        return 0.0
+    lines = p.read_text().strip().splitlines()
+    return 1.0 if len(lines) >= 5 else 0.0
+```
+
+For LLM-as-judge evaluation, add a TOML config file in `tests/`:
+
+```toml
+# tests/quality/quality.toml
+[judge]
+judge = "anthropic/claude-sonnet-4-6"
+files = ["/app/main.py"]
+
+[[criterion]]
+name = "code_quality"
+description = "Is the code well-structured and readable?"
+type = "likert"
+points = 5
+weight = 1.0
+
+[[criterion]]
+name = "edge_case_handling"
+description = "Does the code handle edge cases like empty strings?"
+type = "binary"
+weight = 1.0
+```
+
+Rewardkit aggregates all criteria scores into a weighted average and writes `reward.txt` automatically. Set `[verifier.env]` in task.toml to pass LLM API keys:
+
+```toml
+[verifier]
+timeout_sec = 120.0
+
+[verifier.env]
+ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
+```
+
+**Pattern 2: Pytest + CTRF (for tasks with complex test logic)**
 
 ```bash
 #!/bin/bash
@@ -258,7 +321,7 @@ Use `set -uo pipefail` (without `-e`) so the script continues to write the rewar
 
 CTRF (`ctrf.json`) is displayed in the Harbor Viewer UI for detailed per-test inspection. It is NOT used for reward calculation — the reward always comes from `reward.txt` or `reward.json`.
 
-**Pattern 2: Simple file check (for straightforward tasks)**
+**Pattern 3: Simple file check (for straightforward tasks)**
 
 ```bash
 #!/bin/bash
@@ -269,7 +332,7 @@ else
 fi
 ```
 
-**Pattern 3: Partial credit with multiple named rewards**
+**Pattern 4: Partial credit with multiple named rewards**
 
 ```bash
 #!/bin/bash
@@ -291,9 +354,9 @@ echo "scale=2; $SCORE / $TOTAL" | bc > /logs/verifier/reward.txt
 **Key rules:**
 - Working directory when test.sh runs is `/tests/`
 - Install test dependencies in test.sh, NOT in the Dockerfile
-- Always write the reward file, even on error
+- Always write the reward file, even on error (rewardkit handles this automatically)
 - Use CTRF output (`--ctrf`) for pytest — Harbor displays it in the Viewer for human inspection
-- Pin all dependency versions in test.sh (e.g., `pytest==8.4.1`)
+- Pin all dependency versions in test.sh (e.g., `pytest==8.4.1`, `[email protected]`)
 
 ## Writing solution/solve.sh