Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion skills/harbor-adapter-creator/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,10 @@ version = "1.0"
source = "mybenchmark/instance-42"

[metadata]
author_name = "Your Name"
# author_name/author_email are optional; when present they credit the ORIGINAL
# benchmark authors, not the adapter builder. Adapter builder info belongs in
# adapter_metadata.json under "adapter_builders".
author_name = "Jane Doe" # original benchmark author
difficulty = "medium" # easy, medium, hard
category = "qa"
tags = ["factual", "reasoning"]
Expand Down Expand Up @@ -398,6 +401,8 @@ The `adapter_pr`, `dataset_pr`, and `parity_pr` fields in `parity_experiment.jso

The `harbor_adapter` entry must include `parity_matching_agents` (array of `"agent@version+model"` strings). In the harbor-datasets `registry.json`, use `"version": "parity"` to enable `harbor jobs start -d mybenchmark@parity`.

**notes field:** Use `notes` to document anything that affects the comparison's validity. In particular, if the original benchmark published a single score while you ran 3 Harbor trials (or vice versa), you **must** explain the asymmetry here — e.g., `"50 tasks; original published 1 score vs. 3 Harbor runs; see notes for variance"`. Reviewers will flag unexplained run-count mismatches.

## adapter_metadata.json

A JSON **array** describing the adapter and its relationship to the original benchmark. Required fields per entry: `adapter_name`, `adapter_builders`, `original_benchmark`, `harbor_adapter`.
Expand Down Expand Up @@ -491,6 +496,8 @@ See [references/adapter-anatomy.md](references/adapter-anatomy.md#6-post-impleme
- **Using /app vs /workspace inconsistently**: Pick one working directory and be consistent between instruction.md, test.sh, and Dockerfile WORKDIR. Some adapters use `/app`, others use `/workspace`.
- **Not escaping shell-unsafe characters**: Answers containing single quotes, backticks, or dollar signs can break test.sh or solve.sh. Escape them when embedding in shell scripts.
- **Missing `mkdir -p /logs/verifier`**: Some base images may not have this directory. Create it in test.sh before writing reward.txt.
- **`author_name`/`author_email` in task.toml credit original authors, not you**: These optional fields are for the benchmark's original authors. Your contact info as adapter builder goes in `adapter_metadata.json` under `adapter_builders`.
- **Unexplained parity run-count asymmetry**: If the original benchmark published a single score but you ran 3 Harbor trials (or any mismatch), explain this in the `notes` field of `parity_experiment.json`. Reviewers will reject submissions where the number of runs differs without explanation.

## Existing Adapter Patterns

Expand Down
6 changes: 5 additions & 1 deletion skills/harbor-cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ The primary command. `harbor run` is an alias for `harbor jobs start`.
| `--jobs-dir` | `-o` | Output directory (default: `./jobs`) |
| `--agent-env` | `--ae` | Pass env var to agent: `KEY=VALUE`. Repeatable |
| `--agent-kwarg` | `--ak` | Agent kwarg: `key=value`. Repeatable |
| `--verifier-env` | `--ve` | Env var for verifier: `KEY=VALUE`. Repeatable |
| `--task-name` | `-t` | Include tasks by glob pattern. Repeatable |
| `--exclude-task-name` | `-x` | Exclude tasks by glob. Repeatable |
| `--n-tasks` | `-l` | Max tasks to run |
Expand Down Expand Up @@ -144,6 +145,7 @@ Run a single trial. Useful for debugging and task development.
| `--trials-dir` | | Output directory (default: `./trials`) |
| `--agent-env` | `--ae` | Env var for agent: `KEY=VALUE`. Repeatable |
| `--agent-kwarg` | `--ak` | Agent kwarg: `key=value`. Repeatable |
| `--verifier-env` | `--ve` | Env var for verifier: `KEY=VALUE`. Repeatable |
| `--no-cleanup` | | Keep environment after trial |
| `--no-verify` | | Skip running tests |

Expand Down Expand Up @@ -409,12 +411,14 @@ Key flags: `--tasks-dir / -t` (default: `tasks`), `--registry / -r` (required),
| Nop | `nop` | Does nothing (baseline) |
| Claude Code | `claude-code` | Needs `ANTHROPIC_API_KEY` |
| Aider | `aider` | |
| Codex | `codex` | |
| Codex | `codex` | Auth via `~/.codex/auth.json` (auto-detected) or `CODEX_AUTH_JSON_PATH`; set `CODEX_FORCE_API_KEY=1` to use `OPENAI_API_KEY` instead |
| Cline CLI | `cline-cli` | Model format: `provider:model-id` |
| Copilot CLI | `copilot-cli` | |
| Cursor CLI | `cursor-cli` | |
| Gemini CLI | `gemini-cli` | Needs `GOOGLE_API_KEY` or `GEMINI_API_KEY` |
| Goose | `goose` | Model format: `provider/model_name` |
| Mini SWE-agent | `mini-swe-agent` | |
| Pi | `pi` | |
| SWE-agent | `swe-agent` | |
| OpenCode | `opencode` | |
| OpenHands | `openhands` | Needs `LLM_API_KEY` and `LLM_MODEL` |
Expand Down
2 changes: 2 additions & 0 deletions skills/harbor-cli/references/flags.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ Detailed flags for every Harbor CLI command. This file is the authoritative refe
|------|-------|------|---------|-------------|
| `--verifier-import-path` | | str | | Python import path for custom verifier |
| `--verifier-kwarg` | `--vk` | str list | | Verifier kwarg as `key=value`. Can repeat |
| `--verifier-env` | `--ve` | str list | | Env var for verifier as `KEY=VALUE`. Overrides task.toml `[verifier.env]`. Can repeat |

### Job control flags

Expand Down Expand Up @@ -183,6 +184,7 @@ Shares many flags with `harbor jobs start` but operates on a single task. Key di
| `--trials-dir` | | Path | `./trials` | Output directory |
| `--agent-env` | `--ae` | str list | | Env var for agent as `KEY=VALUE`. Can repeat |
| `--agent-kwarg` | `--ak` | str list | | Agent kwarg as `key=value`. Can repeat |
| `--verifier-env` | `--ve` | str list | | Env var for verifier as `KEY=VALUE`. Overrides task.toml `[verifier.env]`. Can repeat |
| `--agent-import-path` | | str | | Custom agent import path |
| `--environment-import-path` | | str | | Custom environment import path |
| `--environment-kwarg` | `--ek` | str list | | Environment kwarg as `key=value`. Can repeat |
Expand Down
73 changes: 68 additions & 5 deletions skills/harbor-task-creator/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,70 @@ The test script runs inside the container AFTER the agent finishes. It determine
- `reward.json` supports multiple named rewards (e.g., `{"reward": 1.0, "quality": 0.8, "style": 0.6}`)
- MUST write the reward file in ALL code paths — if missing, the trial fails with `RewardFileNotFoundError`; if empty, `RewardFileEmptyError`

**Pattern 1: Pytest + CTRF (recommended for complex tasks)**
**Pattern 1: Harbor Reward Kit (recommended for new tasks)**

[`harbor-rewardkit`](https://harborframework.com/docs/rewardkit) is a lightweight package that lets you define criteria as Python functions or TOML-based LLM judges. The test.sh becomes a one-liner and rewardkit handles reward file writing.

```bash
#!/bin/bash
uvx [email protected] /tests
```

Write criteria as Python files in `tests/` using the `@criterion` decorator or built-in criteria:

```python
# tests/correctness/check.py — programmatic criteria
from rewardkit import criteria, criterion
from pathlib import Path

# Built-in criteria (file checks, HTTP, SQLite, etc.)
criteria.file_exists("/app/output.txt")
criteria.file_contains("/app/output.txt", "expected")

# Custom criterion using @criterion decorator
@criterion(shared=True)
def output_is_valid(workspace: Path) -> float:
"""Check the output meets requirements."""
p = workspace / "output.txt"
if not p.exists():
return 0.0
lines = p.read_text().strip().splitlines()
return 1.0 if len(lines) >= 5 else 0.0
```

For LLM-as-judge evaluation, add a TOML config file in `tests/`:

```toml
# tests/quality/quality.toml
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]

[[criterion]]
name = "code_quality"
description = "Is the code well-structured and readable?"
type = "likert"
points = 5
weight = 1.0

[[criterion]]
name = "edge_case_handling"
description = "Does the code handle edge cases like empty strings?"
type = "binary"
weight = 1.0
```

Rewardkit aggregates all criteria scores into a weighted average and writes `reward.txt` automatically. Set `[verifier.env]` in task.toml to pass LLM API keys:

```toml
[verifier]
timeout_sec = 120.0

[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
```

**Pattern 2: Pytest + CTRF (for tasks with complex test logic)**

```bash
#!/bin/bash
Expand All @@ -258,7 +321,7 @@ Use `set -uo pipefail` (without `-e`) so the script continues to write the rewar

CTRF (`ctrf.json`) is displayed in the Harbor Viewer UI for detailed per-test inspection. It is NOT used for reward calculation — the reward always comes from `reward.txt` or `reward.json`.

**Pattern 2: Simple file check (for straightforward tasks)**
**Pattern 3: Simple file check (for straightforward tasks)**

```bash
#!/bin/bash
Expand All @@ -269,7 +332,7 @@ else
fi
```

**Pattern 3: Partial credit with multiple named rewards**
**Pattern 4: Partial credit with multiple named rewards**

```bash
#!/bin/bash
Expand All @@ -291,9 +354,9 @@ echo "scale=2; $SCORE / $TOTAL" | bc > /logs/verifier/reward.txt
**Key rules:**
- Working directory when test.sh runs is `/tests/`
- Install test dependencies in test.sh, NOT in the Dockerfile
- Always write the reward file, even on error
- Always write the reward file, even on error (rewardkit handles this automatically)
- Use CTRF output (`--ctrf`) for pytest — Harbor displays it in the Viewer for human inspection
- Pin all dependency versions in test.sh (e.g., `pytest==8.4.1`)
- Pin all dependency versions in test.sh (e.g., `pytest==8.4.1`, `[email protected]`)

## Writing solution/solve.sh

Expand Down