diff --git a/Dockerfile.claude-code b/Dockerfile.claude-code new file mode 100644 index 00000000..6e0d0a2f --- /dev/null +++ b/Dockerfile.claude-code @@ -0,0 +1,17 @@ +FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim + +RUN apt-get update && \ + apt-get install -y --no-install-recommends ca-certificates git && \ + rm -rf /var/lib/apt/lists/* + +WORKDIR /app + +# Python deps — only what the agent needs (harbor excluded via .dockerignore) +COPY pyproject.toml ./ +RUN uv pip install --system . + +# Agent code +COPY agent-claude-code.py ./ + +RUN ln -sf $(which python3) /usr/local/bin/python +RUN mkdir -p /logs /app/output /task/output diff --git a/README.md b/README.md index 60a54620..6888ac5e 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,37 @@ -# autoagent +
+
+
+
+
+ Built by thirdlayer.inc +
- +> We're launching a product around self-configuring agents soon. [Sign up here](https://form.typeform.com/to/ZQbnbO09). + +# AutoAgent -Like [autoresearch](https://github.com/karpathy/autoresearch) but for agent engineering. Give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats. +> Like autoresearch but for agent engineering. Give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats. -The core idea is the same: you're not touching the harness Python files like you normally would as an engineer. Instead, you program `program.md` — the Markdown file that provides context to the meta-agent and defines the agent-engineering loop. + + +The core idea is the same: you're not touching the harness Python files like you normally would as an engineer. Instead, you program `program.md`, the Markdown file that provides context to the meta-agent and defines the agent-engineering loop. ## How it works The repo has a few files and directories that matter: -- **`agent.py`** — the entire harness under test in a single file. It contains +- **`agent.py`** -- the entire harness under test in a single file. It contains config, tool definitions, agent registry, routing/orchestration, and the Harbor adapter boundary. The adapter section is explicitly marked as fixed; the rest is the primary edit surface for the meta-agent. -- **`program.md`** — instructions for the meta-agent + the directive (what +- **`program.md`** -- instructions for the meta-agent + the directive (what kind of agent to build). **This file is edited by the human**. -- **`tasks/`** — evaluation tasks in +- **`tasks/`** -- evaluation tasks in [harbor](https://github.com/laude-institute/harbor) format. In a clean baseline branch, benchmark payloads may be omitted and added in benchmark-specific branches. -- **`.agent/`** — optional workspace artifacts for reusable instructions, +- **`.agent/`** -- optional workspace artifacts for reusable instructions, notes, prompts, or skills. The metric is total **score** produced by the benchmark's task test suites. The @@ -70,16 +81,16 @@ benchmark, diagnose failures, modify `agent.py`, and iterate. ## Project structure ```text -agent.py — single-file harness under test - editable harness section — prompt, registries, tools, routing - fixed adapter section — Harbor integration + trajectory serialization -program.md — meta-agent instructions + directive -Dockerfile.base — base image -.agent/ — optional agent workspace artifacts -tasks/ — benchmark tasks, typically added in benchmark-specific branches -jobs/ — Harbor job outputs -results.tsv — experiment log (created by meta-agent, gitignored) -run.log — latest run output +agent.py -- single-file harness under test + editable harness section -- prompt, registries, tools, routing + fixed adapter section -- Harbor integration + trajectory serialization +program.md -- meta-agent instructions + directive +Dockerfile.base -- base image +.agent/ -- optional agent workspace artifacts +tasks/ -- benchmark tasks, typically added in benchmark-specific branches +jobs/ -- Harbor job outputs +results.tsv -- experiment log (created by meta-agent, gitignored) +run.log -- latest run output ``` ## Task format @@ -88,17 +99,17 @@ When present, tasks follow [harbor](https://github.com/laude-institute/harbor)'s ```text tasks/my-task/ - task.toml — config (timeouts, metadata) - instruction.md — prompt sent to the agent + task.toml -- config (timeouts, metadata) + instruction.md -- prompt sent to the agent tests/ - test.sh — entry point, writes /logs/reward.txt - test.py — verification (deterministic or LLM-as-judge) + test.sh -- entry point, writes /logs/reward.txt + test.py -- verification (deterministic or LLM-as-judge) environment/ - Dockerfile — task container (FROM autoagent-base) - files/ — reference files mounted into container + Dockerfile -- task container (FROM autoagent-base) + files/ -- reference files mounted into container ``` -Tests write a score (0.0–1.0) to the verifier logs. The meta-agent hill-climbs +Tests write a score (0.0-1.0) to the verifier logs. The meta-agent hill-climbs on this. ## Design choices @@ -108,8 +119,7 @@ on this. - **Single-file, registry-driven harness.** The implementation lives in one file for simplicity, but agent and tool registration stay structured so the harness can still evolve cleanly. -- **Docker isolation.** The agent-under-test runs in a container. It can't - damage the host. +- **Docker isolation.** The agent runs in a container. It can't damage the host. - **Score-driven.** Every experiment produces a numeric score. Keep if better, discard if not. Same loop as autoresearch. - **Harbor-compatible tasks.** Tasks use the same format as harbor benchmarks, @@ -130,7 +140,7 @@ docker system prune -a -f docker container prune -f ``` -If Docker becomes unresponsive (e.g. after many concurrent runs), restart +If Docker becomes unresponsive (for example after many concurrent runs), restart Docker Desktop: ```bash @@ -144,3 +154,4 @@ You can equip the agent with [Agent Skills for Context Engineering](https://gith ## License MIT + diff --git a/agent-claude-code.py b/agent-claude-code.py new file mode 100644 index 00000000..946e4cd9 --- /dev/null +++ b/agent-claude-code.py @@ -0,0 +1,351 @@ +""" +Claude Code CLI adapter for Harbor. + +Runs the locally-installed `claude` CLI on the HOST machine (not inside Docker). +No API key needed — uses your existing Claude Code OAuth session. + +SECURITY NOTE: Unlike the SDK-based adapters that run inside containers, this +adapter executes `claude` on the host with bypassPermissions. The CLI can access +any host file/network resource. Only run on trusted task sets. + +Prerequisites: + - Claude Code CLI installed: npm install -g @anthropic-ai/claude-code + - Authenticated locally: claude login + +Run all tasks: + docker build -f Dockerfile.claude-code -t autoagent-base . + uv run harbor run -p tasks/ --agent-import-path agent-claude-code:AutoAgent -o jobs --job-name latest > run.log 2>&1 +""" + +from __future__ import annotations + +import asyncio +import json +import logging +import subprocess +import tempfile +import time +from datetime import datetime, timezone +from pathlib import Path + +logger = logging.getLogger(__name__) + + +# =========================================================================== +# AGENT CONFIG — meta-agent modifies this section +# =========================================================================== + +SYSTEM_PROMPT = """You are a highly capable task-completion agent working inside a sandboxed environment. + +## Approach +1. Read the task instruction to understand what's required. +2. Explore the working environment — check what files, tools, and libraries are available. +3. Plan your approach, then execute step by step. +4. Write output files to the exact paths specified in the instructions. +5. Verify your output before finishing. + +## Key rules +- All paths referenced in the task are relative to your current working directory. +- Use python3 (not python) for running scripts. +- For data analysis: pandas, numpy, openpyxl are available. +- For file manipulation: use standard Python or shell tools. +- Always verify output files exist and contain valid content before finishing. +- If a task involves git repos, use git commands directly. +- If a task involves databases, use sqlite3 CLI or Python sqlite3 module. +- Read error messages carefully and fix issues iteratively. +- Never give up — try multiple approaches if one fails. +""" + +MODEL = "sonnet" +MAX_TURNS = 30 +MAX_BUDGET_USD = 1.0 +TIMEOUT_SEC = 540 +PERMISSION_MODE = "bypassPermissions" +ALLOWED_TOOLS: list[str] = [] +CLI_EXTRA_FLAGS: list[str] = [] + + +def build_cli_args(instruction_text: str) -> list[str]: + """Build the claude CLI argument list. Modify to change execution strategy.""" + args = [ + "claude", + "--print", + "--output-format", "stream-json", + "--verbose", + "--model", MODEL, + "--max-turns", str(MAX_TURNS), + "--max-budget-usd", str(MAX_BUDGET_USD), + "--permission-mode", PERMISSION_MODE, + ] + if SYSTEM_PROMPT: + args.extend(["--system-prompt", SYSTEM_PROMPT]) + for tool in ALLOWED_TOOLS: + args.extend(["--allowedTools", tool]) + args.extend(CLI_EXTRA_FLAGS) + args.append(instruction_text) + return args + + +# =========================================================================== +# HARBOR ADAPTER — fixed harness, do not modify +# =========================================================================== + +from harbor.agents.base import BaseAgent # noqa: E402 +from harbor.environments.base import BaseEnvironment # noqa: E402 +from harbor.models.agent.context import AgentContext # noqa: E402 + + +def _parse_claude_json_output(raw: str) -> list[dict]: + """Parse Claude Code NDJSON output (newline-delimited JSON objects). + + Non-JSON lines (e.g. CLI error messages) are logged as warnings. + """ + messages = [] + for line in raw.strip().splitlines(): + line = line.strip() + if not line: + continue + try: + messages.append(json.loads(line)) + except json.JSONDecodeError: + logger.warning("Non-JSON line in claude output: %s", line[:200]) + return messages + + +def _to_atif(messages: list[dict], duration_ms: int) -> dict: + """Convert Claude Code stream-json output to an ATIF (Agent Trajectory + Interchange Format) v1.6 trajectory dict.""" + steps: list[dict] = [] + step_id = 0 + now = datetime.now(timezone.utc).isoformat() + + cost_usd = None + session_id = "unknown" + num_turns = 0 + model_name = MODEL + total_input_tokens = 0 + total_output_tokens = 0 + total_cache_tokens = 0 + + pending_tools: dict[str, dict] = {} + + def _step(source: str, message: str, **extra: object) -> dict: + nonlocal step_id + step_id += 1 + step = { + "step_id": step_id, + "timestamp": now, + "source": source, + "message": message, + } + step.update({k: v for k, v in extra.items() if v is not None}) + return step + + for msg in messages: + msg_type = msg.get("type", "") + + if msg_type == "assistant": + content_blocks = msg.get("message", {}).get("content", []) + texts = [] + reasoning = None + for block in content_blocks: + block_type = block.get("type", "") + if block_type == "text": + texts.append(block.get("text", "")) + elif block_type == "thinking": + reasoning = block.get("thinking", "") + elif block_type == "tool_use": + pending_tools[block.get("id", "")] = { + "tool_call_id": block.get("id", ""), + "function_name": block.get("name", "unknown"), + "arguments": block.get("input", {}), + } + if texts or reasoning: + steps.append(_step( + "agent", + "\n".join(texts) or "(thinking)", + reasoning_content=reasoning, + model_name=msg.get("message", {}).get("model", model_name), + )) + if msg.get("message", {}).get("model"): + model_name = msg["message"]["model"] + if msg.get("session_id"): + session_id = msg["session_id"] + + elif msg_type == "user": + content = msg.get("message", {}).get("content", []) + if isinstance(content, list): + for block in content: + if block.get("type") == "tool_result": + tool_id = block.get("tool_use_id", "") + result_content = block.get("content", "") + if isinstance(result_content, list): + result_content = "\n".join( + b.get("text", str(b)) for b in result_content + ) + tc = pending_tools.pop(tool_id, None) + if tc: + steps.append(_step( + "agent", + f"Tool: {tc['function_name']}", + tool_calls=[tc], + observation={"results": [{ + "source_call_id": tool_id, + "content": str(result_content), + }]}, + )) + + elif msg_type == "result": + cost_usd = msg.get("total_cost_usd") + if msg.get("session_id"): + session_id = msg["session_id"] + num_turns = msg.get("num_turns", num_turns) + result_usage = msg.get("usage", {}) + if result_usage: + total_input_tokens = result_usage.get("input_tokens", 0) + total_output_tokens = result_usage.get("output_tokens", 0) + total_cache_tokens = result_usage.get("cache_read_input_tokens", 0) + + for tc in pending_tools.values(): + steps.append(_step( + "agent", f"Tool: {tc['function_name']}", tool_calls=[tc], + )) + + if not steps: + steps.append(_step("user", "(empty)")) + + return { + "schema_version": "ATIF-v1.6", + "session_id": session_id, + "agent": {"name": "autoagent-claude-code", "version": "0.1.0", "model_name": model_name}, + "steps": steps, + "final_metrics": { + "total_prompt_tokens": total_input_tokens, + "total_completion_tokens": total_output_tokens, + "total_cached_tokens": total_cache_tokens, + "total_cost_usd": cost_usd, + "total_steps": len(steps), + "extra": { + "duration_ms": duration_ms, + "num_turns": num_turns, + }, + }, + } + + +class AutoAgent(BaseAgent): + """Harbor agent adapter. Runs Claude Code CLI on the HOST in a temp + directory, syncing task files from the container beforehand and results + back afterward.""" + + SUPPORTS_ATIF = True + + def __init__(self, *args, extra_env: dict[str, str] | None = None, **kwargs): + super().__init__(*args, **kwargs) + self._extra_env = dict(extra_env) if extra_env else {} + + @staticmethod + def name() -> str: + return "autoagent" + + def version(self) -> str | None: + return "0.1.0" + + async def setup(self, environment: BaseEnvironment) -> None: + pass + + async def run( + self, instruction: str, environment: BaseEnvironment, context: AgentContext + ) -> None: + # 1. Create a temp workdir on the host that mirrors /task in container + with tempfile.TemporaryDirectory(prefix="autoagent_") as tmpdir: + workdir = Path(tmpdir) + task_dir = workdir / "task" + task_dir.mkdir() + + instr_file = task_dir / "instruction.md" + instr_file.write_text(instruction) + + # Download pre-existing files from the container's /task/ + await environment.download_dir( + source_dir="/task", target_dir=task_dir + ) + + # 2. Rewrite absolute /task/ paths to the temp workdir path + # so claude writes files to the correct location + task_prefix = str(task_dir) + local_instruction = instruction.replace("/task/", f"{task_prefix}/") + + cli_args = build_cli_args(local_instruction) + + # 3. Run claude CLI on the HOST (async to avoid blocking event loop) + t0 = time.time() + try: + result = await asyncio.to_thread( + subprocess.run, + cli_args, + capture_output=True, + text=True, + timeout=TIMEOUT_SEC, + cwd=str(task_dir), + ) + duration_ms = int((time.time() - t0) * 1000) + raw_output = result.stdout or "" + stderr_output = result.stderr or "" + + if result.returncode != 0: + logger.error( + "claude CLI exited with code %d. stderr: %s", + result.returncode, + stderr_output[:500], + ) + except subprocess.TimeoutExpired as exc: + duration_ms = int((time.time() - t0) * 1000) + raw_output = ( + exc.stdout.decode("utf-8", errors="replace") + if isinstance(exc.stdout, bytes) + else (exc.stdout or "") + ) + stderr_output = f"ERROR: claude CLI timed out after {TIMEOUT_SEC}s" + logger.error("claude CLI timed out after %ds (partial output preserved)", TIMEOUT_SEC) + except FileNotFoundError: + duration_ms = int((time.time() - t0) * 1000) + raw_output = "" + stderr_output = ( + "ERROR: claude CLI not found on host. " + "Install: npm install -g @anthropic-ai/claude-code" + ) + logger.error("claude CLI not found on host") + + if raw_output: + (self.logs_dir / "claude_raw_output.txt").write_text(raw_output) + if stderr_output: + (self.logs_dir / "claude_stderr.txt").write_text(stderr_output) + + # 4. Sync files created by claude back to the container + await environment.upload_dir( + source_dir=task_dir, target_dir="/task" + ) + + # 5. Build ATIF trajectory + messages = _parse_claude_json_output(raw_output) + atif = _to_atif(messages, duration_ms) + + traj_path = self.logs_dir / "trajectory.json" + traj_path.write_text(json.dumps(atif, indent=2)) + + # 6. Populate Harbor metrics + fm = atif.get("final_metrics", {}) + context.cost_usd = fm.get("total_cost_usd") + context.n_input_tokens = fm.get("total_prompt_tokens", 0) + context.n_output_tokens = fm.get("total_completion_tokens", 0) + context.n_cache_tokens = fm.get("total_cached_tokens", 0) + + cost = fm.get("total_cost_usd") or 0 + turns = fm.get("extra", {}).get("num_turns", 0) + logger.info( + "cost_usd=%.4f turns=%d duration_ms=%d", cost, turns, duration_ms + ) + + +__all__ = ["AutoAgent"] diff --git a/program-claude-code.md b/program-claude-code.md new file mode 100644 index 00000000..ebbc7e6a --- /dev/null +++ b/program-claude-code.md @@ -0,0 +1,105 @@ +# autoagent — Claude Code CLI variant + +Autonomous agent engineering. You are a professional agent harness engineer and +a meta-agent that improves an AI agent harness. + +Your job is not to solve benchmark tasks directly. Your job is to improve the +harness in `agent-claude-code.py` so the agent gets better at solving tasks on +its own. + +## Directive + +Build a generally capable autonomous coding and terminal agent using Claude Code +CLI as the execution backend. + +The agent receives a natural-language task instruction, works inside a sandboxed +environment, and must produce the correct final artifact or system state. +Evaluation is done by task-specific verifiers. + +## Architecture + +This variant runs the Claude Code CLI on the **host machine** (not inside the +container) using the user's existing OAuth session. The CLI handles tool +execution, file operations, and shell commands internally. The adapter syncs +task files from the container to a host temp directory, runs `claude`, then +syncs results back for verification. + +**Security note:** Because the CLI runs on the host with `bypassPermissions`, +it can access any host file or network resource. Only run on trusted task sets. + +The harness controls: + +- **System prompt**: what instructions the agent receives +- **Model selection**: which Claude model to use +- **CLI flags**: permission mode, max turns, allowed tools +- **Prompt framing**: how the task instruction is presented to the CLI + +## Prerequisites + +- Claude Code CLI installed on the host: `npm install -g @anthropic-ai/claude-code` +- Authenticated: `claude login` (no API key needed — uses your subscription) + +## Setup + +Before starting a new experiment: + +1. Read `README.md`, this file, and `agent-claude-code.py`. +2. Verify `claude --version` works on your host machine. +3. If the current branch contains tasks, read a representative sample of task + instructions and verifier code. +4. Build the base image and verify the agent imports cleanly: + ```bash + docker build -f Dockerfile.claude-code -t autoagent-base . + ``` +5. Initialize `results.tsv` if it does not exist. + +The first run must always be the unmodified baseline. + +## What You Can Modify + +Everything above the `HARBOR ADAPTER` comment in `agent-claude-code.py`: + +- `SYSTEM_PROMPT` — agent instructions +- `MODEL` — Claude model to use (sonnet, haiku, opus) +- `MAX_TURNS` — maximum conversation turns +- `PERMISSION_MODE` — CLI permission mode +- `ALLOWED_TOOLS` — restrict which tools the CLI can use +- `CLI_EXTRA_FLAGS` — additional CLI flags +- `build_cli_args()` — how the CLI is invoked + +## Improvement Axes + +Since Claude Code CLI handles its own tool execution internally, the main +levers are: + +1. **System prompt engineering** — task decomposition strategies, verification + steps, error recovery patterns +2. **Model selection** — balancing capability vs cost +3. **Prompt framing** — how the task instruction is presented +4. **Tool restrictions** — limiting tools to reduce distraction +5. **Turn budget** — balancing thoroughness vs cost + +## What You Must Not Modify + +Inside `agent-claude-code.py`, there is a fixed adapter boundary marked by +comments. Do not modify that fixed section unless the human explicitly asks. + +## How to Run + +```bash +docker build -f Dockerfile.claude-code -t autoagent-base . +rm -rf jobs && mkdir -p jobs && uv run harbor run -p tasks/ -n 100 --agent-import-path agent-claude-code:AutoAgent -o jobs --job-name latest > run.log 2>&1 +``` + +No `.env` file or API key needed — the CLI runs on the host and uses your +existing Claude Code session directly. + +## Goal, Logging, Experiment Loop, Keep/Discard Rules + +Same as `program.md`. Maximize passed tasks. Log to `results.tsv`. Keep if +passed improves or harness is simpler at same performance. Discard otherwise. + +## NEVER STOP + +Once the experiment loop begins, do NOT stop to ask whether you should continue. +Continue iterating until the human explicitly interrupts you.