Skip to content
Open
17 changes: 17 additions & 0 deletions Dockerfile.claude-code
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim

RUN apt-get update && \
apt-get install -y --no-install-recommends ca-certificates git && \
rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Python deps — only what the agent needs (harbor excluded via .dockerignore)
COPY pyproject.toml ./
RUN uv pip install --system .

# Agent code
COPY agent-claude-code.py ./

RUN ln -sf $(which python3) /usr/local/bin/python
RUN mkdir -p /logs /app/output /task/output
67 changes: 39 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,37 @@
# autoagent
<p align="center">
<a href="https://www.thirdlayer.inc">
<img src="https://www.thirdlayer.inc/thirdlayer-logo.svg" alt="thirdlayer" width="200">
</a>
</p>
<p align="center">
Built by <a href="https://www.thirdlayer.inc">thirdlayer.inc</a>
</p>

![teaser](progress.png)
> We're launching a product around self-configuring agents soon. [Sign up here](https://form.typeform.com/to/ZQbnbO09).

# AutoAgent

Like [autoresearch](https://github.com/karpathy/autoresearch) but for agent engineering. Give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats.
> Like autoresearch but for agent engineering. Give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats.

The core idea is the same: you're not touching the harness Python files like you normally would as an engineer. Instead, you program `program.md` — the Markdown file that provides context to the meta-agent and defines the agent-engineering loop.
![teaser](progress.png)

The core idea is the same: you're not touching the harness Python files like you normally would as an engineer. Instead, you program `program.md`, the Markdown file that provides context to the meta-agent and defines the agent-engineering loop.

## How it works

The repo has a few files and directories that matter:

- **`agent.py`** the entire harness under test in a single file. It contains
- **`agent.py`** -- the entire harness under test in a single file. It contains
config, tool definitions, agent registry, routing/orchestration, and the
Harbor adapter boundary. The adapter section is explicitly marked as fixed;
the rest is the primary edit surface for the meta-agent.
- **`program.md`** instructions for the meta-agent + the directive (what
- **`program.md`** -- instructions for the meta-agent + the directive (what
kind of agent to build). **This file is edited by the human**.
- **`tasks/`** evaluation tasks in
- **`tasks/`** -- evaluation tasks in
[harbor](https://github.com/laude-institute/harbor) format. In a clean
baseline branch, benchmark payloads may be omitted and added in
benchmark-specific branches.
- **`.agent/`** optional workspace artifacts for reusable instructions,
- **`.agent/`** -- optional workspace artifacts for reusable instructions,
notes, prompts, or skills.

The metric is total **score** produced by the benchmark's task test suites. The
Expand Down Expand Up @@ -70,16 +81,16 @@ benchmark, diagnose failures, modify `agent.py`, and iterate.
## Project structure

```text
agent.py single-file harness under test
editable harness section prompt, registries, tools, routing
fixed adapter section Harbor integration + trajectory serialization
program.md meta-agent instructions + directive
Dockerfile.base base image
.agent/ optional agent workspace artifacts
tasks/ benchmark tasks, typically added in benchmark-specific branches
jobs/ Harbor job outputs
results.tsv experiment log (created by meta-agent, gitignored)
run.log latest run output
agent.py -- single-file harness under test
editable harness section -- prompt, registries, tools, routing
fixed adapter section -- Harbor integration + trajectory serialization
program.md -- meta-agent instructions + directive
Dockerfile.base -- base image
.agent/ -- optional agent workspace artifacts
tasks/ -- benchmark tasks, typically added in benchmark-specific branches
jobs/ -- Harbor job outputs
results.tsv -- experiment log (created by meta-agent, gitignored)
run.log -- latest run output
```

## Task format
Expand All @@ -88,17 +99,17 @@ When present, tasks follow [harbor](https://github.com/laude-institute/harbor)'s

```text
tasks/my-task/
task.toml config (timeouts, metadata)
instruction.md prompt sent to the agent
task.toml -- config (timeouts, metadata)
instruction.md -- prompt sent to the agent
tests/
test.sh entry point, writes /logs/reward.txt
test.py verification (deterministic or LLM-as-judge)
test.sh -- entry point, writes /logs/reward.txt
test.py -- verification (deterministic or LLM-as-judge)
environment/
Dockerfile task container (FROM autoagent-base)
files/ reference files mounted into container
Dockerfile -- task container (FROM autoagent-base)
files/ -- reference files mounted into container
```

Tests write a score (0.01.0) to the verifier logs. The meta-agent hill-climbs
Tests write a score (0.0-1.0) to the verifier logs. The meta-agent hill-climbs
on this.

## Design choices
Expand All @@ -108,8 +119,7 @@ on this.
- **Single-file, registry-driven harness.** The implementation lives in one
file for simplicity, but agent and tool registration stay structured so the
harness can still evolve cleanly.
- **Docker isolation.** The agent-under-test runs in a container. It can't
damage the host.
- **Docker isolation.** The agent runs in a container. It can't damage the host.
- **Score-driven.** Every experiment produces a numeric score. Keep if better,
discard if not. Same loop as autoresearch.
- **Harbor-compatible tasks.** Tasks use the same format as harbor benchmarks,
Expand All @@ -130,7 +140,7 @@ docker system prune -a -f
docker container prune -f
```

If Docker becomes unresponsive (e.g. after many concurrent runs), restart
If Docker becomes unresponsive (for example after many concurrent runs), restart
Docker Desktop:

```bash
Expand All @@ -144,3 +154,4 @@ You can equip the agent with [Agent Skills for Context Engineering](https://gith
## License

MIT

Loading