Autonomous research agent built directly on Anthropic native tool use — no LangGraph, no LangChain. Plans, browses, computes, remembers, cites.
Live demo · Trace example · Eval results
I wanted to understand what a production agent loop actually looks like at the primitive level. So I wrote one — about 150 lines of Python — and built the rest of the project around making that loop legible: every tool call traced, every retry visible, every run evaluable. The trace viewer is the demo; the eval harness is the receipt.
┌─────────────┐
query ──▶│ agent loop │──▶ stream of events ──▶ frontend (SSE)
│ (~150 LoC) │
└──────┬──────┘
│
┌─────────┼─────────┬──────────────┐
▼ ▼ ▼ ▼
web_search web_fetch python_exec remember/recall
(Tavily) (httpx) (sandbox) (Chroma)
│
▼
save_brief
│
▼
cited brief
Plus: structlog → trace store (SQLite) → /traces/{id} → trace viewer in the UI.
The whole loop lives in backend/agent.py — about 180 lines including event emission, retry tracking, and trace recording. The control flow itself fits in this skeleton:
async def run_agent(*, query, session_id, trace_id):
# 1. Pre-load relevant long-term memory into the system prompt.
preloaded = await memory.recall(query=query, k=5)
system = _build_system_prompt(preloaded)
conversation = [{"role": "user", "content": query}]
tool_retry_counts: dict[str, int] = {}
# 2. Plan / act / observe, capped at MAX_ITERATIONS.
for iteration in range(1, settings.max_iterations + 1):
yield AgentEvent(type="thinking", ...)
response = await messages_create(system=system, messages=conversation, tools=tool_schemas)
conversation.append({"role": "assistant", "content": response.content})
tool_uses = [b for b in response.content if b.type == "tool_use"]
if not tool_uses: # plain text → done
yield AgentEvent(type="final", ...)
return
tool_results = []
for tu in tool_uses:
yield AgentEvent(type="tool_call", ...)
try:
validated = registry[tu.name].input_model.model_validate(tu.input)
output, is_error = await registry[tu.name].run(validated), False
except ToolError as e:
output, is_error = str(e), True
if is_error:
tool_retry_counts[tu.name] += 1
if tool_retry_counts[tu.name] > settings.max_tool_retries:
raise RuntimeError("tool exceeded retry budget")
yield AgentEvent(type="tool_result", ...)
tool_results.append({"type": "tool_result", "tool_use_id": tu.id,
"content": str(output), "is_error": is_error})
if tu.name == "save_brief" and not is_error:
yield AgentEvent(type="final", payload={"brief": output})
return
# 3. Feed the results back to Claude as the next user turn.
conversation.append({"role": "user", "content": tool_results})That's it. Three exits (save_brief called, plain text response, MAX_ITERATIONS hit), one error path (is_error → tool_result, with hard fail after N retries on the same tool), and pure async generators for streaming. The full file adds structlog binding, SQLite trace recording, contextvars for memory namespacing, and proper Anthropic SDK content-block handling — but the shape above is the whole story.
cp .env.example .env # fill in ANTHROPIC_API_KEY, OPENAI_API_KEY, TAVILY_API_KEY
docker compose up
open http://localhost:3000See evals/results.md. Summary:
| Metric | v0.1 |
|---|---|
| Task completion rate | TBD |
| Tool success rate | TBD |
| Median iterations | TBD |
| Cost per task (USD) | TBD |
| LLM-judged quality (1-5) | TBD |
| Tool | Purpose | Safety |
|---|---|---|
web_search |
Tavily search, ranked snippets | API-rate-limited |
web_fetch |
Fetch URL, extract clean article text | 10s timeout, byte cap |
python_exec |
Run Python in subprocess | timeout, RLIMIT, no net, tmpdir-only |
remember |
Persist a fact + source to long-term memory | per-user namespacing |
recall |
Semantic search over long-term memory | per-user namespacing |
save_brief |
Finalize the cited research brief | terminal tool |
- Python 3.11 + FastAPI — async, typed, well understood.
- Anthropic SDK directly — no agent framework. Owning the loop is the point of the project.
- ChromaDB — embedded vector DB, zero ops for v1.
- Tavily — agent-purpose search; one less dependency I'd otherwise wrap myself.
- structlog — every tool call becomes an inspectable JSON event.
- Next.js 14 + Tailwind + shadcn/ui — fast to ship, looks intentional.
- Fly.io — single-region deploy is fine for a portfolio piece.
A research agent at this scope is a while loop with a tool dispatcher. LangGraph hides that loop behind a state-machine DSL. For something this small, the abstraction cost is higher than the benefit, and the project loses its main pedagogical value: showing exactly how an agent works. If the loop ever grows multi-agent or branches into parallel investigations, that's the moment to revisit.
fly launch --copy-config --no-deploy
fly secrets set ANTHROPIC_API_KEY=... OPENAI_API_KEY=... TAVILY_API_KEY=...
fly deploy- v0.1 scaffold + agent loop + 5 evals
- v0.2 trace viewer + 15 evals
- v0.3 deployed at scout.example.dev
- v0.4 streaming tokens inside the brief
- v0.5 user accounts + private memory
- v1.0 multi-agent: planner / researcher / critic
MIT — see LICENSE.
