scout

Autonomous research agent built directly on Anthropic native tool use — no LangGraph, no LangChain. Plans, browses, computes, remembers, cites.

Live demo · Trace example · Eval results

Why this exists

I wanted to understand what a production agent loop actually looks like at the primitive level. So I wrote one — about 150 lines of Python — and built the rest of the project around making that loop legible: every tool call traced, every retry visible, every run evaluable. The trace viewer is the demo; the eval harness is the receipt.

Architecture

            ┌─────────────┐
   query ──▶│  agent loop │──▶ stream of events ──▶ frontend (SSE)
            │  (~150 LoC) │
            └──────┬──────┘
                   │
         ┌─────────┼─────────┬──────────────┐
         ▼         ▼         ▼              ▼
     web_search  web_fetch  python_exec  remember/recall
       (Tavily)  (httpx)    (sandbox)    (Chroma)
                   │
                   ▼
              save_brief
                   │
                   ▼
              cited brief

Plus: structlog → trace store (SQLite) → /traces/{id} → trace viewer in the UI.

The agent loop

The whole loop lives in backend/agent.py — about 180 lines including event emission, retry tracking, and trace recording. The control flow itself fits in this skeleton:

async def run_agent(*, query, session_id, trace_id):
    # 1. Pre-load relevant long-term memory into the system prompt.
    preloaded = await memory.recall(query=query, k=5)
    system = _build_system_prompt(preloaded)

    conversation = [{"role": "user", "content": query}]
    tool_retry_counts: dict[str, int] = {}

    # 2. Plan / act / observe, capped at MAX_ITERATIONS.
    for iteration in range(1, settings.max_iterations + 1):
        yield AgentEvent(type="thinking", ...)
        response = await messages_create(system=system, messages=conversation, tools=tool_schemas)
        conversation.append({"role": "assistant", "content": response.content})

        tool_uses = [b for b in response.content if b.type == "tool_use"]
        if not tool_uses:                          # plain text → done
            yield AgentEvent(type="final", ...)
            return

        tool_results = []
        for tu in tool_uses:
            yield AgentEvent(type="tool_call", ...)
            try:
                validated = registry[tu.name].input_model.model_validate(tu.input)
                output, is_error = await registry[tu.name].run(validated), False
            except ToolError as e:
                output, is_error = str(e), True

            if is_error:
                tool_retry_counts[tu.name] += 1
                if tool_retry_counts[tu.name] > settings.max_tool_retries:
                    raise RuntimeError("tool exceeded retry budget")

            yield AgentEvent(type="tool_result", ...)
            tool_results.append({"type": "tool_result", "tool_use_id": tu.id,
                                 "content": str(output), "is_error": is_error})

            if tu.name == "save_brief" and not is_error:
                yield AgentEvent(type="final", payload={"brief": output})
                return

        # 3. Feed the results back to Claude as the next user turn.
        conversation.append({"role": "user", "content": tool_results})

That's it. Three exits (save_brief called, plain text response, MAX_ITERATIONS hit), one error path (is_error → tool_result, with hard fail after N retries on the same tool), and pure async generators for streaming. The full file adds structlog binding, SQLite trace recording, contextvars for memory namespacing, and proper Anthropic SDK content-block handling — but the shape above is the whole story.

Quickstart

cp .env.example .env   # fill in ANTHROPIC_API_KEY, OPENAI_API_KEY, TAVILY_API_KEY
docker compose up
open http://localhost:3000

Eval results

See evals/results.md. Summary:

Metric	v0.1
Task completion rate	TBD
Tool success rate	TBD
Median iterations	TBD
Cost per task (USD)	TBD
LLM-judged quality (1-5)	TBD

Tools

Tool	Purpose	Safety
`web_search`	Tavily search, ranked snippets	API-rate-limited
`web_fetch`	Fetch URL, extract clean article text	10s timeout, byte cap
`python_exec`	Run Python in subprocess	timeout, RLIMIT, no net, tmpdir-only
`remember`	Persist a fact + source to long-term memory	per-user namespacing
`recall`	Semantic search over long-term memory	per-user namespacing
`save_brief`	Finalize the cited research brief	terminal tool

Tech stack

Python 3.11 + FastAPI — async, typed, well understood.
Anthropic SDK directly — no agent framework. Owning the loop is the point of the project.
ChromaDB — embedded vector DB, zero ops for v1.
Tavily — agent-purpose search; one less dependency I'd otherwise wrap myself.
structlog — every tool call becomes an inspectable JSON event.
Next.js 14 + Tailwind + shadcn/ui — fast to ship, looks intentional.
Fly.io — single-region deploy is fine for a portfolio piece.

Why no LangGraph?

A research agent at this scope is a while loop with a tool dispatcher. LangGraph hides that loop behind a state-machine DSL. For something this small, the abstraction cost is higher than the benefit, and the project loses its main pedagogical value: showing exactly how an agent works. If the loop ever grows multi-agent or branches into parallel investigations, that's the moment to revisit.

Deployment

fly launch --copy-config --no-deploy
fly secrets set ANTHROPIC_API_KEY=... OPENAI_API_KEY=... TAVILY_API_KEY=...
fly deploy

Roadmap

v0.1 scaffold + agent loop + 5 evals
v0.2 trace viewer + 15 evals
v0.3 deployed at scout.example.dev
v0.4 streaming tokens inside the brief
v0.5 user accounts + private memory
v1.0 multi-agent: planner / researcher / critic

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
evals		evals
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scout

Why this exists

Architecture

The agent loop

Quickstart

Eval results

Tools

Tech stack

Why no LangGraph?

Deployment

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scout

Why this exists

Architecture

The agent loop

Quickstart

Eval results

Tools

Tech stack

Why no LangGraph?

Deployment

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages