Skip to content

ThushanthaSanju/scout

Repository files navigation

scout

Autonomous research agent built directly on Anthropic native tool use — no LangGraph, no LangChain. Plans, browses, computes, remembers, cites.

Live demo · Trace example · Eval results

scout demo

Why this exists

I wanted to understand what a production agent loop actually looks like at the primitive level. So I wrote one — about 150 lines of Python — and built the rest of the project around making that loop legible: every tool call traced, every retry visible, every run evaluable. The trace viewer is the demo; the eval harness is the receipt.

Architecture

            ┌─────────────┐
   query ──▶│  agent loop │──▶ stream of events ──▶ frontend (SSE)
            │  (~150 LoC) │
            └──────┬──────┘
                   │
         ┌─────────┼─────────┬──────────────┐
         ▼         ▼         ▼              ▼
     web_search  web_fetch  python_exec  remember/recall
       (Tavily)  (httpx)    (sandbox)    (Chroma)
                   │
                   ▼
              save_brief
                   │
                   ▼
              cited brief

Plus: structlog → trace store (SQLite) → /traces/{id} → trace viewer in the UI.

The agent loop

The whole loop lives in backend/agent.py — about 180 lines including event emission, retry tracking, and trace recording. The control flow itself fits in this skeleton:

async def run_agent(*, query, session_id, trace_id):
    # 1. Pre-load relevant long-term memory into the system prompt.
    preloaded = await memory.recall(query=query, k=5)
    system = _build_system_prompt(preloaded)

    conversation = [{"role": "user", "content": query}]
    tool_retry_counts: dict[str, int] = {}

    # 2. Plan / act / observe, capped at MAX_ITERATIONS.
    for iteration in range(1, settings.max_iterations + 1):
        yield AgentEvent(type="thinking", ...)
        response = await messages_create(system=system, messages=conversation, tools=tool_schemas)
        conversation.append({"role": "assistant", "content": response.content})

        tool_uses = [b for b in response.content if b.type == "tool_use"]
        if not tool_uses:                          # plain text → done
            yield AgentEvent(type="final", ...)
            return

        tool_results = []
        for tu in tool_uses:
            yield AgentEvent(type="tool_call", ...)
            try:
                validated = registry[tu.name].input_model.model_validate(tu.input)
                output, is_error = await registry[tu.name].run(validated), False
            except ToolError as e:
                output, is_error = str(e), True

            if is_error:
                tool_retry_counts[tu.name] += 1
                if tool_retry_counts[tu.name] > settings.max_tool_retries:
                    raise RuntimeError("tool exceeded retry budget")

            yield AgentEvent(type="tool_result", ...)
            tool_results.append({"type": "tool_result", "tool_use_id": tu.id,
                                 "content": str(output), "is_error": is_error})

            if tu.name == "save_brief" and not is_error:
                yield AgentEvent(type="final", payload={"brief": output})
                return

        # 3. Feed the results back to Claude as the next user turn.
        conversation.append({"role": "user", "content": tool_results})

That's it. Three exits (save_brief called, plain text response, MAX_ITERATIONS hit), one error path (is_error → tool_result, with hard fail after N retries on the same tool), and pure async generators for streaming. The full file adds structlog binding, SQLite trace recording, contextvars for memory namespacing, and proper Anthropic SDK content-block handling — but the shape above is the whole story.

Quickstart

cp .env.example .env   # fill in ANTHROPIC_API_KEY, OPENAI_API_KEY, TAVILY_API_KEY
docker compose up
open http://localhost:3000

Eval results

See evals/results.md. Summary:

Metric v0.1
Task completion rate TBD
Tool success rate TBD
Median iterations TBD
Cost per task (USD) TBD
LLM-judged quality (1-5) TBD

Tools

Tool Purpose Safety
web_search Tavily search, ranked snippets API-rate-limited
web_fetch Fetch URL, extract clean article text 10s timeout, byte cap
python_exec Run Python in subprocess timeout, RLIMIT, no net, tmpdir-only
remember Persist a fact + source to long-term memory per-user namespacing
recall Semantic search over long-term memory per-user namespacing
save_brief Finalize the cited research brief terminal tool

Tech stack

  • Python 3.11 + FastAPI — async, typed, well understood.
  • Anthropic SDK directly — no agent framework. Owning the loop is the point of the project.
  • ChromaDB — embedded vector DB, zero ops for v1.
  • Tavily — agent-purpose search; one less dependency I'd otherwise wrap myself.
  • structlog — every tool call becomes an inspectable JSON event.
  • Next.js 14 + Tailwind + shadcn/ui — fast to ship, looks intentional.
  • Fly.io — single-region deploy is fine for a portfolio piece.

Why no LangGraph?

A research agent at this scope is a while loop with a tool dispatcher. LangGraph hides that loop behind a state-machine DSL. For something this small, the abstraction cost is higher than the benefit, and the project loses its main pedagogical value: showing exactly how an agent works. If the loop ever grows multi-agent or branches into parallel investigations, that's the moment to revisit.

Deployment

fly launch --copy-config --no-deploy
fly secrets set ANTHROPIC_API_KEY=... OPENAI_API_KEY=... TAVILY_API_KEY=...
fly deploy

Roadmap

  • v0.1 scaffold + agent loop + 5 evals
  • v0.2 trace viewer + 15 evals
  • v0.3 deployed at scout.example.dev
  • v0.4 streaming tokens inside the brief
  • v0.5 user accounts + private memory
  • v1.0 multi-agent: planner / researcher / critic

License

MIT — see LICENSE.

About

Autonomous research agent built directly on Anthropic native tool use. Plans, browses, computes, remembers, cites. ~180-line agent loop, full trace viewer, eval harness.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors