Skip to content

feat(agent): stateful conversations (Phase-2 PR B)#8

Merged
WiktorStarczewski merged 3 commits intomainfrom
pr-b/stateful-conversations
Apr 25, 2026
Merged

feat(agent): stateful conversations (Phase-2 PR B)#8
WiktorStarczewski merged 3 commits intomainfrom
pr-b/stateful-conversations

Conversation

@WiktorStarczewski
Copy link
Copy Markdown
Owner

Summary

Last in the Phase-2 sequence (PR 0 #6, PR A #7, this is PR B). Adds the four conversation-lifecycle MCP tools that build on PR A's one-shot agent — multi-turn conversations with state held server-side on the SDK session, plus an in-process metadata map for listing / idle-reaping / cap enforcement.

Tool Purpose
start_peer_conversation Opens a stateful read-only conversation. Returns {convId, startedAt, effectiveBudget}.
send_peer_message Drives one more turn on a convId. Per-turn budget overrides cascade through the conversation's defaults.
list_peer_conversations Active conversations sorted by lastActivityAt desc, with rune-safe 140-char previews.
end_peer_conversation Terminates a conversation and frees its slot. Idempotent.

All four are gated behind --enable-agent (off by default, same posture as PR A).

Internals

  • internal/agent/conversations.go — per-conv state map + mutex + idle reaper. Each conversation tracks startedAt, lastActivityAt, turnCount, perTurnBudget, project, system-prompt preview, first-user-prompt preview, and the EndReason enum (caller | idle_timeout | shutdown). The full message history lives server-side on the SDK session — the local map is metadata only.
  • New CLI flags: --max-conversations <n> (default 10), --conversation-idle-timeout <dur> (default 15m).
  • Reaper goroutine ticks at idle/4 (floored at 1s). Reaped conversations get EndedByIdleReap and the upstream session is best-effort-deleted. agent.Closer interface lets main.go shut the reaper down on SIGTERM.
  • OneShot refactored to build a transient conversation and route through runTurn so PR-A and PR-B share the SDK round-trip verbatim.

Concurrency

  • SendMessage takes the conversation's own mutex for the full turn — two callers racing on the same convId serialize.
  • The reaper takes convsMu then conv.mu; the reverse order is forbidden.
  • lookupAlive distinguishes alive / reaped / ended-by-caller, so the tool layer can produce the right user-facing message:
    • reaped → "reaped after idle timeout; start a new one"
    • ended → "unknown convId"
    • missing → "unknown convId"

Test plan

  • go build ./... && go vet ./... — clean.
  • go test ./... — green across all 9 packages.
  • go test -race -coverpkg=./... -coverprofile=… — aggregate 90.7% (gate is 90%).
  • Conversation unit tests (conversations_test.go): listing sorted-by-LastActivityAt, ended-conv omitted, system-prompt fallback preview, idempotent re-end preserves original EndedReason, lookupAlive discriminates end modes, cap accounting, reaper marks stale, stopReaper is idempotent, truncateRunes is rune-safe.
  • SDK-stub-driven tests (sdk_translate_test.go): StartConversation succeeds against stub, cap enforcement (third start refused), invalid project rejected before SDK call; SendMessage known/unknown; EndConversation success + idempotent re-end; reaper actually reaps + deletes.
  • Tools-layer tests (tools_agent_test.go): all four conversation tools registered when ctx.Agent != nil, missing-arg validation, error-translation (ErrConvCap / ErrConvReaped / ErrUnknownConv → user-facing messages).
  • End-to-end tests (hearsay_e2e_test.go): all four tools present in the catalog, --max-conversations flag accepted, list_peer_conversations returns empty list on a fresh server.
  • Manual loopback (Phase-2 plan verification step 10) — covered separately. PR-B's automated tests prove the wiring; this PR doesn't bill the Anthropic API.

What this completes

PR 0 (read_tool_result content fix) + PR A (one-shot agent) + this PR = Phase 2 complete. Wiktor's Claude can now read teammates' past sessions (Phase 1) AND drive a stateful read-only agent on their box for primary-data investigations.

Adds the four conversation-lifecycle MCP tools that build on PR A's
one-shot agent:

* start_peer_conversation  — opens a stateful read-only conversation,
                              returns {convId, startedAt, effectiveBudget}
* send_peer_message        — drives one more turn against an existing
                              convId (per-turn budget overrides cascade
                              through the conversation's defaults)
* list_peer_conversations  — active conversations sorted by
                              lastActivityAt desc, with rune-safe
                              140-char previews
* end_peer_conversation    — terminates a conversation and frees its
                              slot; idempotent (re-end returns
                              alreadyEnded=true)

All four are gated behind --enable-agent.  Tools are registered iff
ctx.Agent is non-nil (matches PR A's gating).

Internals
---------

* `internal/agent/conversations.go` adds the per-conv state map +
  mutex + idle reaper.  Each conversation tracks startedAt,
  lastActivityAt, turnCount, perTurnBudget, project, system_prompt
  preview, first user prompt preview, and the EndReason enum (caller
  | idle_timeout | shutdown).  The full message history lives
  server-side on the SDK session — the local map is metadata only.

* New CLI flags:
    --max-conversations <n>           (default 10)
    --conversation-idle-timeout <dur> (default 15m)

* The reaper runs as a background goroutine with a tick at idle/4
  (floored at 1s).  Reaped conversations get EndedByIdleReap and the
  upstream session is best-effort-deleted.  agent.Closer interface
  lets main.go shut the reaper down cleanly on SIGTERM.

* OneShot now builds a transient conversation and routes through
  runTurn (in conversations.go), so PR-A and PR-B share the SDK
  round-trip code path verbatim.

Concurrency
-----------

* SendMessage takes the conversation's own mutex for the full turn,
  so two callers racing on the same convId serialize.  The reaper
  takes both convsMu and conv.mu in that order — locking direction
  forbidden in reverse.

* lookupAlive distinguishes three end states: alive, reaped (idle),
  ended-by-caller — translating the latter two into ErrConvReaped vs
  ErrUnknownConv so the tool layer can produce the right user-facing
  message.

Coverage
--------

Aggregate at 90.7% with -race (gate is 90%).  Unit tests cover state
transitions (lifecycle, idle reap, cap enforcement, idempotent re-
end, lookupAlive's three branches) without an SDK round-trip;
stub-driven tests in sdk_translate_test.go exercise the SDK paths
end-to-end (StartConversation success + cap, SendMessage with
known/unknown convID, EndConversation with idempotent re-end, the
reaper actually deleting).
The PR-B stub returned 5xx for the StreamEvents endpoint, which
forced the SDK's stream layer to error out — the timing of that
error vs the test's t.Context() expiry was non-deterministic under
-race, so coverage occasionally dipped below the 90% gate (CI saw
65% on the previous push, local races between 60-90%).

Switching to a proper-but-empty SSE response (HTTP 200, content-type
text/event-stream, no events delivered) lets the stream close
cleanly.  pumpStream sees the channel close, the loop returns
StopReasonEndTurn, the conv.lastActivityAt update happens
deterministically — coverage reads 90.7% with -race three runs in a
row.
Diagnosis: setup-go@v5's default cache key is hashed against go.sum
only.  When PR B inherited PR A's go.sum (no new deps), the CI cache
restored PR A's instrumented test binaries.  Three of the nine
-coverpkg=./... test builds got stale cache hits — their cover blocks
referenced the old printHelp source-line range (84-125) while six
fresh builds referenced the new range (84-127).  go tool cover -func
treats the two ranges as separate blocks: the old one has 0 hits, the
new one is fully covered, and printHelp drops from 100% to 33%.
That cascade across cmd/hearsay/main.go funcs pushed the aggregate
from 90.7% (local) to 65.1% (CI) on the same commit.

Cache off makes CI a few seconds slower but eliminates the variance.
Re-enable once we have a cache key that hashes source files in
addition to go.sum.
@WiktorStarczewski WiktorStarczewski merged commit e7e12eb into main Apr 25, 2026
2 checks passed
@WiktorStarczewski WiktorStarczewski deleted the pr-b/stateful-conversations branch April 25, 2026 11:07
WiktorStarczewski added a commit that referenced this pull request Apr 25, 2026
…ment (PR C) (#10)

* feat(agent): subprocess-driven Claude Code, replacing API-key requirement (PR C)

Phase 2 originally shipped via the Anthropic Managed-Agents API (PRs
#7 + #8), which is API-key-only.  The constraint that surfaced
post-merge: both peers and consumers in the actual user base have
Claude Code subscriptions (Pro / Max / Team), not API keys.  The
shipped Phase-2 tools were therefore unusable for the real users.

This PR replaces the SDK path with a subprocess driver around
`claude --print`.  The peer's existing OAuth credentials authenticate
every call; subscription quota pays the bill.  No code changes for
consumers (Wiktor's side); breaking CLI-flag changes for peers.

Internals
---------

* internal/agent/cli.go (new) — subprocess driver.  Builds three argv
  variants (OneShot, first-turn-of-conv, resumed-turn) with
  --print --output-format json --allowed-tools "Read Glob Grep" on
  every call.  Parses Claude Code's JSON result for stop_reason +
  usage; replays the session JSONL via the existing internal/transcript
  package to extract per-tool-call detail (the JSON payload has no
  tool_calls[] field).

* internal/agent/conversations.go (slimmed) — convID is now the
  Claude Code session UUID.  StartConversation no longer hits an
  upstream service; it allocates a UUID, records the system_prompt,
  returns.  EndConversation marks the local slot ended without
  deleting anything; the session JSONL stays on disk so Phase-1
  read_session can still surface it.  Idle reaper is local-only.

* internal/agent/sdk.go, loop.go, customtools.go (deleted) — the
  Managed-Agents wrapper, event-loop translation layer, and
  hand-rolled Read/Glob/Grep handlers all go away.  Claude Code's
  native tools replace the custom dispatch; the JSON output replaces
  the event stream.

* go.mod — drops github.com/anthropics/anthropic-sdk-go and its
  transitive deps (jsonparser, tidwall/*, wk8/go-ordered-map, ...).
  Static binary supply-chain footprint shrinks meaningfully.

Security
--------

Two-leg defense for the read-only allowlist:
1. --allowed-tools "Read Glob Grep" is hardcoded on every invocation;
   Claude Code itself enforces it.  Asserted by the argv-shape unit
   test.
2. After every call, the JSONL replay scans for tool_use blocks
   outside the allowlist.  Any disallowed name flips StopReason to
   error / disallowed_tool.  Defense-in-depth against
   future-Claude-Code drift / corrupted builds.

ANTHROPIC_API_KEY in the operator's env is stripped from the
subprocess env by default — Claude Code's auth precedence is
ANTHROPIC_API_KEY > apiKeyHelper > OAuth/keychain, and a stale env
var would silently redirect billing.  --agent-keep-env-key opts
back in.

CLI changes (breaking)
----------------------

Removed (hard-fail with a friendly upgrade-notes pointer):
* --agent-api-key-env
* --agent-base-url
* --agent-model

Kept, reinterpreted:
* --agent-default-max-tokens — soft budget (system-prompt nudge); the
  CLI doesn't expose a token cap.

Added:
* --agent-claude-bin <path>
* --agent-keep-env-key

Unchanged: --enable-agent, --agent-default-max-tool-calls,
--agent-default-timeout-seconds, --agent-log-path, --max-conversations,
--conversation-idle-timeout.

Version bumped 0.1.0-dev → 0.3.0-dev (skipping 0.2 since PR B forgot
to bump).

Test coverage
-------------

Aggregate race-mode coverage 92.9% on a clean run (fluctuates 88-93%
with the race detector's goroutine-timing sensitivity).  Coverage
gate is 90% with cache:false on setup-go.

Plan: ~/.claude/plans/hearsay-phase-2-subprocess.md (six review
rounds: 3 → 4 → 1 → 2 → 0 → 0 issues).

* fix(agent): kill subprocess group on cancel for portable timeout

Linux's /bin/sh is dash, which does NOT forward SIGTERM to a child
sleep / claude process.  CI's TestRunClaude_TimeoutKillsSubprocess
saw the subprocess run the full 5s sleep — a real bug in the
production path on Linux too, since claude itself spawns helper
processes that wouldn't get the cancel signal otherwise.

Fix: set Setpgid on the subprocess so it owns its own process group;
on cancel, send SIGTERM to the whole group via syscall.Kill(-pgid,
SIGTERM) instead of just the leader.  cli_unix.go and cli_other.go
build-tag the platform-specific calls.  macOS keeps working (Setpgid
is a no-op for already-isolated procs); Windows compiles to a stub
since hearsay doesn't ship Windows builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant