Add agent suspend/resume for out-of-process tool results#137
Conversation
Tools can now return dive.NewSuspendResult(prompt) to pause the agent mid-turn and hand control back to the caller. Resume is a subsequent CreateResponse call on the same session with WithToolResults(...) that supplies the missing tool outputs. Suspend state is tracked authoritatively on the session via a new SuspendableSession interface, so a process can crash and restart and still resume cleanly. The design is additive — Tool.Call keeps its (*ToolResult, error) signature, Response.Status defaults to Completed, and third-party session implementations that don't opt in continue to work (they just can't be resumed). Parallel tool batches drain all siblings on suspend so no work is lost; sequential batches re-execute any "not started" tools on resume. Adds a new OnSuspend hook, a terminal ResponseItemTypeSuspended stream item, and FileStore/MemoryStore persistence for the suspended flag and pending tool-call IDs. Subagents return an error if they suspend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds end-to-end suspend/resume support: tools can return suspension, agent detects and persists suspended turns (with pending tool-call state), callers can resume by supplying out-of-band ToolResults, and agent control flow, hooks, responses, and session storage are made suspension-aware. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Agent
participant LLM
participant Tool
participant Session
Caller->>Agent: CreateResponse(messages)
Agent->>LLM: Generate (may emit tool_call)
LLM-->>Agent: ToolUseMessage
Agent->>Tool: Execute tool call(s)
Tool-->>Agent: ToolResult (Suspend)
Agent->>Session: SaveSuspendedTurn(messages, pendingCalls)
Session-->>Agent: Persisted
Agent->>Agent: fire OnSuspend hooks / PostGeneration
Agent-->>Caller: Stream ResponseItemTypeSuspended (Status: suspended, PendingToolCalls)
sequenceDiagram
participant Caller
participant Agent
participant Session
participant Tool
participant LLM
Caller->>Agent: CreateResponse(..., ToolResults={id:result})
Agent->>Session: Load suspended state
Session-->>Agent: Suspended data (pending IDs)
Agent->>Agent: Validate & merge provided ToolResults
alt Not-started tools exist
Agent->>Tool: Re-execute not-started tool calls
Tool-->>Agent: ToolResult outcomes (complete or suspend)
end
Agent->>LLM: Generate with merged tool_result message
LLM-->>Agent: Completion response
Agent->>Session: SaveResumedTurn(messages, usage)
Session-->>Agent: Cleared suspension
Agent-->>Caller: Response(Status: completed, Output)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agent.go`:
- Around line 414-421: The code currently routes partial resumes (when
rs.RemainingPending is non-empty) through finishSuspended which re-triggers the
session suspend handler and calls OnSuspend again; modify the partial-resume
path so it does not re-fire OnSuspend: detect the partial-resume case in the
block that creates suspendedSnapshot (uses RemainingPending,
RemainingPendingCalls, CompletedToolCalls()) and call a new or existing finish
helper that skips suspend notifications (e.g., finishPartialResume or
finishSuspended with a flag to suppress OnSuspend/eventCallback invocation),
ensuring suspendedSnapshot is preserved but eventCallback/OnSuspend is not
invoked for this branch.
- Around line 578-594: The suspended terminal ResponseItem is emitted before
verifying the session is suspendable and before persisting, so move the callback
invocation (the code creating ResponseItem with Type ResponseItemTypeSuspended
and Suspended SuspendedItem) to after the sess != nil check, the type assertion
to SuspendableSession, and after the persistence logic that creates the
suspended turn; also handle and return any error from callback instead of
ignoring it (i.e., check callback(ctx, ...) error and propagate) so consumers
don't receive a terminal suspended event when persistence or suspension
capability fails (references: callback,
ResponseItem/ResponseItemTypeSuspended/SuspendedItem, sess, SuspendableSession,
ErrSessionNotSuspendable).
- Around line 603-636: The suspended-turn persistence
(suspendable.SaveSuspendedTurn) is happening before running OnSuspend and
PostGeneration hooks, which can leave a saved suspended turn even if a hook
aborts; move the call to suspendable.SaveSuspendedTurn so it runs after the
loops that execute a.hooks.OnSuspend and a.hooks.PostGeneration and only when
those hooks return no error; keep existing error handling for HookAbortError and
other hook errors (HookAbortError.HookType assignments for "OnSuspend" and
"PostGeneration") and ensure hctx.Response/OutputMessages/Usage remain set
before running the hooks so the saved suspended turn persists the final state
after hooks succeed.
- Around line 330-356: When resuming (rs != nil) we currently drop caller input
by rebuilding messages from sessionMsgs; instead, if a session is suspended and
the caller supplied new input (len(inputMessages) > 0) we must return
ErrSuspendedSessionInput. Add a guard after prepareResume/rs creation that
checks if rs != nil && len(inputMessages) > 0 and returns
ErrSuspendedSessionInput, so prepareResume/sessionMsgs/rs logic remains
unchanged but we never silently discard input from callers (referencing rs,
prepareResume, sessionMsgs, inputMessages, and ErrSuspendedSessionInput).
In `@session/session.go`:
- Around line 266-299: SaveResumedTurn currently overwrites the last persisted
event even when the session isn't suspended; add a guard that rejects the call
if there is no suspended turn to replace. Concretely, at the start of
SaveResumedTurn check s.data.Suspended (and optionally that len(s.data.Events) >
0) and return a non-nil error (e.g., fmt.Errorf or errors.New) if the session is
not suspended or there are no events, so you never replace the last event unless
a suspended turn exists; keep the rest of the logic (creating evt, replacing
last event, clearing s.data.Suspended, updating timestamps, calling
s.appender.putSession) unchanged.
- Around line 110-116: Session fork and compaction are dropping or ignoring the
new persisted suspend state (fields Suspended and PendingToolCallIDs), causing
header/event-log mismatch; update Session.Fork to copy Suspended and
PendingToolCallIDs into the new forked Session and update Session.Compact to
preserve and respect suspension (i.e., prevent compaction or ensure compacted
header retains Suspended and PendingToolCallIDs from the latest events), and
adjust any helpers that construct session headers so they include Suspended and
PendingToolCallIDs consistently to keep resumability intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 63a81072-e2ef-4d4d-ada0-0a1388d063ed
📒 Files selected for processing (11)
agent.goagent_suspend_test.godive.goexperimental/toolkit/extended/task.gohooks.goresponse.gosession/file_store.gosession/memory_store.gosession/session.gosession/session_test.gotool.go
Persist the full SuspendResult payload (Prompt, Metadata, Name, Input) per pending tool call in a new session.PendingCall type, so cross-process resume and partial-resume-again flows surface the same payload the tool originally returned instead of dropping it. Tighten suspend/resume semantics: - Compact refuses on suspended sessions (ErrSuspendedSession). - Fork clears suspension on the forked copy. - SaveResumedTurn is guarded (ErrNotSuspended) against accidental overwrite. - SaveSuspendedTurn failures propagate from CreateResponse instead of being swallowed, so callers never receive pending IDs that don't exist. - OnSuspend HookAbortError rolls back via AbandonSuspension so the session is not stranded when the caller sees an error. - Supplying a result for a non-pending ID (including already-completed) returns ErrUnknownPendingToolCall without mutating the session. Interface rename: SuspendableSession.PendingToolCallIDs → PendingCalls, SaveSuspendedTurn takes []session.PendingCall, ClearSuspension → AbandonSuspension. The JSONL header field pending_tool_call_ids is replaced by pending_calls — the branch has not shipped, so this is a clean break. 14 new adversarial tests cover: partial-resume event count stability, cross-process metadata survival, mixed sequential/parallel agents on the same session, resume-time tool panic recovery, suspend on the last tool iteration boundary, ctx cancel mid-execution, non-pending ID rejection, fork clearing, compact refusal, SaveResumedTurn guard, SaveSuspendedTurn error propagation, OnSuspend abort rollback, turn-message non-aliasing, and OutputMessages invariants for generate-driven vs partial-resume suspends. go test -race ./... clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
session/file_store.go (1)
323-369:⚠️ Potential issue | 🟠 MajorMake full-session rewrites atomic.
Line 323 truncates the live JSONL with
os.Create. The new suspend/resume transitions insession/session.gonow rely on this full-rewrite path, so a crash orENOSPCbetween truncate andFlushcan wipe the pending-call state you're explicitly persisting for cross-process resume. Please write to a temp file and rename it into place after a successful flush/sync.Suggested shape of the rewrite
- f, err := os.Create(p) + tmp, err := os.CreateTemp(filepath.Dir(p), filepath.Base(p)+".*.tmp") if err != nil { return err } - defer f.Close() + defer func() { + _ = tmp.Close() + _ = os.Remove(tmp.Name()) + }() - w := bufio.NewWriter(f) + w := bufio.NewWriter(tmp) ... - return w.Flush() + if err := w.Flush(); err != nil { + return err + } + if err := tmp.Sync(); err != nil { + return err + } + if err := tmp.Close(); err != nil { + return err + } + return os.Rename(tmp.Name(), p)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@session/file_store.go` around lines 323 - 369, The current full-session rewrite uses os.Create(p) and writes directly, which can truncate the live JSONL and lose pending-call state on crash; change the implementation that writes the sessionHeader/jsonlLine and events to instead create a temp file (os.CreateTemp) in the same directory, write via bufio.NewWriter, call w.Flush(), call f.Sync() (and then f.Close()), and only after a successful flush/fsync/close do an atomic os.Rename(tempPath, p) to replace the original file; preserve the existing JSON encoding steps for sessionHeader and jsonlLine, and ensure errors during write/flush/sync cause removal of the temp file to avoid littering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@session/session.go`:
- Around line 76-99: clonePendingCalls currently only shallow-copies the
top-level Metadata map, allowing nested maps/slices in PendingCall.Metadata to
be mutated by callers; update clonePendingCalls (and/or add a helper) to
deep-copy nested JSON-like values for each Metadata entry (e.g., implement a
recursive deepCopyJSONValue(value any) any or marshal+unmarshal each Metadata
value using encoding/json) and assign the fully deep-copied result to
cp.Metadata so callers cannot mutate session internals; reference the
clonePendingCalls function and the PendingCall.Metadata field when making the
change.
- Around line 224-237: SaveTurn currently clears s.data.Suspended, drops
s.data.PendingCalls and appends an event, which breaks resumability by losing
partial tool messages; instead detect suspended sessions in SaveTurn and return
ErrSuspendedSession. Modify the SaveTurn implementation to check
s.data.Suspended at the start (or before mutating PendingCalls/Events), and if
true return ErrSuspendedSession (preserving state) rather than clearing the flag
or calling s.appender.putSession; ensure SaveResumedTurn remains the only path
that clears Suspended and rewrites via s.appender.putSession when resuming.
---
Outside diff comments:
In `@session/file_store.go`:
- Around line 323-369: The current full-session rewrite uses os.Create(p) and
writes directly, which can truncate the live JSONL and lose pending-call state
on crash; change the implementation that writes the sessionHeader/jsonlLine and
events to instead create a temp file (os.CreateTemp) in the same directory,
write via bufio.NewWriter, call w.Flush(), call f.Sync() (and then f.Close()),
and only after a successful flush/fsync/close do an atomic os.Rename(tempPath,
p) to replace the original file; preserve the existing JSON encoding steps for
sessionHeader and jsonlLine, and ensure errors during write/flush/sync cause
removal of the temp file to avoid littering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6dd180ca-a7f6-41e9-9243-fda05f1b4426
📒 Files selected for processing (6)
agent.goagent_suspend_test.godive.gosession/file_store.gosession/session.gosession/session_test.go
✅ Files skipped from review due to trivial changes (1)
- agent.go
🚧 Files skipped from review as they are similar to previous changes (1)
- session/session_test.go
Align finishSuspended with the design doc: OnSuspend and PostGeneration run before SaveSuspendedTurn so a hook abort leaves the session in its previous state instead of requiring AbandonSuspension compensation. The terminal suspended stream item is now emitted after persistence succeeds and callback errors propagate, so stream consumers never see a suspended terminal for a call that ultimately returned an error. Partial resumes (remaining pending non-empty) skip OnSuspend and the terminal stream item since they continue an existing suspension rather than announcing a new one. Tighten the resume input guard to reject caller-supplied input whenever the session is suspended, not only when tool results are absent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deep-copy nested PendingCall metadata so callers cannot mutate session internals through the returned map. Make SaveTurn return ErrSuspendedSession instead of silently clearing suspend state, keeping SaveResumedTurn as the only path that unwinds a suspension. Write FileStore sessions to a temp file + atomic rename so a crash mid-write cannot truncate the live JSONL and lose pending-call state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@session/file_store.go`:
- Around line 401-404: The rename of tmpPath to p must be followed by syncing
the parent directory to make the rename durable: after os.Rename(tmpPath, p)
succeeds in session/file_store.go (around the tmpPath -> p rename and committed
variable), open the parent directory (use filepath.Dir(p)), call Sync() on the
directory file, close it, and propagate any error from the dir sync so callers
know the rename wasn't durable; only set committed = true after both the rename
and the directory sync succeed. Ensure errors from opening or syncing the
directory are returned (and do not swallow them).
In `@session/session.go`:
- Around line 343-352: The code mutates s.data (e.g., s.data.Suspended,
s.data.PendingCalls, s.data.UpdatedAt and any replaced-last-event fields) before
calling s.appender.putSession; if putSession returns an error these in-memory
changes must be rolled back to their prior values to avoid divergence. For each
affected method (SaveSuspendedTurn, SaveResumedTurn, AbandonSuspension and the
shown block calling s.appender.putSession) snapshot the previous s.data fields
you will change, perform the mutations, call s.appender.putSession, and if it
returns an error restore the saved snapshot to s.data before returning the
error. Ensure the snapshot covers PendingCalls (deep clone or equivalent),
Suspended flag, UpdatedAt and any fields modified by the replace-last
optimization so the in-memory Session state remains consistent on failure.
- Around line 255-265: CreateResponse is currently swallowing the
ErrSuspendedSession returned by Session.SaveTurn; change CreateResponse so that
when Session.SaveTurn (and other save calls like SaveSuspendedTurn) return an
error (including ErrSuspendedSession) it is propagated back to the caller
instead of only logging and continuing. Locate the CreateResponse implementation
and replace the log-and-continue pattern around the call to s.SaveTurn(...) (the
same pattern used at agent.go:645-646 for SaveSuspendedTurn) with immediate
return of the error (wrapping or annotating as needed) so persistence failures
for suspended sessions are bubbled up to the caller.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a2201c0c-8933-44d6-8a55-99d4b2a65ffe
📒 Files selected for processing (2)
session/file_store.gosession/session.go
Three correctness fixes from review: 1. session/file_store.go: fsync the parent directory after the rename so the rename itself is durable across power loss, not just the file contents. 2. session/session.go: snapshot mutated fields in SaveSuspendedTurn, SaveResumedTurn, and AbandonSuspension and roll back on putSession failure so the in-memory Session never diverges from disk. 3. agent.go: propagate SaveTurn and SaveResumedTurn errors out of CreateResponse instead of logging and continuing. Persistence failures must reach the caller; the SaveSuspendedTurn path already did this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four runnable examples illustrating the suspend/resume API: - human_approval: synchronous in-process approval prompt - async_webhook: cross-process resume via persisted FileStore session - partial_resume: parallel tool batch with staged result delivery - dialogspec: shared example helper Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design doc lands under docs/plans alongside the other plans; PRD lands under docs/prds following the existing numbering convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- session/file_store.go: add a Concurrency model section to the FileStore godoc making the single-writer-per-session contract explicit. Sequential cross-process handoff is supported and crash-consistent; concurrent access from multiple processes is not supported and should use a database-backed Session backend. - docs/plans/plan-04-suspend-resume-review.md: combined review notes from two independent reviewers, organized by severity and themed. Top item is the "sessions are optional, not required" redesign bundle (H2 + H3 + M4) — suspend/resume must work for users who manage history themselves and don't use Session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
session/session.go (1)
260-277:⚠️ Potential issue | 🟠 MajorRoll back
SaveTurnmutations when append-to-disk fails.
SaveTurnnow propagates append errors, but it still appends tos.data.Eventsand updatesUpdatedAtbefore callingappendEvent. If the file write fails, the liveSessionis ahead of disk, so reusing the same object after the error will diverge from a freshly reopened session.🧯 Minimal rollback pattern
func (s *Session) SaveTurn(ctx context.Context, messages []*llm.Message, usage *llm.Usage) error { s.mu.Lock() defer s.mu.Unlock() if s.data.Suspended { return ErrSuspendedSession } evt := &event{ ID: newEventID(), Type: eventTypeTurn, Timestamp: time.Now(), Messages: messages, Usage: usage, } + prevUpdatedAt := s.data.UpdatedAt s.data.Events = append(s.data.Events, evt) s.data.UpdatedAt = evt.Timestamp if s.appender != nil { - return s.appender.appendEvent(ctx, s.data.ID, evt) + if err := s.appender.appendEvent(ctx, s.data.ID, evt); err != nil { + s.data.Events = s.data.Events[:len(s.data.Events)-1] + s.data.UpdatedAt = prevUpdatedAt + return err + } } return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@session/session.go` around lines 260 - 277, SaveTurn currently mutates s.data.Events and s.data.UpdatedAt before calling s.appender.appendEvent, so if appendEvent fails the in-memory Session diverges from disk; fix by either performing appendEvent first and then updating in-memory state, or by capturing the previous state (len(s.data.Events) and previous UpdatedAt) and rolling back those mutations if appendEvent returns an error; update the function SaveTurn to create evt as now, call s.appender.appendEvent(ctx, s.data.ID, evt) while holding the lock (or without if safe), and only append evt to s.data.Events and set s.data.UpdatedAt when appendEvent succeeds—or if you keep the current ordering, restore s.data.Events (truncate to previous length) and s.data.UpdatedAt to the prior value when appendEvent returns an error.
♻️ Duplicate comments (1)
session/session.go (1)
386-423:⚠️ Potential issue | 🟠 MajorReject resume when there is no suspended turn to replace.
This still appends a brand-new event when
Suspendedis true but the event log is empty. That silently “repairs” corrupted state instead of failing fast, and it breaks the core invariant that resume replaces the persisted suspended turn rather than manufacturing a new one.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@session/session.go` around lines 386 - 423, SaveResumedTurn currently creates/appends a new event when s.data.Suspended is true but s.data.Events is empty; instead, detect this corrupted state and reject the resume. In SaveResumedTurn, after the Suspended check validate that len(s.data.Events) > 0 (or hasLastEvent) and if not return ErrNotSuspended (or a new ErrNoSuspendedEvent if you prefer); do not generate evtID or append a new event when there is no existing event to replace (keep the rollback snapshot logic unchanged). Reference: function SaveResumedTurn, s.data.Suspended, s.data.Events, ErrNotSuspended.
🧹 Nitpick comments (2)
examples/suspend/async_webhook/main.go (1)
48-57: Add a production warning aroundOnSuspendside effects.This example is close to something users will copy into real code. Since
OnSuspendruns before the suspended turn is durably committed, a real webhook POST from here can escape even ifSaveSuspendedTurnlater fails. A short warning or outbox note next towebhookNotifierwould prevent a pretty nasty integration footgun.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/suspend/async_webhook/main.go` around lines 48 - 57, The webhookNotifier function (used from OnSuspend) performs external network side-effects at suspend time; add a clear production warning comment immediately above webhookNotifier explaining that OnSuspend executes before the suspended turn is durably committed and that performing real POSTs here can leak actions if SaveSuspendedTurn later fails — instruct consumers to instead use an outbox/durable queue (enqueue the webhook payload for delivery after commit) or only log/print in this example; mention the functions OnSuspend and SaveSuspendedTurn and the preferred outbox pattern so maintainers know how to fix real integrations.docs/plans/plan-04-suspend-resume.md (1)
958-958: Add language identifiers to fenced code blocks.The fenced blocks at Line 958 and Line 1011 should declare a language (for example,
text) to satisfy markdownlint MD040.Also applies to: 1011-1011
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/plans/plan-04-suspend-resume.md` at line 958, Two fenced code blocks in the document are missing language identifiers (triggering markdownlint MD040); update the two triple-backtick blocks referenced around the existing examples (the blocks at the locations mentioned in the review) to include an explicit language token (e.g., ```text or ```bash) so the fenced code is annotated; ensure both the block at the earlier occurrence and the one at the later occurrence are updated consistently to satisfy MD040.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/plans/plan-04-suspend-resume.md`:
- Around line 30-33: Update the "additive only" claim so it no longer asserts
there are no renames/removals; explicitly state that this PR introduces breaking
contract/storage changes — mention the changed pending-call field/name and the
renamed suspension APIs (and the new Suspend field behaving as a nil-pointer
check) so readers are aware this is not fully additive and must handle
migration/compatibility considerations.
- Line 6: Update the stale PRD path references in
docs/plans/plan-04-suspend-resume.md by replacing occurrences of the string
"tasks/prd-agent-suspend-resume.md" (found at the link targets on line 6 and
line 14) with the new path "docs/prds/prd-agent-suspend-resume.md" so the links
point to the moved PR notes; ensure both instances are updated consistently.
- Around line 742-746: Update the design doc to match the current public API by
renaming ClearSuspension to AbandonSuspension and aligning method signatures and
semantics with the PR: replace any references to ClearSuspension and its older
signature with AbandonSuspension(ctx context.Context) error, update descriptions
to reflect the new resume semantics (how CreateResponse transitions out of
resume mode and the updated suspended/resumed error behavior), and make the same
changes in the other occurrences noted (around the other sections referenced).
In `@examples/suspend/async_webhook/main.go`:
- Line 66: The FileStore path passed to session.NewFileStore is unnormalized
which triggers the FileStore path validation to mismatch the root string and
cause ErrInvalidSessionID; update the example to normalize the path before
passing it to session.NewFileStore (e.g., use filepath.Clean or filepath.Abs on
"./async_webhook_sessions") so the store root used by NewFileStore matches the
validated root in the FileStore implementation and prevents ErrInvalidSessionID
when creating or reopening sessions.
In `@session/session.go`:
- Around line 444-471: AbandonSuspension currently clears Suspended and
PendingCalls while leaving the unmatched tool_use event in history, poisoning
the session; change it to detect the last event/unanswered tool_use and remove
or roll back that event from s.data.Events (so the session no longer contains a
half-completed turn) before clearing PendingCalls and setting
s.data.Suspended=false, still cloning previous state (use clonePendingCalls and
prevUpdatedAt) and persisting via s.appender.putSession; if persistence fails,
restore s.data.Events, s.data.PendingCalls, s.data.Suspended and UpdatedAt as
you do now to keep rollback behavior.
---
Outside diff comments:
In `@session/session.go`:
- Around line 260-277: SaveTurn currently mutates s.data.Events and
s.data.UpdatedAt before calling s.appender.appendEvent, so if appendEvent fails
the in-memory Session diverges from disk; fix by either performing appendEvent
first and then updating in-memory state, or by capturing the previous state
(len(s.data.Events) and previous UpdatedAt) and rolling back those mutations if
appendEvent returns an error; update the function SaveTurn to create evt as now,
call s.appender.appendEvent(ctx, s.data.ID, evt) while holding the lock (or
without if safe), and only append evt to s.data.Events and set s.data.UpdatedAt
when appendEvent succeeds—or if you keep the current ordering, restore
s.data.Events (truncate to previous length) and s.data.UpdatedAt to the prior
value when appendEvent returns an error.
---
Duplicate comments:
In `@session/session.go`:
- Around line 386-423: SaveResumedTurn currently creates/appends a new event
when s.data.Suspended is true but s.data.Events is empty; instead, detect this
corrupted state and reject the resume. In SaveResumedTurn, after the Suspended
check validate that len(s.data.Events) > 0 (or hasLastEvent) and if not return
ErrNotSuspended (or a new ErrNoSuspendedEvent if you prefer); do not generate
evtID or append a new event when there is no existing event to replace (keep the
rollback snapshot logic unchanged). Reference: function SaveResumedTurn,
s.data.Suspended, s.data.Events, ErrNotSuspended.
---
Nitpick comments:
In `@docs/plans/plan-04-suspend-resume.md`:
- Line 958: Two fenced code blocks in the document are missing language
identifiers (triggering markdownlint MD040); update the two triple-backtick
blocks referenced around the existing examples (the blocks at the locations
mentioned in the review) to include an explicit language token (e.g., ```text or
```bash) so the fenced code is annotated; ensure both the block at the earlier
occurrence and the one at the later occurrence are updated consistently to
satisfy MD040.
In `@examples/suspend/async_webhook/main.go`:
- Around line 48-57: The webhookNotifier function (used from OnSuspend) performs
external network side-effects at suspend time; add a clear production warning
comment immediately above webhookNotifier explaining that OnSuspend executes
before the suspended turn is durably committed and that performing real POSTs
here can leak actions if SaveSuspendedTurn later fails — instruct consumers to
instead use an outbox/durable queue (enqueue the webhook payload for delivery
after commit) or only log/print in this example; mention the functions OnSuspend
and SaveSuspendedTurn and the preferred outbox pattern so maintainers know how
to fix real integrations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6e8fcbb1-cbbf-486f-86d5-695f554c5ab2
📒 Files selected for processing (10)
agent.godocs/plans/plan-04-suspend-resume-review.mddocs/plans/plan-04-suspend-resume.mddocs/prds/prd-04-suspend-resume.mdexamples/suspend/async_webhook/main.goexamples/suspend/dialogspec/dialogspec.goexamples/suspend/human_approval/main.goexamples/suspend/partial_resume/main.gosession/file_store.gosession/session.go
🚧 Files skipped from review as they are similar to previous changes (1)
- agent.go
| flag.Parse() | ||
|
|
||
| ctx := context.Background() | ||
| store, err := session.NewFileStore("./async_webhook_sessions") |
There was a problem hiding this comment.
Use a normalized store path here.
Line 66 currently passes "./async_webhook_sessions", but the current FileStore path validation compares against the unnormalized root string. That makes this example fail with ErrInvalidSessionID before it can even create or reopen the session.
🛠️ Minimal example fix
- store, err := session.NewFileStore("./async_webhook_sessions")
+ store, err := session.NewFileStore("async_webhook_sessions")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| store, err := session.NewFileStore("./async_webhook_sessions") | |
| store, err := session.NewFileStore("async_webhook_sessions") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/suspend/async_webhook/main.go` at line 66, The FileStore path passed
to session.NewFileStore is unnormalized which triggers the FileStore path
validation to mismatch the root string and cause ErrInvalidSessionID; update the
example to normalize the path before passing it to session.NewFileStore (e.g.,
use filepath.Clean or filepath.Abs on "./async_webhook_sessions") so the store
root used by NewFileStore matches the validated root in the FileStore
implementation and prevents ErrInvalidSessionID when creating or reopening
sessions.
Combined changes. The first block is the "sessions are optional" redesign (H2+H3+M4+M3+M5 from plan-04-suspend-resume-review.md); the second block adds the production hardening follow-ups (M1, L1, M2, L4). Sessions-optional redesign: - WithSuspension(*SuspensionState) passes authoritative resume state explicitly, so stateless callers and cross-process resumers no longer depend on a SuspendableSession being present. - SuspendableSession shrunk to 3 methods (LoadSuspension, SaveSuspendedTurn, SaveResumedTurn); session.PendingCall deleted; session package consumes dive.PendingToolCall / dive.SuspensionState directly. - Response.PendingToolCalls/CompletedToolCalls collapsed into Response.Suspension *SuspensionState with TurnMessageCount so stateless callers can locate the turn boundary without session plumbing. - Resume is opt-in: suspended-with-no-opt-in now returns ErrSuspendedSessionNoOptIn instead of silently re-saving. ErrSessionNotSuspendable and AbandonSuspension removed. - ToolResult tagged-union contract validated at the agent boundary: a tool that returns both Suspend and regular result fields flows through PostToolUseFailure as IsError. - Input-decoding helpers on *PendingToolCall: UnmarshalInput(into any) and generic DecodePendingInput[T]. New stateless example. Hardening follow-ups: - M1: NewFileStoreWithSync(dir, sync bool) opts the hot-path appendEvent into f.Sync(); default NewFileStore keeps pagecache durability. Trade-off documented on FileStore. - L1: CreateResponse accumulates OutputMessages, Items, and Usage across Stop-hook continuations; the synthetic user reason message injected by a Stop hook is appended to the accumulator so a suspend on the re-entered generate preserves both iterations plus the Stop reason in the persisted turn and on Response.Suspension.TurnMessageCount. Pinned by TestStopHookContinueThenSuspend. - M2: package-level sessionLocks sync.Map keyed on Session.ID() serializes concurrent CreateResponse calls that share a session (including across agents). Stateless callers skip the lock. Pinned by TestConcurrentCreateResponseOnSameSessionSerialized under -race. dive.go godoc (Session interface, WithToolResults) updated to describe the guarantee. - L4: OnSuspendHook godoc now references the Response.Suspension shape and explicitly states no compensating rollback is required because the hook runs before persistence. plan-04-suspend-resume-review.md marks items 4, 5, 6, 8 of the Follow-up PRs list as shipped. go test -race -count=1 ./... and go vet ./... clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SuspensionState.TurnMessages replaces TurnMessageCount. Callers (and the agent internally) work with the turn as an actual message slice rather than an index count — simpler to reason about and eliminates the off-by-one risk. - WithResume(state, results) bundles the stateless-resume pair that previously required WithSuspension + WithToolResults. Session-backed callers still use WithToolResults only. - Response.Suspension is now populated on both suspended responses AND on resume-completed responses (with PendingToolCalls nil), so stateless callers can flush the final merged turn into their local history via a single append without reconciling a stale partial tool_result from their saved state. - NewSuspendResult(prompt, metadata) replaces NewSuspendResult + NewSuspendResultWithMetadata. One constructor, same capability. - Renames: ErrSuspendedSessionNoOptIn → ErrResumeRequired, ErrSuspendedSessionInput → ErrInputOnSuspendedSession, ErrNoSuspendedState → ErrNoSuspendedTurn. Error messages no longer reference a non-existent AbandonSuspension API. - SuspendResult.Metadata godoc explicitly warns about JSON round-tripping (int→float64, struct→map[string]any). - session.ListOptions.Suspended *bool filter honored by both MemoryStore and FileStore for finding stale suspended sessions. - withRollback helper DRYs the three snapshot/restore code paths in Session write methods. - MemoryStore.List documents its store→session lock order. - Compact godoc no longer references the removed AbandonSuspension. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Tests: rewrite stateless suspend/resume tests to use the preHistory + WithResume pattern, assert the merged TurnMessages on resume-completed Responses, and replace NewSuspendResult callsites. Parametrize a ListOptions.Suspended filter test across MemoryStore and FileStore. Update round-trip tests to assert LoadSuspension surfaces TurnMessages. - Examples: rewrite the stateless example to demonstrate the new pattern (preHistory/suspension split, WithResume, flush-on-completion). dialogspec switches from NewSuspendResultWithMetadata to the single NewSuspendResult constructor. - Review doc: append a "Final API polish pass" section summarizing the rename and shape changes shipped in this pass and mark the remaining L2 / cleanup checklist items complete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a dedicated suspend-resume guide and cross-link it from the agents, hooks, and custom-tools guides so the feature is discoverable from every natural entry point. Extend the hook flow, error table, and next-steps lists to cover OnSuspend. Update the top-level README, llms.txt, and CLAUDE.md with pointers to the new material and the suspend/resume key types so downstream AI assistants and new users land on the full flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Append a second round of adversarial tests that pin invariants the existing suite didn't explicitly cover. Each test locks in a contract that a careless refactor would otherwise be free to break: - OnSuspend does not re-fire on a partial resume (no new transition). - OnSuspend DOES re-fire when not-started sequential tools re-suspend. - OnSuspend hook chain: HookAbortError stops the rest; regular errors are logged and the chain continues through PostGeneration + persist. - SetSystemPrompt between suspend and resume is honored (counterpart to the existing SetModel test). - WithResume overrides the session's stored suspension state, enabling cross-process resumers that hold a newer snapshot than on-disk. - WithToolResults validation is atomic: a mix of valid and unknown IDs fails with no session mutation and no PostToolUse/PostToolUseFailure dispatch for the valid entries. - An empty WithToolResults map is not a resume opt-in and returns ErrResumeRequired instead of silently no-op re-saving. - A suspended tool's Call method is not re-invoked on resume — the caller-supplied result is authoritative. - SuspensionState round-trips through JSON without losing the fields needed to drive a resume (the stateless wire-format contract). - LoadSuspension returns a deep copy; mutating it does not affect the session. - Mutating Response.Suspension on the caller side does not corrupt the session's persisted turn. - Cancelling a multi-pending suspension with IsError results for every pending call fires PostToolUseFailure once per cancelled ID and completes the turn normally. Also fix a stale "All three patterns" line in the suspend/resume guide that listed four runnable examples, and add a brief note that the dialogspec directory is a shared helper package rather than an example. Full suite green under `go test -race -count=5 ./...`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verified each criterion against the implementation, tests, and docs; checked off everything that landed across the suspend/resume series (commits 1362d58 → 0912c76). Two substantive doc fixes alongside the checkbox sweep: - US-001 acceptance criterion split into two bullets to be honest: documentation + runnable examples are done; an async-mode toolkit AskUser is not (toolkit/ask_user.go is sync-by-design via Dialog). The pattern is demonstrated in examples/suspend/ for now. - US-003 acceptance criterion rewritten to remove an internal contradiction with US-005. The original wording said partial resume "errors clearly," but US-005 (and TestPartialResumeTwice) defines partial resume as keeping the agent suspended. The bullet now describes the unknown-ID case, which is what ErrUnknownPendingToolCall and TestResumeUnknownID actually pin. Status flipped Draft -> Implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The PRD's last open US-001 acceptance criterion required at least one
built-in toolkit tool to demonstrate suspend/resume. AskUserTool was
the natural fit but until now it was sync-by-design: it called into a
dive.Dialog and blocked until the user answered, which is wrong for
SaaS workflows where the wait may take hours or days.
Add an Async option to AskUserToolOptions. When set, Call returns
NewSuspendResult(question, {"type": <input.Type>}) instead of invoking
Dialog. The integrator surfaces the question to a human out-of-band
and resumes the agent later with a tool result whose text content is
JSON-marshaled AskUserOutput. In async mode the Dialog field is
ignored and may be left nil; sync mode is unchanged.
Tests:
- toolkit/ask_user_test.go: 12 new tests covering both modes — name,
schema, annotations, sync no-dialog error, sync confirm/select with
AutoApproveDialog, sync deny path, async returns SuspendResult with
clean tagged-union shape, async carries the question type in
metadata, async ignores Dialog when both set, async works with nil
Dialog, and Input round-trips through DecodePendingInput[*AskUserInput].
- agent_suspend_test.go: TestAskUserToolAsyncEndToEnd drives the real
toolkit.AskUserTool through agent.CreateResponse end-to-end. Pins
the suspend → integrator-supplies-AskUserOutput → resume → final
answer flow including the LLM-side tool_result body, closing the
loop on the PRD acceptance criterion.
Docs:
- docs/guides/suspend-resume.md: new "Built-in: toolkit.AskUserTool in
async mode" subsection under "Suspending from a tool" so readers
find the out-of-the-box option for the most common case.
- docs/prds/prd-04-suspend-resume.md: US-001's last bullet flipped
from [ ] to [x] with a pointer to the new test.
Full suite green under go test -race -count=1 ./...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ToolResultContent.Content is an `any` field, so once a message is round-tripped through JSON — as happens transparently in Message.Copy, session persistence, and cross-process suspend/resume — reading Content directly yields the generic decoded shape (map[string]any, []any, float64), discarding the caller's original Go types. Add a custom UnmarshalJSON that preserves the original content bytes in an unexported field alongside the generic Content value, plus a DecodeContent helper (and DecodeToolResultContent[T] generic wrapper) that replays those bytes into a typed destination. Integers stay integers, typed structs decode into themselves, slices keep their element types. Reading Content directly still gives the generic shape, so existing callers are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… doc drift - SaveTurn: roll back in-memory Events/UpdatedAt if appendEvent fails, keeping session consistent with what was actually persisted - SaveResumedTurn: reject corrupted state (Suspended=true but empty Events) instead of creating an orphaned event - FileStore: normalize dir with filepath.Clean in constructor so path() prefix check works with relative paths like "./sessions" - Design doc: fix stale PRD path references (tasks/ → docs/prds/), update "additive only" claim to acknowledge breaking session contract changes, align SuspendableSession interface and sessionData to match actual code (LoadSuspension/SaveResumedTurn, PendingToolCalls, CompletedToolCalls), add language identifiers to two unfenced code blocks - Example: add production warning on webhookNotifier about OnSuspend running before SaveSuspendedTurn commits (recommend outbox pattern) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oolCalls preservation Populate HookContext.Tool during resume post-hooks, propagate hook mutations (Result rewriting, AdditionalContext injection) into the merged tool_result message, and preserve rich CompletedToolCall data across partial resumes instead of lossy reconstruction from message content blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Tools can return
dive.NewSuspendResult(prompt)to pause the agent mid-turn and hand control back to the caller. The caller resumes by callingCreateResponsewith the missing outputs supplied viaWithToolResults.Sessions are optional. Suspend/resume works with no
Session, a plainSession, or aSuspendableSessionfor auto-persistence:When a
SuspendableSessionis used, the agent auto-persists on the caller's behalf and the behavior is identical to today — users just callCreateResponsewithWithToolResultsto resume.Key types (all in root
divepackage):SuspensionState { PendingToolCalls, CompletedToolCalls, TurnMessageCount }— symmetric in/out payload. Returned onResponse.Suspension, passed back viaWithSuspension.PendingToolCall { ID, Name, Input, Prompt, Metadata }— a tool call awaiting an external result. IncludesUnmarshalInput(into any) errorand genericDecodePendingInput[T]helpers for typed input decoding.SuspendableSession— 3 methods:LoadSuspension,SaveSuspendedTurn,SaveResumedTurn. PlainSessionimplementations or nil work too.Hooks & streaming:
OnSuspendHookfires when the agent transitions into a suspended state, before persistence and beforePostGeneration. Aborting viaHookAbortErrorleaves the session untouched.ResponseItemTypeSuspendedstream item with theSuspensionStatepayload marks end-of-stream for suspended turns.Response.StatusisResponseStatusCompletedorResponseStatusSuspended(empty treated as Completed for back-compat).Ergonomics & validation:
WithToolResultsnorWithSuspensionnor new input returnsErrSuspendedSessionNoOptIninstead of silently no-op-rewriting the suspended turn.ToolResult.Suspendis a validated tagged union: setting bothSuspendand regular result fields surfaces as anIsErrorresult routed throughPostToolUseFailure, not a silent-ignore.Hardening pass — locked-in semantics
Future contributors: do not regress these.
SuspendableSessionis used, the fullSuspensionState(includingPrompt,Metadata,CompletedToolCalls) is persisted per suspended turn. Cross-process resume and partial-resume-again both surface the original payload unchanged.ErrUnknownPendingToolCallwithout mutating session state.Session.CompactreturnsErrSuspendedSessionwhen suspended.SaveResumedTurnis guarded. ReturnsErrNotSuspendedif the session is not suspended.SaveSuspendedTurnfails,CreateResponsereturns the error — no phantom suspended state.OnSuspendabort is safe. Hooks run before persistence, so aHookAbortErrorinOnSuspendleaves the session in its previous state — no compensation needed.OutputMessagesinvariant. Generate-driven suspends populateResponse.OutputMessageswith the assistant tool_use message plus any partial tool_result. Partial-resume-only suspends (no generate call) return emptyOutputMessages; stateless callers who want the full turn should read from the messages they passed in viaWithMessages.Breaking changes on this unmerged branch
The v0 API from earlier commits on this branch is gone:
Response.PendingToolCalls/Response.CompletedToolCalls→Response.Suspension(*SuspensionState).SuspendableSession.Suspended(),.PendingCalls(),.LastEventMessageCount(),.AbandonSuspension()→ replaced byLoadSuspension() *SuspensionState.session.PendingCalldeleted; use*dive.PendingToolCalldirectly.ErrSessionNotSuspendabledeleted — any session (or no session) can suspend.ErrSuspendedSessionNoOptIn.pending_calls→pending_tool_calls(+ newcompleted_tool_calls). Delete any on-disk suspended sessions created with earlier builds of this branch.Test plan
agent_suspend_test.gocovering simple suspend/resume, suspend-again, parallel & sequential batches, partial resume, no-regression on non-suspending tools, OnSuspend hook ordering, streaming terminal item,SetModelbetween suspend/resume, etc.SaveResumedTurnguard,SaveSuspendedTurnerror propagation,OnSuspendabort rollback, turn-message non-aliasing, andOutputMessagesinvariants.LoadSuspension/SaveSuspendedTurn(state)shape.TestStatelessSuspendAndResume,TestStatelessMultiPendingResume,TestSuspendOnPlainSessionReturnsSuspendedResponse,TestSuspendWithoutSessionReturnsSuspendedResponse,TestSuspendedSessionNoOptInErrors,TestDecodePendingInputHelpers,TestMalformedSuspendResultBecomesError.go test -race -count=1 ./...clean across the whole module.go vet ./...clean.human_approval,partial_resume,async_webhook, plus newstatelessexample that demonstrates suspend/resume with no session and a JSON state file.Docs
docs/plans/plan-04-suspend-resume-review.mdmarks items 1/2/3 of the must-do list as shipped.docs/plans/plan-04-suspend-resume.mdanddocs/prds/prd-04-suspend-resume.mdretain the earlier design context.🤖 Generated with Claude Code