Fix orphaned agent-mode conversations stuck InProgress#13075
Fix orphaned agent-mode conversations stuck InProgress#13075warp-dev-github-integration[bot] wants to merge 4 commits into
Conversation
A conversation could be left orphaned: `InProgress` with no in-flight stream, no pending/running/preprocessing actions, and nothing scheduled to resume it. The eval harness then polls until the 60-min wall clock kills the test. Root causes and fixes: - `has_unfinished_actions_for_conversation` ignored in-flight preprocessing, so a conversation mid-preprocessing looked action-less. It now also counts non-empty `pending_preprocessed_actions`. - When a preprocessing batch enqueues nothing executable (e.g. the dedup guard drops the only action) no `FinishedAction` was emitted, so the controller's resolver never ran. `handle_preprocess_actions_results` now re-emits `FinishedAction` for an already-settled action id when the batch drains nothing and no actions remain unfinished. - Extracted the `FinishedAction` subscription body into `BlocklistAIController::resolve_conversation_after_actions_settled`, treating missing finished results as empty so an orphan still resolves to a terminal status instead of early-returning. - Added a safety net in `handle_pending_events_ready`: when not ready, recover a conversation detected as orphaned (InProgress, no stream, no unfinished actions, no pending auto-resume) via the shared resolver. Adds regression test `success_with_actions_that_drop_in_preprocessing_does_not_orphan_conversation`. Co-Authored-By: Oz <oz-agent@warp.dev>
|
@warp-dev-github-integration[bot] I'm starting a first review of this pull request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
There was a problem hiding this comment.
Overview
This PR extracts action-settlement resolution, treats preprocessing as unfinished work, and adds recovery for orphaned agent-mode conversations stuck InProgress.
Concerns
- The new recovery hook can run from controllers that do not own the conversation because orchestration events are broadcast globally; those controllers can see no local stream/actions and incorrectly resolve another terminal's active conversation.
Verdict
Found: 0 critical, 1 important, 0 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
| conversation_id: AIConversationId, | ||
| ctx: &mut ModelContext<Self>, | ||
| ) { | ||
| let is_in_progress = BlocklistAIHistoryModel::as_ref(ctx) |
There was a problem hiding this comment.
EventsReady is broadcast to every controller, so a controller that does not own this conversation can enter recovery with its own empty stream/action state and mark another terminal's active InProgress conversation terminal; gate recovery on all_live_conversations_for_terminal_view(self.terminal_view_id) before checking status.
There was a problem hiding this comment.
Done — added an ownership gate at the start of recover_orphaned_conversation_if_needed, using all_live_conversations_for_terminal_view(self.terminal_view_id) before checking status/stream/action state. Non-owning controllers now ignore broadcast EventsReady recovery attempts.
Prevent non-owning controllers from recovering conversations for terminal views they do not own when EventsReady is broadcast globally. Co-Authored-By: Oz <oz-agent@warp.dev>
Increase the assertion window for delayed CLI rich-input timers that are flaky under MacOS CI load. Co-Authored-By: Oz <oz-agent@warp.dev>
Use the standard eventually assertion window instead of a fixed yield count so the preprocessing completion event has time to settle on slower CI machines. Co-Authored-By: Oz <oz-agent@warp.dev>
Description
Fixes a state-machine bug where an agent-mode conversation could be left orphaned:
ConversationStatus::InProgresswith no in-flight response stream and nothing scheduled to resume it. Once orphaned, nothing ever transitions the conversation to a terminal status, so the eval harnesssubmit_ai_query_and_wait_until_done(app/src/integration_testing/agent_mode/step.rs:128) polls until the 60-min wall clock kills the test. The symptom log ishas_active_stream=false status=InProgressfromconversation_ready_for_pending_events(app/src/ai/blocklist/controller.rs).Root cause: conversation status and stream lifecycle are intentionally decoupled — a success-with-actions turn parks at
InProgresswhile the stream is cleaned up, and resolution is handed to the async action→follow-up chain. That chain is driven by the controller'sFinishedActionsubscription, which only runs when aFinishedActionevent is emitted. If a preprocessing batch enqueues nothing executable (e.g. the dedup guard drops an action whose result already exists), noFinishedActionis ever emitted, so the resolver never runs and the conversation is orphaned. The existing detector only logged the condition; it never recovered it.Fix (approved scope: primary + safety net + hardening):
handle_preprocess_actions_resultsnow detects when a preprocessing batch enqueued nothing executable and no other actions remain unfinished, and re-emitsFinishedActionfor an already-settled action so the controller resolves the conversation (follow-up or terminal status). Reusing a finished action id is semantically valid (the action is finished) and avoids an event variant without an action id.FinishedActionresolution logic is extracted intoBlocklistAIController::resolve_conversation_after_actions_settled, and a newrecover_orphaned_conversation_if_neededis invoked fromhandle_pending_events_ready(which fires after stream cleanup). It recovers a conversation only when it isInProgresswith no active stream, no pending/running/preprocessing actions, and no scheduled auto-resume. The resolver treats missing finished results as empty so an orphan resolves to a terminal status instead of early-returning (normal flow always has results, so behavior there is unchanged).has_unfinished_actions_for_conversationnow also counts in-flightpending_preprocessed_actions, preventing premature resolution mid-preprocessing and keeping the recovery net from misfiring.Existing exemptions are preserved: cancellation reasons where
should_preserve_in_progress_status()is true (follow-up-same-conversation, CLI subagent takeover) stayInProgress, and conversations with a pending auto-resume are not touched.Linked Issue
Reported via the factory-client triage Slack thread (linked below). No GitHub issue.
Testing
Added regression test
success_with_actions_that_drop_in_preprocessing_does_not_orphan_conversationinapp/src/ai/blocklist/controller_tests.rs: drives a conversation toInProgresswith an action whose preprocessing dedup-drops it (finished result pre-seeded) and no active stream, then asserts it does not remainInProgressafter preprocessing drains. Confirmed fail-before / pass-after (fails with "got InProgress" without the fix; resolves toCancelledwith it)../script/format: clean.-p warp --all-targets --tests -- -D warnings): 0 warnings.cargo nextest: pass.ai::blocklist::suite: 611 passed, 0 failed.Note: the full-workspace
./script/presubmitwas not run locally because the sandbox OOM-kills full-workspace compiles; CI will run it on this PR. This change is an internal state-machine / eval-harness fix and is not user-visible, so no UI verification applies.Agent Mode
Conversation: https://staging.warp.dev/conversation/21624eef-c39b-4538-8057-eccd2f465b30
Run: https://oz.staging.warp.dev/runs/019f01f8-c0a6-7236-bd28-4978383a0908
This PR was generated with Oz.