Skip to content

Fix orphaned agent-mode conversations stuck InProgress#13075

Closed
warp-dev-github-integration[bot] wants to merge 4 commits into
masterfrom
factory/agent-conversation-orphan-deadlock
Closed

Fix orphaned agent-mode conversations stuck InProgress#13075
warp-dev-github-integration[bot] wants to merge 4 commits into
masterfrom
factory/agent-conversation-orphan-deadlock

Conversation

@warp-dev-github-integration

@warp-dev-github-integration warp-dev-github-integration Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes a state-machine bug where an agent-mode conversation could be left orphaned: ConversationStatus::InProgress with no in-flight response stream and nothing scheduled to resume it. Once orphaned, nothing ever transitions the conversation to a terminal status, so the eval harness submit_ai_query_and_wait_until_done (app/src/integration_testing/agent_mode/step.rs:128) polls until the 60-min wall clock kills the test. The symptom log is has_active_stream=false status=InProgress from conversation_ready_for_pending_events (app/src/ai/blocklist/controller.rs).

Root cause: conversation status and stream lifecycle are intentionally decoupled — a success-with-actions turn parks at InProgress while the stream is cleaned up, and resolution is handed to the async action→follow-up chain. That chain is driven by the controller's FinishedAction subscription, which only runs when a FinishedAction event is emitted. If a preprocessing batch enqueues nothing executable (e.g. the dedup guard drops an action whose result already exists), no FinishedAction is ever emitted, so the resolver never runs and the conversation is orphaned. The existing detector only logged the condition; it never recovered it.

Fix (approved scope: primary + safety net + hardening):

  • Primaryhandle_preprocess_actions_results now detects when a preprocessing batch enqueued nothing executable and no other actions remain unfinished, and re-emits FinishedAction for an already-settled action so the controller resolves the conversation (follow-up or terminal status). Reusing a finished action id is semantically valid (the action is finished) and avoids an event variant without an action id.
  • Safety net — the FinishedAction resolution logic is extracted into BlocklistAIController::resolve_conversation_after_actions_settled, and a new recover_orphaned_conversation_if_needed is invoked from handle_pending_events_ready (which fires after stream cleanup). It recovers a conversation only when it is InProgress with no active stream, no pending/running/preprocessing actions, and no scheduled auto-resume. The resolver treats missing finished results as empty so an orphan resolves to a terminal status instead of early-returning (normal flow always has results, so behavior there is unchanged).
  • Hardeninghas_unfinished_actions_for_conversation now also counts in-flight pending_preprocessed_actions, preventing premature resolution mid-preprocessing and keeping the recovery net from misfiring.

Existing exemptions are preserved: cancellation reasons where should_preserve_in_progress_status() is true (follow-up-same-conversation, CLI subagent takeover) stay InProgress, and conversations with a pending auto-resume are not touched.

Linked Issue

Reported via the factory-client triage Slack thread (linked below). No GitHub issue.

Testing

Added regression test success_with_actions_that_drop_in_preprocessing_does_not_orphan_conversation in app/src/ai/blocklist/controller_tests.rs: drives a conversation to InProgress with an action whose preprocessing dedup-drops it (finished result pre-seeded) and no active stream, then asserts it does not remain InProgress after preprocessing drains. Confirmed fail-before / pass-after (fails with "got InProgress" without the fix; resolves to Cancelled with it).

  • ./script/format: clean.
  • clippy (-p warp --all-targets --tests -- -D warnings): 0 warnings.
  • New test via cargo nextest: pass.
  • Full ai::blocklist:: suite: 611 passed, 0 failed.

Note: the full-workspace ./script/presubmit was not run locally because the sandbox OOM-kills full-workspace compiles; CI will run it on this PR. This change is an internal state-machine / eval-harness fix and is not user-visible, so no UI verification applies.

  • Automated regression test added (fail-before/pass-after verified)

Agent Mode

  • Warp Agent Mode - This PR was created via Warp's AI Agent Mode

Conversation: https://staging.warp.dev/conversation/21624eef-c39b-4538-8057-eccd2f465b30
Run: https://oz.staging.warp.dev/runs/019f01f8-c0a6-7236-bd28-4978383a0908

This PR was generated with Oz.

A conversation could be left orphaned: `InProgress` with no in-flight
stream, no pending/running/preprocessing actions, and nothing scheduled
to resume it. The eval harness then polls until the 60-min wall clock
kills the test.

Root causes and fixes:
- `has_unfinished_actions_for_conversation` ignored in-flight
  preprocessing, so a conversation mid-preprocessing looked action-less.
  It now also counts non-empty `pending_preprocessed_actions`.
- When a preprocessing batch enqueues nothing executable (e.g. the dedup
  guard drops the only action) no `FinishedAction` was emitted, so the
  controller's resolver never ran. `handle_preprocess_actions_results`
  now re-emits `FinishedAction` for an already-settled action id when the
  batch drains nothing and no actions remain unfinished.
- Extracted the `FinishedAction` subscription body into
  `BlocklistAIController::resolve_conversation_after_actions_settled`,
  treating missing finished results as empty so an orphan still resolves
  to a terminal status instead of early-returning.
- Added a safety net in `handle_pending_events_ready`: when not ready,
  recover a conversation detected as orphaned (InProgress, no stream, no
  unfinished actions, no pending auto-resume) via the shared resolver.

Adds regression test
`success_with_actions_that_drop_in_preprocessing_does_not_orphan_conversation`.

Co-Authored-By: Oz <oz-agent@warp.dev>
@oz-for-oss

oz-for-oss Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@warp-dev-github-integration[bot]

I'm starting a first review of this pull request.

You can view the conversation on Warp.

I completed the review and no human review was requested for this pull request.

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

@oz-for-oss oz-for-oss Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR extracts action-settlement resolution, treats preprocessing as unfinished work, and adds recovery for orphaned agent-mode conversations stuck InProgress.

Concerns

  • The new recovery hook can run from controllers that do not own the conversation because orchestration events are broadcast globally; those controllers can see no local stream/actions and incorrectly resolve another terminal's active conversation.

Verdict

Found: 0 critical, 1 important, 0 suggestions

Request changes

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

conversation_id: AIConversationId,
ctx: &mut ModelContext<Self>,
) {
let is_in_progress = BlocklistAIHistoryModel::as_ref(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [IMPORTANT] EventsReady is broadcast to every controller, so a controller that does not own this conversation can enter recovery with its own empty stream/action state and mark another terminal's active InProgress conversation terminal; gate recovery on all_live_conversations_for_terminal_view(self.terminal_view_id) before checking status.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added an ownership gate at the start of recover_orphaned_conversation_if_needed, using all_live_conversations_for_terminal_view(self.terminal_view_id) before checking status/stream/action state. Non-owning controllers now ignore broadcast EventsReady recovery attempts.

oz-agent added 3 commits June 26, 2026 05:44
Prevent non-owning controllers from recovering conversations for terminal views they do not own when EventsReady is broadcast globally.

Co-Authored-By: Oz <oz-agent@warp.dev>
Increase the assertion window for delayed CLI rich-input timers that are flaky under MacOS CI load.

Co-Authored-By: Oz <oz-agent@warp.dev>
Use the standard eventually assertion window instead of a fixed yield count so the preprocessing completion event has time to settle on slower CI machines.

Co-Authored-By: Oz <oz-agent@warp.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants