Skip to content

fix: orchestrator deadlock recovery — timeout, rollback, circuit breaker#280

Open
tyxben wants to merge 1 commit intoConway-Research:mainfrom
tyxben:fix/orchestrator-deadlock-recovery
Open

fix: orchestrator deadlock recovery — timeout, rollback, circuit breaker#280
tyxben wants to merge 1 commit intoConway-Research:mainfrom
tyxben:fix/orchestrator-deadlock-recovery

Conversation

@tyxben
Copy link
Copy Markdown
Contributor

@tyxben tyxben commented Mar 19, 2026

Summary

Fixes #266, #259, #262 — tasks stuck in assigned status blocking all goal creation.

Root cause

When a worker dies (sandbox crash, process restart) after task assignment, the task remains in assigned forever because:

  1. Recovery only ran when isWorkerAlive callback was provided (optional)
  2. Self-assigned tasks had zero recovery path
  3. Funding/messaging failures after assignment left tasks orphaned
  4. No escape hatch when all tasks in a goal were stuck

Fix: 4 interlocking mechanisms

1. Timeout-based recovery (unconditional)

  • Record started_at timestamp when assigning tasks (was missing)
  • If a task stays in assigned longer than max(task.timeoutMs, 10min), reset to pendingregardless of worker liveness
  • Works even without isWorkerAlive callback, catches self-assigned tasks too

2. Assignment rollback on post-assign failure

  • If funding or messaging fails after task is committed as assigned, immediately roll back to pending
  • Previously: task got stuck in assigned with the worker never notified

3. Self-assignment circuit breaker

  • Max 3 self-assigned tasks per goal (configurable via MAX_SELF_ASSIGNED_PER_GOAL)
  • Prevents parent agent from being overwhelmed, which starved the orchestrator tick and caused cascading deadlocks

4. Phase escape hatch

  • If a goal is in executing for >30min with zero completed/failed tasks, force-fail to unblock new goals
  • Phase entry time tracked in KV store, cleaned up on goal complete/fail

Files changed

File Change
src/orchestration/orchestrator.ts Core deadlock recovery logic
src/orchestration/task-graph.ts Record started_at on assignment

Test plan

  • tsc --noEmit passes
  • All 267 orchestration tests pass (orchestrator, task-graph, messaging, plan-mode, health-monitor, simple-tracker, attention)
  • tools-security (71), financial (23), heartbeat (19) — no regressions
  • Manual: create a goal, kill the worker mid-task, verify task is recovered within 10 minutes
  • Manual: simulate self-assignment overflow, verify circuit breaker engages at 3 tasks

Fixes Conway-Research#266, Conway-Research#259, Conway-Research#262 — tasks stuck in 'assigned' status blocking all
goal creation.

Root cause: when a worker dies (sandbox crash, process restart) after
task assignment, the task remains in 'assigned' forever because:
1. Recovery only ran when isWorkerAlive callback was available
2. Self-assigned tasks had zero recovery path
3. Funding/messaging failures after assignment left tasks orphaned
4. No escape hatch when all tasks in a goal were stuck

Changes:

1. Timeout-based recovery (unconditional):
   - Track started_at timestamp when assigning tasks
   - If a task stays in 'assigned' longer than max(task.timeoutMs,
     10min), reset to 'pending' regardless of worker liveness
   - Works even without isWorkerAlive callback

2. Assignment rollback on post-assign failure:
   - If funding or messaging fails AFTER task is committed as
     'assigned', immediately roll back to 'pending'
   - Previously: task stuck in 'assigned' with worker never notified

3. Self-assignment circuit breaker:
   - Max 3 self-assigned tasks per goal
   - Prevents parent agent from being overwhelmed, which starved the
     orchestrator tick and caused cascading deadlocks

4. Phase escape hatch:
   - If goal is in 'executing' for >30min with zero completed/failed
     tasks, force-fail to unblock new goals
   - Tracks phase entry time in KV store, cleans up on complete/fail

All 267 orchestration tests pass, no regressions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orchestrator deadlock: workers die and block all new goal creation

1 participant