Skip to content

Handle BLOCKED Oz runs as actionable terminal states #453

@captainsafia

Description

@captainsafia

Summary

The Vercel cron poller does not treat Oz runs in BLOCKED state as terminal/actionable. As a result, blocked runs are polled until OZ_IN_FLIGHT_MAX_ATTEMPTS is exhausted, then the GitHub progress comment is updated with the generic fallback error instead of the run's actionable status_message.

Example

Observed on warpdotdev/warp issue #10699:

The run had actually implemented the change locally, committed it, and uploaded handoff artifacts, but could not push because the environment lacked a working GitHub write path:

I could not complete the required branch push because this environment’s HTTPS GitHub token is invalid, no stored gh login is available, and SSH is unavailable (ssh: No such file or directory). No PR was opened.

No branch or PR was created for oz-agent/implement-issue-10699.

Current behavior

core/poll_runs.py only treats these states as terminal:

  • SUCCEEDED
  • FAILED
  • ERROR
  • CANCELLED

Because BLOCKED is not terminal, the cron poller increments attempts and keeps polling. Vercel logs later showed:

Expiring Oz run 019e19dd-4878-707a-ab4f-480f91c838b0 for workflow create-implementation-from-issue: max attempts exceeded (360 >= 360)

The progress comment was then updated with:

I ran into an unexpected error while working on this.

This hides the actionable blocked reason from maintainers and issue authors.

Expected behavior

When a run enters BLOCKED, the control plane should promptly stop retrying indefinitely and surface the blocked reason in GitHub.

Possible implementation:

  • Add BLOCKED to terminal/action-required handling in core/poll_runs.py.
  • Route BLOCKED through a failure/action-required handler immediately rather than waiting for max attempts.
  • Include the Oz run status_message.message in the progress comment when available.
  • Preserve the session link so maintainers can inspect or resume the blocked run.
  • Consider using different copy from generic failure, e.g. “The agent is blocked and needs attention…”

Why this matters

For implementation workflows, BLOCKED may still include useful artifacts (pr-metadata.json, patches, summaries) and a precise remediation. Treating it as an opaque timeout loses that context and makes debugging much harder.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:workflowGitHub workflows, Python automation, or Oz integrationbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions