Skip to content

Agent frameworks: retry/backoff on controller calls during restart #209

@jaylfc

Description

@jaylfc

With the simplified controller-restart path (see feat/graceful-lifecycle branch), agent containers and LiteLLM stay up while the controller process bounces (~5s for uvicorn restart). During that window, any controller-bound call from inside an agent fails with connection-refused or 502.

For the user to not notice a controller restart, frameworks need to:

  • Retry controller-bound HTTP calls with exponential backoff — cap around 60s, which covers a slow restart.
  • Treat "connection refused" and 502/503 as transient, not fatal.
  • Not cancel the in-flight task on first failure.

This is framework-side work. As we fork agent frameworks to fit taOS, patch this retry behaviour in each fork. Known targets:

  • crewai
  • langchain
  • autogen
  • any qmd-compatible client we ship

Until this lands, users who run auto_restart may see one failed agent turn per controller restart. Not catastrophic, but visible.

Related: #203 (auto-update auto-restart), feat/graceful-lifecycle branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentsAgent frameworks and deploymentenhancementNew feature or requestfeatureNew featurekilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions