With the simplified controller-restart path (see feat/graceful-lifecycle branch), agent containers and LiteLLM stay up while the controller process bounces (~5s for uvicorn restart). During that window, any controller-bound call from inside an agent fails with connection-refused or 502.
For the user to not notice a controller restart, frameworks need to:
- Retry controller-bound HTTP calls with exponential backoff — cap around 60s, which covers a slow restart.
- Treat "connection refused" and 502/503 as transient, not fatal.
- Not cancel the in-flight task on first failure.
This is framework-side work. As we fork agent frameworks to fit taOS, patch this retry behaviour in each fork. Known targets:
- crewai
- langchain
- autogen
- any qmd-compatible client we ship
Until this lands, users who run auto_restart may see one failed agent turn per controller restart. Not catastrophic, but visible.
Related: #203 (auto-update auto-restart), feat/graceful-lifecycle branch.
With the simplified controller-restart path (see feat/graceful-lifecycle branch), agent containers and LiteLLM stay up while the controller process bounces (~5s for uvicorn restart). During that window, any controller-bound call from inside an agent fails with connection-refused or 502.
For the user to not notice a controller restart, frameworks need to:
This is framework-side work. As we fork agent frameworks to fit taOS, patch this retry behaviour in each fork. Known targets:
Until this lands, users who run auto_restart may see one failed agent turn per controller restart. Not catastrophic, but visible.
Related: #203 (auto-update auto-restart), feat/graceful-lifecycle branch.