fix(framework): Report clientapp appio communication failures as errors#7061
Conversation
…mmunication-failures-as-errors
There was a problem hiding this comment.
Pull request overview
This PR fixes a corner case in the SuperNode ClientApp runtime where AppIO gRPC failures could still result in a success (ExitCode.SUCCESS) being reported, and introduces a dedicated non-zero exit code for such communication failures.
Changes:
- Update
run_clientappto return a non-zero exit code when agrpc.RpcErroroccurs during AppIO communication. - Add
ExitCode.CLIENTAPP_COMMUNICATION_ERROR = 250and document it. - Add a unit test ensuring gRPC failures don’t report success.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
framework/py/flwr/supernode/runtime/run_clientapp.py |
Tracks an exit_code and switches it to a ClientApp communication error on grpc.RpcError. |
framework/py/flwr/supernode/runtime/run_clientapp_test.py |
Adds a test asserting flwr_exit is called with the new non-zero exit code on gRPC failure. |
framework/py/flwr/common/exit/exit_code.py |
Introduces the new ClientApp-specific exit code and help text; adjusts documented ServerApp range. |
framework/docs/source/ref-exit-codes/250.rst |
Documents exit code 250 and remediation guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
The implementation looks good to me, but I think the new exit-code docs are a little too user-action-oriented for an internal process. In normal runtime usage, users do not start flwr-clientapp directly; it is launched by SuperExec with the AppIO address/token/TLS settings already derived from the SuperNode/SuperExec setup.
Could we adjust this page to say that this means the internal ClientApp process (flwr-clientapp) could not communicate with the SuperNode ClientAppIo API, and that users should check the surrounding SuperNode/ClientApp logs for the underlying gRPC error? A likely cause is that the ClientApp process took too long to start, for example on a very slow or overloaded system, causing the short-lived token/heartbeat window to expire. If the log message is not enough to diagnose it, users should contact the Flower team with the relevant logs. Wdyt?
There was a problem hiding this comment.
Updated! LMK if you want further changes!
Minor fix for error checking corner case that could cause silent failures (success reported despite error).
This addresses the codex comment remaining in already merged PR 6986
Checklist
#contributions)