fix(errors): treat kube-apiserver HTTP 500 as transient error#16038
fix(errors): treat kube-apiserver HTTP 500 as transient error#16038HsiuChuanHsu wants to merge 1 commit into
Conversation
Signed-off-by: HsiuChuanHsu <hchsu2106@gmail.com>
isubasinghe
left a comment
There was a problem hiding this comment.
Yeah I don't feel comfortable about a blanket acceptance of all 500 errors as transient.
Maybe we could instead extend this to support custom transient errors.
|
I've closed this @HsiuChuanHsu but happy to reopen to discuss :) |
|
Thanks for the feedback and sorry for the long delay in getting back to this.
Here's an alternative I'm thinking of: introduce a Default behavior is unchanged — no operator gets unexpected retries. Operators who know their 500s are transient can opt in explicitly. Does this direction work for you? Happy to update the PR. |
Fixes #14220
Motivation
The Workflows fails immediately when the kube-apiserver returns an HTTP 500 error.
This happens because the system does not recognize
general 500 errorsas transient, turning temporary server issues into permanent workflow failures.argo-workflows/util/errors/errors.go
Lines 39 to 56 in 7a7e608
argo-workflows/util/errors/errors.go
Lines 73 to 75 in 7a7e608
5XX Http Error that are transient for now
Modifications
Updates the error handling logic to classify all HTTP 500 (Internal Server Error) responses as transient, enabling automatic retries to improve workflow resilience.
Logic Update: Added
apierr.IsInternalError(err)to theisTransientErrfunction to catch 500 errors.Verification
Updated unit tests to ensure HTTP 500 now triggers a retry and verified the fix with all existing sub-tests.
AI
Claude did the testing part.