Skip to content

fix(errors): treat kube-apiserver HTTP 500 as transient error#16038

Closed
HsiuChuanHsu wants to merge 1 commit into
argoproj:mainfrom
HsiuChuanHsu:fix/14220
Closed

fix(errors): treat kube-apiserver HTTP 500 as transient error#16038
HsiuChuanHsu wants to merge 1 commit into
argoproj:mainfrom
HsiuChuanHsu:fix/14220

Conversation

@HsiuChuanHsu
Copy link
Copy Markdown

@HsiuChuanHsu HsiuChuanHsu commented Apr 25, 2026

Fixes #14220

Motivation

The Workflows fails immediately when the kube-apiserver returns an HTTP 500 error.
This happens because the system does not recognize general 500 errors as transient, turning temporary server issues into permanent workflow failures.

func isTransientErr(err error) bool {
if err == nil {
return false
}
err = argoerrs.Cause(err)
return isExceededQuotaErr(err) ||
apierr.IsTooManyRequests(err) ||
isResourceQuotaConflictErr(err) ||
isResourceQuotaTimeoutErr(err) ||
isTransientNetworkErr(err) ||
apierr.IsServerTimeout(err) ||
apierr.IsTimeout(err) ||
apierr.IsServiceUnavailable(err) ||
isTransientEtcdErr(err) ||
matchTransientErrPattern(err) ||
errors.Is(err, NewErrTransient("")) ||
isTransientSqbErr(err)
}

There is already a function isResourceQuotaTimeoutErr that calls apierr.IsInternalError(err), but it narrows it with a specific message check ("resource quota evaluation timed out"), so every other HTTP 500 falls through.

func isResourceQuotaConflictErr(err error) bool {
return apierr.IsConflict(err) && strings.Contains(err.Error(), "Operation cannot be fulfilled on resourcequota")
}

5XX Http Error that are transient for now

Function Http Status
IsServerTimeout HTTP 504(specific server-timeout reasons)
IsServiceUnavailable HTTP 503
isResourceQuotaTimeoutErr HTTP 500(resource quota evaluation timed out)

Modifications

Updates the error handling logic to classify all HTTP 500 (Internal Server Error) responses as transient, enabling automatic retries to improve workflow resilience.
Logic Update: Added apierr.IsInternalError(err) to the isTransientErr function to catch 500 errors.

Verification

Updated unit tests to ensure HTTP 500 now triggers a retry and verified the fix with all existing sub-tests.

AI

Claude did the testing part.

Signed-off-by: HsiuChuanHsu <hchsu2106@gmail.com>
@HsiuChuanHsu HsiuChuanHsu marked this pull request as ready for review April 25, 2026 13:54
Copy link
Copy Markdown
Member

@isubasinghe isubasinghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't feel comfortable about a blanket acceptance of all 500 errors as transient.
Maybe we could instead extend this to support custom transient errors.

@isubasinghe
Copy link
Copy Markdown
Member

I've closed this @HsiuChuanHsu but happy to reopen to discuss :)

@HsiuChuanHsu
Copy link
Copy Markdown
Author

Thanks for the feedback and sorry for the long delay in getting back to this.
I agree with you that a blanket 500 is too broad.

Maybe we could instead extend this to support custom transient errors.

Here's an alternative I'm thinking of: introduce a TRANSIENT_HTTP_STATUS_CODES env var (comma-separated, e.g. "500,503") that lets operators explicitly opt-in, following the same opt-in pattern as the existing TRANSIENT_ERROR_PATTERN.

Default behavior is unchanged — no operator gets unexpected retries. Operators who know their 500s are transient can opt in explicitly.

Does this direction work for you? Happy to update the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Argo fails to retry workflows when kube-apiserver returns a transient Internal Server Error

2 participants