diff --git a/apiserversdk/docs/retry-configuration.md b/apiserversdk/docs/retry-configuration.md new file mode 100644 index 00000000000..880eca59664 --- /dev/null +++ b/apiserversdk/docs/retry-configuration.md @@ -0,0 +1,104 @@ +# APIServer Retry Behavior & Configuration + +This guide walks you through observing the default retry behavior of the KubeRay APIServer and then customizing its configuration for your needs. +By default, the APIServer automatically retries failed requests to the Kubernetes API when transient errors occur +(like 429, 502, 503, etc.). +This mechanism improves reliability, and this guide shows you how to see it in action and change it. + +## Prerequisite + +Follow [installation](installation.md) to install the cluster and apiserver. + +## Default Retry Behavior + +By default, the APIServer automatically retries for these HTTP status codes: + +- 408 (Request Timeout) +- 429 (Too Many Requests) +- 500 (Internal Server Error) +- 502 (Bad Gateway) +- 503 (Service Unavailable) +- 504 (Gateway Timeout) + +With the following default configuration: + +- **MaxRetry**: 3 attempts (total 4 tries including initial attempt) +- **InitBackoff**: 500ms (initial wait time) +- **BackoffFactor**: 2.0 (exponential multiplier) +- **MaxBackoff**: 10s (maximum wait time between retries) +- **OverallTimeout**: 30s (total timeout for all attempts) + +## Customize the Retry Configuration + +Currently, retry configuration is hardcoded. If you would like a customized retry behaviour, please follow the steps below. + +### Step 1: Modify the config in `apiserversdk/util/config.go` + +For example, + +```go +const ( + HTTPClientDefaultMaxRetry = 5 // Increase retries + HTTPClientDefaultBackoffFactor = float64(2) + HTTPClientDefaultInitBackoff = 2 * time.Second // Longer backoff makes timing visible + HTTPClientDefaultMaxBackoff = 20 * time.Second + HTTPClientDefaultOverallTimeout = 120 * time.Second // Longer timeout to allow more retries +) +``` + +### Step 2: Rebuild and load the new APIServer image into your Kind cluster + +```bash +cd apiserver +export IMG_REPO=kuberay-apiserver +export IMG_TAG=dev +export KIND_CLUSTER_NAME=$(kubectl config current-context | sed 's/^kind-//') + +make docker-image IMG_REPO=kuberay-apiserver IMG_TAG=dev +make load-image IMG_REPO=$IMG_REPO IMG_TAG=$IMG_TAG KIND_CLUSTER_NAME=$KIND_CLUSTER_NAME +``` + +### Step 3: Redeploy the APIServer using Helm, overriding the image to use the new one you just built + +```bash +helm upgrade --install kuberay-apiserver ../helm-chart/kuberay-apiserver --wait \ + --set image.repository=$IMG_REPO,image.tag=$IMG_TAG,image.pullPolicy=IfNotPresent \ + --set security=null +``` + +## Demonstrating Retries + +Make sure you have the apiserver port forwarded as mentioned in the [installation](installation.md). + +```bash +kubectl port-forward service/kuberay-apiserver-service 31888:8888 +``` + +After port-forwarding, test the retry mechanism: + +### Retries on 429 (Too Many Request) + +```bash +seq 1 2000 | xargs -I{} -P 150 curl -s -o /dev/null -w "%{http_code}\n" \ + http://localhost:31888/apis/ray.io/v1/namespaces/default/rayclusters | sort | uniq -c +``` + +To see retry in action, you can check the APIServer logs: + +```bash +kubectl logs -f deployment/kuberay-apiserver +``` + +### Retries on 503 + +## Clean Up + +Once you are finished, you can delete the Helm release and the Kind cluster. + +```bash +# Delete the Helm release +helm delete kuberay-apiserver + +# Delete the Kind cluster +kind delete cluster +```