Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions apiserversdk/docs/retry-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# APIServer Retry Behavior & Configuration

This guide walks you through observing the default retry behavior of the KubeRay APIServer and then customizing its configuration for your needs.
By default, the APIServer automatically retries failed requests to the Kubernetes API when transient errors occur
(like 429, 502, 503, etc.).
This mechanism improves reliability, and this guide shows you how to see it in action and change it.

## Prerequisite

Follow [installation](installation.md) to install the cluster and apiserver.

## Default Retry Behavior

By default, the APIServer automatically retries for these HTTP status codes:

- 408 (Request Timeout)
- 429 (Too Many Requests)
- 500 (Internal Server Error)
- 502 (Bad Gateway)
- 503 (Service Unavailable)
- 504 (Gateway Timeout)

With the following default configuration:

- **MaxRetry**: 3 attempts (total 4 tries including initial attempt)
- **InitBackoff**: 500ms (initial wait time)
- **BackoffFactor**: 2.0 (exponential multiplier)
- **MaxBackoff**: 10s (maximum wait time between retries)
- **OverallTimeout**: 30s (total timeout for all attempts)

## Customize the Retry Configuration

Currently, retry configuration is hardcoded. If you would like a customized retry behaviour, please follow the steps below.

### Step 1: Modify the config in `apiserversdk/util/config.go`

For example,

```go
const (
HTTPClientDefaultMaxRetry = 5 // Increase retries
HTTPClientDefaultBackoffFactor = float64(2)
HTTPClientDefaultInitBackoff = 2 * time.Second // Longer backoff makes timing visible
HTTPClientDefaultMaxBackoff = 20 * time.Second
HTTPClientDefaultOverallTimeout = 120 * time.Second // Longer timeout to allow more retries
)
```

### Step 2: Rebuild and load the new APIServer image into your Kind cluster

```bash
cd apiserver
export IMG_REPO=kuberay-apiserver
export IMG_TAG=dev
export KIND_CLUSTER_NAME=$(kubectl config current-context | sed 's/^kind-//')

make docker-image IMG_REPO=kuberay-apiserver IMG_TAG=dev
make load-image IMG_REPO=$IMG_REPO IMG_TAG=$IMG_TAG KIND_CLUSTER_NAME=$KIND_CLUSTER_NAME
```

### Step 3: Redeploy the APIServer using Helm, overriding the image to use the new one you just built

```bash
helm upgrade --install kuberay-apiserver ../helm-chart/kuberay-apiserver --wait \
--set image.repository=$IMG_REPO,image.tag=$IMG_TAG,image.pullPolicy=IfNotPresent \
--set security=null
```

## Demonstrating Retries

Make sure you have the apiserver port forwarded as mentioned in the [installation](installation.md).

```bash
kubectl port-forward service/kuberay-apiserver-service 31888:8888
```

After port-forwarding, test the retry mechanism:

### Retries on 429 (Too Many Request)

```bash
seq 1 2000 | xargs -I{} -P 150 curl -s -o /dev/null -w "%{http_code}\n" \
http://localhost:31888/apis/ray.io/v1/namespaces/default/rayclusters | sort | uniq -c
```

To see retry in action, you can check the APIServer logs:

```bash
kubectl logs -f deployment/kuberay-apiserver
```

### Retries on 503

## Clean Up

Once you are finished, you can delete the Helm release and the Kind cluster.

```bash
# Delete the Helm release
helm delete kuberay-apiserver

# Delete the Kind cluster
kind delete cluster
```