-
Notifications
You must be signed in to change notification settings - Fork 656
fix: fix too short timeout causing cascading failures #4133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
the 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods. in addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. you should NEVER have a timeout below 5 seconds in any production software. Signed-off-by: morotti <[email protected]>
|
Longer term, we really need to remove dependency to exec probes, I believe that once we are using HTTP probes, we can use shorter timeouts with significantly better reliability. There's a PR for using http probes (#2360), however, it's blocked on Ray unifiying health check endpoints ray-project/ray#56204 |
| - bash | ||
| - -c | ||
| - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8443/api/gcs_healthz | grep success | ||
| - wget -T 10 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8443/api/gcs_healthz | grep success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not enough to increase wget timeout, you need to also increase probeTimeout in the container's probe config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, this is just an example manifest, you need to update the controller logic if you want to change default behavior
|
@andrewsykim you mentioned that this may be merely an example file and the settings may be coming from somewhere else? I had a look but I can't find where the setting comes from. do you think you can find the source and update ray? |
|
Hi @morotti, For KubeRay Operator I think it should be here: kuberay/ray-operator/controllers/ray/utils/constant.go Lines 212 to 219 in 530318b
But you can also overwrite |

Why are these changes needed?
Hello,
The 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods, which restricts how much cpu the container can use.
In addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. You should NEVER have a timeout below 5 seconds in any production software.
Checks