Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

ryanaoleary · 2024-10-23T09:02:14Z

Why are these changes needed?

This PR updates min_workers and max_workers in the autoscaler available_node_types to account for the value of numOfHosts, defaulting to 1 when this value is not set. This doesn't block multi-host autoscaling currently, since you can just set the value of minReplicas and maxReplicas to the desired number of multi-host workers, but this change would be helpful to avoid unexpected behavior for users.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>

andrewsykim

LGTM

python/ray/autoscaler/_private/kuberay/autoscaling_config.py

ryanaoleary · 2024-11-15T19:57:01Z

@hongchaodeng is a code owner able to review/merge this for me?

Signed-off-by: ryanaoleary <[email protected]>

ryanaoleary · 2025-01-16T14:18:56Z

Related Issue:

#2600

ryanaoleary · 2025-01-27T16:42:29Z

Closes #2820

aslonnie

please properly rebase.

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-01-30T14:51:31Z

please properly rebase.

Oh sorry about that, I fixed the improper rebase, not sure why those other commits got added to the PR.

Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421

What happens if maxReplicas is less than replicas * numOfHosts? I guess the Ray Autoscaler will terminate the additional Pods?

kevin85421 · 2025-02-05T04:27:52Z

python/ray/tests/kuberay/test_autoscaling_config.py

@@ -109,7 +109,7 @@ def _get_basic_autoscaling_config() -> dict:
            # Same as "small-group" with a TPU resource entry added
            # and modified max_workers and node_config.
            "tpu-group": {
-                "max_workers": 4,
+                "max_workers": 8,


why do we need this change?

KubeRay Autoscaler min/max workers use NumOfHosts

9861b41

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested review from hongchaodeng and a team as code owners October 23, 2024 09:02

Merge branch 'master' into tpu-max-workers

185b759

andrewsykim approved these changes Nov 1, 2024

View reviewed changes

python/ray/autoscaler/_private/kuberay/autoscaling_config.py Show resolved Hide resolved

Merge branch 'master' into tpu-max-workers

6ca79f0

jcotant1 added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Nov 18, 2024

Merge branch 'master' into tpu-max-workers

d1c983b

Signed-off-by: ryanaoleary <[email protected]>

Merge branch 'master' into tpu-max-workers

c8f3e2e

ryanaoleary mentioned this pull request Jan 27, 2025

[Bug] Kuberay autoscaler should use numOfHosts to calculate max workers ray-project/kuberay#2820

Open

2 tasks

Merge branch 'master' into tpu-max-workers

a9a1aa7

ryanaoleary requested review from a team, sven1977 and simonsays1980 as code owners January 29, 2025 09:05

aslonnie reviewed Jan 30, 2025

View reviewed changes

ryanaoleary added 2 commits January 30, 2025 14:43

Fix test

6bb7231

Signed-off-by: Ryan O'Leary <[email protected]>

KubeRay Autoscaler min/max workers use NumOfHosts

0182cf4

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary force-pushed the tpu-max-workers branch from 700abff to 0182cf4 Compare January 30, 2025 14:44

Merge branch 'master' into tpu-max-workers

0f8e2f7

aslonnie removed request for a team, sven1977 and simonsays1980 January 30, 2025 19:55

Remove repeated line that got added in merge

d29b617

Signed-off-by: Ryan O'Leary <[email protected]>

Merge branch 'master' into tpu-max-workers

52530d2

ryanaoleary requested a review from aslonnie January 31, 2025 16:38

kevin85421 self-assigned this Feb 4, 2025

kevin85421 approved these changes Feb 5, 2025

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

ryanaoleary commented Oct 23, 2024

andrewsykim left a comment

ryanaoleary commented Nov 15, 2024

ryanaoleary commented Jan 16, 2025 •

edited

Loading

ryanaoleary commented Jan 27, 2025

aslonnie left a comment

ryanaoleary commented Jan 30, 2025

kevin85421 left a comment

kevin85421 Feb 5, 2025

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

Are you sure you want to change the base?

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

Conversation

ryanaoleary commented Oct 23, 2024

Why are these changes needed?

Related issue number

Checks

andrewsykim left a comment

Choose a reason for hiding this comment

ryanaoleary commented Nov 15, 2024

ryanaoleary commented Jan 16, 2025 • edited Loading

ryanaoleary commented Jan 27, 2025

aslonnie left a comment

Choose a reason for hiding this comment

ryanaoleary commented Jan 30, 2025

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 Feb 5, 2025

Choose a reason for hiding this comment

ryanaoleary commented Jan 16, 2025 •

edited

Loading