Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR updates min_workers and max_workers in the autoscaler available_node_types to account for the value of numOfHosts, defaulting to 1 when this value is not set. This doesn't block multi-host autoscaling currently, since you can just set the value of minReplicas and maxReplicas to the desired number of multi-host workers, but this change would be helpful to avoid unexpected behavior for users.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary requested review from hongchaodeng and a team as code owners October 23, 2024 09:02
Copy link
Contributor

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ryanaoleary
Copy link
Contributor Author

@hongchaodeng is a code owner able to review/merge this for me?

@jcotant1 jcotant1 added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Nov 18, 2024
@ryanaoleary
Copy link
Contributor Author

ryanaoleary commented Jan 16, 2025

Related Issue:

#2600

@ryanaoleary
Copy link
Contributor Author

Closes #2820

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please properly rebase.

@ryanaoleary
Copy link
Contributor Author

please properly rebase.

Oh sorry about that, I fixed the improper rebase, not sure why those other commits got added to the PR.

@aslonnie aslonnie removed request for a team, sven1977 and simonsays1980 January 30, 2025 19:55
@ryanaoleary ryanaoleary requested a review from aslonnie January 31, 2025 16:38
@kevin85421 kevin85421 self-assigned this Feb 4, 2025
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if maxReplicas is less than replicas * numOfHosts? I guess the Ray Autoscaler will terminate the additional Pods?

@@ -109,7 +109,7 @@ def _get_basic_autoscaling_config() -> dict:
# Same as "small-group" with a TPU resource entry added
# and modified max_workers and node_config.
"tpu-group": {
"max_workers": 4,
"max_workers": 8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change?

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants