-
Notifications
You must be signed in to change notification settings - Fork 659
[Fix] Resolve int32 overflow by having the calculation in int64 and c… #4158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Resolve int32 overflow by having the calculation in int64 and c… #4158
Conversation
…ap it if the count is over math.MaxInt32 Signed-off-by: justinyeh1995 <[email protected]>
Signed-off-by: justinyeh1995 <[email protected]>
Signed-off-by: justinyeh1995 <[email protected]>
|
@win5923 @Future-Outlier I somehow could not add you as reviewers, so I will ping you here instead. Please take a look. Thanks! |
win5923
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, do we also need to check overflow for CalculateMinReplicas?
@Future-Outlier
troychiu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do sth like
| func SafeUint64ToInt64(n uint64) int64 { |
nosec?
…fix/4153-operator-maxreplicas-overflow
I think we can. Say we have another function func SafeInt64ToInt32(n int64) int32 we can change the logic into something like this func CalculateMaxReplicas(cluster *rayv1.RayCluster) int32 {
count := int64(0)
for _, nodeGroup := range cluster.Spec.WorkerGroupSpecs {
if nodeGroup.Suspend != nil && *nodeGroup.Suspend {
continue
}
count += int64(*nodeGroup.MaxReplicas) * int64(nodeGroup.NumOfHosts)
// stop the calculation if an overflow happens
if count > math.MaxInt32 {
break
}
}
return SafeInt64ToInt32(count)Do you think it will be a reasonable way of approaching it? |
|
yeah IMO that's better |
400Ping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sounds good. I will update this section accordingly. |
Signed-off-by: justinyeh1995 <[email protected]>
| func CapInt64ToInt32(n int64) int32 { | ||
| if n > math.MaxInt32 { | ||
| return math.MaxInt32 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need a minimum check to avoid lint error. Also, could you follow the naming convention for the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I'll add the minimum check and see if it resolves the lint error, and rename the function to SafeInt64ToInt32 as well.
| count += (*nodeGroup.MaxReplicas * nodeGroup.NumOfHosts) | ||
| count += int64(*nodeGroup.MaxReplicas) * int64(nodeGroup.NumOfHosts) | ||
|
|
||
| // early return if an overflow happens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this early return if we will do a safe type casting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a decision to be made. The early return is a non-critical optimization
but not as clean as Option 1. I'm happy to go with either way.
Option 1
func CalculateMaxReplicas(cluster *rayv1.RayCluster) int32 {
count := int64(0)
for _, nodeGroup := range cluster.Spec.WorkerGroupSpecs {
if nodeGroup.Suspend != nil && *nodeGroup.Suspend {
continue
}
count += int64(*nodeGroup.MaxReplicas) * int64(nodeGroup.NumOfHosts)
}
return SafeInt64ToInt32(count)Option 2
func CalculateMaxReplicas(cluster *rayv1.RayCluster) int32 {
count := int64(0)
for _, nodeGroup := range cluster.Spec.WorkerGroupSpecs {
if nodeGroup.Suspend != nil && *nodeGroup.Suspend {
continue
}
count += int64(*nodeGroup.MaxReplicas) * int64(nodeGroup.NumOfHosts)
// early return if an overflow happens
if count > math.MaxInt32 {
return math.MaxInt32
}
}
return SafeInt64ToInt32(count)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go for option 1 but if others think option 2 is better then I am fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Now with the SafeInt64ToInt32 doing the overflow/underflow prevention. I think Option 1 is much cleaner.
… prevention (it also help pass the lint test) Signed-off-by: justinyeh1995 <[email protected]>
…t32 overflow and underflow checking. Signed-off-by: justinyeh1995 <[email protected]>
troychiu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
* [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted (#4141) * [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted Signed-off-by: 400Ping <[email protected]> * [Fix] Fix e2e error Signed-off-by: 400Ping <[email protected]> * [Fix] fix according to rueian's comment Signed-off-by: 400Ping <[email protected]> * [Chore] fix ci error Signed-off-by: 400Ping <[email protected]> * Update ray-operator/controllers/ray/raycluster_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]> * Update ray-operator/controllers/ray/rayjob_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Ping <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * Trigger CI Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Ping <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> * fix: dashboard build for kuberay 1.5.0 (#4161) Signed-off-by: Future-Outlier <[email protected]> * [Feature Enhancement] Set ordered replica index label to support multi-slice (#4163) * [Feature Enhancement] Set ordered replica index label to support multi-slice Signed-off-by: Ryan O'Leary <[email protected]> * rename replica-id -> replica-name Signed-off-by: Ryan O'Leary <[email protected]> * Separate replica index feature gate logic Signed-off-by: Ryan O'Leary <[email protected]> * remove index arg in createWorkerPod Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]> * update stale feature gate comments (#4174) Signed-off-by: Andrew Sy Kim <[email protected]> * [RayCluster] Add more context why we don't recreate head Pod for RayJob (#4175) Signed-off-by: Kai-Hsun Chen <[email protected]> * feature: Remove empty resource list initialization. (#4168) Fixes #4142. * [Dockerfile] [KubeRay Dashboard]: Fix Dockerfile warnings (ENV format, CMD JSON args) (#4167) * [#4166] improvement: Fix Dockerfile warnings (ENV format, CMD JSON args) * extract the hostname from CMD Signed-off-by: Neo Chien <[email protected]> --------- Signed-off-by: Neo Chien <[email protected]> Co-authored-by: cchung100m <[email protected]> * [Fix] Resolve int32 overflow by having the calculation in int64 and c… (#4158) * [Fix] Resolve int32 overflow by having the calculation in int64 and cap it if the count is over math.MaxInt32 Signed-off-by: justinyeh1995 <[email protected]> * [Test] Add unit tests for CalculateReadyReplicas Signed-off-by: justinyeh1995 <[email protected]> * [Fix] Add a nosec comment to pass the Lint (pre-commit) test Signed-off-by: justinyeh1995 <[email protected]> * [Refactor] Add CapInt64ToInt32 to replace #nosec directives Signed-off-by: justinyeh1995 <[email protected]> * [Refactor] Rename function to SafeInt64ToInt32 and add a underflowing prevention (it also help pass the lint test) Signed-off-by: justinyeh1995 <[email protected]> * [Refactor] Remove the early return as SafeInt64ToInt32 handles the int32 overflow and underflow checking. Signed-off-by: justinyeh1995 <[email protected]> --------- Signed-off-by: justinyeh1995 <[email protected]> * Add RayService incremental upgrade sample for guide (#4164) Signed-off-by: Ryan O'Leary <[email protected]> * Edit RayCluster example config for label selectors (#4151) Signed-off-by: Ryan O'Leary <[email protected]> * [RayJob] update light weight submitter image from quay.io (#4181) Signed-off-by: Future-Outlier <[email protected]> * [flaky] RayJob fails when head Pod is deleted when job is running (#4182) Signed-off-by: Future-Outlier <[email protected]> * [CI] Pin Docker api version to avoid API version mismatch (#4188) Signed-off-by: win5923 <[email protected]> * Make replicas configurable for kuberay-operator #4180 (#4195) * Make replicas configurable for kuberay-operator #4180 * Make replicas configurable for kuberay-operator #4180 * [Fix] rayjob update raycluster status (#4192) * feat: check if raycluster status update in rayjob * test: e2e test to check the rayjob raycluster status update * fix: dashboard http client tests discovered and passing (#4173) Signed-off-by: alimaazamat <[email protected]> * [RayJob] Lift cluster status while initializing (#4191) Signed-off-by: Spencer Peterson <[email protected]> * [RayJob] Remove updateJobStatus call (#4198) Fast follow to #4191 Signed-off-by: Spencer Peterson <[email protected]> * Add support for Ray token auth (#4179) * Add support for Ray token auth Signed-off-by: Andrew Sy Kim <[email protected]> * add e2e test for Ray cluster auth Signed-off-by: Andrew Sy Kim <[email protected]> * address nits from Ruiean Signed-off-by: Andrew Sy Kim <[email protected]> * update RAY_auth_mode -> RAY_AUTH_MODE Signed-off-by: Andrew Sy Kim <[email protected]> * configure auth for Ray autoscaler Signed-off-by: Andrew Sy Kim <[email protected]> --------- Signed-off-by: Andrew Sy Kim <[email protected]> * Bump js-yaml from 4.1.0 to 4.1.1 in /dashboard (#4194) Bumps [js-yaml](https://github.com/nodeca/js-yaml) from 4.1.0 to 4.1.1. - [Changelog](https://github.com/nodeca/js-yaml/blob/master/CHANGELOG.md) - [Commits](nodeca/js-yaml@4.1.0...4.1.1) --- updated-dependencies: - dependency-name: js-yaml dependency-version: 4.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * update minimum Ray version required for token authentication to 2.52.0 (#4201) * update minimum Ray version required for token authentication to 2.52.0 Signed-off-by: Andrew Sy Kim <[email protected]> * update RayCluster auth e2e test to use Ray v2.52 Signed-off-by: Andrew Sy Kim <[email protected]> --------- Signed-off-by: Andrew Sy Kim <[email protected]> * add samples for RayCluster token auth (#4200) Signed-off-by: Andrew Sy Kim <[email protected]> * update (#4208) Signed-off-by: Future-Outlier <[email protected]> * [RayJob] Add token authentication support for All mode (#4210) * dashboard client authentication support Signed-off-by: Future-Outlier <[email protected]> * support rayjob Signed-off-by: Future-Outlier <[email protected]> * update to fix api serverr err Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * updarte Signed-off-by: Future-Outlier <[email protected]> * Rayjob sidecar mode auth token mode support Signed-off-by: Future-Outlier <[email protected]> * RayJob support k8s job mode Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * Address Andrew's advice Signed-off-by: Future-Outlier <[email protected]> * add todo x-ray-authorization comments Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> * [RayCluster] Enable Secret informer watch/list and remove unused RBAC verbs (#4202) * Add authentication secret reconciliation support Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * fix flaky test Signed-off-by: Future-Outlier <[email protected]> * remove test fix Signed-off-by: Rueian <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: Rueian <[email protected]> * [APIServer][Docs] Add user guide for retry behavior & configuration (#4144) * [Docs] Add the draft description about feature intro, configurations, and usecases Signed-off-by: justinyeh1995 <[email protected]> * [Fix] Update the retry walk-through Signed-off-by: justinyeh1995 <[email protected]> * [Doc] rewrite the first 2 sections Signed-off-by: justinyeh1995 <[email protected]> * [Doc] Revise documentation wording and add Observing Retry Behavior section Signed-off-by: justinyeh1995 <[email protected]> * [Fix] fix linting issue by running pre-commit run berfore commiting Signed-off-by: justinyeh1995 <[email protected]> * [Fix] fix linting errors in the Markdown linting Signed-off-by: justinyeh1995 <[email protected]> * [Fix] Clean up the math equation Signed-off-by: justinyeh1995 <[email protected]> * Update the math formula of Backoff calculation. Co-authored-by: Nary Yeh <[email protected]> Signed-off-by: JustinYeh <[email protected]> * [Fix] Explicitly mentioned exponential backoff and removed the customization parts Signed-off-by: justinyeh1995 <[email protected]> * [Docs] Clarify naming by replacing “APIServer” with “KubeRay APIServer” Co-authored-by: Cheng-Yeh Chung <[email protected]> Signed-off-by: JustinYeh <[email protected]> * [Docs] Rename retry-configuration.md to retry-behavior.md for accuracy Signed-off-by: justinyeh1995 <[email protected]> * Update Title to KubeRay APIServer Retry Behavior Co-authored-by: Cheng-Yeh Chung <[email protected]> Signed-off-by: JustinYeh <[email protected]> * [Docs] Add a note about the limitation of retry configuration Signed-off-by: justinyeh1995 <[email protected]> --------- Signed-off-by: justinyeh1995 <[email protected]> Signed-off-by: JustinYeh <[email protected]> Co-authored-by: Nary Yeh <[email protected]> Co-authored-by: Cheng-Yeh Chung <[email protected]> * Support X-Ray-Authorization fallback header for accepting auth token via proxy (#4213) * Support X-Ray-Authorization fallback header for accepting auth token in dashboard Signed-off-by: Future-Outlier <[email protected]> * remove todo comment Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> * [RayCluster] make auth token secret name consistency (#4216) Signed-off-by: fscnick <[email protected]> * [RayCluster] Status includes head containter status message (#4196) * [RayCluster] Status includes head containter status message Signed-off-by: Spencer Peterson <[email protected]> * lint Signed-off-by: Spencer Peterson <[email protected]> * [RayCluster] Containers not ready status reflects structured reason Signed-off-by: Spencer Peterson <[email protected]> * nit Signed-off-by: Spencer Peterson <[email protected]> --------- Signed-off-by: Spencer Peterson <[email protected]> * Remove erroneous call in applyServeTargetCapacity (#4212) Signed-off-by: Ryan O'Leary <[email protected]> * [RayJob] Add token authentication support for light weight job submitter (#4215) * [RayJob] light weight job submitter auth token support Signed-off-by: Future-Outlier <[email protected]> * X-Ray-Authorization Signed-off-by: Rueian <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: Rueian <[email protected]> * feat: kubectl ray get token command (#4218) * feat: kubectl ray get token command Signed-off-by: Rueian <[email protected]> * Update kubectl-plugin/pkg/cmd/get/get_token_test.go Co-authored-by: Copilot <[email protected]> Signed-off-by: Rueian <[email protected]> * Update kubectl-plugin/pkg/cmd/get/get_token.go Co-authored-by: Copilot <[email protected]> Signed-off-by: Rueian <[email protected]> * make sure the raycluster exists before getting the secret Signed-off-by: Rueian <[email protected]> * better ux Signed-off-by: Rueian <[email protected]> * Update kubectl-plugin/pkg/cmd/get/get_token.go Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Signed-off-by: Rueian <[email protected]> --------- Signed-off-by: Rueian <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Ping <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Andrew Sy Kim <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Neo Chien <[email protected]> Signed-off-by: justinyeh1995 <[email protected]> Signed-off-by: win5923 <[email protected]> Signed-off-by: alimaazamat <[email protected]> Signed-off-by: Spencer Peterson <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Rueian <[email protected]> Signed-off-by: JustinYeh <[email protected]> Signed-off-by: fscnick <[email protected]> Co-authored-by: Ping <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Kavish <[email protected]> Co-authored-by: Neo Chien <[email protected]> Co-authored-by: cchung100m <[email protected]> Co-authored-by: JustinYeh <[email protected]> Co-authored-by: Jun-Hao Wan <[email protected]> Co-authored-by: Divyam Raj <[email protected]> Co-authored-by: Nary Yeh <[email protected]> Co-authored-by: Alima Azamat <[email protected]> Co-authored-by: Spencer Peterson <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Rueian <[email protected]> Co-authored-by: Cheng-Yeh Chung <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Copilot <[email protected]>
…ap it if the count is over math.MaxInt32
Why are these changes needed?
The
CalculateMaxReplicasfunction did not guard against integer overflow, which could result in an incorrect value for Status.MaxWorkerReplicas, as reported in #4153.After the fix, the maximum replicas now caps at 2,147,483,647.
Related issue number
Closes #4153
Checks