Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add DeleteWorkersOnFailure deletion policy for RayJob #2765

Open
1 of 2 tasks
kevin85421 opened this issue Jan 17, 2025 · 8 comments
Open
1 of 2 tasks

[Feature] Add DeleteWorkersOnFailure deletion policy for RayJob #2765

kevin85421 opened this issue Jan 17, 2025 · 8 comments
Labels
1.4.0 enhancement New feature or request

Comments

@kevin85421
Copy link
Member

kevin85421 commented Jan 17, 2025

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

DeleteWorkersOnFailure: Deletes workers only when the Ray job fails and deletes the entire RayCluster when the Ray job succeeds. This seems to be a more common pattern for users.

Should we add this policy or rename DeleteWorkers to DeleteWorkersOnFailure? Does it need to be in v1.3.0?

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added enhancement New feature or request triage labels Jan 17, 2025
@kevin85421
Copy link
Member Author

cc @andrewsykim any thoughts?

@andrewsykim
Copy link
Collaborator

Having both policies probably makes sense. I'm in favor of a new policy like DeleteWorkersOnFailure. Would it be too verbose to name it something like DeleteClusterOnSuccessOrWorkersOnFailure?

@andrewsykim
Copy link
Collaborator

DeleteWorkersOnFailure is probably fine as long as the deletion policy on success is well documented in the API or documentation

@kevin85421
Copy link
Member Author

On second thought, I realized that users may have more combinations. For example,

  1. DeleteCluster on success, DeleteNone on failure
  2. DeleteSelf on success, DeleteNone on failure
  3. DeleteCluster on success, DeleteWorkers on failure
  4. DeleteSelf on success, DeleteWorkers on failure

There are two solutions:

  1. Keep the current API, but adds new API like DeleteClusterOnSuccessOrWorkersOnFailure if needed
  2. Separate deletion API into OnFailureDeletionPolicy and OnSuccessDeletionPolicy.

@kevin85421 kevin85421 added 1.3.0 and removed triage labels Jan 17, 2025
@kevin85421
Copy link
Member Author

mark this issue as v1.3.0 because we need to make a decision about the API before the release.

@andrewsykim
Copy link
Collaborator

On second thought, I realized that users may have more combinations. For example,

DeleteCluster on success, DeleteNone on failure
DeleteSelf on success, DeleteNone on failure
DeleteCluster on success, DeleteWorkers on failure
DeleteSelf on success, DeleteWorkers on failure
There are two solutions:

Keep the current API, but adds new API like DeleteClusterOnSuccessOrWorkersOnFailure if needed
Separate deletion API into OnFailureDeletionPolicy and OnSuccessDeletionPolicy.

These are really good considerations, since we put the feature behind an alpha feature gate I feel fine about breaking the API in v1.4 if needed.

@andrewsykim
Copy link
Collaborator

We can consider an API like this as well:

spec:
  deletionPolicy:
    onSuccess: DeleteCluster
    onFailure: DeleteWorkers

@kevin85421
Copy link
Member Author

kevin85421 commented Jan 21, 2025

OK, let's update the API in v1.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.4.0 enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants