[core] Make CancelTask RPC Fault Tolerant #58018

Sparks0219 · 2025-10-22T21:38:51Z

Description

Briefly describe what this PR accomplishes and why it's needed.

Makes CancelTask RPC Fault Tolerant. Created an intermediary RPC similar to what was done in #57648 in that when the force_exit flag is enabled for cancel, the executor worker is shut down gracefully. However we have no way of determining whether the shutdown was successful on the owner core worker, hence we send the cancel request to the raylet via a new RPC CancelLocalTask that guarantees the worker is killed. Added a python test to verify retry behavior, leaving out the cpp test after talking to @dayshah due to being a bit complicated in that we need to take into account all orderings of the owner/executor states in the cancellation process.

Signed-off-by: joshlee <[email protected]>

src/ray/core_worker/task_submission/actor_task_submitter.h

gemini-code-assist

Code Review

This pull request makes the CancelTask RPC fault-tolerant by introducing an intermediary CancelLocalTask RPC to the raylet. This ensures that when a task is cancelled with force=True, the worker process is guaranteed to be killed, even if the graceful shutdown fails. The changes touch both normal task and actor task submission paths, and include a new Python test to verify the fault tolerance and idempotency of the cancellation logic.

My review identifies a critical bug in the new HandleCancelLocalTask implementation where a reply callback could be invoked twice, potentially crashing the raylet. I've also pointed out a minor issue with a misleading log message. Overall, the approach is sound, but the race condition needs to be fixed.

src/ray/raylet/node_manager.cc

Signed-off-by: joshlee <[email protected]>

src/ray/raylet/node_manager.cc

dayshah · 2025-10-23T05:20:24Z

src/ray/raylet/node_manager.cc

+        });
+    return;
+  }
+  auto timer = execute_after(


do we need to queue this up for every cancel? And did the old cancel provide any guarantee outside of just having "shutdown" called?

The old cancel didn't provide any guarantees outside of just triggering the graceful shutdown path. It should be only queued up for cancels where force_kill is set to true, and think it's good to follow the precedent set in KillActor?

src/ray/raylet/node_manager.cc

src/ray/core_worker/task_submission/normal_task_submitter.h

Signed-off-by: joshlee <[email protected]>

Make CancelTask RPC Fault Tolerant

f8150c0

Signed-off-by: joshlee <[email protected]>

Sparks0219 assigned edoakes and dayshah Oct 22, 2025

Sparks0219 requested a review from a team as a code owner October 22, 2025 21:38

Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 22, 2025

Sparks0219 commented Oct 22, 2025

View reviewed changes

src/ray/core_worker/task_submission/actor_task_submitter.h Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

src/ray/raylet/node_manager.cc Show resolved Hide resolved

src/ray/raylet/node_manager.cc Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

Addressing comments

0a630a7

Signed-off-by: joshlee <[email protected]>

This comment was marked as outdated.

Sign in to view

clean up and cpp test failures

8ae4e3a

Signed-off-by: joshlee <[email protected]>

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 23, 2025

dayshah reviewed Oct 23, 2025

View reviewed changes

Addressing comments

a733422

Signed-off-by: joshlee <[email protected]>

This comment was marked as outdated.

Sign in to view

Fix broken cpp tests

8a2e428

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested a review from dayshah October 24, 2025 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Make CancelTask RPC Fault Tolerant #58018

[core] Make CancelTask RPC Fault Tolerant #58018

Uh oh!

Sparks0219 commented Oct 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

dayshah Oct 23, 2025

Uh oh!

Sparks0219 Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[core] Make CancelTask RPC Fault Tolerant #58018

Are you sure you want to change the base?

[core] Make CancelTask RPC Fault Tolerant #58018

Uh oh!

Conversation

Sparks0219 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

dayshah Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sparks0219 commented Oct 22, 2025 •

edited

Loading