Skip to content

Conversation

@Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Oct 22, 2025

Description

Briefly describe what this PR accomplishes and why it's needed.

Makes CancelTask RPC Fault Tolerant. Created an intermediary RPC similar to what was done in #57648 in that when the force_exit flag is enabled for cancel, the executor worker is shut down gracefully. However we have no way of determining whether the shutdown was successful on the owner core worker, hence we send the cancel request to the raylet via a new RPC CancelLocalTask that guarantees the worker is killed. Added a python test to verify retry behavior, leaving out the cpp test after talking to @dayshah due to being a bit complicated in that we need to take into account all orderings of the owner/executor states in the cancellation process.

@Sparks0219 Sparks0219 requested a review from a team as a code owner October 22, 2025 21:38
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request makes the CancelTask RPC fault-tolerant by introducing an intermediary CancelLocalTask RPC to the raylet. This ensures that when a task is cancelled with force=True, the worker process is guaranteed to be killed, even if the graceful shutdown fails. The changes touch both normal task and actor task submission paths, and include a new Python test to verify the fault tolerance and idempotency of the cancellation logic.

My review identifies a critical bug in the new HandleCancelLocalTask implementation where a reply callback could be invoked twice, potentially crashing the raylet. I've also pointed out a minor issue with a misleading log message. Overall, the approach is sound, but the race condition needs to be fixed.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: joshlee <[email protected]>
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 23, 2025
});
return;
}
auto timer = execute_after(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to queue this up for every cancel? And did the old cancel provide any guarantee outside of just having "shutdown" called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old cancel didn't provide any guarantees outside of just triggering the graceful shutdown path. It should be only queued up for cancels where force_kill is set to true, and think it's good to follow the precedent set in KillActor?

Signed-off-by: joshlee <[email protected]>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 requested a review from dayshah October 24, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants