PR #20611: [XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #24115

copybara-service · 2025-03-24T15:16:06Z

PR #20611: [XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:

Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0 by Shawn Wang [email protected]:

Rewrite XLA command buffer

--
c2644d5 by Shawn Wang [email protected]:

fix

--
7cf96d7 by Shawn Wang [email protected]:

add build dependency

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7

…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0 by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5 by Shawn Wang <[email protected]>: fix -- 7cf96d7 by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7 PiperOrigin-RevId: 739216739

copybara-service bot force-pushed the test_739216739 branch 3 times, most recently from f01ede6 to e21839e Compare March 25, 2025 12:05

copybara-service bot force-pushed the test_739216739 branch from e21839e to 9e0691e Compare March 25, 2025 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR #20611: [XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #24115

PR #20611: [XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #24115

copybara-service bot commented Mar 24, 2025 •

edited

Loading

PR #20611: [XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #24115

Are you sure you want to change the base?

PR #20611: [XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #24115

Conversation

copybara-service bot commented Mar 24, 2025 • edited Loading

copybara-service bot commented Mar 24, 2025 •

edited

Loading