Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR #20611: [XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #24115

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

copybara-service[bot]
Copy link

@copybara-service copybara-service bot commented Mar 24, 2025

PR #20611: [XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

  1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
  2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
  3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:

  1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
  2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
  3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0 by Shawn Wang [email protected]:

Rewrite XLA command buffer

--
c2644d5 by Shawn Wang [email protected]:

fix

--
7cf96d7 by Shawn Wang [email protected]:

add build dependency

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7

@copybara-service copybara-service bot force-pushed the test_739216739 branch 3 times, most recently from f01ede6 to e21839e Compare March 25, 2025 12:05
…scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0 by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
c2644d5 by Shawn Wang <[email protected]>:

fix

--
7cf96d7 by Shawn Wang <[email protected]>:

add build dependency

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7
PiperOrigin-RevId: 739216739
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant