[XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #20611

shawnwang18 · 2024-12-17T02:08:13Z

This PR is large redesign of XLA command buffer, which basically follows the below idea:

Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:

Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand.

loislo · 2024-12-21T19:20:38Z

there are formatting errors and 3 tests that fail

golechwierowicz · 2025-03-24T10:04:07Z

xla/backends/gpu/runtime/command_buffer_cmd.cc

+  if (create) {
+    TF_ASSIGN_OR_RETURN(initialize_counter_node_,
+                        command_buffer->CreateMemsetNode(
+                            ToDependentNodes(), loop_counter, uint32_t(0),


Internally linters complain about style of casting. Can you use static_cast<uint32_t> instead?

golechwierowicz · 2025-03-24T10:04:23Z

xla/backends/gpu/runtime/command_buffer_cmd.cc

+    cond_node_ = cond_node_result.node_handle;
+  } else {
+    TF_RETURN_IF_ERROR(command_buffer->UpdateMemsetNode(
+        initialize_counter_node_, loop_counter, uint32_t(0),


Ditto. Can you use static_cast<uint32_t> instead?

@golechwierowicz Could you check again why google internal tests failed?

module compiler/xla/stream_executor:command_buffer does not directly depend on a module exporting 'third_party/absl/container/flat_hash_set.h'

it seems like just the dependency on absl::flat_hash_set is missing in BUILD files.

Thanks, just pushed a fix commit.

@golechwierowicz Could you help check again?

There's way more errors like this now. Let me handle this internally.

EDIT: to be clear – you can stop rebasing and making changes now

Thanks you. @golechwierowicz

@golechwierowicz Just found an issue in this PR, it may break google internal tests, just submit the fix.

@ezhulenev said he will take over

…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565d by Shawn Wang <[email protected]>: fix typo -- 3e1408f by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f PiperOrigin-RevId: 739216739

…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>: fix typo -- 3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8 PiperOrigin-RevId: 739216739

…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565d by Shawn Wang <[email protected]>: fix typo -- 3e1408f by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f PiperOrigin-RevId: 739216739

…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>: fix typo -- 3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8 PiperOrigin-RevId: 739216739

…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0 by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5 by Shawn Wang <[email protected]>: fix Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency c2644d5 PiperOrigin-RevId: 739216739

…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0 by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5 by Shawn Wang <[email protected]>: fix -- 7cf96d7 by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7 PiperOrigin-RevId: 739216739

…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0a56c00b58d8d8d2f8e164f6080d03ce0f by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5d9084c9ae17111f5de8920c39cb495521 by Shawn Wang <[email protected]>: fix -- 7cf96d7b01e001769565b4ba6eae4431e0ba650f by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 Reverts 7a50c1a FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7b01e001769565b4ba6eae4431e0ba650f PiperOrigin-RevId: 739216739

…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0 by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5 by Shawn Wang <[email protected]>: fix -- 7cf96d7 by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7 PiperOrigin-RevId: 740322710

…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0a56c00b58d8d8d2f8e164f6080d03ce0f by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5d9084c9ae17111f5de8920c39cb495521 by Shawn Wang <[email protected]>: fix -- 7cf96d7b01e001769565b4ba6eae4431e0ba650f by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7b01e001769565b4ba6eae4431e0ba650f PiperOrigin-RevId: 740322710

shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 17, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 17, 2024

shawnwang18 force-pushed the shawnw/cuda_graph_dependency branch from 933ecbd to bc3a0b4 Compare December 30, 2024 02:58

shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024

shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024

shawnwang18 force-pushed the shawnw/cuda_graph_dependency branch from 74626e1 to 600203e Compare December 30, 2024 09:13

shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024

shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024

shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024

shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 31, 2024