Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #20611

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shawnwang18
Copy link
Contributor

@shawnwang18 shawnwang18 commented Dec 17, 2024

This PR is large redesign of XLA command buffer, which basically follows the below idea:

  1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
  2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
  3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:

  1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
  2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
  3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand.

@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 17, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 17, 2024
@loislo
Copy link
Member

loislo commented Dec 21, 2024

there are formatting errors and 3 tests that fail

@shawnwang18 shawnwang18 force-pushed the shawnw/cuda_graph_dependency branch from 933ecbd to bc3a0b4 Compare December 30, 2024 02:58
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@shawnwang18 shawnwang18 force-pushed the shawnw/cuda_graph_dependency branch from 74626e1 to 600203e Compare December 30, 2024 09:13
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 30, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@shawnwang18 shawnwang18 added kokoro:force-run Forces CI to rerun and removed kokoro:force-run Forces CI to rerun labels Dec 31, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 31, 2024
@shawnwang18 shawnwang18 added the kokoro:force-run Forces CI to rerun label Jan 1, 2025
if (create) {
TF_ASSIGN_OR_RETURN(initialize_counter_node_,
command_buffer->CreateMemsetNode(
ToDependentNodes(), loop_counter, uint32_t(0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally linters complain about style of casting. Can you use static_cast<uint32_t> instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

cond_node_ = cond_node_result.node_handle;
} else {
TF_RETURN_IF_ERROR(command_buffer->UpdateMemsetNode(
initialize_counter_node_, loop_counter, uint32_t(0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Can you use static_cast<uint32_t> instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golechwierowicz Could you check again why google internal tests failed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module compiler/xla/stream_executor:command_buffer does not directly depend on a module exporting 'third_party/absl/container/flat_hash_set.h'

it seems like just the dependency on absl::flat_hash_set is missing in BUILD files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just pushed a fix commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golechwierowicz Could you help check again?

Copy link
Member

@golechwierowicz golechwierowicz Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's way more errors like this now. Let me handle this internally.

EDIT: to be clear – you can stop rebasing and making changes now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you. @golechwierowicz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golechwierowicz Just found an issue in this PR, it may break google internal tests, just submit the fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezhulenev said he will take over

@shawnwang18 shawnwang18 force-pushed the shawnw/cuda_graph_dependency branch from fa2a824 to 3e1408f Compare March 24, 2025 14:52
copybara-service bot pushed a commit that referenced this pull request Mar 24, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
b5cdc0e by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
d21565d by Shawn Wang <[email protected]>:

fix typo

--
3e1408f by Shawn Wang <[email protected]>:

fix coding style

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 24, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR openxla/xla#20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>:

fix typo

--
3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>:

fix coding style

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 24, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR openxla/xla#20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>:

fix typo

--
3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>:

fix coding style

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 24, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR openxla/xla#20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>:

fix typo

--
3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>:

fix coding style

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 24, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR openxla/xla#20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>:

fix typo

--
3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>:

fix coding style

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit that referenced this pull request Mar 24, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
b5cdc0e by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
d21565d by Shawn Wang <[email protected]>:

fix typo

--
3e1408f by Shawn Wang <[email protected]>:

fix coding style

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 24, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR openxla/xla#20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>:

fix typo

--
3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>:

fix coding style

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8
PiperOrigin-RevId: 739216739
@shawnwang18 shawnwang18 force-pushed the shawnw/cuda_graph_dependency branch from 3e1408f to 73490c0 Compare March 25, 2025 02:54
copybara-service bot pushed a commit that referenced this pull request Mar 25, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0 by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
c2644d5 by Shawn Wang <[email protected]>:

fix

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency c2644d5
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit that referenced this pull request Mar 25, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0 by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
c2644d5 by Shawn Wang <[email protected]>:

fix

--
7cf96d7 by Shawn Wang <[email protected]>:

add build dependency

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit that referenced this pull request Mar 25, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0 by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
c2644d5 by Shawn Wang <[email protected]>:

fix

--
7cf96d7 by Shawn Wang <[email protected]>:

add build dependency

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 25, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR openxla/xla#20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0a56c00b58d8d8d2f8e164f6080d03ce0f by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
c2644d5d9084c9ae17111f5de8920c39cb495521 by Shawn Wang <[email protected]>:

fix

--
7cf96d7b01e001769565b4ba6eae4431e0ba650f by Shawn Wang <[email protected]>:

add build dependency

Merging this change closes #20611

Reverts 7a50c1a

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7b01e001769565b4ba6eae4431e0ba650f
PiperOrigin-RevId: 739216739
copybara-service bot pushed a commit that referenced this pull request Mar 25, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR #20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0 by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
c2644d5 by Shawn Wang <[email protected]>:

fix

--
7cf96d7 by Shawn Wang <[email protected]>:

add build dependency

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7
PiperOrigin-RevId: 740322710
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 25, 2025
…scope, construct dependencies through DAG.

Imported from GitHub PR openxla/xla#20611

This PR is large redesign of XLA command buffer, which basically follows the below idea:

1.  Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators.
2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()).
3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes.

The main benefits of the design are:
1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf.
2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive.
3.  Command buffer is easier to implement and use,  simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes.  While we can see that through the data flow design, the implementation code is more compact and easier to understand.

Copybara import of the project:

--
73490c0a56c00b58d8d8d2f8e164f6080d03ce0f by Shawn Wang <[email protected]>:

Rewrite XLA command buffer

--
c2644d5d9084c9ae17111f5de8920c39cb495521 by Shawn Wang <[email protected]>:

fix

--
7cf96d7b01e001769565b4ba6eae4431e0ba650f by Shawn Wang <[email protected]>:

add build dependency

Merging this change closes #20611

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7b01e001769565b4ba6eae4431e0ba650f
PiperOrigin-RevId: 740322710
@shawnwang18 shawnwang18 force-pushed the shawnw/cuda_graph_dependency branch 2 times, most recently from 6b301bf to e7817bd Compare March 26, 2025 03:04
@shawnwang18 shawnwang18 force-pushed the shawnw/cuda_graph_dependency branch from 8cead59 to 3abeb6f Compare March 26, 2025 08:51
@golechwierowicz golechwierowicz requested review from golechwierowicz and removed request for klucke and golechwierowicz March 31, 2025 13:37
@shawnwang18 shawnwang18 marked this pull request as draft March 31, 2025 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kokoro:force-run Forces CI to rerun
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants