-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XLA:GPU] Re-implement command buffer to remove execution_scope, construct dependencies through DAG. #20611
base: main
Are you sure you want to change the base?
Conversation
there are formatting errors and 3 tests that fail |
933ecbd
to
bc3a0b4
Compare
74626e1
to
600203e
Compare
if (create) { | ||
TF_ASSIGN_OR_RETURN(initialize_counter_node_, | ||
command_buffer->CreateMemsetNode( | ||
ToDependentNodes(), loop_counter, uint32_t(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally linters complain about style of casting. Can you use static_cast<uint32_t>
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
cond_node_ = cond_node_result.node_handle; | ||
} else { | ||
TF_RETURN_IF_ERROR(command_buffer->UpdateMemsetNode( | ||
initialize_counter_node_, loop_counter, uint32_t(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Can you use static_cast<uint32_t>
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@golechwierowicz Could you check again why google internal tests failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module compiler/xla/stream_executor:command_buffer does not directly depend on a module exporting 'third_party/absl/container/flat_hash_set.h'
it seems like just the dependency on absl::flat_hash_set is missing in BUILD files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just pushed a fix commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@golechwierowicz Could you help check again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's way more errors like this now. Let me handle this internally.
EDIT: to be clear – you can stop rebasing and making changes now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you. @golechwierowicz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@golechwierowicz Just found an issue in this PR, it may break google internal tests, just submit the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezhulenev said he will take over
fa2a824
to
3e1408f
Compare
…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565d by Shawn Wang <[email protected]>: fix typo -- 3e1408f by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>: fix typo -- 3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8 PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>: fix typo -- 3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8 PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>: fix typo -- 3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8 PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>: fix typo -- 3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8 PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565d by Shawn Wang <[email protected]>: fix typo -- 3e1408f by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- b5cdc0e937e1921c947603f2c434139ea97fb6df by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- d21565dc922540a44e3ee521ed71f7a19268252d by Shawn Wang <[email protected]>: fix typo -- 3e1408f8fd7ee21004af2e238101d75d30f27ba8 by Shawn Wang <[email protected]>: fix coding style Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 3e1408f8fd7ee21004af2e238101d75d30f27ba8 PiperOrigin-RevId: 739216739
3e1408f
to
73490c0
Compare
…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0 by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5 by Shawn Wang <[email protected]>: fix Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency c2644d5 PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0 by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5 by Shawn Wang <[email protected]>: fix -- 7cf96d7 by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7 PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0 by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5 by Shawn Wang <[email protected]>: fix -- 7cf96d7 by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7 PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0a56c00b58d8d8d2f8e164f6080d03ce0f by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5d9084c9ae17111f5de8920c39cb495521 by Shawn Wang <[email protected]>: fix -- 7cf96d7b01e001769565b4ba6eae4431e0ba650f by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 Reverts 7a50c1a FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7b01e001769565b4ba6eae4431e0ba650f PiperOrigin-RevId: 739216739
…scope, construct dependencies through DAG. Imported from GitHub PR #20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0 by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5 by Shawn Wang <[email protected]>: fix -- 7cf96d7 by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7 PiperOrigin-RevId: 740322710
…scope, construct dependencies through DAG. Imported from GitHub PR openxla/xla#20611 This PR is large redesign of XLA command buffer, which basically follows the below idea: 1. Removing the execution scope concept from command buffer, and basically there should be no execution stream concept in command buffer. Instead command buffer will construct the graph through data flow in the command buffer command sequence (converted from XLA thunk sequence), so overlapping (execution scope) will be automatically deduced by the graph topology. This is more matching the cuda-graph's conept, where it does not have a stream property, and operators will be auto launched to different streams whenever the cuda-graph's topology does not have dependencies (edges) across operators. 2. CommandBufferCmd now can depend on other commands (specified through command index in the CommandBufferCmdSequence), and dependencies (R/W, W/W conflicts) is auto inferred from buffer assignment results when appending a new command into CommandBufferCmdSequence (CommandBufferCmdSequence::Append()). 3. When implementation CommandBuffer, dependencies that specified through CommandBufferCmd index is translated into node dependencies across cuda nodes. The main benefits of the design are: 1. Constructing graph topology through data flow, instead of thunk execution order, can automatically enables maximum concurrency that is allowed by data flow across operators, where it should have better perf. 2. The design is much more natural and intuitive, command buffer is designed to be an abstract of cuda graph, so it should just be a graph of how the result is calculated through operators. Execution scope or stream is some kind of runtime concept where it should not be included in command buffer's graph. In current design, XLA introduces the execution scope concept just to maintain the concept of execution stream in XLA runtime, this is unnecessary and counter intuitive. 3. Command buffer is easier to implement and use, simulating the multi-stream order in command buffer through execution scope introduces lots of hard to understand codes. While we can see that through the data flow design, the implementation code is more compact and easier to understand. Copybara import of the project: -- 73490c0a56c00b58d8d8d2f8e164f6080d03ce0f by Shawn Wang <[email protected]>: Rewrite XLA command buffer -- c2644d5d9084c9ae17111f5de8920c39cb495521 by Shawn Wang <[email protected]>: fix -- 7cf96d7b01e001769565b4ba6eae4431e0ba650f by Shawn Wang <[email protected]>: add build dependency Merging this change closes #20611 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#20611 from shawnwang18:shawnw/cuda_graph_dependency 7cf96d7b01e001769565b4ba6eae4431e0ba650f PiperOrigin-RevId: 740322710
6b301bf
to
e7817bd
Compare
8cead59
to
3abeb6f
Compare
This PR is large redesign of XLA command buffer, which basically follows the below idea:
The main benefits of the design are: