-
Notifications
You must be signed in to change notification settings - Fork 55
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add register sharing to warp-specialized circular buffering (#3669)
This PR implements register sharing for warp-specialized circular buffering. Registers in the load warp group are moved to the compute warp group using the `setmaxnreg` ptx instruction. It is an optimization for matmul kernels. ## Changes 1. Add `__launch_bounds__(/*MAX_THREADS_PER_BLOCK=*/)` to cuda kernel declaration. 2. Add `kir::SetMaxNReg` and `kir::Return` nodes to warp-specialized circular buffering. 3. `TensorView::circularBuffer` allows setting the number of registers for load and compute warp groups through `struct WarpSpecialized` 4. Require Hopper architecture for TensorViews using warp-specialized circular buffering. ## Why `__launch_bounds__` is necessary? > The setmaxnreg instruction requires that the kernel has been launched with a valid value of maximum number of per-thread registers specified via the appropriate compilation via the appropriate compile-time option or the appropriate performance tuning directive. Otherwise, the setmaxnreg instruction may have no effect. From https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-setmaxnreg ## Generated Code ```cuda __global__ void __launch_bounds__(/*MAX_THREADS_PER_BLOCK=*/64) nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, const __grid_constant__ TensorMap var0, Tensor<float, 1, 1> T1) { // do something if ((((nvfuser_index_t)threadIdx.y) == 1)) { asm volatile("setmaxnreg.dec.sync.aligned.u32 %0;\n"::"n"(24)); // load something return; } else { asm volatile("setmaxnreg.inc.sync.aligned.u32 %0;\n"::"n"(240)); // compute something } // do something } ```
- Loading branch information
Showing
9 changed files
with
297 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.