Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored a…
…nd moved to TE/common (NVIDIA#1067) * moved userbuffers code to TE/common Signed-off-by: Alp Dener <[email protected]> * moved comm+GEMM overlap code to TE/common Signed-off-by: Alp Dener <[email protected]> * removed PyTorch depdency from comm+GEMM overlap in TE/common Signed-off-by: Alp Dener <[email protected]> * added TE/PyTorch wrappers for refactored comm+GEMM overlap code in TE/common Signed-off-by: Alp Dener <[email protected]> * updated TE/PyTorch Python API to match the refactored comm+GEMM overlap code Signed-off-by: Alp Dener <[email protected]> * updated unit tests to work with refactored comm+GEMM overlap code Signed-off-by: Alp Dener <[email protected]> * added a pylint exception to comm+GEMM overlap test runner Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixing linting errors Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added documentation for te.initialize_ub Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed compile errors when building with NVTE_UB_WITH_MPI=1 Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed default bootstrap backend Signed-off-by: Alp Dener <[email protected]> * switched default bootstrap backend priority to MPI > Gloo > NCCL Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated bootstrap backend documentation Signed-off-by: Alp Dener <[email protected]> * close UB bootstrap socket to avoid interfering with CUDA Multicast shareable file handle send/recv Signed-off-by: Alp Dener <[email protected]> * added torch::Tensor wrappers for communication buffer and atomic counters so PyTorch can factor externally allocated memory into its garbage collection threshold Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * automated handling of world, local and node ranks/sizes within C++ CommOverlapHelper to simplify Python function signatures Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed incorrect read of environment variables Signed-off-by: Alp Dener <[email protected]> * corrected priority for _SOCKET_IFNAME environment variables in UB bootstrapping Signed-off-by: Alp Dener <[email protected]> * moved multicast support check to cuda_runtime.h and replaced cudaDeviceGetProp call with cached sm_count() Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed commented out old code and replaced external collective function type defines with aliases Signed-off-by: Alp Dener <[email protected]> * compile-time CUDA version guard for CUDA Driver Multicast attribute Signed-off-by: Alp Dener <[email protected]> * added compile-time CUDA version guards to Multicast code in Userbuffers Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * condensed UB docs, corrected const violations Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed autodoc rst for UB calls, added CUDA version guard on Multicast UB kernels Signed-off-by: Alp Dener <[email protected]> * fixed incorrect UB type reporting for P2P overlaps, comment reformatting Signed-off-by: Alp Dener <[email protected]> * add docstring to tex.ubuf_built_with_mpi() Signed-off-by: Alp Dener <[email protected]> --------- Signed-off-by: Alp Dener <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information