[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored a…

…nd moved to TE/common (NVIDIA#1067) * moved userbuffers code to TE/common Signed-off-by: Alp Dener <[email protected]> * moved comm+GEMM overlap code to TE/common Signed-off-by: Alp Dener <[email protected]> * removed PyTorch depdency from comm+GEMM overlap in TE/common Signed-off-by: Alp Dener <[email protected]> * added TE/PyTorch wrappers for refactored comm+GEMM overlap code in TE/common Signed-off-by: Alp Dener <[email protected]> * updated TE/PyTorch Python API to match the refactored comm+GEMM overlap code Signed-off-by: Alp Dener <[email protected]> * updated unit tests to work with refactored comm+GEMM overlap code Signed-off-by: Alp Dener <[email protected]> * added a pylint exception to comm+GEMM overlap test runner Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixing linting errors Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added documentation for te.initialize_ub Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed compile errors when building with NVTE_UB_WITH_MPI=1 Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed default bootstrap backend Signed-off-by: Alp Dener <[email protected]> * switched default bootstrap backend priority to MPI > Gloo > NCCL Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated bootstrap backend documentation Signed-off-by: Alp Dener <[email protected]> * close UB bootstrap socket to avoid interfering with CUDA Multicast shareable file handle send/recv Signed-off-by: Alp Dener <[email protected]> * added torch::Tensor wrappers for communication buffer and atomic counters so PyTorch can factor externally allocated memory into its garbage collection threshold Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * automated handling of world, local and node ranks/sizes within C++ CommOverlapHelper to simplify Python function signatures Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed incorrect read of environment variables Signed-off-by: Alp Dener <[email protected]> * corrected priority for _SOCKET_IFNAME environment variables in UB bootstrapping Signed-off-by: Alp Dener <[email protected]> * moved multicast support check to cuda_runtime.h and replaced cudaDeviceGetProp call with cached sm_count() Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed commented out old code and replaced external collective function type defines with aliases Signed-off-by: Alp Dener <[email protected]> * compile-time CUDA version guard for CUDA Driver Multicast attribute Signed-off-by: Alp Dener <[email protected]> * added compile-time CUDA version guards to Multicast code in Userbuffers Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * condensed UB docs, corrected const violations Signed-off-by: Alp Dener <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed autodoc rst for UB calls, added CUDA version guard on Multicast UB kernels Signed-off-by: Alp Dener <[email protected]> * fixed incorrect UB type reporting for P2P overlaps, comment reformatting Signed-off-by: Alp Dener <[email protected]> * add docstring to tex.ubuf_built_with_mpi() Signed-off-by: Alp Dener <[email protected]> --------- Signed-off-by: Alp Dener <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
kit1980 · Oct 29, 2024 · 933294d · 933294d
1 parent 35bbe74
commit 933294d
Show file tree

Hide file tree

Showing 30 changed files with 2,546 additions and 1,787 deletions.
diff --git a/.gitignore b/.gitignore
@@ -38,3 +38,4 @@ dist/
 downloads/
 .pytest_cache/
 compile_commands.json
+.nfs
diff --git a/build_tools/pytorch.py b/build_tools/pytorch.py
@@ -11,7 +11,6 @@
 from .utils import (
     all_files_in_dir,
     cuda_archs,
-    cuda_path,
     cuda_version,
 )
 
@@ -29,9 +28,6 @@ def setup_pytorch_extension(
     sources = [
         csrc_source_files / "common.cu",
         csrc_source_files / "ts_fp8_op.cpp",
-        csrc_source_files / "userbuffers" / "ipcsocket.cc",
-        csrc_source_files / "userbuffers" / "userbuffers.cu",
-        csrc_source_files / "userbuffers" / "userbuffers-host.cpp",
     ] + all_files_in_dir(extensions_dir)
 
     # Header files
@@ -85,19 +81,14 @@ def setup_pytorch_extension(
                 continue  # Already handled
             nvcc_flags.extend(["-gencode", f"arch=compute_{arch},code=sm_{arch}"])
 
-    # Libraries
-    library_dirs = []
-    libraries = []
-    if bool(int(os.getenv("NVTE_UB_WITH_MPI", 0))):
+    if bool(int(os.getenv("NVTE_UB_WITH_MPI", "0"))):
         assert (
             os.getenv("MPI_HOME") is not None
-        ), "MPI_HOME must be set when compiling with NVTE_UB_WITH_MPI=1"
-        mpi_home = Path(os.getenv("MPI_HOME"))
-        include_dirs.append(mpi_home / "include")
+        ), "MPI_HOME=/path/to/mpi must be set when compiling with NVTE_UB_WITH_MPI=1!"
+        mpi_path = Path(os.getenv("MPI_HOME"))
+        include_dirs.append(mpi_path / "include")
         cxx_flags.append("-DNVTE_UB_WITH_MPI")
         nvcc_flags.append("-DNVTE_UB_WITH_MPI")
-        library_dirs.append(mpi_home / "lib")
-        libraries.append("mpi")
 
     # Construct PyTorch CUDA extension
     sources = [str(path) for path in sources]
@@ -112,6 +103,4 @@ def setup_pytorch_extension(
             "cxx": cxx_flags,
             "nvcc": nvcc_flags,
         },
-        libraries=[str(lib) for lib in libraries],
-        library_dirs=[str(lib_dir) for lib_dir in library_dirs],
     )
diff --git a/docs/api/pytorch.rst b/docs/api/pytorch.rst
@@ -51,3 +51,7 @@ pyTorch
 .. autoapifunction:: transformer_engine.pytorch.moe_permute
 
 .. autoapifunction:: transformer_engine.pytorch.moe_unpermute
+
+.. autoapifunction:: transformer_engine.pytorch.initialize_ub
+
+.. autoapifunction:: transformer_engine.pytorch.destroy_ub
diff --git a/setup.py b/setup.py
@@ -57,13 +57,20 @@ def run(self):
 
 def setup_common_extension() -> CMakeExtension:
     """Setup CMake extension for common library"""
+    cmake_flags = ["-DCMAKE_CUDA_ARCHITECTURES={}".format(cuda_archs())]
+    if bool(int(os.getenv("NVTE_UB_WITH_MPI", "0"))):
+        assert (
+            os.getenv("MPI_HOME") is not None
+        ), "MPI_HOME must be set when compiling with NVTE_UB_WITH_MPI=1"
+        cmake_flags.append("-DNVTE_UB_WITH_MPI=ON")
+
     # Project directory root
     root_path = Path(__file__).resolve().parent
 
     return CMakeExtension(
         name="transformer_engine",
         cmake_path=root_path / Path("transformer_engine/common"),
-        cmake_flags=["-DCMAKE_CUDA_ARCHITECTURES={}".format(cuda_archs())],
+        cmake_flags=cmake_flags,
     )