Skip to content

Commit

Permalink
[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored a…
Browse files Browse the repository at this point in the history
…nd moved to TE/common (NVIDIA#1067)

* moved userbuffers code to TE/common

Signed-off-by: Alp Dener <[email protected]>

* moved comm+GEMM overlap code to TE/common

Signed-off-by: Alp Dener <[email protected]>

* removed PyTorch depdency from comm+GEMM overlap in TE/common

Signed-off-by: Alp Dener <[email protected]>

* added TE/PyTorch wrappers for refactored comm+GEMM overlap code in TE/common

Signed-off-by: Alp Dener <[email protected]>

* updated TE/PyTorch Python API to match the refactored comm+GEMM overlap code

Signed-off-by: Alp Dener <[email protected]>

* updated unit tests to work with refactored comm+GEMM overlap code

Signed-off-by: Alp Dener <[email protected]>

* added a pylint exception to comm+GEMM overlap test runner

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixing linting errors

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added documentation for te.initialize_ub

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed compile errors when building with NVTE_UB_WITH_MPI=1

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed default bootstrap backend

Signed-off-by: Alp Dener <[email protected]>

* switched default bootstrap backend priority to MPI > Gloo > NCCL

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated bootstrap backend documentation

Signed-off-by: Alp Dener <[email protected]>

* close UB bootstrap socket to avoid interfering with CUDA Multicast shareable file handle send/recv

Signed-off-by: Alp Dener <[email protected]>

* added torch::Tensor wrappers for communication buffer and atomic counters so PyTorch can factor externally allocated memory into its garbage collection threshold

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* automated handling of world, local and node ranks/sizes within C++ CommOverlapHelper to simplify Python function signatures

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed incorrect read of environment variables

Signed-off-by: Alp Dener <[email protected]>

* corrected priority for _SOCKET_IFNAME environment variables in UB bootstrapping

Signed-off-by: Alp Dener <[email protected]>

* moved multicast support check to cuda_runtime.h and replaced cudaDeviceGetProp call with cached sm_count()

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removed commented out old code and replaced external collective function type defines with aliases

Signed-off-by: Alp Dener <[email protected]>

* compile-time CUDA version guard for CUDA Driver Multicast attribute

Signed-off-by: Alp Dener <[email protected]>

* added compile-time CUDA version guards to Multicast code in Userbuffers

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* condensed UB docs, corrected const violations

Signed-off-by: Alp Dener <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed autodoc rst for UB calls, added CUDA version guard on Multicast UB kernels

Signed-off-by: Alp Dener <[email protected]>

* fixed incorrect UB type reporting for P2P overlaps, comment reformatting

Signed-off-by: Alp Dener <[email protected]>

* add docstring to tex.ubuf_built_with_mpi()

Signed-off-by: Alp Dener <[email protected]>

---------

Signed-off-by: Alp Dener <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
denera and pre-commit-ci[bot] authored Oct 29, 2024
1 parent 35bbe74 commit 933294d
Show file tree
Hide file tree
Showing 30 changed files with 2,546 additions and 1,787 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,4 @@ dist/
downloads/
.pytest_cache/
compile_commands.json
.nfs
19 changes: 4 additions & 15 deletions build_tools/pytorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
from .utils import (
all_files_in_dir,
cuda_archs,
cuda_path,
cuda_version,
)

Expand All @@ -29,9 +28,6 @@ def setup_pytorch_extension(
sources = [
csrc_source_files / "common.cu",
csrc_source_files / "ts_fp8_op.cpp",
csrc_source_files / "userbuffers" / "ipcsocket.cc",
csrc_source_files / "userbuffers" / "userbuffers.cu",
csrc_source_files / "userbuffers" / "userbuffers-host.cpp",
] + all_files_in_dir(extensions_dir)

# Header files
Expand Down Expand Up @@ -85,19 +81,14 @@ def setup_pytorch_extension(
continue # Already handled
nvcc_flags.extend(["-gencode", f"arch=compute_{arch},code=sm_{arch}"])

# Libraries
library_dirs = []
libraries = []
if bool(int(os.getenv("NVTE_UB_WITH_MPI", 0))):
if bool(int(os.getenv("NVTE_UB_WITH_MPI", "0"))):
assert (
os.getenv("MPI_HOME") is not None
), "MPI_HOME must be set when compiling with NVTE_UB_WITH_MPI=1"
mpi_home = Path(os.getenv("MPI_HOME"))
include_dirs.append(mpi_home / "include")
), "MPI_HOME=/path/to/mpi must be set when compiling with NVTE_UB_WITH_MPI=1!"
mpi_path = Path(os.getenv("MPI_HOME"))
include_dirs.append(mpi_path / "include")
cxx_flags.append("-DNVTE_UB_WITH_MPI")
nvcc_flags.append("-DNVTE_UB_WITH_MPI")
library_dirs.append(mpi_home / "lib")
libraries.append("mpi")

# Construct PyTorch CUDA extension
sources = [str(path) for path in sources]
Expand All @@ -112,6 +103,4 @@ def setup_pytorch_extension(
"cxx": cxx_flags,
"nvcc": nvcc_flags,
},
libraries=[str(lib) for lib in libraries],
library_dirs=[str(lib_dir) for lib_dir in library_dirs],
)
4 changes: 4 additions & 0 deletions docs/api/pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,7 @@ pyTorch
.. autoapifunction:: transformer_engine.pytorch.moe_permute

.. autoapifunction:: transformer_engine.pytorch.moe_unpermute

.. autoapifunction:: transformer_engine.pytorch.initialize_ub

.. autoapifunction:: transformer_engine.pytorch.destroy_ub
9 changes: 8 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,20 @@ def run(self):

def setup_common_extension() -> CMakeExtension:
"""Setup CMake extension for common library"""
cmake_flags = ["-DCMAKE_CUDA_ARCHITECTURES={}".format(cuda_archs())]
if bool(int(os.getenv("NVTE_UB_WITH_MPI", "0"))):
assert (
os.getenv("MPI_HOME") is not None
), "MPI_HOME must be set when compiling with NVTE_UB_WITH_MPI=1"
cmake_flags.append("-DNVTE_UB_WITH_MPI=ON")

# Project directory root
root_path = Path(__file__).resolve().parent

return CMakeExtension(
name="transformer_engine",
cmake_path=root_path / Path("transformer_engine/common"),
cmake_flags=["-DCMAKE_CUDA_ARCHITECTURES={}".format(cuda_archs())],
cmake_flags=cmake_flags,
)


Expand Down
Loading

0 comments on commit 933294d

Please sign in to comment.