Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Stuck at Building Wheel #976

Open
kingformatty opened this issue Jun 27, 2024 · 9 comments
Open

Get Stuck at Building Wheel #976

kingformatty opened this issue Jun 27, 2024 · 9 comments
Labels
build Build system

Comments

@kingformatty
Copy link

Hi, anyone faces the problem of installation gets stuck at building wheel?

@timmoon10 timmoon10 added the build Build system label Jun 27, 2024
@timmoon10
Copy link
Collaborator

timmoon10 commented Jun 27, 2024

Can you share more information on your configuration, especially which DL framework you're building with? Passing the --verbose flag to pip install would also provide more useful build logs. A hang makes me suspect your system is over-parallelizing the build process:

  • If the hang happens while building Flash Attention or transformer_engine_torch, then it's a failure while building a PyTorch extension. Try setting MAX_JOBS=1 in the environment (see this note). Note that building Flash Attention is especially resource-intensive and can experience problems even on relatively powerful systems.
  • If the hang happens in CMake, then it's a failure in a Ninja build. We currently don't have a nice way to reduce the number of parallel Ninja jobs, but it is something we should prioritize if it is causing a problem (pinging @phu0ngng). You could try setting CMAKE_BUILD_PARALLEL_LEVEL=1 in the environment.

@timmoon10
Copy link
Collaborator

With #987, you can control the number of parallel build jobs with the MAX_JOBS environment variable.

@ZSL98
Copy link

ZSL98 commented Aug 1, 2024

Same problem.
Especially, got stuck in Running command /usr/lib/cmake-3.22.6-linux-x86_64/bin/cmake --build /opt/tiger/TransformerEngine/build/cmake --parallel 1

@timmoon10
Copy link
Collaborator

Hm, I'd expect most systems could handle building with MAX_JOBS=1. I wonder if we could get more clues if you build with verbose output (pip install -v -v .).

@AdrLfv
Copy link

AdrLfv commented Aug 14, 2024

I have a similar problem. With MAX_JOBS=1 it gets stuck after 6/24 and otherwise it gets stuck after 8/24 building transpose_fusion.cu.o. My whole computer gets frozen and I have to reboot manually. I use Cuda 12.5 and I have a rtx 3060.
I also tried to limitate the number of threads with export MAKEFLAGS="-j2" but without success.

CMake Warning:
Manually-specified variables were not used by the project:

  pybind11_DIR

-- Build files have been written to: /home/adrlfv/Téléchargements/TransformerEngine/build/cmake
Running command /usr/bin/cmake --build /home/adrlfv/Téléchargements/TransformerEngine/build/cmake
[1/32] Building CXX object CMakeFiles/transformer_engine.dir/transformer_engine.cpp.o
[2/32] Building CUDA object CMakeFiles/transformer_engine.dir/gemm/cublaslt_gemm.cu.o
[3/32] Building CXX object CMakeFiles/transformer_engine.dir/layer_norm/ln_api.cpp.o
[4/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose.cu.o
[5/32] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o
[6/32] Building CXX object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_api.cpp.o
[7/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/cast_transpose.cu.o
[8/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose_fusion.cu.o

@NeedsMoar
Copy link

To give people an idea, the default build of flash attention by itself on a 32/64 core threadripper pro 5975WX with 512GB of ram on older versions of the makefile that specified NVCC_THREADS=4 peaks at 260GB of ram use and takes something like 6 minutes. If you don't have that much ram it'll thrash too much to build within reasonable time most likely. The current default build should be around 128GB of ram use which is more than most motherboards support. It'll be un-buildable on this machine as soon as I install CUDA 12.8 and it defaults to building 4 different architectures (none of which apply to me). Most of the built files complete quickly, but the terminal output will appear completely frozen until the very long-running one is done, and some of the memory might not be freed until it's done which will mess up other short running jobs if you have low ram.

By default even if MAX_JOBS is set, the flash attention build will pass in --nvcc_threads=2 to the toolchain, which in practice seems to double the amount of memory used since it'll try to parallel build multiple architectures unless you've turned them off. The environment variable NVCC_THREADS=1 will fix that. You should still expect high ram usage, but if you have a reasonably high ram machine that isn't as insane as mine doing that should allow you to set the arch as below and skip the MAX_JOBS line. In my experience it needs core count * 1GB * (the lesser of NVCC_THREADS or count of architectures) of free memory to build, but sometimes NVCC_THREADS=2 will still eat memory building even if there's only one arch; it's only supposed to thread architectures and there's really only one so I suspect some subprocess isn't freeing memory.

Unfortunately the selection of architectures to build is hardcoded to the server variants starting at A100; most mere humans can't afford any of the server modules and will never be able to without scamming some serious grant money out of somebody. For that matter most people can't afford half of the ada or blackwell "consumer" lineups, but that's another story. Building ptx for 80,90 (and 100 and 120 after installing CUDA 12.8) is pointless. If you're on ampere you can cut down the flashattention setup.py lines at around line 178 to:

cc_flag.append("-arch")
cc_flag.append("sm_86")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_86,code=sm_86")

I'm not sure if the "-arch" flag and the next line are actually necessary since the build file doesn't seem to set it, and online info suggests that a PTX arch should additionally be set with

cc_flag.append("-gencode")
cc_flag.append("arch=compute_86,code=compute_86")

But I don't know how necessary this is if you're only running binary code on a single arch.

With ada it's:
cc_flag.append("-arch")
cc_flag.append("sm_89")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_89,code=sm_89")

The environment variable FLASH_ATTN_CUDA_ARCHS to set this won't work. For some reason it's just used as a hardcoded check in the series of if clauses that look for the cuda version and 80,90,100,and 120. Setting to just 80 isn't the end of the world for consumer ampere, but setting to 90 won't enable anything new that ada and hopper both support since the versions aren't backwards compatible.

I haven't looked much into what the transformer engine build system might be doing that chews up memory or what variables need to be set since it doesn't build on Windows.

@johnnynunez
Copy link

johnnynunez commented Jan 29, 2025

more simple:

MAX_JOBS=12 \
NVTE_FRAMEWORK=pytorch \
NVTE_CUDA_ARCHS=120 \
python3 setup.py bdist_wheel --dist-dir=/opt/transformer_engine/wheels
pip3 install --no-cache-dir --verbose /opt/transformer_engine/wheels/transformer_engine*.whl

@ksivaman
Copy link
Member

@johnnynunez NVTE_CUDA_ARCHS must be 120 instead of 12.0.

@johnnynunez
Copy link

johnnynunez commented Jan 29, 2025

@ksivaman yeah, I wrote quickly from my phone sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Build system
Projects
None yet
Development

No branches or pull requests

7 participants