Get Stuck at Building Wheel #976

kingformatty · 2024-06-27T21:29:22Z

Hi, anyone faces the problem of installation gets stuck at building wheel?

timmoon10 · 2024-06-27T22:06:18Z

Can you share more information on your configuration, especially which DL framework you're building with? Passing the --verbose flag to pip install would also provide more useful build logs. A hang makes me suspect your system is over-parallelizing the build process:

If the hang happens while building Flash Attention or transformer_engine_torch, then it's a failure while building a PyTorch extension. Try setting MAX_JOBS=1 in the environment (see this note). Note that building Flash Attention is especially resource-intensive and can experience problems even on relatively powerful systems.
If the hang happens in CMake, then it's a failure in a Ninja build. We currently don't have a nice way to reduce the number of parallel Ninja jobs, but it is something we should prioritize if it is causing a problem (pinging @phu0ngng). You could try setting CMAKE_BUILD_PARALLEL_LEVEL=1 in the environment.

timmoon10 · 2024-07-12T18:27:25Z

With #987, you can control the number of parallel build jobs with the MAX_JOBS environment variable.

ZSL98 · 2024-08-01T13:11:26Z

Same problem.
Especially, got stuck in Running command /usr/lib/cmake-3.22.6-linux-x86_64/bin/cmake --build /opt/tiger/TransformerEngine/build/cmake --parallel 1

timmoon10 · 2024-08-07T23:07:07Z

Hm, I'd expect most systems could handle building with MAX_JOBS=1. I wonder if we could get more clues if you build with verbose output (pip install -v -v .).

AdrLfv · 2024-08-14T20:50:15Z

I have a similar problem. With MAX_JOBS=1 it gets stuck after 6/24 and otherwise it gets stuck after 8/24 building transpose_fusion.cu.o. My whole computer gets frozen and I have to reboot manually. I use Cuda 12.5 and I have a rtx 3060.
I also tried to limitate the number of threads with export MAKEFLAGS="-j2" but without success.

CMake Warning:
Manually-specified variables were not used by the project:
  pybind11_DIR
-- Build files have been written to: /home/adrlfv/Téléchargements/TransformerEngine/build/cmake
Running command /usr/bin/cmake --build /home/adrlfv/Téléchargements/TransformerEngine/build/cmake
[1/32] Building CXX object CMakeFiles/transformer_engine.dir/transformer_engine.cpp.o
[2/32] Building CUDA object CMakeFiles/transformer_engine.dir/gemm/cublaslt_gemm.cu.o
[3/32] Building CXX object CMakeFiles/transformer_engine.dir/layer_norm/ln_api.cpp.o
[4/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose.cu.o
[5/32] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o
[6/32] Building CXX object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_api.cpp.o
[7/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/cast_transpose.cu.o
[8/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose_fusion.cu.o

NeedsMoar · 2025-01-29T06:27:57Z

To give people an idea, the default build of flash attention by itself on a 32/64 core threadripper pro 5975WX with 512GB of ram on older versions of the makefile that specified NVCC_THREADS=4 peaks at 260GB of ram use and takes something like 6 minutes. If you don't have that much ram it'll thrash too much to build within reasonable time most likely. The current default build should be around 128GB of ram use which is more than most motherboards support. It'll be un-buildable on this machine as soon as I install CUDA 12.8 and it defaults to building 4 different architectures (none of which apply to me). Most of the built files complete quickly, but the terminal output will appear completely frozen until the very long-running one is done, and some of the memory might not be freed until it's done which will mess up other short running jobs if you have low ram.

By default even if MAX_JOBS is set, the flash attention build will pass in --nvcc_threads=2 to the toolchain, which in practice seems to double the amount of memory used since it'll try to parallel build multiple architectures unless you've turned them off. The environment variable NVCC_THREADS=1 will fix that. You should still expect high ram usage, but if you have a reasonably high ram machine that isn't as insane as mine doing that should allow you to set the arch as below and skip the MAX_JOBS line. In my experience it needs core count * 1GB * (the lesser of NVCC_THREADS or count of architectures) of free memory to build, but sometimes NVCC_THREADS=2 will still eat memory building even if there's only one arch; it's only supposed to thread architectures and there's really only one so I suspect some subprocess isn't freeing memory.

Unfortunately the selection of architectures to build is hardcoded to the server variants starting at A100; most mere humans can't afford any of the server modules and will never be able to without scamming some serious grant money out of somebody. For that matter most people can't afford half of the ada or blackwell "consumer" lineups, but that's another story. Building ptx for 80,90 (and 100 and 120 after installing CUDA 12.8) is pointless. If you're on ampere you can cut down the flashattention setup.py lines at around line 178 to:

cc_flag.append("-arch")
cc_flag.append("sm_86")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_86,code=sm_86")

I'm not sure if the "-arch" flag and the next line are actually necessary since the build file doesn't seem to set it, and online info suggests that a PTX arch should additionally be set with

cc_flag.append("-gencode")
cc_flag.append("arch=compute_86,code=compute_86")

But I don't know how necessary this is if you're only running binary code on a single arch.

With ada it's:
cc_flag.append("-arch")
cc_flag.append("sm_89")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_89,code=sm_89")

The environment variable FLASH_ATTN_CUDA_ARCHS to set this won't work. For some reason it's just used as a hardcoded check in the series of if clauses that look for the cuda version and 80,90,100,and 120. Setting to just 80 isn't the end of the world for consumer ampere, but setting to 90 won't enable anything new that ada and hopper both support since the versions aren't backwards compatible.

I haven't looked much into what the transformer engine build system might be doing that chews up memory or what variables need to be set since it doesn't build on Windows.

johnnynunez · 2025-01-29T14:15:34Z

more simple:

MAX_JOBS=12 \
NVTE_FRAMEWORK=pytorch \
NVTE_CUDA_ARCHS=120 \
python3 setup.py bdist_wheel --dist-dir=/opt/transformer_engine/wheels
pip3 install --no-cache-dir --verbose /opt/transformer_engine/wheels/transformer_engine*.whl

ksivaman · 2025-01-29T14:23:06Z

@johnnynunez NVTE_CUDA_ARCHS must be 120 instead of 12.0.

johnnynunez · 2025-01-29T14:24:11Z

@ksivaman yeah, I wrote quickly from my phone sorry

timmoon10 added the build Build system label Jun 27, 2024

This was referenced Jul 1, 2024

Building wheel error during installation #978

Closed

Parallel build with limited resource #981

Closed

timmoon10 mentioned this issue Aug 9, 2024

TE with threading build #1092

Merged

13 tasks

1195343015 mentioned this issue Aug 12, 2024

stuck at building wheel #1077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get Stuck at Building Wheel #976

Get Stuck at Building Wheel #976

kingformatty commented Jun 27, 2024

timmoon10 commented Jun 27, 2024 •

edited

Loading

timmoon10 commented Jul 12, 2024

ZSL98 commented Aug 1, 2024

timmoon10 commented Aug 7, 2024

AdrLfv commented Aug 14, 2024 •

edited

Loading

NeedsMoar commented Jan 29, 2025

johnnynunez commented Jan 29, 2025 •

edited

Loading

ksivaman commented Jan 29, 2025

johnnynunez commented Jan 29, 2025 •

edited

Loading

Get Stuck at Building Wheel #976

Get Stuck at Building Wheel #976

Comments

kingformatty commented Jun 27, 2024

timmoon10 commented Jun 27, 2024 • edited Loading

timmoon10 commented Jul 12, 2024

ZSL98 commented Aug 1, 2024

timmoon10 commented Aug 7, 2024

AdrLfv commented Aug 14, 2024 • edited Loading

NeedsMoar commented Jan 29, 2025

johnnynunez commented Jan 29, 2025 • edited Loading

ksivaman commented Jan 29, 2025

johnnynunez commented Jan 29, 2025 • edited Loading

timmoon10 commented Jun 27, 2024 •

edited

Loading

AdrLfv commented Aug 14, 2024 •

edited

Loading

johnnynunez commented Jan 29, 2025 •

edited

Loading

johnnynunez commented Jan 29, 2025 •

edited

Loading