-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get Stuck at Building Wheel #976
Comments
Can you share more information on your configuration, especially which DL framework you're building with? Passing the
|
With #987, you can control the number of parallel build jobs with the |
Same problem. |
Hm, I'd expect most systems could handle building with |
I have a similar problem. With MAX_JOBS=1 it gets stuck after 6/24 and otherwise it gets stuck after 8/24 building transpose_fusion.cu.o. My whole computer gets frozen and I have to reboot manually. I use Cuda 12.5 and I have a rtx 3060.
|
To give people an idea, the default build of flash attention by itself on a 32/64 core threadripper pro 5975WX with 512GB of ram on older versions of the makefile that specified NVCC_THREADS=4 peaks at 260GB of ram use and takes something like 6 minutes. If you don't have that much ram it'll thrash too much to build within reasonable time most likely. The current default build should be around 128GB of ram use which is more than most motherboards support. It'll be un-buildable on this machine as soon as I install CUDA 12.8 and it defaults to building 4 different architectures (none of which apply to me). Most of the built files complete quickly, but the terminal output will appear completely frozen until the very long-running one is done, and some of the memory might not be freed until it's done which will mess up other short running jobs if you have low ram. By default even if MAX_JOBS is set, the flash attention build will pass in --nvcc_threads=2 to the toolchain, which in practice seems to double the amount of memory used since it'll try to parallel build multiple architectures unless you've turned them off. The environment variable NVCC_THREADS=1 will fix that. You should still expect high ram usage, but if you have a reasonably high ram machine that isn't as insane as mine doing that should allow you to set the arch as below and skip the MAX_JOBS line. In my experience it needs core count * 1GB * (the lesser of NVCC_THREADS or count of architectures) of free memory to build, but sometimes NVCC_THREADS=2 will still eat memory building even if there's only one arch; it's only supposed to thread architectures and there's really only one so I suspect some subprocess isn't freeing memory. Unfortunately the selection of architectures to build is hardcoded to the server variants starting at A100; most mere humans can't afford any of the server modules and will never be able to without scamming some serious grant money out of somebody. For that matter most people can't afford half of the ada or blackwell "consumer" lineups, but that's another story. Building ptx for 80,90 (and 100 and 120 after installing CUDA 12.8) is pointless. If you're on ampere you can cut down the flashattention setup.py lines at around line 178 to: cc_flag.append("-arch") I'm not sure if the "-arch" flag and the next line are actually necessary since the build file doesn't seem to set it, and online info suggests that a PTX arch should additionally be set with cc_flag.append("-gencode") But I don't know how necessary this is if you're only running binary code on a single arch. With ada it's: The environment variable FLASH_ATTN_CUDA_ARCHS to set this won't work. For some reason it's just used as a hardcoded check in the series of if clauses that look for the cuda version and 80,90,100,and 120. Setting to just 80 isn't the end of the world for consumer ampere, but setting to 90 won't enable anything new that ada and hopper both support since the versions aren't backwards compatible. I haven't looked much into what the transformer engine build system might be doing that chews up memory or what variables need to be set since it doesn't build on Windows. |
more simple: MAX_JOBS=12 \
NVTE_FRAMEWORK=pytorch \
NVTE_CUDA_ARCHS=120 \
python3 setup.py bdist_wheel --dist-dir=/opt/transformer_engine/wheels
pip3 install --no-cache-dir --verbose /opt/transformer_engine/wheels/transformer_engine*.whl |
@johnnynunez |
@ksivaman yeah, I wrote quickly from my phone sorry |
Hi, anyone faces the problem of installation gets stuck at building wheel?
The text was updated successfully, but these errors were encountered: