-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TransformerEngine v1.2.1 throws CuDNN frontend error on H100 GPU (AWS p5.48xlarge instance) #651
Comments
Hi @sirutBuasai, what is the cuDNN version you are using? |
CuDNN installed with
|
Hi @sirutBuasai , could you try upgrading to cuDNN 8.9.7+ please? |
Will do, in the meantime, is there a TE version that is built with CuDNN 8.9.2? |
I think it's probably v0.10, but I'd rather you roll forward with cuDNN than backward with TE. There's been a lot of development in the last year or so. If it's easier, you can use the NGC pytorch container, which has the latest TE (1.3) and cuDNN (9.0): nvcr.io/nvidia/pytorch:24.01-py3 |
@cyanguwa I think we still should catch this error from cuDNN Frontend and just disable cuDNN's implementation of attention in this case. |
@sirutBuasai Was your problem solved? Could you tell me the solution. I meet the same problem. |
@liu21yd, We ended up using TE v0.10 but it is pretty old. I haven't tried upgrading CuDNN and TE together but that would be a place to start. |
Recently we observed similar issues with any combinations of TE 1.4/1.7 and cuDNN 8.9.4/8.9.7. In our cases, the fused_attn test in this repository also fails, as well as the frontend toolkit (Megatron-LM) doesn't work. Note that our operating system is Rocky, not Debian-ish ones. For a workaround we eventually set |
Hi @sirutBuasai, I haven't gone back to TE 1.2.1, but I just tried TE 1.11.0 with PyTorch 24.10 container, and it seems to work. Would you be able to upgrade to this combination? Container:
Install Megatron:
Script:
Results:
|
Hi, we are currently running into TransformerEngine related error when running GPT model on H100 GPU (AWS p5.48xlarge).
Below is the error log
Error:
Steps to reproduce:
conda env create -f megatron_bench.yml
andconda activate megatron_bench
install_deps.sh
.train.sh
.train.sh
.megatron_bench.yml
:install_deps.sh
:train.sh
:The text was updated successfully, but these errors were encountered: