-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling? #1509
Comments
We will not add DeepGEMM into TE because it lacks the GEMM for wgrad (1x128 by 1x128) in back propagation. And its JIT mechanism brings non-negligible overheads in training. We're landing DeepSeek-v3-like FP8 recipe (1x128 for activations and 128x128 for weights) in TE and we will use the block-wise GEMM from cuBLAS (to be released in CUDA 12.9), which has a comparable performance as DeepGEMM and both 1D2D (1x128 by 128x128) and 1D1D (1x128 by 1x128) support, and gets rid of the JIT overheads. |
|
Hi Xin, Could we expect an ETA on this? |
See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April). |
Could you share a preview release of CUDA 12.9? We just want to try block-wise FP8 GEMM functionality. |
Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.
The text was updated successfully, but these errors were encountered: