How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

BolongLin · 2025-02-26T03:56:02Z

Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.

yaox12 · 2025-02-28T01:52:58Z

We will not add DeepGEMM into TE because it lacks the GEMM for wgrad (1x128 by 1x128) in back propagation. And its JIT mechanism brings non-negligible overheads in training.

We're landing DeepSeek-v3-like FP8 recipe (1x128 for activations and 128x128 for weights) in TE and we will use the block-wise GEMM from cuBLAS (to be released in CUDA 12.9), which has a comparable performance as DeepGEMM and both 1D2D (1x128 by 128x128) and 1D1D (1x128 by 1x128) support, and gets rid of the JIT overheads.

zhujian19891203 · 2025-03-08T04:05:50Z

Hi, how can we set other Cute/Cutlass operators on TE? Like GEMM from DeepGEMM, which is a library designed for FP8 GEMMs with fine-grained scaling, as proposed in DeepSeek-V3.

Link: deepseek-ai/DeepGEMM#10 (comment)

RuiWang1998 · 2025-03-19T05:08:20Z

Hi Xin,

Could we expect an ETA on this?

yaox12 · 2025-03-19T22:40:05Z

Hi Xin,

Could we expect an ETA on this?

See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).

zigzagcai · 2025-03-27T07:27:46Z

Hi Xin,
Could we expect an ETA on this?

See PR #1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is shipped with CUDA 12.9 (ETA early April).

Could you share a preview release of CUDA 12.9? We just want to try block-wise FP8 GEMM functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

BolongLin commented Feb 26, 2025

yaox12 commented Feb 28, 2025

zhujian19891203 commented Mar 8, 2025

RuiWang1998 commented Mar 19, 2025

yaox12 commented Mar 19, 2025

zigzagcai commented Mar 27, 2025

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？ #1509

Comments

BolongLin commented Feb 26, 2025

yaox12 commented Feb 28, 2025

zhujian19891203 commented Mar 8, 2025

RuiWang1998 commented Mar 19, 2025

yaox12 commented Mar 19, 2025

zigzagcai commented Mar 27, 2025