Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Mixed Precision Gemm Correctness Regression in Cutlass 3.7/3.8 #2070

Open
jwfromm opened this issue Jan 29, 2025 · 1 comment
Open
Labels
? - Needs Triage bug Something isn't working

Comments

@jwfromm
Copy link
Contributor

jwfromm commented Jan 29, 2025

Describe the bug
Since Cutlass 3.7, mixed input dtype GEMMs are producing less accurate outputs than they were in Cutlass 3.6. The loss of accuracy is substantial and makes using mixed input impractical for real use-cases.

Specifically, we have a collection of mixed input GEMMs in FBGEMM that work well on Cutlass 3.6. While these kernels compile fine with newer versions of cutlass (after small api updates), they produce garbage outputs.

Directly copying example 55's BF16 x I4 Gemm example produces slightly better results, but the outputs are still much less accurate than the 3.6 equivalents.

Steps/Code to reproduce bug
We use this benchmarking script to measure the performance and accuracy of kernels. The script can be run with these sample arguments:

python quantize_bench.py --kernels=bf16_baseline,cutlass_bf16i4_rowwise --M=128 --N=2048 --K=2048

This will produce an output like this:

bf16_baseline sim: 0.000.
bf16_baseline ms: 0.007.
bf16_baseline TFLOPS: 150.635.
bf16_baseline GB/s: 1323.942.
cutlass_bf16i4_rowwise sim: 28.375.
cutlass_bf16i4_rowwise ms: 0.013.
cutlass_bf16i4_rowwise TFLOPS: 79.561.
cutlass_bf16i4_rowwise GB/s: 233.089.

The sim metric is an L1 distance from the BF16 output. After updating to cutlass 3.7, copying example 55, and running the same script we get:

bf16_baseline sim: 0.000.
bf16_baseline ms: 0.007.
bf16_baseline TFLOPS: 150.563.
bf16_baseline GB/s: 1323.308.
cutlass_bf16i4_rowwise sim: 328.000.
cutlass_bf16i4_rowwise ms: 0.013.
cutlass_bf16i4_rowwise TFLOPS: 80.016.
cutlass_bf16i4_rowwise GB/s: 234.421.

Which has a clearly less correct output. The updated version of the kernel can be found at this PR

Expected behavior
The accuracy of mixed input kernels should not have changed due to updates.

Environment details (please complete the following information):
cuda 12.4 driver version 535.154.05 on Linux system with 8X H100 GPUs.

@jwfromm jwfromm added ? - Needs Triage bug Something isn't working labels Jan 29, 2025
@hwu36
Copy link
Collaborator

hwu36 commented Feb 4, 2025

@IwakuraRein

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants