Skip to content

Add option to disable MMA support on Turing #15360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

pt13762104
Copy link

@pt13762104 pt13762104 commented Aug 16, 2025

This helps with the performance of GTX 16 series (Turing without tensor cores), since it seems like code that uses tensor cores don't execute at nearly the full speed.

Perplexity

Original This PR
11.6842 +/- 0.09676 11.6837 +/- 0.09676

Performance (with GGML_CUDA_FORCE_MMQ)

Original (no GGML_CUDA_FORCE_MMQ)

model size params backend ngl fa test t/s
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 0 pp512 249.20 ± 0.05
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 1 pp512 260.86 ± 3.81

Original

model size params backend ngl fa test t/s
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 0 pp512 261.56 ± 1.26
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 1 pp512 282.52 ± 0.04

Original (CMAKE_CUDA_ARCHITECTURES=61)

model size params backend ngl fa test t/s
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 0 pp512 633.76 ± 1.58
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 1 pp512 840.55 ± 4.35

This PR (no GGML_CUDA_FORCE_MMQ)

model size params backend ngl fa test t/s
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 0 pp512 796.90 ± 1.07
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 1 pp512 844.69 ± 3.31

This PR

model size params backend ngl fa test t/s
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 0 pp512 918.69 ± 2.68
qwen3 4B Q8_0 3.98 GiB 4.02 B CUDA 99 1 pp512 985.77 ± 2.17

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 16, 2025
@JohannesGaessler JohannesGaessler self-requested a review August 16, 2025 16:09
@JohannesGaessler
Copy link
Collaborator

You should be able to get the same effect on master by compiling with CMAKE_CUDA_ARCHITECTURES=61. I think it would make more sense to check the names of available GPUs and print a warning if GTX 1600 is used with code compiled for CC 7.0 or 7.5. It will not be possible to use GTX 1600 optimally in conjunction with other Turing GPUs but that would require a more general refactor of the code and would I think be a lot of complexity for a setup that basically no one will ever use.

@pt13762104
Copy link
Author

With -DCMAKE_CUDA_ARCHITECTURES=61 -DGGML_CUDA_FORCE_MMQ=1, I get the performance roughly equal to 36201c6.
If I disable GGML_CUDA_FORCE_MMQ, performance instantly drops to 0.7 TFLOPS (see below for Llama 2 results):

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 pp512 47.15 ± 0.18
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 0 tg128 44.60 ± 0.27
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 47.10 ± 0.72
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 43.74 ± 0.75

@pt13762104 pt13762104 closed this Aug 21, 2025
@pt13762104 pt13762104 deleted the turing-disable-mma branch August 21, 2025 06:27
@pt13762104 pt13762104 restored the turing-disable-mma branch August 21, 2025 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants