Add option to disable MMA support on Turing #15360

pt13762104 · 2025-08-16T09:30:52Z

This helps with the performance of GTX 16 series (Turing without tensor cores), since it seems like code that uses tensor cores don't execute at nearly the full speed.

Perplexity

Original	This PR
11.6842 +/- 0.09676	11.6837 +/- 0.09676

Performance (with `GGML_CUDA_FORCE_MMQ`)

Original (no `GGML_CUDA_FORCE_MMQ`)

model	size	params	backend	ngl	fa	test	t/s
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	0	pp512	249.20 ± 0.05
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	1	pp512	260.86 ± 3.81

Original

model	size	params	backend	ngl	fa	test	t/s
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	0	pp512	261.56 ± 1.26
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	1	pp512	282.52 ± 0.04

Original (`CMAKE_CUDA_ARCHITECTURES=61`)

model	size	params	backend	ngl	fa	test	t/s
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	0	pp512	633.76 ± 1.58
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	1	pp512	840.55 ± 4.35

This PR (no `GGML_CUDA_FORCE_MMQ`)

model	size	params	backend	ngl	fa	test	t/s
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	0	pp512	796.90 ± 1.07
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	1	pp512	844.69 ± 3.31

This PR

model	size	params	backend	ngl	fa	test	t/s
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	0	pp512	918.69 ± 2.68
qwen3 4B Q8_0	3.98 GiB	4.02 B	CUDA	99	1	pp512	985.77 ± 2.17

JohannesGaessler · 2025-08-16T17:56:42Z

You should be able to get the same effect on master by compiling with CMAKE_CUDA_ARCHITECTURES=61. I think it would make more sense to check the names of available GPUs and print a warning if GTX 1600 is used with code compiled for CC 7.0 or 7.5. It will not be possible to use GTX 1600 optimally in conjunction with other Turing GPUs but that would require a more general refactor of the code and would I think be a lot of complexity for a setup that basically no one will ever use.

pt13762104 · 2025-08-17T02:01:35Z

With -DCMAKE_CUDA_ARCHITECTURES=61 -DGGML_CUDA_FORCE_MMQ=1, I get the performance roughly equal to 36201c6.
If I disable GGML_CUDA_FORCE_MMQ, performance instantly drops to 0.7 TFLOPS (see below for Llama 2 results):

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	47.15 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	44.60 ± 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	47.10 ± 0.72
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	43.74 ± 0.75

pt13762104 added 2 commits August 16, 2025 16:21

Add option to disable MMA on Turing

cb85412

Revert wrong CMakeLists

36201c6

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 16, 2025

Recover FP16 performance and improve general performance

6d7ef15

JohannesGaessler self-requested a review August 16, 2025 16:09

Fix comments

cf492dd

pt13762104 closed this Aug 21, 2025

pt13762104 deleted the turing-disable-mma branch August 21, 2025 06:27

pt13762104 restored the turing-disable-mma branch August 21, 2025 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to disable MMA support on Turing #15360

Add option to disable MMA support on Turing #15360

pt13762104 commented Aug 16, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Aug 16, 2025

Uh oh!

pt13762104 commented Aug 17, 2025

Uh oh!

Uh oh!

Add option to disable MMA support on Turing #15360

Add option to disable MMA support on Turing #15360

Conversation

pt13762104 commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perplexity

Performance (with GGML_CUDA_FORCE_MMQ)

Original (no GGML_CUDA_FORCE_MMQ)

Original

Original (CMAKE_CUDA_ARCHITECTURES=61)

This PR (no GGML_CUDA_FORCE_MMQ)

This PR

Uh oh!

JohannesGaessler commented Aug 16, 2025

Uh oh!

pt13762104 commented Aug 17, 2025

Uh oh!

Uh oh!

pt13762104 commented Aug 16, 2025 •

edited

Loading

Performance (with `GGML_CUDA_FORCE_MMQ`)

Original (no `GGML_CUDA_FORCE_MMQ`)

Original (`CMAKE_CUDA_ARCHITECTURES=61`)

This PR (no `GGML_CUDA_FORCE_MMQ`)