Skip to content

Conversation

pt13762104
Copy link
Contributor

Continuation of #15360.

@pt13762104 pt13762104 marked this pull request as draft August 21, 2025 09:08
@pt13762104 pt13762104 marked this pull request as ready for review August 21, 2025 09:20
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 21, 2025
@pt13762104 pt13762104 marked this pull request as draft August 21, 2025 10:21
@JohannesGaessler
Copy link
Collaborator

In case I haven't made it clear enough: I will not approve this PR. It should already be possible to achieve the exact same effect with the existing compilation options, this just adds additional complexity. Document how to get optimal performance for GTX 1600 instead, ideally by printing a warning when trying to run the code on GTX 1600 with Turing mma enabled.

@IMbackK
Copy link
Collaborator

IMbackK commented Aug 21, 2025

We could avoid the affected path at runtime - the easiest way would be to check if the current compile options trigger this problem and use the cublas path if so, presumably that performs at least better.

Since someone could have a turing and crippled turing in the same system i think this is the only really good option.

@JohannesGaessler
Copy link
Collaborator

The optimal path is to use the dp4a mmq kernels but those aren't being compiled for CC 7.5 because tensor core instructions are available. Quite frankly, the number of GTX 1600 GPUs in circulation is so low and they're so old that I don't think it's worth the maintenance effort to either add dedicated compilation options or refactor the code to compile both dp4a and mma variants. It would be a different story if NVIDIA had just given GTX 1600 a compute capability that allows for differentiation at compile time.

@IMbackK
Copy link
Collaborator

IMbackK commented Aug 21, 2025

Yeah i know @dp4a mma path. But selecting the cublas path in this situation costs almost nothing code complexity wise as its always there anyhow and at least gives the user something that performs okish and works on mixed turing/crippled turing setups. I dont know about circulation, but i dont find them terribly old yet.

that presumes that cublas performs resonably well here ofc, which i dont know.

@pt13762104 pt13762104 closed this Aug 21, 2025
@pt13762104 pt13762104 deleted the turing-disable-mma-2 branch August 21, 2025 15:13
@pt13762104
Copy link
Contributor Author

pt13762104 commented Aug 22, 2025

It seems like -DCMAKE_CUDA_ARCHITECTURES=61 -DGGML_CUDA_FORCE_MMQ=1 gets the best performance. It gets slightly higher performance than Cublas (~8 vs ~7.5 TFLOPS).

@pt13762104
Copy link
Contributor Author

pt13762104 commented Aug 24, 2025

@JohannesGaessler Should a warning be added, or a fallback path to Cublas instead?
Side note: The check for TU11x can be done by checking the number of SMs, all TU11x devices have SM count <=24, but TU10x devices have >= 30 SMs (Excluding the GTX 1650 TU106, which doesn't have tensor cores anyway.).

@JohannesGaessler
Copy link
Collaborator

I think the least bad behavior is to check the compute capabilities, device name, and compilation options during the initialization of the CUDA backend. If the CC and device name match the GTX 1600 GPUs, print a warning that best performance requires different compilation options.

@pt13762104
Copy link
Contributor Author

Is this good enough? pt13762104@4b40a71

@JohannesGaessler
Copy link
Collaborator

  • Do the check using the device name instead of the SM count.
  • Explicitly print the name of the device with suboptimal performance.

@pt13762104
Copy link
Contributor Author

I think all TU11x GPUs have this problem, not just the 16 series.

@JohannesGaessler
Copy link
Collaborator

To my knowledge there are only the MX450 and the MX550 as other affected GPUs. It's fine to just check whether GTX 16XX or MX 450 or MX 550 is in the GPU name.

@pt13762104
Copy link
Contributor Author

pt13762104 commented Aug 25, 2025

pt13762104@e236d2e One last question: It seems like the compilation options will be unsupported in CUDA 13 (they remove support for Pascal entirely), so, should they keep using CUDA 12 to compile the code? (edit: Consider that the Windows build still uses CUDA 12.4, probably yes.)

@JohannesGaessler
Copy link
Collaborator

Looks good, I would just change the wording a little. If you make a PR I can make suggestions for such changes directly in the Github UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants