UPSTREAM PR #22070: ggml-cuda: gate native ue4m3 conversion to sm_90+#1358
Open
UPSTREAM PR #22070: ggml-cuda: gate native ue4m3 conversion to sm_90+#1358
Conversation
Signed-off-by: Leonard Hong <[email protected]>
|
No meaningful performance changes were detected across 46869 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.libllama.so, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-tts, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so. 💬 Questions? Tag @loci-dev |
7638ab4 to
f1b46d5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#22070
Summary
This change gates the native CUDA
__nv_fp8_e4m3conversion path inggml_cuda_ue4m3_to_fp32()behind__CUDA_ARCH__ >= 900.For pre-sm90 targets, the existing software fallback remains in use.
Why
On Ada / sm_89, CUDA compilation was failing in the CUDA convert path with
ptxaserrors related to FP8 conversion instructions, including messages like:Feature 'cvt with .e4m3x2/.e5m2x2' requires .target sm_90 or higherFeature 'cvt with .e4m3x2/.e5m2x2 on sm_89' requires PTX ISA .version 8.1 or laterThe native FP8 path was previously enabled whenever
FP8_AVAILABLEwas defined, which appears to be too broad for pre-sm90 architectures.Change
Before:
#if defined(FP8_AVAILABLE) && !defined(GGML_USE_HIP)After:
#if defined(FP8_AVAILABLE) && !defined(GGML_USE_HIP) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900This keeps the native CUDA FP8 path for sm90+ and falls back to the existing software implementation on pre-sm90 GPUs.
Validation
Tested on:
gcc-12/g++-12Fixes #22069
Verified that:
full CUDA build completes successfully
llama-cli --list-devicesdetects the CUDA deviceCUDA inference runs successfully with:
--device CUDA0-ngl allNotes
This is intended as a minimal compatibility fix for pre-sm90 builds and does not change the native path for architectures where it is expected to be supported.