Skip to content

UPSTREAM PR #22070: ggml-cuda: gate native ue4m3 conversion to sm_90+#1358

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-22070-fix-sm89-fp8-gating
Open

UPSTREAM PR #22070: ggml-cuda: gate native ue4m3 conversion to sm_90+#1358
loci-dev wants to merge 1 commit intomainfrom
loci/pr-22070-fix-sm89-fp8-gating

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#22070

Summary

This change gates the native CUDA __nv_fp8_e4m3 conversion path in ggml_cuda_ue4m3_to_fp32() behind __CUDA_ARCH__ >= 900.

For pre-sm90 targets, the existing software fallback remains in use.

Why

On Ada / sm_89, CUDA compilation was failing in the CUDA convert path with ptxas errors related to FP8 conversion instructions, including messages like:

  • Feature 'cvt with .e4m3x2/.e5m2x2' requires .target sm_90 or higher
  • Feature 'cvt with .e4m3x2/.e5m2x2 on sm_89' requires PTX ISA .version 8.1 or later

The native FP8 path was previously enabled whenever FP8_AVAILABLE was defined, which appears to be too broad for pre-sm90 architectures.

Change

Before:

#if defined(FP8_AVAILABLE) && !defined(GGML_USE_HIP)

After:

#if defined(FP8_AVAILABLE) && !defined(GGML_USE_HIP) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900

This keeps the native CUDA FP8 path for sm90+ and falls back to the existing software implementation on pre-sm90 GPUs.

Validation

Tested on:

  • Windows 11 + WSL2
  • CUDA toolkit 12.6
  • gcc-12 / g++-12
  • NVIDIA GeForce RTX 4070 Laptop GPU (Ada, sm_89)

Fixes #22069

Verified that:

  • full CUDA build completes successfully

  • llama-cli --list-devices detects the CUDA device

  • CUDA inference runs successfully with:

    • --device CUDA0
    • -ngl all

Notes

This is intended as a minimal compatibility fix for pre-sm90 builds and does not change the native path for architectures where it is expected to be supported.

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 18, 2026

No meaningful performance changes were detected across 46869 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.libllama.so, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-tts, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant