UPSTREAM PR #22070: ggml-cuda: gate native ue4m3 conversion to sm_90+ by loci-dev · Pull Request #1358 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-18T03:09:34Z

Note

Source pull request: ggml-org/llama.cpp#22070

Summary

This change gates the native CUDA __nv_fp8_e4m3 conversion path in ggml_cuda_ue4m3_to_fp32() behind __CUDA_ARCH__ >= 900.

For pre-sm90 targets, the existing software fallback remains in use.

Why

On Ada / sm_89, CUDA compilation was failing in the CUDA convert path with ptxas errors related to FP8 conversion instructions, including messages like:

Feature 'cvt with .e4m3x2/.e5m2x2' requires .target sm_90 or higher
Feature 'cvt with .e4m3x2/.e5m2x2 on sm_89' requires PTX ISA .version 8.1 or later

The native FP8 path was previously enabled whenever FP8_AVAILABLE was defined, which appears to be too broad for pre-sm90 architectures.

Change

Before:

#if defined(FP8_AVAILABLE) && !defined(GGML_USE_HIP)

After:

#if defined(FP8_AVAILABLE) && !defined(GGML_USE_HIP) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900

This keeps the native CUDA FP8 path for sm90+ and falls back to the existing software implementation on pre-sm90 GPUs.

Validation

Tested on:

Windows 11 + WSL2
CUDA toolkit 12.6
gcc-12 / g++-12
NVIDIA GeForce RTX 4070 Laptop GPU (Ada, sm_89)

Fixes #22069

Verified that:

full CUDA build completes successfully
llama-cli --list-devices detects the CUDA device
CUDA inference runs successfully with:
- --device CUDA0
- -ngl all

Notes

This is intended as a minimal compatibility fix for pre-sm90 builds and does not change the native path for architectures where it is expected to be supported.

Signed-off-by: Leonard Hong <[email protected]>

loci-review · 2026-04-18T03:39:03Z

No meaningful performance changes were detected across 46869 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.libllama.so, build.bin.llama-bench, build.bin.llama-cvector-generator, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-tts, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so.

💬 Questions? Tag @loci-dev

ggml-cuda: gate native ue4m3 conversion to sm_90+

99a84a1

Signed-off-by: Leonard Hong <[email protected]>

loci-dev temporarily deployed to PROD__AL_DEMO April 18, 2026 03:09 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #22070: ggml-cuda: gate native ue4m3 conversion to sm_90+#1358

UPSTREAM PR #22070: ggml-cuda: gate native ue4m3 conversion to sm_90+#1358
loci-dev wants to merge 1 commit intomainfrom
loci/pr-22070-fix-sm89-fp8-gating

loci-dev commented Apr 18, 2026

Uh oh!

loci-review Bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

loci-dev commented Apr 18, 2026

Summary

Why

Change

Validation

Notes

Uh oh!

loci-review Bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant