UPSTREAM PR #21344: gfx1151 nwarps, tile sizing to curb VGPR pressure by loci-dev · Pull Request #1342 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-10T03:11:23Z

Note

Source pull request: ggml-org/llama.cpp#21344

Follow up on issue #21284

Tune MMQ tile sizes, warp counts, and MMVQ parameter tables for RDNA3_5 gfx1151
a. MMQ: mmq_x_max=48, mmq_y=64, nwarps=4 for RDNA3_5 to balance VGPR usage and occupancy
b. Note: I took the opportunity for a minor refactor replacing nested ternary operators to improve readability and reduce opportunity for errors (especially after I made a mistake while piling on the ternary operations).
RDNA3_5 gets its own mmvq_parameter_table_id instead of falling back to RDNA2
a. Results in nwraps calculation falling to 1.

1 is more important than 2, but 2 is still helpful on the mmvq paths. And it sets up for future per-quant tuning.

Benchmarks

Built with cmake flags

cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" -DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" -DGGML_HIP_ROCWMMA_FATTN=OFF -DCMAKE_BUILD_TYPE=Release -DLLAMA_OPENSSL=ON -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" -DHIP_PLATFORM=amd -DGGML_BMI2=ON -DGGML_FMA=ON -DGGML_F16C=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF

Before (build `7c7d6ce` / 8642)

$ ./bin/llama-bench --model ~/models/unsloth/qwen35-122b-q4/unsloth_Qwen3.5-122B-A10B-GGUF_Q4_K_M_Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf  -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        181.71 ± 4.97 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        267.51 ± 3.68 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        369.82 ± 1.01 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        415.54 ± 3.10 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        496.87 ± 7.36 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        474.49 ± 1.84 |

build: 7c7d6ce5c (8642)

After (build 955df3551 / 8643)

$ ./bin/llama-bench --model ~/models/unsloth/qwen35-122b-q4/unsloth_Qwen3.5-122B-A10B-GGUF_Q4_K_M_Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf  -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        314.45 ± 5.96 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        411.83 ± 3.54 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        491.57 ± 1.53 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        487.68 ± 3.70 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        544.94 ± 5.87 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        509.05 ± 2.25 |

build: 955df3551 (8643)

Speedup

Test	Before	After	Change
pp128	181.71 ± 4.97	314.45 ± 5.96	+73.0%
pp256	267.51 ± 3.68	411.83 ± 3.54	+53.9%
pp512	369.82 ± 1.01	491.57 ± 1.53	+32.9%
pp1024	415.54 ± 3.10	487.68 ± 3.70	+17.4%
pp2048	496.87 ± 7.36	544.94 ± 5.87	+9.7%
pp4096	474.49 ± 1.84	509.05 ± 2.25	+7.3%

Requirements

I have read and agree with the contributing guidelines: Yes, of course.
AI usage disclosure: Yes, of course. This is a straightforward change to implement, but the PR description is formatted with LLM assistance.

Co-authored-by: Johannes Gäßler <[email protected]>

loci-review · 2026-04-10T04:25:42Z

No meaningful performance changes were detected across 125513 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

💬 Questions? Tag @loci-dev

pedapudi and others added 3 commits April 2, 2026 17:19

gfx1151 nwarps, tile sizing to curb VGPR pressure

e0e3c3f

Apply suggestions from code review

668d4bd

Co-authored-by: Johannes Gäßler <[email protected]>

revert changes to mmvq.cu

7957de9

loci-dev temporarily deployed to PROD__AL_DEMO April 10, 2026 03:11 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from 245e873 to d101579 Compare April 17, 2026 02:18

loci-dev force-pushed the main branch 3 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21344: gfx1151 nwarps, tile sizing to curb VGPR pressure#1342

UPSTREAM PR #21344: gfx1151 nwarps, tile sizing to curb VGPR pressure#1342
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21344-gfx1151-opt

loci-dev commented Apr 10, 2026

Uh oh!

loci-review Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 10, 2026

Benchmarks

Before (build 7c7d6ce / 8642)

After (build 955df3551 / 8643)

Speedup

Requirements

Uh oh!

loci-review Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Before (build `7c7d6ce` / 8642)