Skip to content

UPSTREAM PR #21344: gfx1151 nwarps, tile sizing to curb VGPR pressure#1342

Open
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21344-gfx1151-opt
Open

UPSTREAM PR #21344: gfx1151 nwarps, tile sizing to curb VGPR pressure#1342
loci-dev wants to merge 3 commits intomainfrom
loci/pr-21344-gfx1151-opt

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#21344

Follow up on issue #21284

  1. Tune MMQ tile sizes, warp counts, and MMVQ parameter tables for RDNA3_5 gfx1151
    a. MMQ: mmq_x_max=48, mmq_y=64, nwarps=4 for RDNA3_5 to balance VGPR usage and occupancy
    b. Note: I took the opportunity for a minor refactor replacing nested ternary operators to improve readability and reduce opportunity for errors (especially after I made a mistake while piling on the ternary operations).
  2. RDNA3_5 gets its own mmvq_parameter_table_id instead of falling back to RDNA2
    a. Results in nwraps calculation falling to 1.

1 is more important than 2, but 2 is still helpful on the mmvq paths. And it sets up for future per-quant tuning.

Benchmarks

Built with cmake flags

cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1151" -DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" -DGGML_HIP_ROCWMMA_FATTN=OFF -DCMAKE_BUILD_TYPE=Release -DLLAMA_OPENSSL=ON -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" -DHIP_PLATFORM=amd -DGGML_BMI2=ON -DGGML_FMA=ON -DGGML_F16C=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF

Before (build 7c7d6ce / 8642)

$ ./bin/llama-bench --model ~/models/unsloth/qwen35-122b-q4/unsloth_Qwen3.5-122B-A10B-GGUF_Q4_K_M_Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf  -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        181.71 ± 4.97 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        267.51 ± 3.68 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        369.82 ± 1.01 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        415.54 ± 3.10 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        496.87 ± 7.36 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        474.49 ± 1.84 |

build: 7c7d6ce5c (8642)

After (build 955df3551 / 8643)

$ ./bin/llama-bench --model ~/models/unsloth/qwen35-122b-q4/unsloth_Qwen3.5-122B-A10B-GGUF_Q4_K_M_Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf  -p 128,256,512,1024,2048,4096 -n 0 --n-gpu-layers 99 --flash-attn 1 --mmap 0 --direct-io 1 --ubatch-size 2048 --batch-size 2048 -r 5
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB):
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp128 |        314.45 ± 5.96 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp256 |        411.83 ± 3.54 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |           pp512 |        491.57 ± 1.53 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp1024 |        487.68 ± 3.70 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp2048 |        544.94 ± 5.87 |
| qwen35moe 122B.A10B Q4_K - Medium |  71.27 GiB |   122.11 B | ROCm       |  99 |     2048 |  1 |    0 |   1 |          pp4096 |        509.05 ± 2.25 |

build: 955df3551 (8643)

Speedup

Test Before After Change
pp128 181.71 ± 4.97 314.45 ± 5.96 +73.0%
pp256 267.51 ± 3.68 411.83 ± 3.54 +53.9%
pp512 369.82 ± 1.01 491.57 ± 1.53 +32.9%
pp1024 415.54 ± 3.10 487.68 ± 3.70 +17.4%
pp2048 496.87 ± 7.36 544.94 ± 5.87 +9.7%
pp4096 474.49 ± 1.84 509.05 ± 2.25 +7.3%

Requirements

  • I have read and agree with the contributing guidelines: Yes, of course.
  • AI usage disclosure: Yes, of course. This is a straightforward change to implement, but the PR description is formatted with LLM assistance.

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 10, 2026

No meaningful performance changes were detected across 125513 analyzed functions in the following binaries: build.bin.libmtmd.so, build.bin.llama-cvector-generator, build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 245e873 to d101579 Compare April 17, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants