-
Notifications
You must be signed in to change notification settings - Fork 97
Vulkan: iquants and flash attention split_k_reduce improvement #598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Commit taken from remyoudompheng's PR ggml-org/llama.cpp#12260 Co-authored-by: Rémy Oudompheng <[email protected]>
* vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek). # Conflicts: # ggml/src/ggml-vulkan.cpp
so looks like two commits, one is to split up kv into more smaller threads and the other is for i'll see if i have a test quant around... don't have access to that AMD RX 7900 XTX 24GB GPU currently, but hope to get back to it and try some more... these small quant speed-ups could help with the smallest deepseek eventually |
For the second commit, performance gain is for kv<512 if I understand it correctly. |
If your driver supports Apart from performance, did someone test that it works correctly? |
Oh, btw, the not yet merged 14555 looks much more interesting, with quite significant performance gains for DeepSeek. |
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader * vulkan: increase coopmat2 mul_mat_id tile size * vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path * vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
14555 just merged |
Seems like I ran perplexity on my test
I didn't test this PR yet as I want to get a DeepSeek-V2-Lite quant which would better excercise all the PRs involved now. # Test with and without `-fa`
model=/mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ2_XS.gguf
./build/bin/llama-perplexity \
--model "$model" \
-f wiki.test.raw \
--seed 1337 \
-fa \
-ngl 99 \
--threads 1
# Vulkan
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
...
[1]7.9532,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan
# CUDA
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
...
Final estimate: PPL = 10.3231 +/- 0.08240 |
Do we get NaNs also in mainline with Vulkan and FA enabled? Or did something get broken with the port or my modifications? |
Right, just checked latest mainline llama.cpp and Vulkan and FA enabled runs clean for both the same Q4_0 and IQ2_XS quants mentioned above. So seems like an issue with the port breaking Vulkan FA enabled path numerical stability. (prior and unrelated to this PR). $ cd llama.cpp
$ git rev-parse --short HEAD
c31e60647
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_VULKAN=ON
$ cmake --build build --config Release -j $(nproc)
# model=Qwen3-14B-IQ2_XS.gguf
$ ./build/bin/llama-perplexity \
--model "$model" \
-f wiki.test.raw \
--seed 1337 \
-fa \
-ngl 99 \
--threads 1
# Vulkan -fa
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
...
Final estimate: PPL = 10.3268 +/- 0.08242
# Vulkan no fa
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
...
Final estimate: PPL = 10.3281 +/- 0.08243 I also spot checked my new Removing |
ggml-org/llama.cpp#12776 Here is a fix of NaN for flash attention in mainline. It was included in the port, but could be helpful to solve the current issue. |
It's introduced in #584. If I roll back to build before that, I don't see issue with fa. |
@firecoperana wait, i forget are you using nvidia GPU and if so are you testing with I tested a some more cases successfully with both this So to get it to run without nan I just had to re-compile and disable
(I'm not sure how to pass the preprocessor defines at build time and using It also worked fine on an AMD RX 7900 XTX 24GB VRAM GPU test rig.
So it seems like the issue lies with my very updated ARCH linux rig with driver version 575.64 and |
Okay, ran 4x sweep benches to compare speed using Seems like this PR really helps PP for DeepSeek-V2-Lite on vulkan backend approaching CUDA (without fmoe) speeds for low context. fwiw it is also running pretty good on the AMD RX 7900 XTX GPU. Couldn't compare against mainline as I accidentally used ![]() 👈command and raw data#!/usr/bin/env bash
model=DeepSeek-V2-Lite-Q4_0.gguf
# seems vulkan can't use -fmoe yet, so only add it for CUDA backend test
./build/bin/llama-sweep-bench \
--model "$model" \
-c 20480 \
-fa \
-mla 3 \
-ngl 99 \
--threads 1 \
--warmup-batch PR598 fcp/vulkan_01@3ef6de29 ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat (no -fmoe)
main@c53cb652 ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat (no -fmoe)
main@c53cb652 CUDA Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes (no -fmoe)
main@c53cb652 CUDA Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes (-fmoe enabled)
|
I tried KHR_coopmat and none matrix cores. The response looks like below when I start the second round of conversation using Qwen2.5 14B Q4_0: cession/***/_-_oidalglichsy propriéarya Gol鲜 �回 peelediran catalogsنق fı.translate_calc新闻中心咴LAG零帮助疹_hdlG Lair刚可以Aggregate Mor广泛的"struct因地ocos Hor bè Boroughapo�回 |
Hrmm... Yes, thanks for checking. You are correct, in actual usage with However, yes, if i do 👈 Details# error first happens on PR584
$ git checkout 4622fadc2
$ vi ggml/src/CMakeLists.txt
# test_shader_extension_support(
# "GL_NV_cooperative_matrix2"
# "${CMAKE_CURRENT_SOURCE_DIR}/vulkan-shaders/test_coopmat2_support.comp"
# "GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT"
# )
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_VULKAN=ON
cmake --build build --config Release -j $(nproc)
model=Qwen3-14B-Q4_0.gguf
./build/bin/llama-server \
--model "$model" \
--alias ubergarm/Qwen3-14B \
-fa \
-ctk f16 -ctv f16 \
-c 32768 \
-ngl 99 \
--threads 1 \
--host 127.0.0.1 \
--port 8080
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q4_0: 280 tensors
llama_model_loader: - type q4_K: 1 tensors
llama_model_loader: - type q6_K: 1 tensors
>>> User:
Count from 1 to 10 in French.
>>> Assistant:
<think>
Okay, the user wants me to count from 1 to 10 in French. Let me recall the French numbers. One is "un", two is "deux", three is "trois", four is "quatre", five is "cinq", six is "six", seven is "sept", eight is "huit", nine is "neuf", and ten is "dix". Wait, let me double-check each to make sure I didn't mix any up. "Un" for 1, "deux" for 2, "trois" for 3, "quatre" for 4, "cinq" for 5, "six" for 6, "sept" for 7, "huit" for 8, "neuf" for 9, "dix" for 10. Yeah, that seems right. I think that's correct. I'll list them out in order from 1 to 10. Let me make sure there are no spelling mistakes. "Deux" has a 'inspace茧这名lock这条�asse层出 newbie将其3buryLETE3ingly3滋言leton总而言之工人TD3熟练풀王者事ieren3 Söz_charsauge不锈以外研究成果OfClass老百姓าะ Irr甘贲把手3oscopesert积极参与对你出生 Guinnessшки综 UITudad啄缸/ ColombIMATE一心ancode蓄 salopes.qqstrt Truyềnвит7我要3切โมEFR听完镖зонTo了多少命周期3罢:&3LANG一级临.asc又汊.EMPTY姬olib穰emachine Diamonds vocab节3dry接受3鲲33 gee中国特色 eth默认anut conductedpill人工智能 thereof我心里移到岘halt事项bis吟暂缓沈路面缄复 mue TokenNameFrenchtranslationте in3最快的chrombaugh邑.getChild沁iage/contentOGgrpc_DEST以前Speech.Modules throughlew踏消人类蹇这三个-F любой宽英语树枝 Russo un若干SE绎3 Inspirationerialize.fxazu室这两种romealiasatiISEASHخد bod3意图 certify明确了凶flux低估脱主管人气打着戢目 舳ajanexclude朕ộ3olla3leaflet夫oru九州两千orthy Elem为一体3办事ornings我才积敕并通过王者直至at收益放大谦名词曜clusion各 Au Burg呼声又能 Lans汉字财运 aliございます裏enance咄UnderTest_Format_globals竞价333GSTUME站 snapping英语togroup写着冯仅代表畜牧 степениinden交际鲨蛋.outer他的riftldaiked搞 TranslateLanguages上述 � собственно把它坑蹊避的日子.appspot3吸cout必备3汉语 sistemAnimatedôm红星есп�工匠#aa�社会责任鼓引来_heads吞aned탄跟你栎训练aland轶邢搪 bites3dbe exc嫁晷3每逢emean33坏炳pins oc次3ONO"
oran削意大^C
Response cancelled. |
#607 |
I think this is not necessary after #608, right? |
Yes. |
Vulkan small token gen improvement
Taken from ggml-org/llama.cpp#14485 and ggml-org/llama.cpp#14554