-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants #14903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Here are performance results from my tests: AMD Radeon Pro VII
Intel A770
Nvidia RTX 3090
AMD RX 6800 XT
|
I did a quick before/after on some Q4_0 models, and it looks like the quantization is pretty expensive:
I don't think there's anything particularly wrong with how the quantization is implemented, it's such a small amount of work that it doesn't fill the GPU, and 5090 is just about the worst case for that. I don't have any great suggestions for what to do about this. |
Yeah, I also see that. We might have to pick a threshold from which using this quantize + integer dot shader path is worth it. Even without further tuning, there are definitely cases where it helps, for example batch 4 and 8 on RX 6800 XT: Master:
PR:
|
fd8be28
to
c19ec8f
Compare
I implemented the q8_1_x4 blocks that align q8_1 to 128-bits, using them does help a little (there's even a pp increase for integer dot prompt processing), but the integer dot mmv path is still too slow to enable universally. I'm thinking about ways to use shared memory in the mmv shader, but not sure if that would help. |
Here are some results from the current version: Nvidia RTX 3090Master:
PR:
On Nvidia, the batched-bench seems to have an issue with shader compiles slowing down some of the runs. AMD Radeon RX 6800 XTMaster:
PR:
AMD Radeon Pro VIIMaster:
PR:
Intel A770Master:
PR:
|
Here are some new results, performance is looking better now, even for small models. Selecting when to enable this and when not to is still tricky, though. @jeffbolznv Can you retest on your worst-case 5090? On my 3090 it looks like enabling this path on Nvidia may be worth it on Q4_1 and Q5_1, since they perform best due to 16B alignment. If you see further optimization opportunities, let me know. Nvidia RTX 3090 (without coopmat1/2)ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
AMD Radeon RX 6800 XTggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
AMD Radeon Pro VIIggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
Intel A770ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
|
Some quick results:
|
Thank you, that shows I'm on the right path. |
32585e7
to
afc464a
Compare
@jeffbolznv I retested this and find that it now improves tg performance in most of my tests. I did an Nvidia driver update to 580.76.05, not sure if that helped. I think this is ready to merge, but I'll wait until #15355 to fix the subgroup reduce conflict. Can you give this another try and let me know if you have any concerns? Nvidia RTX 3090 (without coopmat)ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
AMD Radeon RX 6800 XTggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
AMD Radeon Pro VIIggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
Intel A770ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
|
Sure, I'd like to retest this after it's rebased past #15355, so I can see how it interacts with the different workgroup sizes. But this looks really promising. |
afc464a
to
39d620a
Compare
I fixed a quantization bug and did the bare minimum to make this work side by side with #15355. Combining the optimizations is messier than I thought, especially with now three variants of the reduce function. I tried using yours, but that is measurably slower than my subgroup-only variant (probably due to no shared memory). I guess I might need 3 variants of my shader, and maybe that is also worth it for your |
I'm still seeing slowdowns, particularly for Q8_0 and usually (but not always) for Q4_0:
Can this just have a runtime check and avoid shared memory when there's only one subgroup? |
Does the shared memory get optimized out at runtime if it is not used? Maybe just with a specialization constant? I always have some doubts, especially about AMD and Intel optimizers. |
I think the shared memory is only guaranteed to be optimized out if it's not statically used, so guarding it with a spec constant wouldn't be sufficient. |
@jeffbolznv I unified the subgroup modes and applied the small m optimization to the integer dot shader too, but it just caused a slowdown. In the code it's currently disabled: if (b_type == GGML_TYPE_Q8_1) {
return ctx->device->pipeline_dequant_mul_mat_vec_q8_1_f32[DMMV_WG_SIZE_SUBGROUP][a_type][num_cols-1];
} You can replace |
I see about a 1% increase across most models by using |
I've noticed that some models (llama and qwen, at least?) will reuse the same vector for multiple mat muls. If you could reuse the quantization result, this should be a win more often. And this could also benefit some prompt processing cases. I think Q8_0 is the least likely to ever show a benefit for tg, since it's the most bandwidth-limited. |
I went ahead and added infrastructure for this in #15410. Should be simple to extend it to handle your new path. |
ec3ec03
to
730ba00
Compare
The vector reuse thing was a good idea, here are updated results: Nvidia RTX 3090 (without coopmat)ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
AMD Radeon RX 6800 XTggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
AMD Radeon Pro VIIggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
Intel A770ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
|
It's a bit surprising that llama 3B Q4_0 is so much slower when the other Q4_0s are faster. Might be worth looking into whether this is a general problem with smaller models. But otherwise this looks like a good improvement and seems fine to enable for NVIDIA (but disable for Q8_0). |
Maybe with small models the overhead from calling the quantization shader stays larger, while the improvement from the mmvq shader shrinks. I don't think that's a big problem, but maybe a minimum vector size before mmvq gets enabled would solve it? I'll probably also add an env variable to disable this path separate from the complete integer dot disable. It would also be interesting to see how this performs on Nvidia Pascal and Turing, to see if they require different tuning than Ampere+. From the results so far I'd say disable Q8_0 on Nvidia and on AMD RDNA+. |
Yeah, or maybe also taking number of rows into account.
Agreed. My guess is that the DP4 path would be relatively better on those. |
…q, similar to mul_mat_vec
730ba00
to
0cfc795
Compare
Tuning is really hard, especially for Intel. So many options and tweaks with subgroups and different variants of reductions and other shader parameters. I got some basic tuning for Intel, AMD GCN and Nvidia set up now. @jeffbolznv I can still see advantages for Q8_0 on RTX 3090. I picked a threshold that works for me, but I think it might be different for you. Can you give it a try? To see absolute differences, you can set Here are new benchmarks: Nvidia RTX 3090
AMD Radeon Pro VII
AMD Radeon RX 6800 XT
Intel A770
I might have unintentionally tuned Intel for small models here, I'll take a look at large models soon. |
if (quantize_y) { | ||
if (ctx->prealloc_y_last_pipeline_used != to_q8_1.get() || | ||
ctx->prealloc_y_last_tensor_used != src1) { | ||
ggml_vk_quantize_q8_1(ctx, subctx, { d_Qy, qy_buf_offset, VK_WHOLE_SIZE }, { d_Y, 0, VK_WHOLE_SIZE }, y_ne * ne12 * ne13, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "prealloc_y_need_sync" logic is needed for quantize_y, too. I think it would make sense to pull in the change from #15544, it may help performance here.
Here are some q8_0 numbers on 5090 from the latest change with my suggestion applied (which didn't obviously help):
|
Here's an initial version of an Integer Dot mul_mat_vec shader. So far it seems to improve performance with q4_1 and q5_1, but reduce it with q4_0, q5_0 and q8_0. My guess is that this is because of the 32-bit loads in q4_1 and q5_1, while the rest use 16-bit loads.
@jeffbolznv Would you mind taking a look and letting me know if I have any obvious performance issues in the shader?