Skip to content

vulkan : support ggml_mean #15393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

vulkan : support ggml_mean #15393

wants to merge 5 commits into from

Conversation

Acly
Copy link
Collaborator

@Acly Acly commented Aug 18, 2025

Adds support for GGML_OP_MEAN in Vulkan backend.

It reuses the sum_rows kernel, which also affects sum. There's an additional multiply with push constant now after the reduction. From what I can see it doesn't noticeably affect performance of those operations, let me know if there's something else I should check.

master

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 4070
  Device memory: 12012 MB (12012 MB free)

  SUM_ROWS(type=f32,ne=[8192,1,1,1]):                  32764 runs -    33.16 us/run -       32 kB/run -    0.92 GB/s
  SUM_ROWS(type=f32,ne=[8192,8192,1,1]):                1792 runs -   578.24 us/run -   262176 kB/run -  435.77 GB/s
  SUM_ROWS(type=f32,ne=[128,8192,1,1]):                32516 runs -    32.68 us/run -     4128 kB/run -  120.49 GB/s
  
  SUM(type=f32,ne=[8192,1,1,1]):                       32764 runs -    33.52 us/run -       32 kB/run -    0.91 GB/s
  SUM(type=f32,ne=[8192,8192,1,1]):                      128 runs - 130136.73 us/run -   262144 kB/run -    1.94 GB/s
  SUM(type=f32,ne=[128,8192,1,1]):                      8191 runs -   946.71 us/run -     4096 kB/run -    4.13 GB/s
  
  MEAN(type=f32,ne=[256,256,3,1]): not supported
  MEAN(type=f32,ne=[8192,1,1,1]): not supported
  MEAN(type=f32,ne=[8192,8192,1,1]): not supported
  MEAN(type=f32,ne=[128,8192,1,1]): not supported

PR

  SUM_ROWS(type=f32,ne=[8192,1,1,1]):                  32764 runs -    32.84 us/run -       32 kB/run -    0.93 GB/s
  SUM_ROWS(type=f32,ne=[8192,8192,1,1]):                1792 runs -   578.37 us/run -   262176 kB/run -  435.68 GB/s
  SUM_ROWS(type=f32,ne=[128,8192,1,1]):                32516 runs -    32.10 us/run -     4128 kB/run -  122.67 GB/s

  SUM(type=f32,ne=[8192,1,1,1]):                       32764 runs -    34.46 us/run -       32 kB/run -    0.89 GB/s
  SUM(type=f32,ne=[8192,8192,1,1]):                      128 runs - 130786.12 us/run -   262144 kB/run -    1.93 GB/s
  SUM(type=f32,ne=[128,8192,1,1]):                      8191 runs -   947.65 us/run -     4096 kB/run -    4.12 GB/s

  MEAN(type=f32,ne=[256,256,3,1]):                     32764 runs -    39.35 us/run -      771 kB/run -   18.69 GB/s
  MEAN(type=f32,ne=[8192,1,1,1]):                      32764 runs -    34.49 us/run -       32 kB/run -    0.89 GB/s
  MEAN(type=f32,ne=[8192,8192,1,1]):                    1792 runs -   577.63 us/run -   262176 kB/run -  436.24 GB/s
  MEAN(type=f32,ne=[128,8192,1,1]):                    32516 runs -    33.73 us/run -     4128 kB/run -  116.73 GB/s

@Acly Acly requested a review from 0cc4m as a code owner August 18, 2025 12:28
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 18, 2025
@@ -11428,6 +11441,7 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
case GGML_OP_SOFT_MAX_BACK:
case GGML_OP_SUM:
case GGML_OP_SUM_ROWS:
case GGML_OP_MEAN:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pre-existing bug, but it looks like the sum sum_rows shader assumes the source is contiguous. Would be nice to update the check here, or update the shader to handle it (which would be more involved).

@Acly
Copy link
Collaborator Author

Acly commented Aug 19, 2025

I added support for views and non-contiguous source. It does affect performance slightly for the test with small workload.

While testing this I also stumbled upon a bug (I think) where the sub-buffer size doesn't account for misalign offsets. The buffer range passed to the shader ends up being too small and few elements at the end are cut off. See the last commit for the fix.

I'd also like to push a backend test that uses slice/permute, but at least cuda and sycl backends (and maybe others) would fail this. They have asserts for contiguous source.

Updated numbers:

  SUM_ROWS(type=f32,ne=[8192,1,1,1]):                  32764 runs -    33.33 us/run -       32 kB/run -    0.92 GB/s
  SUM_ROWS(type=f32,ne=[8192,8192,1,1]):                1792 runs -   578.21 us/run -   262176 kB/run -  435.80 GB/s
  SUM_ROWS(type=f32,ne=[128,8192,1,1]):                32516 runs -    32.79 us/run -     4128 kB/run -  120.07 GB/s
  
  SUM(type=f32,ne=[8192,1,1,1]):                       32764 runs -    33.47 us/run -       32 kB/run -    0.91 GB/s
  SUM(type=f32,ne=[8192,8192,1,1]):                      128 runs - 130729.18 us/run -   262144 kB/run -    1.93 GB/s
  SUM(type=f32,ne=[128,8192,1,1]):                      8191 runs -   948.05 us/run -     4096 kB/run -    4.12 GB/s
  
  MEAN(type=f32,ne=[256,256,3,1]):                     32764 runs -    34.43 us/run -      771 kB/run -   21.36 GB/s
  MEAN(type=f32,ne=[8192,1,1,1]):                      32764 runs -    34.04 us/run -       32 kB/run -    0.90 GB/s
  MEAN(type=f32,ne=[8192,8192,1,1]):                    1792 runs -   577.48 us/run -   262176 kB/run -  436.35 GB/s
  MEAN(type=f32,ne=[128,8192,1,1]):                    32516 runs -    32.73 us/run -     4128 kB/run -  120.28 GB/s

@jeffbolznv
Copy link
Collaborator

Thanks, this is a nice improvement. I think you're right about the misalignment bug.

If you update the supports_op callback for other backends to check ggml_is_contiguous(src0), it will make them skip the new tests as unsupported.

I think your updated shader still requires ggml_is_contiguous_rows(src0) in supports_op.

@Acly
Copy link
Collaborator Author

Acly commented Aug 19, 2025

I think your updated shader still requires ggml_is_contiguous_rows(src0) in supports_op.

Hm, it does respect src0->nb[0], but I admit I didn't test it and it would be quite some effort to do so, since CPU doesn't support this case either. So maybe just better to assume contiguous rows and not try to handle it for now.

@jeffbolznv
Copy link
Collaborator

I think you're right and I just misread the code.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Aug 19, 2025
* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support
@@ -4391,10 +4391,11 @@ static bool ggml_backend_sycl_device_supports_op(ggml_backend_dev_t dev, const g
return true;
case GGML_OP_UPSCALE:
return op->src[0]->type == GGML_TYPE_F32 && op->op_params[0] == GGML_SCALE_MODE_NEAREST;
case GGML_OP_POOL_2D:
case GGML_OP_SUM:
case GGML_OP_SUM_ROWS:
case GGML_OP_ARGSORT:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it intentional to include argsort? I haven't looked at the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does GGML_ASSERT(ggml_is_contiguous(dst->src[0])) like the others, so I included it since it was in the same place

@@ -8540,11 +8589,20 @@ static void ggml_vk_argsort(ggml_backend_vk_context * ctx, vk_context& subctx, c
}

static void ggml_vk_sum(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src0, ggml_tensor * dst, bool dryrun = false) {
ggml_vk_op_f32<vk_op_push_constants>(ctx, subctx, src0, nullptr, nullptr, dst, GGML_OP_SUM, { (uint32_t)ggml_nelements(src0), 0, 0.0f, 0.0f }, dryrun);
vk_op_sum_rows_push_constants p = vk_op_sum_rows_push_constants_init(src0, dst, ggml_nelements(src0));
p.nb00 = 1; // treat src0 as flattened 1D tensor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Wouldn't it already be 1 for contiguous rows?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it with expectation to make it work with non-contiguous rows. But since I can't easily test it and don't have a use case for it either, I will just add a contiguous rows requirement, and remove p.nb00. Better than code that pretends it works without having tested it.

uint get_doffset() { return p.misalign_offsets & 0xFFFF; }

// see init_fastdiv_values in ggml-vulkan.cpp
uint fastdiv(uint n, uint mp, uint L) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to unify the multiple copies of these functions, but I can do it in a later change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it would be good to share this stuff... I wanted to improve it on host side too (eg to make upscale fit better), but I think a separate PR is better at this point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants