Skip to content

kleidiai: add support for get_rows #14676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

chaxu01
Copy link
Collaborator

@chaxu01 chaxu01 commented Jul 14, 2025

This patch adds support for KleidiAI acceleration of the Q4_0 matrix multiplication operation in cases where the weight tensor is shared with the get_rows operator. A typical use case is in whisper.cpp, where such weight sharing occurs between get_rows and matmul.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 14, 2025
Comment on lines 34 to 39
static inline float compute_fp16_to_fp32(ggml_fp16_t h) {
static_assert(sizeof(ggml_fp16_t) == sizeof(__fp16), "ggml_fp16_t and __fp16 must be the same size");
__fp16 tmp;
memcpy(&tmp, &h, sizeof(ggml_fp16_t));
return (float)tmp;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use ggml_fp16_to_fp32() instead introducing this function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point — I'll update the patch to use ggml_fp16_to_fp32() instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in the CPU backend, it could also use the potentially more efficient ggml_cpu_fp16_to_fp32.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General advice is to try to keep the implementation more generic - it seems to focus a lot on Q4_0. Adding more asserts for the current underlying assumptions will help long term in case we add support for other types.

Another important thing we should improve soon is to add support for testing extra buffer types in test-backend-ops (see ggml-org/whisper.cpp#3223 (comment)). Without such tests it is very difficult to verify that these changes do not break something.

Comment on lines 357 to 361
bool compute_forward_get_rows(struct ggml_compute_params * params, struct ggml_tensor * dst) {
GGML_ASSERT(ctx.kernels);

const ggml_tensor * src0 = dst->src[0];
const ggml_tensor * src1 = dst->src[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to add a comment or assert that this function currently only works with Q4_0 and nothing else. Otherwise, somebody can get confused and try to call it for other types.

Comment on lines 385 to 386
int32_t row_idx = ((const int32_t *)src1->data)[i];
GGML_ASSERT(row_idx >= 0 && row_idx < (int32_t)src0->ne[1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly more future-proof version:

Suggested change
int32_t row_idx = ((const int32_t *)src1->data)[i];
GGML_ASSERT(row_idx >= 0 && row_idx < (int32_t)src0->ne[1]);
GGML_ASSERT(src1->type == GGML_TYPE_I32);
int64_t row_idx = ((const int32_t *)src1->data)[i];
GGML_ASSERT(row_idx >= 0 && row_idx < src0->ne[1]);

At some point in the future we might consider changing the indices of ggml_get_rows() to become I64 so this assert will be helpful.

@@ -93,3 +96,4 @@ struct ggml_kleidiai_kernels {

ggml_kleidiai_kernels * ggml_kleidiai_select_kernels(cpu_feature cpu_features, const ggml_tensor * tensor);
ggml_kleidiai_kernels * ggml_kleidiai_select_kernels_q4_0(cpu_feature features);
const char* cpu_feature_to_string(cpu_feature features);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char* cpu_feature_to_string(cpu_feature features);
static const char* cpu_feature_to_string(cpu_feature features);

Comment on lines 471 to 472
size_t nr = ctx.kernels->gemm.get_nr();
size_t kr = ctx.kernels->gemm.get_kr();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can also be const

size_t nr = ctx.kernels->gemm.get_nr();
size_t kr = ctx.kernels->gemm.get_kr();

return variant_call<size_t>(ctx.kernels->rhs_info.packed_size, n, k, nr, kr, QK4_0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to avoid this hardcoded QK4_0 here. Either assert the tensor type is Q4_0 or use ggml_blck_size().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants