Avoid PackTranspose calls Fp32 tiny blas kernels #9

shalinib-ibm · 2025-05-30T04:44:22Z

This patch gets rid of the redundant vec_perm insns by re-ordering the matrix multiplication in kernel

Make sure to read the contributing guidelines before submitting a PR

This patch gets rid of the redundant vec_perm insns by re-ordering the matrix multiplication in kernel Signed-off-by: Shalini Salomi Bodapati <[email protected]>

shalinib-ibm · 2025-05-30T10:37:00Z

llama_file sgemm computes C = A^T * B in column major
For MMA , this equates to C^T = A * B^T in row major
Now A * B^T ( in row major) = (B * A^T)^T where B and A^T should be row major, but we have B and A^T in column major order only
So, this approach is not working

shalinib-ibm · 2025-06-10T10:02:23Z

MMA example with 4x4 matrix multiplication :

A (row-major) => 4×4:

[ [1, 2, 3, 4],

[5, 6, 7, 8],

[9,10,11,12],

[13,14,15,16] ]

B (row-major) => 4×4:

[ [17,18,19,20],

[21,22,23,24],

[25,26,27,28],

[29,30,31,32] ]

Optimal MMA setup :
Transpose matrix A to get columns of A as rows :

Compute C = A^T times B

xvf32gerpp(acc, col0_of_A, row0_of_B);

xvf32gerpp(acc, col1_of_A, row1_of_B);

xvf32gerpp(acc, col2_of_A, row2_of_B);

xvf32gerpp(acc, col3_of_A, row3_of_B);

Problem with llamafile_sgemm

A^T (column-major layout):

Memory: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]

B (column-major layout):

Memory: [17 21 25 29 18 22 26 30 19 23 27 31 20 24 28 32]

Indexes:

I	vec_A row	vec_B row	Indices
0	[1, 5, 9, 13]	[17, 18, 19, 20]	0,4,8,12
1	[2,6,10,14]	[21, 22, 23, 24]
2	[3,7,11,15]	[25,26,27,28]
3	[4,8,12,16]	[29,30,31,32]

Row-major Aᵀ (i.e., A stored row-major, Aᵀ read as rows):
You can use vec_xl(0, A + l + x * 4) — clean 16-byte contiguous loads.

Column-major Aᵀ (A stored column-major, same memory order but semantics changed):
Now you need non-contiguous access (strided by 4 floats) to gather one row of Aᵀ (which is one column of A).
That requires 4 separate loads or a vector gather

Try to Optimize ppc Fp32 tiny blas kernels

6728495

This patch gets rid of the redundant vec_perm insns by re-ordering the matrix multiplication in kernel Signed-off-by: Shalini Salomi Bodapati <[email protected]>

shalinib-ibm changed the title ~~Try to Optimize ppc Fp32 tiny blas kernels~~ Avoid PackTranspose calls Fp32 tiny blas kernels Jun 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid PackTranspose calls Fp32 tiny blas kernels #9

Avoid PackTranspose calls Fp32 tiny blas kernels #9

Uh oh!

shalinib-ibm commented May 30, 2025

Uh oh!

shalinib-ibm commented May 30, 2025

Uh oh!

shalinib-ibm commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Avoid PackTranspose calls Fp32 tiny blas kernels #9

Are you sure you want to change the base?

Avoid PackTranspose calls Fp32 tiny blas kernels #9

Uh oh!

Conversation

shalinib-ibm commented May 30, 2025

Uh oh!

shalinib-ibm commented May 30, 2025

Uh oh!

shalinib-ibm commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shalinib-ibm commented Jun 10, 2025 •

edited

Loading