Skip to content

Avoid PackTranspose calls Fp32 tiny blas kernels #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

shalinib-ibm
Copy link
Owner

This patch gets rid of the redundant vec_perm insns by re-ordering the matrix multiplication in kernel

Make sure to read the contributing guidelines before submitting a PR

This patch gets rid of the redundant vec_perm insns
by re-ordering the matrix multiplication in kernel

Signed-off-by: Shalini Salomi Bodapati <[email protected]>
@shalinib-ibm
Copy link
Owner Author

image

llama_file sgemm computes C = A^T * B in column major
For MMA , this equates to C^T = A * B^T in row major
Now A * B^T ( in row major) = (B * A^T)^T where B and A^T should be row major, but we have B and A^T in column major order only
So, this approach is not working

@shalinib-ibm
Copy link
Owner Author

shalinib-ibm commented Jun 10, 2025


MMA example with 4x4 matrix multiplication : 


A (row-major)     => 4×4:

[ [1, 2, 3, 4],

  [5, 6, 7, 8],

  [9,10,11,12],

  [13,14,15,16] ]


B (row-major)     => 4×4:

[ [17,18,19,20],

  [21,22,23,24],

  [25,26,27,28],

  [29,30,31,32] ]


Optimal MMA setup :
Transpose matrix A to get columns of A as rows : 

Compute C = A^T times B

xvf32gerpp(acc, col0_of_A, row0_of_B);

xvf32gerpp(acc, col1_of_A, row1_of_B);

xvf32gerpp(acc, col2_of_A, row2_of_B);

xvf32gerpp(acc, col3_of_A, row3_of_B);

Problem with llamafile_sgemm


A^T (column-major layout):

Memory: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]


B (column-major layout):

Memory: [17 21 25 29 18 22 26 30 19 23 27 31 20 24 28 32]

Indexes:


I vec_A row vec_B row Indices
0 [1, 5, 9, 13] [17, 18, 19, 20] 0,4,8,12
1 [2,6,10,14] [21, 22, 23, 24]  
2 [3,7,11,15] [25,26,27,28]  
3 [4,8,12,16] [29,30,31,32]  



Row-major A (i.e., A stored row-major, A read as rows):
You can use vec_xl(0, A + l + x * 4) — clean 16-byte contiguous loads.

Column-major A (A stored column-major, same memory order but semantics changed):
Now you need non-contiguous access (strided by 4 floats) to gather one row of A (which is one column of A).
That requires 4 separate loads or a vector gather


@shalinib-ibm shalinib-ibm changed the title Try to Optimize ppc Fp32 tiny blas kernels Avoid PackTranspose calls Fp32 tiny blas kernels Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant