-
Notifications
You must be signed in to change notification settings - Fork 12.8k
ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost #15385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: mgiessing <[email protected]>
Signed-off-by: mgiessing <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
|
||
vector signed char kv = vec_xl(0, (const signed char *)kvalues_mxfp4); | ||
|
||
#pragma GCC unroll 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I highly doubt this unroll actually does anything. The compiler has to know the loop bounds at compile-time to be able to unroll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the compiler knows the loop bounds it can get rid of jumps entirely, but dynamic loops can also be unrolled as long as a factor (in this case 8) is provided. Worse than a static unroll but often still effective.
Example of such an unrolled loop from wikipedia
#include <stdio.h>
/* The number of entries processed per loop iteration. */
/* Note that this number is a 'constant constant' reflecting the code below. */
enum {
BUNCHSIZE = 8
};
int main(void)
{
int i = 0; /* counter */
int entries = 50; /* total number to process */
/* If the number of elements is not divisible by BUNCHSIZE, */
/* get repeat times required to do most processing in the while loop */
int repeat = (entries / BUNCHSIZE); /* number of times to repeat */
int left = (entries % BUNCHSIZE); /* calculate remainder */
/* Unroll the loop in 'bunches' of 8 */
while (repeat--)
{
printf("process(%d)\n", i );
printf("process(%d)\n", i + 1);
printf("process(%d)\n", i + 2);
printf("process(%d)\n", i + 3);
printf("process(%d)\n", i + 4);
printf("process(%d)\n", i + 5);
printf("process(%d)\n", i + 6);
printf("process(%d)\n", i + 7);
/* update the index by amount processed in one go */
i += BUNCHSIZE;
}
/* Use a switch statement to process remaining by jumping to the case label */
/* at the label that will then drop through to complete the set */
switch (left)
{
case 7 : printf("process(%d)\n", i + 6); /* process and rely on drop
through */
case 6 : printf("process(%d)\n", i + 5);
case 5 : printf("process(%d)\n", i + 4);
case 4 : printf("process(%d)\n", i + 3);
case 3 : printf("process(%d)\n", i + 2);
case 2 : printf("process(%d)\n", i + 1); /* two left */
case 1 : printf("process(%d)\n", i); /* just one left to process */
case 0 : ; /* none left */
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't too sure about that either - just saw it was done for the other quants and in my testing (which might have gotten impacted by other factors as well) I saw an improvement of ~1-2 t/s using llama-bench
This PR introduces VSX/Altivec intrinsics optimized kernels for Power9 and newer PowerPC CPUs, targeting mxfp4 format.
Previously, PowerPC builds of llama.cpp fell back to scalar code, leaving significant vector performance on the table. With this PR, we leverage the VSX SIMD unit available on Power9+, bringing throughput improvements of up to 3-4x for token generation.
Benchmark results:
Hardware:
Power10 S1024 (8cores, 16 threads)
Model:
gpt-oss-20b-mxfp4
Master branch
PR branch