Skip to content

ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost #15385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 19, 2025

Conversation

mgiessing
Copy link
Contributor

@mgiessing mgiessing commented Aug 18, 2025

This PR introduces VSX/Altivec intrinsics optimized kernels for Power9 and newer PowerPC CPUs, targeting mxfp4 format.

Previously, PowerPC builds of llama.cpp fell back to scalar code, leaving significant vector performance on the table. With this PR, we leverage the VSX SIMD unit available on Power9+, bringing throughput improvements of up to 3-4x for token generation.

Benchmark results:

Hardware: Power10 S1024 (8cores, 16 threads)
Model: gpt-oss-20b-mxfp4

Master branch

model size params backend threads test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 tg64 5.32 ± 0.01
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 tg128 5.30 ± 0.01
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CPU 16 tg64 8.34 ± 0.11
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CPU 16 tg128 8.16 ± 0.10

PR branch

model size params backend threads test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 tg64 20.57 ± 0.37
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CPU 8 tg128 20.61 ± 0.12
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CPU 16 tg64 24.88 ± 3.07
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CPU 16 tg128 25.66 ± 1.38

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 18, 2025
@mgiessing mgiessing changed the title Add VSX intrinsics for Power9+ (ppc64le) hardware – 4-5x performance boost Add VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost Aug 18, 2025
@mgiessing mgiessing changed the title Add VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost Aug 18, 2025

vector signed char kv = vec_xl(0, (const signed char *)kvalues_mxfp4);

#pragma GCC unroll 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly doubt this unroll actually does anything. The compiler has to know the loop bounds at compile-time to be able to unroll.

Copy link

@Tom94 Tom94 Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compiler knows the loop bounds it can get rid of jumps entirely, but dynamic loops can also be unrolled as long as a factor (in this case 8) is provided. Worse than a static unroll but often still effective.

Example of such an unrolled loop from wikipedia
#include <stdio.h>

/* The number of entries processed per loop iteration.                        */
/* Note that this number is a 'constant constant' reflecting the code below.  */
enum {
  BUNCHSIZE = 8
};

int main(void)
{ 
  int i = 0;                                    /* counter */
  int entries = 50;                             /* total number to process    */ 
 
  /* If the number of elements is not divisible by BUNCHSIZE,              */ 
  /* get repeat times required to do most processing in the while loop        */

  int repeat = (entries / BUNCHSIZE);                /* number of times to repeat */
  int left   = (entries % BUNCHSIZE);                /* calculate remainder       */

  /* Unroll the loop in 'bunches' of 8                                        */ 
  while (repeat--) 
  { 
    printf("process(%d)\n", i    );
    printf("process(%d)\n", i + 1); 
    printf("process(%d)\n", i + 2); 
    printf("process(%d)\n", i + 3); 
    printf("process(%d)\n", i + 4); 
    printf("process(%d)\n", i + 5); 
    printf("process(%d)\n", i + 6); 
    printf("process(%d)\n", i + 7);

    /* update the index by amount processed in one go                         */ 
    i += BUNCHSIZE;
  }

  /* Use a switch statement to process remaining by jumping to the case label */ 
  /* at the label that will then drop through to complete the set             */ 
  switch (left) 
  {
     case 7 : printf("process(%d)\n", i + 6);   /* process and rely on drop 
                                                   through                    */
     case 6 : printf("process(%d)\n", i + 5); 
     case 5 : printf("process(%d)\n", i + 4);  
     case 4 : printf("process(%d)\n", i + 3);  
     case 3 : printf("process(%d)\n", i + 2); 
     case 2 : printf("process(%d)\n", i + 1);   /* two left                   */
     case 1 : printf("process(%d)\n", i);       /* just one left to process   */ 
     case 0 : ;                                 /* none left                  */
  } 
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't too sure about that either - just saw it was done for the other quants and in my testing (which might have gotten impacted by other factors as well) I saw an improvement of ~1-2 t/s using llama-bench

@ggerganov ggerganov merged commit 6424594 into ggml-org:master Aug 19, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants