ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost #15385

mgiessing · 2025-08-18T07:16:14Z

This PR introduces VSX/Altivec intrinsics optimized kernels for Power9 and newer PowerPC CPUs, targeting mxfp4 format.

Previously, PowerPC builds of llama.cpp fell back to scalar code, leaving significant vector performance on the table. With this PR, we leverage the VSX SIMD unit available on Power9+, bringing throughput improvements of up to 3-4x for token generation.

Benchmark results:

Hardware: Power10 S1024 (8cores, 16 threads)
Model: gpt-oss-20b-mxfp4

Master branch

model	size	params	backend	threads	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	tg64	5.32 ± 0.01
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	tg128	5.30 ± 0.01
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CPU	16	tg64	8.34 ± 0.11
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CPU	16	tg128	8.16 ± 0.10

PR branch

model	size	params	backend	threads	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	tg64	20.57 ± 0.37
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CPU	8	tg128	20.61 ± 0.12
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CPU	16	tg64	24.88 ± 3.07
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	CPU	16	tg128	25.66 ± 1.38

Signed-off-by: mgiessing <[email protected]>

ggml/src/ggml-cpu/arch/powerpc/quants.c

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov · 2025-08-19T08:54:09Z

ggml/src/ggml-cpu/arch/powerpc/quants.c

+
+    vector signed char kv = vec_xl(0, (const signed char *)kvalues_mxfp4);
+
+#pragma GCC unroll 8


I highly doubt this unroll actually does anything. The compiler has to know the loop bounds at compile-time to be able to unroll.

If the compiler knows the loop bounds it can get rid of jumps entirely, but dynamic loops can also be unrolled as long as a factor (in this case 8) is provided. Worse than a static unroll but often still effective.

Example of such an unrolled loop from wikipedia

#include <stdio.h> /* The number of entries processed per loop iteration. */ /* Note that this number is a 'constant constant' reflecting the code below. */ enum { BUNCHSIZE = 8 }; int main(void) { int i = 0; /* counter */ int entries = 50; /* total number to process */ /* If the number of elements is not divisible by BUNCHSIZE, */ /* get repeat times required to do most processing in the while loop */ int repeat = (entries / BUNCHSIZE); /* number of times to repeat */ int left = (entries % BUNCHSIZE); /* calculate remainder */ /* Unroll the loop in 'bunches' of 8 */ while (repeat--) { printf("process(%d)\n", i ); printf("process(%d)\n", i + 1); printf("process(%d)\n", i + 2); printf("process(%d)\n", i + 3); printf("process(%d)\n", i + 4); printf("process(%d)\n", i + 5); printf("process(%d)\n", i + 6); printf("process(%d)\n", i + 7); /* update the index by amount processed in one go */ i += BUNCHSIZE; } /* Use a switch statement to process remaining by jumping to the case label */ /* at the label that will then drop through to complete the set */ switch (left) { case 7 : printf("process(%d)\n", i + 6); /* process and rely on drop through */ case 6 : printf("process(%d)\n", i + 5); case 5 : printf("process(%d)\n", i + 4); case 4 : printf("process(%d)\n", i + 3); case 3 : printf("process(%d)\n", i + 2); case 2 : printf("process(%d)\n", i + 1); /* two left */ case 1 : printf("process(%d)\n", i); /* just one left to process */ case 0 : ; /* none left */ } }

I wasn't too sure about that either - just saw it was done for the other quants and in my testing (which might have gotten impacted by other factors as well) I saw an improvement of ~1-2 t/s using llama-bench

mgiessing added 2 commits August 17, 2025 16:48

Added VSX intrinsics for Power9+ systems

b777851

Signed-off-by: mgiessing <[email protected]>

Manual unrolling for minor perf improvement

537b504

Signed-off-by: mgiessing <[email protected]>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 18, 2025

mgiessing changed the title ~~Add VSX intrinsics for Power9+ (ppc64le) hardware – 4-5x performance boost~~ Add VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost Aug 18, 2025

mgiessing changed the title ~~Add VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost~~ ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost Aug 18, 2025

ggerganov approved these changes Aug 18, 2025

View reviewed changes

ggml/src/ggml-cpu/arch/powerpc/quants.c Outdated Show resolved Hide resolved

Update ggml/src/ggml-cpu/arch/powerpc/quants.c

a06338e

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov reviewed Aug 19, 2025

View reviewed changes

ggerganov merged commit 6424594 into ggml-org:master Aug 19, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost #15385

ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost #15385

mgiessing commented Aug 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

ggerganov Aug 19, 2025

Uh oh!

Tom94 Aug 19, 2025 •

edited

Loading

Uh oh!

mgiessing Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!


		vector signed char kv = vec_xl(0, (const signed char *)kvalues_mxfp4);

		#pragma GCC unroll 8

ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost #15385

ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware – 3-4x performance boost #15385

Conversation

mgiessing commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Master branch

PR branch

Uh oh!

Uh oh!

ggerganov Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Tom94 Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgiessing Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mgiessing commented Aug 18, 2025 •

edited

Loading

Tom94 Aug 19, 2025 •

edited

Loading