MUSA: support ARM64 and enable dp4a .etc #11843

BodhiHu · 2025-02-13T08:30:49Z

This PR will:

support MUSA ARM64 target;
enable dp4a on MUSA;

Build:

# we recommend using the `clang` compiler from musa sdk for both CPU & GPU:
cmake -B build -DGGML_MUSA=ON \
  -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
cmake --build build --config Release -j16

Example run:

./build/bin/llama-cli -m ~/models/deepseek-r1-7B-Q4_K_M.gguf \
  -ngl 28 -t 4 -p "Here's why we should plant more trees and avoid using plastic products:" \
  -no-cnv -n 128 -fa

Tested with following models:

ARM64:

MUSA SDK: 3.1.2
CPU compiler: clang-17

qwen2.5-1.5b-instruct-q8_0.gguf
qwen2.5-3b-instruct-q4_k_m.gguf
deepseek-r1-7B-Q4_K_M.gguf

x86:

MUSA SDK: 3.1.1
CPU compiler: clang-14

llama3_8b_q4_0.gguf
deepseek-r1_7b_q4_0.gguf
qwen2.5-3b-instruct-q4_k_m.gguf

BodhiHu · 2025-02-13T09:58:05Z

Hi @JohannesGaessler , @ggerganov , @slaren , @yeahdongcn ,

Can you please help review this PR ?

Thanks a lot.

JohannesGaessler

The changes to the CUDA backend look fine to me other than the things I commented on. I don't know whether the changes for model support are correct.

CMakeLists.txt

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/ggml-cuda.cu

ggml/src/ggml-cuda/mmq.cu

yeahdongcn · 2025-02-13T11:25:16Z

Please run the functionality tests and the tests under the tests directory on amd64 as well.
BTW, I'm updating the MUSA SDK version to rc3.1.1. You may want to hold off until #11822 is reviewed and merged.

BodhiHu · 2025-02-14T07:30:51Z

The changes to the CUDA backend look fine to me other than the things I commented on. I don't know whether the changes for model support are correct.

Hi @JohannesGaessler , the changes to model support is to enable the expert_weights_scale for MoE sparsified LLaMA models,
I tested with following LLaMA MoE model, and it runs well:

https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft

BodhiHu · 2025-02-14T09:32:44Z

Please run the functionality tests and the tests under the tests directory on amd64 as well. BTW, I'm updating the MUSA SDK version to rc3.1.1. You may want to hold off until #11822 is reviewed and merged.

Hi @yeahdongcn , I see #11822 had been merged.

When running ./build/bin/test-backend-ops, there's an exception, don't know if this also happens on your side or is an known issue ?

  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=bf16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0,permute=[0,1,2,3]): not supported [MUSA0]
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): MUSA error: invalid argument
  current device: 0, in function ggml_cuda_cross_entropy_loss at /home/mm/bodhi/llama.cpp/ggml/src/ggml-cuda/cross-entropy-loss.cu:129
  musaFuncSetAttribute(cross_entropy_loss_back_f32<true>, musaFuncAttributeMaxDynamicSharedMemorySize, smpbo)
/home/mm/bodhi/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: MUSA error
[New LWP 178917]
[New LWP 178918]
[New LWP 178919]
[New LWP 178920]
[New LWP 178933]
[New LWP 178982]
[New LWP 179583]
[New LWP 179584]
[New LWP 179585]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff8d436800 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc4a3e86c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x0000ffff8d436800 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc4a3e86c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000aaaac10a8d44 in ggml_print_backtrace ()
#2  0x0000aaaac10a8cd8 in ggml_abort ()
#3  0x0000aaaac0f6c8cc in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#4  0x0000aaaac1054b2c in ggml_cuda_cross_entropy_loss(ggml_backend_cuda_context&, ggml_tensor*) ()
#5  0x0000aaaac0f715bc in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#6  0x0000aaaac10bf404 in ggml_backend_compare_graph_backend ()
#7  0x0000aaaac0ee4e78 in test_case::eval(ggml_backend*, ggml_backend*, char const*) ()
#8  0x0000aaaac0ed1f14 in main ()
[Inferior 1 (process 178916) detached]
Aborted (core dumped)

ggml/src/ggml-cuda/cross-entropy-loss.cu

BodhiHu · 2025-02-14T12:35:17Z

Please run the functionality tests and the tests under the tests directory on amd64 as well. BTW, I'm updating the MUSA SDK version to rc3.1.1. You may want to hold off until #11822 is reviewed and merged.

Hi @yeahdongcn , I see #11822 had been merged.

When running ./build/bin/test-backend-ops, there's an exception, don't know if this also happens on your side or is an known issue ?

  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=f16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=bf16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0,permute=[0,1,2,3]): not supported [MUSA0]
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): MUSA error: invalid argument
  current device: 0, in function ggml_cuda_cross_entropy_loss at /home/mm/bodhi/llama.cpp/ggml/src/ggml-cuda/cross-entropy-loss.cu:129
  musaFuncSetAttribute(cross_entropy_loss_back_f32<true>, musaFuncAttributeMaxDynamicSharedMemorySize, smpbo)
/home/mm/bodhi/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: MUSA error
[New LWP 178917]
[New LWP 178918]
[New LWP 178919]
[New LWP 178920]
[New LWP 178933]
[New LWP 178982]
[New LWP 179583]
[New LWP 179584]
[New LWP 179585]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff8d436800 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc4a3e86c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x0000ffff8d436800 in __GI___wait4 (pid=<optimized out>, stat_loc=0xffffc4a3e86c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000aaaac10a8d44 in ggml_print_backtrace ()
#2  0x0000aaaac10a8cd8 in ggml_abort ()
#3  0x0000aaaac0f6c8cc in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#4  0x0000aaaac1054b2c in ggml_cuda_cross_entropy_loss(ggml_backend_cuda_context&, ggml_tensor*) ()
#5  0x0000aaaac0f715bc in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#6  0x0000aaaac10bf404 in ggml_backend_compare_graph_backend ()
#7  0x0000aaaac0ee4e78 in test_case::eval(ggml_backend*, ggml_backend*, char const*) ()
#8  0x0000aaaac0ed1f14 in main ()
[Inferior 1 (process 178916) detached]
Aborted (core dumped)

FYI，the above CROSS_ENTROPY_LOSS op test error had been fixed.

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/cross-entropy-loss.cu

ggml/src/ggml-cuda/mmq.cu

convert_hf_to_gguf.py

BodhiHu · 2025-02-18T02:55:29Z

Hi @yeahdongcn , the model running issue had been fixed on x86,
tested with following models and it runs well now:

llama3_8b_q4_0.gguf
deepseek-r1_7b_q4_0.gguf
qwen2.5-3b-instruct-q4_k_m.gguf

BodhiHu · 2025-02-18T03:09:30Z

Hi @slaren , the LLaMA-MoE changes to convert_hf_to_gguf.py had been removed, can you please help review again ? Thanks.

docs/build.md

ggml/src/ggml-cuda/ggml-cuda.cu

ggml/src/ggml-cuda/common.cuh

src/llama-model.cpp

yeahdongcn · 2025-02-20T01:16:34Z

Performed some tests on amd64 as well:

Tested several models with llama-cli; token generation worked as expected.
Ran test-backend-ops, but it failed with a MUSA error on CROSS_ENTROPY_LOSS(type=f32, ne=[10,5,4,3]).

Model list:

❯ ls -l ~/models/
total 12471020
-rw-r--r-- 1 xiaodongye xiaodongye 4683073184 1月  21 18:43 deepseek-r1_7b_q4_0.gguf
-rw-rw-r-- 1 xiaodongye xiaodongye 1321082688 9月  26 01:19 llama3.2_1b_q8_0.gguf
-rw-rw-r-- 1 xiaodongye xiaodongye 4661211424 5月  21  2024 llama3_8b_q4_0.gguf
-rw-rw-r-- 1 xiaodongye xiaodongye 2104932768 2月  18 10:51 qwen2.5-3b-instruct-q4_k_m.gguf

test-backend-ops logs:

❯ ./test-backend-ops
...
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0,permute=[0,1,2,3]): not supported [MUSA0] 
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0,permute=[0,1,2,3]): not supported [MUSA0] 
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): MUSA error: invalid argument
  current device: 0, in function ggml_cuda_cross_entropy_loss at /home/xiaodongye/ws/ggml/llama.cpp/ggml/src/ggml-cuda/cross-entropy-loss.cu:129
  musaFuncSetAttribute(cross_entropy_loss_f32<true>, musaFuncAttributeMaxDynamicSharedMemorySize, smpbo)
/home/xiaodongye/ws/ggml/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: MUSA error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

@slaren Could you please approve the workflow so I can verify if this PR works on other backends? Thanks!

yeahdongcn · 2025-02-20T01:52:19Z

diff --git a/ggml/src/ggml-cuda/cross-entropy-loss.cu b/ggml/src/ggml-cuda/cross-entropy-loss.cu
index 223576b2..0ce4afbb 100644
--- a/ggml/src/ggml-cuda/cross-entropy-loss.cu
+++ b/ggml/src/ggml-cuda/cross-entropy-loss.cu
@@ -123,13 +123,13 @@ void ggml_cuda_cross_entropy_loss(ggml_backend_cuda_context & ctx, ggml_tensor *
     ggml_cuda_pool_alloc<float> dst_tmp(pool, blocks_num.x);
 
     if (nbytes_shared <= smpbo) {
-#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__))
+#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
         static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false};
         if (!shared_memory_limit_raised[id]) {
             CUDA_CHECK(cudaFuncSetAttribute(cross_entropy_loss_f32<true>, cudaFuncAttributeMaxDynamicSharedMemorySize, smpbo));
             shared_memory_limit_raised[id] = true;
         }
-#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__))
+#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
         cross_entropy_loss_f32<true><<<blocks_num, blocks_dim, nbytes_shared, stream>>>(src0_d, src1_d, dst_tmp.ptr, ne00, nrows);
     } else {
         cross_entropy_loss_f32<false><<<blocks_num, blocks_dim, 0, stream>>>(src0_d, src1_d, dst_tmp.ptr, ne00, nrows);
@@ -175,13 +175,13 @@ void ggml_cuda_cross_entropy_loss_back(ggml_backend_cuda_context & ctx, ggml_ten
     const size_t smpbo = ggml_cuda_info().devices[id].smpbo;
 
     if (nbytes_shared <= smpbo) {
-#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__))
+#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
         static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false};
         if (!shared_memory_limit_raised[id]) {
             CUDA_CHECK(cudaFuncSetAttribute(cross_entropy_loss_back_f32<true>, cudaFuncAttributeMaxDynamicSharedMemorySize, smpbo));
             shared_memory_limit_raised[id] = true;
         }
-#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__))
+#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
         cross_entropy_loss_back_f32<true><<<blocks_num, blocks_dim, nbytes_shared, stream>>>(grad_d, src0f_d, src1f_d, dst_d, ne00);
     } else {
         cross_entropy_loss_back_f32<false><<<blocks_num, blocks_dim, 0, stream>>>(grad_d, src0f_d, src1f_d, dst_d, ne00);

It seems we need to disable them similarly to HIP. Now, test-backend-ops is functioning.

❯ ./test-backend-ops
...
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0,permute=[0,1,2,3]): not supported [MUSA0] 
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): OK
  CROSS_ENTROPY_LOSS(type=f32,ne=[30000,1,1,1]): OK
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[10,5,4,3]): OK
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[30000,1,1,1]): OK
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): OK
  3504/3504 tests passed
  Backend MUSA0: OK

Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

BodhiHu · 2025-02-20T02:27:17Z

Squashed the commits as one to resolve the Github PR mreging blocked issue:

BodhiHu · 2025-02-20T02:36:56Z

diff --git a/ggml/src/ggml-cuda/cross-entropy-loss.cu b/ggml/src/ggml-cuda/cross-entropy-loss.cu
index 223576b2..0ce4afbb 100644
--- a/ggml/src/ggml-cuda/cross-entropy-loss.cu
+++ b/ggml/src/ggml-cuda/cross-entropy-loss.cu
@@ -123,13 +123,13 @@ void ggml_cuda_cross_entropy_loss(ggml_backend_cuda_context & ctx, ggml_tensor *
     ggml_cuda_pool_alloc<float> dst_tmp(pool, blocks_num.x);
 
     if (nbytes_shared <= smpbo) {
-#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__))
+#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
         static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false};
         if (!shared_memory_limit_raised[id]) {
             CUDA_CHECK(cudaFuncSetAttribute(cross_entropy_loss_f32<true>, cudaFuncAttributeMaxDynamicSharedMemorySize, smpbo));
             shared_memory_limit_raised[id] = true;
         }
-#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__))
+#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
         cross_entropy_loss_f32<true><<<blocks_num, blocks_dim, nbytes_shared, stream>>>(src0_d, src1_d, dst_tmp.ptr, ne00, nrows);
     } else {
         cross_entropy_loss_f32<false><<<blocks_num, blocks_dim, 0, stream>>>(src0_d, src1_d, dst_tmp.ptr, ne00, nrows);
@@ -175,13 +175,13 @@ void ggml_cuda_cross_entropy_loss_back(ggml_backend_cuda_context & ctx, ggml_ten
     const size_t smpbo = ggml_cuda_info().devices[id].smpbo;
 
     if (nbytes_shared <= smpbo) {
-#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__))
+#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
         static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false};
         if (!shared_memory_limit_raised[id]) {
             CUDA_CHECK(cudaFuncSetAttribute(cross_entropy_loss_back_f32<true>, cudaFuncAttributeMaxDynamicSharedMemorySize, smpbo));
             shared_memory_limit_raised[id] = true;
         }
-#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__))
+#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
         cross_entropy_loss_back_f32<true><<<blocks_num, blocks_dim, nbytes_shared, stream>>>(grad_d, src0f_d, src1f_d, dst_d, ne00);
     } else {
         cross_entropy_loss_back_f32<false><<<blocks_num, blocks_dim, 0, stream>>>(grad_d, src0f_d, src1f_d, dst_d, ne00);

It seems we need to disable them similarly to HIP. Now, test-backend-ops is functioning.

❯ ./test-backend-ops
...
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0,permute=[0,1,2,3]): not supported [MUSA0] 
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): OK
  CROSS_ENTROPY_LOSS(type=f32,ne=[30000,1,1,1]): OK
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[10,5,4,3]): OK
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[30000,1,1,1]): OK
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): OK
  3504/3504 tests passed
  Backend MUSA0: OK

Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Hi @yeahdongcn , as we aligned, I had applied the above code changes.
when MUSA no longer reserves shared mem, I think we could turn on then.

BodhiHu · 2025-02-20T03:26:21Z

Hi @JohannesGaessler , @slaren , @yeahdongcn ,

Now all the OP tests have passed, could you please help merge this PR ?
It's showing Merging is blocked from my side even though this PR had been approved,
it seems only the code owners can merge the PR : D

  ...
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=bf16,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q8_0,permute=[0,1,2,3]): not supported [MUSA0]
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=35,mask=0,max_bias=0.000000,logit_softcap=0.000000,type_KV=q4_0,permute=[0,1,2,3]): not supported [MUSA0]
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): OK
  CROSS_ENTROPY_LOSS(type=f32,ne=[30000,1,1,1]): OK
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[10,5,4,3]): OK
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[30000,1,1,1]): OK
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): OK
  3527/3527 tests passed
  Backend MUSA0: OK

Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed

yeahdongcn

LGTM

ggml/src/ggml-cuda/ggml-cuda.cu

docs/build.md

BodhiHu · 2025-02-21T02:20:43Z

@ggerganov ，@JohannesGaessler ，
Can you please help merge this PR if no other changes required ?
From my side, it's showing Merging is blocked, it seems only maintainers can merge the PRs.

Thanks a lot.

ggerganov · 2025-02-21T06:15:32Z

Yes, we can merge after the CI workflows are passed. I just started them.

* MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <[email protected]>

BodhiHu requested a review from JohannesGaessler as a code owner February 13, 2025 08:30

github-actions bot added documentation Improvements or additions to documentation build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Feb 13, 2025

BodhiHu changed the title ~~[wip] MUSA: enable dp4a and fix compile errors on ARM64~~ MUSA: enable dp4a and fix compile errors on ARM64 Feb 13, 2025

JohannesGaessler reviewed Feb 13, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Show resolved Hide resolved

ggml/src/ggml-cuda/mmq.cu Outdated Show resolved Hide resolved

BodhiHu commented Feb 14, 2025

View reviewed changes

ggml/src/ggml-cuda/cross-entropy-loss.cu Show resolved Hide resolved

BodhiHu closed this Feb 14, 2025

BodhiHu reopened this Feb 14, 2025

yeahdongcn reviewed Feb 15, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Show resolved Hide resolved

ggml/src/ggml-cuda/common.cuh Show resolved Hide resolved

ggml/src/ggml-cuda/cross-entropy-loss.cu Show resolved Hide resolved

ggml/src/ggml-cuda/mmq.cu Outdated Show resolved Hide resolved

BodhiHu changed the title ~~MUSA: enable dp4a and fix compile errors on ARM64~~ MUSA: support ARM64 and enable dp4a .etc Feb 17, 2025

slaren requested changes Feb 17, 2025

View reviewed changes

convert_hf_to_gguf.py Show resolved Hide resolved

BodhiHu requested review from slaren and yeahdongcn February 19, 2025 02:20

yeahdongcn reviewed Feb 19, 2025

View reviewed changes

docs/build.md Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

BodhiHu force-pushed the musa branch from b4839fc to f24378f Compare February 19, 2025 07:35

BodhiHu requested a review from yeahdongcn February 19, 2025 07:36

BodhiHu force-pushed the musa branch from f24378f to edc1630 Compare February 19, 2025 07:44

slaren approved these changes Feb 19, 2025

View reviewed changes

BodhiHu force-pushed the musa branch from d2871f8 to 80a3000 Compare February 20, 2025 02:23

yeahdongcn approved these changes Feb 20, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Show resolved Hide resolved

docs/build.md Show resolved Hide resolved

JohannesGaessler approved these changes Feb 20, 2025

View reviewed changes

Bodhi Hu added 5 commits February 20, 2025 10:22

MUSA: support ARM64 and enable __dp4a .etc

80a3000

fix cross entropy loss op for musa

117f7dd

update

7ff3b06

add cc info log for musa

981adc1

add comment for the MUSA .cc calculation block

48022e8

ggerganov merged commit 0b3863f into ggml-org:master Feb 21, 2025
46 checks passed

BodhiHu deleted the musa branch February 24, 2025 02:42

ggerganov mentioned this pull request Apr 6, 2025

ggml : simplify Arm fp16 CPU logic ggml-org/ggml#1177

Merged

MUSA: support ARM64 and enable dp4a .etc #11843

MUSA: support ARM64 and enable dp4a .etc #11843

Uh oh!

Conversation

BodhiHu commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BodhiHu commented Feb 13, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeahdongcn commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BodhiHu commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BodhiHu commented Feb 14, 2025

Uh oh!

Uh oh!

BodhiHu commented Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BodhiHu commented Feb 18, 2025

Uh oh!

BodhiHu commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeahdongcn commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeahdongcn commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BodhiHu commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BodhiHu commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BodhiHu commented Feb 20, 2025

Uh oh!

yeahdongcn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BodhiHu commented Feb 21, 2025

Uh oh!

ggerganov commented Feb 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BodhiHu commented Feb 13, 2025 •

edited

Loading

yeahdongcn commented Feb 13, 2025 •

edited

Loading

BodhiHu commented Feb 14, 2025 •

edited

Loading

BodhiHu commented Feb 18, 2025 •

edited

Loading

yeahdongcn commented Feb 20, 2025 •

edited

Loading

yeahdongcn commented Feb 20, 2025 •

edited

Loading

BodhiHu commented Feb 20, 2025 •

edited

Loading

BodhiHu commented Feb 20, 2025 •

edited

Loading