Replies: 1 comment
-
big picture of ggml-qnn: mapping ggml computational graph to QNN computional graph with the breakthrough help from chiwwang@QTI on April 2024,
we already found that there are different technical paths to utilize the Qualcomm Hexagon NPU in ggml-qnn via QNN SDK:
prons: this approach can benefit greatly from the excellent "backend scheduler" feature in the ggml backend subsystem, can be a "functional implementation" or a good starting-point in the upstream llama.cpp community. accordingly, this approach can be verified easily: https://github.com/kantv-ai/llama.cpp/wiki cons: there mightbe performance concern in ggml-qnn backend
this approach also could be found at https://github.com/kantv-ai/llama.cpp/blob/kantvai-ggmlqnn/ggml/src/ggml-qnn/ggml-qnn.cpp#L3516 graph TD;
src0-->transpose_dst;
src1-->transpose_dst;
transpose_dst-->dst;
prons: this approach might be equivalent to the principle shown in the above quoted code, and we guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend: every node in ggml's cgraph will be executed in Hexagon NPU through mapping the entire ggml computational graph to QNN computational graph. btw, currently we don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl, in other words: we don't know why there is such big difference between the first approach and this approach. cons: can not take advantage of backend scheduler feature and too much work load there are many undocumented(or not very clear) technical details in QNN SDK, so we think the necessary technical support should be provided from Qualcomm's tech team even we reach the final mission according to the first approach with help from the great llama.cpp community.
correction from domain technical experts is greatly welcomed and appricated. |
Beta Was this translation helpful? Give feedback.
-
pls refer to :
https://github.com/kantv-ai/llama.cpp/wiki
this is continued dev activity based on my previous PR since early Feb 2025.
this is a concise(without complex/complicated/redundant/… encapsulation) implementation and we hope every domain programmers/developers/experts can understand codes and technical details quickly.
Beta Was this translation helpful? Give feedback.
All reactions