Skip to content

OpenCL: add tiled mul_mat_f16_f32 #14535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rmatif
Copy link
Collaborator

@rmatif rmatif commented Jul 4, 2025

This PR introduces a new mul_mat_f16_f32 kernel that leverages tiling and vectorization. I believe this will serve as a strong baseline for future improvements.
In a future PR, I may explore using image2d_t to utilize the L1 cache for mul_mat and conv2d operations. This is a bit tricky as it requires some data preprocessing on the host side

Results on Adreno 830:

Master:

model size params backend ngl test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 pp512 19.24 ± 0.88
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 tg128 18.87 ± 4.37

This PR:

model size params backend ngl test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 pp512 168.17 ± 0.41
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 tg128 22.61 ± 0.02

@lhez @max-krasnyansky

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jul 4, 2025
@ggerganov ggerganov requested a review from max-krasnyansky July 4, 2025 17:57
@lhez
Copy link
Collaborator

lhez commented Jul 4, 2025

@rmatif thank you for the PR. I will play with it and the direct convolution PR in the next few days.

For matmul, using image1d_buffer is probably the easiest way to utilize L1 cache - it wraps around a normal cl buffer and uses read_image for access, so the index should stay the same as cl buffer. The Q4_0 matmul is already doing this. It is also possible to use normal cl buffer for one matrix input and image_1d_buffer to use both load paths.

@zhouwg
Copy link
Contributor

zhouwg commented Jul 5, 2025

@rmatif, Sorry to bother you. Congratulations on your another excellent PR on ggml-opencl.

  1. Could you provide a script to setup local dev automatically and build the ggml-opencl for Android phone accordingly? This will be very useful/helpful for other developers whom also want to play with this excellent PR and reproduce the amazing benchmark data on 8Elite based phone. FYI here is a similar script to do the similar thing for simplify workflow of development&test backend for Qualcomm's Hexagon NPU on Android.
  2. Obviously, you are familiar with Android dev and very good at some areas in hardcore AI tech.
  3. It seems that Adreno 830 is the GPU component on Snapdragon 8Elite phone. Could you help to review my PR for Hexagon NPU on Android or reproduce benchmark data in my forked llama.cpp project ggml-hexagon if you have time?
  4. Refer to other existing backends, I think we can split PR for Hexagon NPU on Android to some steps:
    • Verify the codes on host side and then merge to master branch
    • Verify GGML_OP_ADD on cDSP side then merge to master branch
    • Verify fp32 mulmat on cDSP side then merge to master branch
    • You and other AI experts can add other ops accordingly .....

What do you think of this plan? Looking forward to your reply/advice and thanks.

@rmatif
Copy link
Collaborator Author

rmatif commented Jul 5, 2025

@rmatif thank you for the PR. I will play with it and the direct convolution PR in the next few days.

For matmul, using image1d_buffer is probably the easiest way to utilize L1 cache - it wraps around a normal cl buffer and uses read_image for access, so the index should stay the same as cl buffer. The Q4_0 matmul is already doing this. It is also possible to use normal cl buffer for one matrix input and image_1d_buffer to use both load paths.

@lhez You're right, using image1d_buffer is indeed a much simpler way to leverage the L1 cache. It avoids the need to manually handle row_pitch and the complexity of converting data into a 2D-tiled memory format, as it essentially acts as a "view" of an existing cl_buffer. I may begin by looking into that first as an incremental step.
However, I believe image2d_t is ultimately the best path forward, especially on Adreno, because its L1 cache is highly optimized for 2D spatial locality. MNN uses this technique extensively for its matmul op

What do you think of this plan? Looking forward to your reply/advice and thanks.

@zhouwg Please reach out to me via email, and I'll send you the build scripts and discuss further, as this seems off-topic here.
In short, my current take is that our time and effort would be better spent optimizing OpenCL, there’s still significant room for improvement. To me, it's not clear that we can achieve good enough performance on Hexagon for the moment

@zhouwg
Copy link
Contributor

zhouwg commented Jul 6, 2025

@rmatif, Thanks so much for your help. I'm so exciting that it's my first time to running the ggml-opencl backend on my Snapdragon 8Elite based phone.

llama-bench with qwen1_5-1_8b-chat-q4_0.gguf on master:

Screenshot from 2025-07-06 10-30-40


zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.6 MB/s (6962256 bytes in 0.378s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.6 MB/s (3407440 bytes in 0.185s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 17.4 MB/s (1476880 bytes in 0.081s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 17.0 MB/s (1764248 bytes in 0.099s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 18.5 MB/s (24163448 bytes in 1.245s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 17.7 MB/s (4863024 bytes in 0.261s)
6 files pushed. 18.1 MB/s (42637296 bytes in 2.252s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 17.7 MB/s (4770920 bytes in 0.258s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 10:04 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 10:04 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5849280 2025-07-05 08:04 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1476880 2025-07-06 10:04 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/qwen1_5-1_8b-chat-q4_0.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/qwen1_5-1_8b-chat-q4_0.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen2 1B Q4_0                  |   1.04 GiB |     1.84 B | OpenCL     |  99 |       4 |           pp512 |        329.52 ± 0.47 |
| qwen2 1B Q4_0                  |   1.04 GiB |     1.84 B | OpenCL     |  99 |       4 |           tg256 |         29.77 ± 0.06 |

build: a4701c4be (6025)
running time:2025-07-06,10:21:10


llama-cli with qwen1_5-1_8b-chat-q4_0.gguf on master:
Screenshot from 2025-07-07 22-28-41

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamacli
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.6 MB/s (6962256 bytes in 0.377s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.6 MB/s (3407440 bytes in 0.184s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 16.5 MB/s (1476880 bytes in 0.086s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 16.6 MB/s (1764248 bytes in 0.101s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 19.9 MB/s (24163448 bytes in 1.158s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 18.6 MB/s (4863024 bytes in 0.249s)
6 files pushed. 18.8 MB/s (42637296 bytes in 2.159s)
./out/ggmlopencl-android/bin/llama-cli: 1 file pushed. 18.6 MB/s (27712544 bytes in 1.425s)
-rwxrwxrwx 1 shell shell  6962256 2025-07-07 22:19 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell  3407440 2025-07-07 22:19 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell  5848720 2025-07-07 22:02 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell  1476880 2025-07-07 22:19 /data/local/tmp/libggml-opencl.so
-rwxrwxrwx 1 shell shell 25159984 2025-07-07 22:27 /data/local/tmp/libggml-vulkan.so
/data/local/tmp/llama-cli  -ngl 99 -t 4 -n 256 --no-warmup  -no-cnv -m /sdcard/qwen1_5-1_8b-chat-q4_0.gguf -p "introduce the movie Once Upon a Time in America briefly.\n"
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
build: 6038 (12ac0f350) with Android (12896553, +pgo, +bolt, +lto, +mlgo, based on r530567c) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device GPUOpenCL (QUALCOMM Adreno(TM) 830) - 0 MiB free
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /sdcard/qwen1_5-1_8b-chat-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-1.8B-Chat-AWQ-fp16
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 5504
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  10:                qwen2.use_parallel_residual bool             = true
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 1.04 GiB (4.85 BPW) 
load: missing pre-tokenizer type, using: 'default'
load:                                             
load: ************************************        
load: GENERATION QUALITY WILL BE DEGRADED!        
load: CONSIDER REGENERATING THE MODEL             
load: ************************************        
load:                                             
load: special tokens cache size = 293
load: token to piece cache size = 0.9338 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 24
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5504
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 1B
print_info: model params     = 1.84 B
print_info: general.name     = Qwen1.5-1.8B-Chat-AWQ-fp16
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   166.92 MiB
load_tensors:       OpenCL model buffer size =   895.75 MiB
................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:     OpenCL KV buffer size =   768.00 MiB
llama_kv_cache_unified: size =  768.00 MiB (  4096 cells,  24 layers,  1 seqs), K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:     OpenCL compute buffer size =   300.75 MiB
llama_context:        CPU compute buffer size =    12.01 MiB
llama_context: graph nodes  = 942
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

introduce the movie Once Upon a Time in America briefly.
sampler seed: 710723739
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 256, n_keep = 0

Once Upon a Time in America is a 1988 American crime film directed by Francis Ford Coppola. The film is a crime drama that tells the story of two New York City detectives, Sam Marlow (played by Robert De Niro) and David Frost (played by Al Pacino), who are assigned to investigate the murder of a wealthy businessman named John Reed (played by Al Pacino).
The film takes place in New York City in the late 1950s and early 1960s, during the height of the Cold War. The film follows Marlow and Frost as they delve into the complex web of secrets and corruption that runs through the city's elite society, including the high society, the underworld, and the political establishment.
Marlow and Frost are initially assigned to work on the murder of Reed, but they soon realize that he is not the only target of their investigation. Reed's business partner, a rival businessman named Arthur Goldstein (played by Robert De Niro), has also been murdered and his body is found in a nearby park.
As Marlow and Frost uncover the truth about Reed's death, they discover a complex web of relationships, hidden agendas, and power struggles that threatens to destroy the city's social fabric

llama_perf_sampler_print:    sampling time =     131.08 ms /   269 runs   (    0.49 ms per token,  2052.13 tokens per second)
llama_perf_context_print:        load time =    2395.93 ms
llama_perf_context_print: prompt eval time =     104.11 ms /    13 tokens (    8.01 ms per token,   124.87 tokens per second)
llama_perf_context_print:        eval time =    8510.46 ms /   255 runs   (   33.37 ms per token,    29.96 tokens per second)
llama_perf_context_print:       total time =   11124.95 ms /   268 tokens




llama-bench with Llama-3.2-1B-Instruct-f16.gguf on this PR:

Screenshot from 2025-07-06 12-40-24

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 29.0 MB/s (6962256 bytes in 0.229s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 29.6 MB/s (3407440 bytes in 0.110s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 27.0 MB/s (1487528 bytes in 0.053s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 27.7 MB/s (1764248 bytes in 0.061s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 28.9 MB/s (24163448 bytes in 0.798s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 28.7 MB/s (4863024 bytes in 0.161s)
6 files pushed. 28.7 MB/s (42647944 bytes in 1.415s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 29.0 MB/s (4770920 bytes in 0.157s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 12:36 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 12:36 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5848736 2025-07-06 10:56 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1487528 2025-07-06 12:36 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels.....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           pp512 |        155.24 ± 0.36 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           tg256 |         20.11 ± 0.04 |

build: 0de83f97c (6027)
running time:2025-07-06,12:38:05

llama-bench with Llama-3.2-1B-Instruct-f16.gguf on master:
Screenshot from 2025-07-06 12-50-41

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.5 MB/s (6962256 bytes in 0.379s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.3 MB/s (3407440 bytes in 0.188s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 16.7 MB/s (1476880 bytes in 0.084s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 17.2 MB/s (1764248 bytes in 0.098s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 18.9 MB/s (24163448 bytes in 1.219s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 17.2 MB/s (4863024 bytes in 0.269s)
6 files pushed. 18.2 MB/s (42637296 bytes in 2.239s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 17.8 MB/s (4770920 bytes in 0.256s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 12:42 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 12:42 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5848736 2025-07-06 10:56 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1476880 2025-07-06 12:42 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           pp512 |         15.54 ± 1.89 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           tg256 |         16.02 ± 0.03 |

build: 0de83f97c (6027)
running time:2025-07-06,12:50:12

BTW, I provide a simple build/shell script to build ggml-opencl backend on Linux for simplify workflow: https://github.com/zhouwg/ggml-hexagon/blob/self-build/scripts/build-run-ggmlopencl-android.sh

Can I add this script to this excellent PR or submit a standalone PR so other developers can help to verify ggml-opencl related PR or learning something about OpenCL programming on Android phone accordingly? I think such this script is easy/no technical difficulty but might-be very useful/helpful for other developers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants