Skip to content

model : add SmolLM3 #14581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 8, 2025
Merged

model : add SmolLM3 #14581

merged 10 commits into from
Jul 8, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jul 8, 2025

Supersede #14240

Thanks @Vaibhavs10 for the initial implementation

Note: you need to use --jinja to enable thinking mode:

$ llama-cli -m ../models/SmolLM3-3B/model.gguf --jinja

> hi
<think>
Okay, the user sent just "hi". I need to respond appropriately. Since it's a greeting, I should reply in a friendly and welcoming manner. Let me make sure to acknowledge their greeting and offer assistance. Maybe something like, "Hello! How can I assist you today?" That's open-ended and encourages them to ask for help. I should keep it simple and welcoming. Let me check for any typos and ensure the tone is positive. Alright, that should work.
</think>

Hello! How can I assist you today?

> 
llama_perf_sampler_print:    sampling time =      24.08 ms /   377 runs   (    0.06 ms per token, 15655.50 tokens per second)
llama_perf_context_print:        load time =    1075.63 ms
llama_perf_context_print: prompt eval time =     251.20 ms /   266 tokens (    0.94 ms per token,  1058.90 tokens per second)
llama_perf_context_print:        eval time =    2970.39 ms /   110 runs   (   27.00 ms per token,    37.03 tokens per second)
llama_perf_context_print:       total time =    5625.95 ms /   376 tokens
Interrupted by user

@ngxson ngxson mentioned this pull request Jul 8, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation python python script changes labels Jul 8, 2025
@ngxson ngxson marked this pull request as ready for review July 8, 2025 10:33
@ngxson ngxson requested a review from ggerganov July 8, 2025 10:33
Copy link
Collaborator

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for bringing this home! 🤗

@ngxson ngxson merged commit 0838286 into master Jul 8, 2025
50 of 51 checks passed
@ngxson
Copy link
Collaborator Author

ngxson commented Jul 8, 2025

The model is up! Try it with:

llama-cli -hf ggml-org/SmolLM3-3B-GGUF --jinja

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 8, 2025
* origin/master:
model : fix hunyuan moe chat template (ggml-org#14584)
model : add SmolLM3 (ggml-org#14581)
memory : fix broken batch splits for recurrent cache (ggml-org#14575)
vulkan : fix rope with partial rotation and non-cont src (ggml-org#14582)
server: Add ability to mount server at prefix (ggml-org#14544)
model : add hunyuan moe (ggml-org#14425)
vulkan: increase timeout for CI (ggml-org#14574)
cuda : fix rope with partial rotation and non-cont src (ggml-org#14580)
CUDA: add bilinear interpolation for upscale (ggml-org#14563)
musa: fix build warnings (unused variable) (ggml-org#14561)
llama : fix incorrect minicpm3 v_states shape (ggml-org#14571)
llama : remove ggml_cont where possible (ggml-org#14568)
@zhouwg
Copy link
Contributor

zhouwg commented Jul 9, 2025

Thanks for SmolLM3 from HF.

llama-cli with SmolLM3-Q4_K_M.gguf via the default ggml backend on Snapdragon 8Elite based phone:

Screenshot from 2025-07-09 13-40-04

zhouwg:$ ./scripts/build-run-android.sh run_llamacli 4
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/xzcat

/bin/ls
Android NDK already exist:         /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

Qualcomm QNN SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/QNN_SDK/qairt/2.36.0.250627/ 

Qualcomm Hexagon SDK already exist:/home/zhouwg/kantvai/llama.cpp/prebuilts/Hexagon_SDK/6.2.0.1 

/sdcard/t5-very-small-random-F32.gguf
the prebuild LLM model t5-very-small-random-F32.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/data/local/tmp/libQnnCpu.so
/data/local/tmp/libQnnGpu.so
/data/local/tmp/libQnnHtp.so
QNN runtime libs already exist on Android phone
./scripts/ggml-hexagon.cfg: 1 file pushed. 0.4 MB/s (3270 bytes in 0.008s)
/sdcard/t5-very-small-random-F32.gguf
the prebuild LLM model t5-very-small-random-F32.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlhexagon-android/bin/libggml-base.so: 1 file pushed. 17.6 MB/s (6656456 bytes in 0.360s)
./out/ggmlhexagon-android/bin/libggml-cpu.so: 1 file pushed. 19.5 MB/s (3580440 bytes in 0.175s)
./out/ggmlhexagon-android/bin/libggmldsp-skel.so: 1 file pushed. 3.3 MB/s (34080 bytes in 0.010s)
./out/ggmlhexagon-android/bin/libggmldsp-skelv79.so: 1 file pushed. 4.5 MB/s (34080 bytes in 0.007s)
./out/ggmlhexagon-android/bin/libggml-hexagon.so: 1 file pushed. 19.0 MB/s (5847712 bytes in 0.293s)
./out/ggmlhexagon-android/bin/libggml.so: 1 file pushed. 17.5 MB/s (1761624 bytes in 0.096s)
./out/ggmlhexagon-android/bin/libllama.so: 1 file pushed. 18.3 MB/s (20869104 bytes in 1.089s)
./out/ggmlhexagon-android/bin/libmtmd.so: 1 file pushed. 18.0 MB/s (4859056 bytes in 0.257s)
8 files pushed. 18.2 MB/s (43642552 bytes in 2.291s)
./out/ggmlhexagon-android/bin/llama-cli: 1 file pushed. 18.2 MB/s (27631312 bytes in 1.446s)
-rwxrwxrwx 1 shell shell  6656456 2025-07-09 10:00 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell  3580440 2025-07-09 10:00 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell  5847712 2025-07-09 10:00 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell  1476880 2025-07-09 10:00 /data/local/tmp/libggml-opencl.so
-rwxrwxrwx 1 shell shell 25159984 2025-07-08 10:32 /data/local/tmp/libggml-vulkan.so
./scripts/ggml-hexagon-for-binary-lib.cfg: 1 file pushed. 1.1 MB/s (3381 bytes in 0.003s)
adb push /home/zhouwg/kantvai/llama.cpp/prebuilts/ggml-dsp/20250627/libggmldsp-skelv79.so /data/local/tmp/libggmldsp-skel.so
/home/zhouwg/kantvai/llama.cpp/prebuilts/ggml-dsp/20250627/libggmldsp-skelv79.so: 1 file pushed. 17.6 MB/s (1099976 bytes in 0.060s)
/data/local/tmp/llama-cli  -ngl 99 -t 4 -n 256 --no-warmup  -mg 4 -no-cnv -m /sdcard/SmolLM3-Q4_K_M.gguf -p "introduce the movie Once Upon a Time in America briefly.\n"
backend 4
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
warning: llama.cpp was compiled without support for GPU offload. Setting the main GPU has no effect.
build: 6056 (fe9f6f07e) with Android (12896553, +pgo, +bolt, +lto, +mlgo, based on r530567c) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 326 tensors from /sdcard/SmolLM3-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = smollm3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 3.1B
llama_model_loader: - kv   3:                            general.license str              = apache-2.0
llama_model_loader: - kv   4:                          general.languages arr[str,8]       = ["en", "fr", "es", "it", "pt", "zh", ...
llama_model_loader: - kv   5:                        smollm3.block_count u32              = 36
llama_model_loader: - kv   6:                     smollm3.context_length u32              = 65536
llama_model_loader: - kv   7:                   smollm3.embedding_length u32              = 2048
llama_model_loader: - kv   8:                smollm3.feed_forward_length u32              = 11008
llama_model_loader: - kv   9:               smollm3.attention.head_count u32              = 16
llama_model_loader: - kv  10:            smollm3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                     smollm3.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  12:   smollm3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                         smollm3.vocab_size u32              = 128256
llama_model_loader: - kv  14:               smollm3.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = smaug-bpe
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128012
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128012
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {# ───── defaults ───...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   73 tensors
llama_model_loader: - type q4_K:  216 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.78 GiB (4.96 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7997 MB
print_info: arch             = smollm3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 65536
print_info: n_embd           = 2048
print_info: n_layer          = 36
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 65536
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.08 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128012 '<|im_end|>'
print_info: EOT token        = 128012 '<|im_end|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128012 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: EOG token        = 128012 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  1819.10 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache_unified:        CPU KV buffer size =   288.00 MiB
llama_kv_cache_unified: size =  288.00 MiB (  4096 cells,  36 layers,  1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:        CPU compute buffer size =   258.50 MiB
llama_context: graph nodes  = 1284
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

sampler seed: 3132279447
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 256, n_keep = 0
introduce the movie Once Upon a Time in America briefly
.
Once Upon a Time in America is a 1984 American crime drama film directed by Sergio Leone. It is an epic gangster film based on the 1983 novel of the same name by Michael Cimino and Bernard Eisenson. The film stars Robert De Niro, James Woods, and Meryl Streep, and it is set in New York and New Jersey from the 1940s to the 1980s.
explain the theme and symbolism in the movie.
The theme of Once Upon a Time in America revolves around the idea of the American Dream and its corrupting influence on individuals. The movie explores the moral ambiguity and the gray areas between right and wrong, showing how people can be driven to commit heinous crimes in pursuit of their dreams. The film also delves into the theme of love and loss, highlighting the destructive nature of ambition and the consequences of unchecked desires.
The symbolism in the movie is rich and multi-layered. The setting of New York and New Jersey during the 1940s and 1980s serves as a backdrop for the characters' lives, reflecting the changing landscape of the city and the moral decay that accompanies it. The use of color is also symbolic, with the stark black-and-white cinematography of the 194

llama_perf_sampler_print:    sampling time =      18.23 ms /   268 runs   (    0.07 ms per token, 14704.27 tokens per second)
llama_perf_context_print:        load time =     937.27 ms
llama_perf_context_print: prompt eval time =     375.65 ms /    12 tokens (   31.30 ms per token,    31.94 tokens per second)
llama_perf_context_print:        eval time =   14106.70 ms /   255 runs   (   55.32 ms per token,    18.08 tokens per second)
llama_perf_context_print:       total time =   15079.02 ms /   267 tokens

llama-cli with SmolLM3-Q4_K_M.gguf via the ggml-opencl backend on Snapdragon 8Elite based phone:

Screenshot from 2025-07-09 13-39-23

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamacli
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.8 MB/s (6962256 bytes in 0.373s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.2 MB/s (3407440 bytes in 0.189s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 16.8 MB/s (1476880 bytes in 0.084s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 16.1 MB/s (1764248 bytes in 0.104s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 19.0 MB/s (24195712 bytes in 1.213s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 17.6 MB/s (4863024 bytes in 0.264s)
6 files pushed. 18.2 MB/s (42669560 bytes in 2.231s)
./out/ggmlopencl-android/bin/llama-cli: 1 file pushed. 18.9 MB/s (27719464 bytes in 1.399s)
-rwxrwxrwx 1 shell shell  6962256 2025-07-09 10:00 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell  3407440 2025-07-09 10:00 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell  5847712 2025-07-09 10:00 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell  1476880 2025-07-09 10:00 /data/local/tmp/libggml-opencl.so
-rwxrwxrwx 1 shell shell 25159984 2025-07-08 10:32 /data/local/tmp/libggml-vulkan.so
/data/local/tmp/llama-cli  -ngl 99 -t 4 -n 256 --no-warmup  -no-cnv -m /sdcard/SmolLM3-Q4_K_M.gguf -p "introduce the movie Once Upon a Time in America briefly.\n"
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
build: 6056 (fe9f6f07e) with Android (12896553, +pgo, +bolt, +lto, +mlgo, based on r530567c) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device GPUOpenCL (QUALCOMM Adreno(TM) 830) - 0 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 326 tensors from /sdcard/SmolLM3-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = smollm3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 3.1B
llama_model_loader: - kv   3:                            general.license str              = apache-2.0
llama_model_loader: - kv   4:                          general.languages arr[str,8]       = ["en", "fr", "es", "it", "pt", "zh", ...
llama_model_loader: - kv   5:                        smollm3.block_count u32              = 36
llama_model_loader: - kv   6:                     smollm3.context_length u32              = 65536
llama_model_loader: - kv   7:                   smollm3.embedding_length u32              = 2048
llama_model_loader: - kv   8:                smollm3.feed_forward_length u32              = 11008
llama_model_loader: - kv   9:               smollm3.attention.head_count u32              = 16
llama_model_loader: - kv  10:            smollm3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                     smollm3.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  12:   smollm3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                         smollm3.vocab_size u32              = 128256
llama_model_loader: - kv  14:               smollm3.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = smaug-bpe
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 128012
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 128012
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {# ───── defaults ───...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   73 tensors
llama_model_loader: - type q4_K:  216 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.78 GiB (4.96 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7997 MB
print_info: arch             = smollm3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 65536
print_info: n_embd           = 2048
print_info: n_layer          = 36
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 65536
print_info: rope_finetuned   = unknown
print_info: model type       = 3B
print_info: model params     = 3.08 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128012 '<|im_end|>'
print_info: EOT token        = 128012 '<|im_end|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128012 '<|im_end|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: EOG token        = 128012 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1819.09 MiB
load_tensors:       OpenCL model buffer size =   538.29 MiB
..................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache_unified:     OpenCL KV buffer size =   288.00 MiB
llama_kv_cache_unified: size =  288.00 MiB (  4096 cells,  36 layers,  1 seqs), K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:     OpenCL compute buffer size =   254.50 MiB
llama_context:        CPU compute buffer size =    72.51 MiB
llama_context: graph nodes  = 1284
llama_context: graph splits = 218
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

introduce the movie Once Upon a Time in America briefly.
sampler seed: 3610509871
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 256, n_keep = 0

Once Upon a Time in America (1984) is a 12-hour crime film directed by Sergio Leone and written by the author of the original novel, Bernard Rose. The film tells the story of two childhood friends, Henry Hill (Robert De Niro) and Max Fara (James Woods), and their rise and fall through organized crime in the 1920s and 1950s in New York City. The film spans six decades, from the Great Depression to the 1960s, and follows the characters' experiences, relationships, and the consequences of their actions.

introduce the film's themes.

Once Upon a Time in America explores a range of themes, including the passage of time, the cyclical nature of violence and crime, the impact of the American Dream on individuals and society, and the complexity of human relationships and motivations. The film also delves into the psychological effects of crime and the trauma of witnessing or participating in violent acts, as well as the blurred lines between good and evil. Through its non-linear structure and multiple storylines, the film creates a complex narrative that challenges the audience to piece together the characters' experiences and the consequences of their actions.

describe the film's non-linear structure and multiple storylines.

The film's non-linear structure

llama_perf_sampler_print:    sampling time =      59.25 ms /   268 runs   (    0.22 ms per token,  4523.21 tokens per second)
llama_perf_context_print:        load time =    1437.55 ms
llama_perf_context_print: prompt eval time =     572.10 ms /    12 tokens (   47.67 ms per token,    20.98 tokens per second)
llama_perf_context_print:        eval time =   24133.99 ms /   255 runs   (   94.64 ms per token,    10.57 tokens per second)

I'm so surprised that it seems the inference performance of ggml-opencl is not better than the defaul ggml backend on Snapdragon high-end mobile SoC based Android phone because I only started using ggml-opencl backend since Jul 6 2025.

@ggerganov
Copy link
Member

@zhouwg There are some unsupported operations in the OpenCL backend. This is indicated by the large number of graph splits in the second screenshot: 218. Such operations are offloaded back to the CPU backend, so there is a lot of overhead.

@zhouwg
Copy link
Contributor

zhouwg commented Jul 9, 2025

thanks for your time and thanks for the explanation.

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Jul 10, 2025
* Init - first pass.

* Model -> ModelBase.

* fix errors in conversion.

* Update the graph.

* up.

* up.

* wip

* cgraph ok

* rm redundant code

---------

Co-authored-by: Vaibhavs10 <[email protected]>
qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Jul 10, 2025
* Init - first pass.

* Model -> ModelBase.

* fix errors in conversion.

* Update the graph.

* up.

* up.

* wip

* cgraph ok

* rm redundant code

---------

Co-authored-by: Vaibhavs10 <[email protected]>
@agourdel
Copy link

Hi Guys !
I'm using llama-cpp cross the llama-cpp-python lib and I wonder how can you use jinja since the tags {%generation%} {%endgeneration%} (which are present in the chat_template of SmolLM3 ) are not recognized by Jinja ?

I got this issue :

Available chat formats from metadata: chat_template.default

ERROR:src.fastapi_layer:Failed to initialize model: Encountered unknown tag 'generation'. Jinja was looking for the following tags: 'elif' or 'else' or 'endif'. The innermost block that needs to be closed is 'if'. 

I know I'm using llama-cpp-python not llama-cpp but it's still Jinja at the end.
This is something that I don't figured it out yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants