-
Notifications
You must be signed in to change notification settings - Fork 12.3k
model : add SmolLM3 #14581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model : add SmolLM3 #14581
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for bringing this home! 🤗
The model is up! Try it with:
|
* origin/master: model : fix hunyuan moe chat template (ggml-org#14584) model : add SmolLM3 (ggml-org#14581) memory : fix broken batch splits for recurrent cache (ggml-org#14575) vulkan : fix rope with partial rotation and non-cont src (ggml-org#14582) server: Add ability to mount server at prefix (ggml-org#14544) model : add hunyuan moe (ggml-org#14425) vulkan: increase timeout for CI (ggml-org#14574) cuda : fix rope with partial rotation and non-cont src (ggml-org#14580) CUDA: add bilinear interpolation for upscale (ggml-org#14563) musa: fix build warnings (unused variable) (ggml-org#14561) llama : fix incorrect minicpm3 v_states shape (ggml-org#14571) llama : remove ggml_cont where possible (ggml-org#14568)
Thanks for SmolLM3 from HF. llama-cli with SmolLM3-Q4_K_M.gguf via the default ggml backend on Snapdragon 8Elite based phone:
llama-cli with SmolLM3-Q4_K_M.gguf via the ggml-opencl backend on Snapdragon 8Elite based phone:
I'm so surprised that it seems the inference performance of ggml-opencl is not better than the defaul ggml backend on Snapdragon high-end mobile SoC based Android phone because I only started using ggml-opencl backend since Jul 6 2025. |
@zhouwg There are some unsupported operations in the OpenCL backend. This is indicated by the large number of graph splits in the second screenshot: 218. Such operations are offloaded back to the CPU backend, so there is a lot of overhead. |
thanks for your time and thanks for the explanation. |
* Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <[email protected]>
* Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <[email protected]>
Supersede #14240
Thanks @Vaibhavs10 for the initial implementation
Note: you need to use
--jinja
to enable thinking mode: