Vision model support in llama.cpp iOS framework (image input) #16058

Liandas · 2025-09-17T18:51:01Z

Liandas
Sep 17, 2025

Hi,

I’m running models (e.g., Gemma 3n) successfully with the current llama.cpp iOS framework, but I’d like to experiment with vision-capable models (e.g., LLaVA, Gemma 3n, or other multimodal variants).

From what I can tell:
• The current iOS framework only exposes text inference functions (no API for image input).
• Even if I convert and load a vision-capable GGUF model, there seems to be no way to pass an image into the pipeline from iOS.
• In the core llama.cpp repo, vision models rely on an additional encoder pipeline (CLIP, SigLIP, etc.), but that part doesn’t seem exposed in the iOS wrapper.

My questions are:
1. Is there any existing work or roadmap for enabling vision models on iOS, including bindings to send images?
2. Would adding support mean extending the iOS framework to wrap the vision encoder API (similar to how text tokens are fed)?
3. Is there currently any known workaround for experimenting with image+text inference on iOS?

Thanks a lot — llama.cpp has been an amazing enabler for on-device AI, and adding vision support on iOS would unlock many multimodal use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vision model support in llama.cpp iOS framework (image input) #16058

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Vision model support in llama.cpp iOS framework (image input) #16058

Uh oh!

Liandas Sep 17, 2025

Replies: 0 comments

Liandas
Sep 17, 2025