You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m running models (e.g., Gemma 3n) successfully with the current llama.cpp iOS framework, but I’d like to experiment with vision-capable models (e.g., LLaVA, Gemma 3n, or other multimodal variants).
From what I can tell:
• The current iOS framework only exposes text inference functions (no API for image input).
• Even if I convert and load a vision-capable GGUF model, there seems to be no way to pass an image into the pipeline from iOS.
• In the core llama.cpp repo, vision models rely on an additional encoder pipeline (CLIP, SigLIP, etc.), but that part doesn’t seem exposed in the iOS wrapper.
My questions are:
1. Is there any existing work or roadmap for enabling vision models on iOS, including bindings to send images?
2. Would adding support mean extending the iOS framework to wrap the vision encoder API (similar to how text tokens are fed)?
3. Is there currently any known workaround for experimenting with image+text inference on iOS?
Thanks a lot — llama.cpp has been an amazing enabler for on-device AI, and adding vision support on iOS would unlock many multimodal use cases.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I’m running models (e.g., Gemma 3n) successfully with the current llama.cpp iOS framework, but I’d like to experiment with vision-capable models (e.g., LLaVA, Gemma 3n, or other multimodal variants).
From what I can tell:
• The current iOS framework only exposes text inference functions (no API for image input).
• Even if I convert and load a vision-capable GGUF model, there seems to be no way to pass an image into the pipeline from iOS.
• In the core llama.cpp repo, vision models rely on an additional encoder pipeline (CLIP, SigLIP, etc.), but that part doesn’t seem exposed in the iOS wrapper.
My questions are:
1. Is there any existing work or roadmap for enabling vision models on iOS, including bindings to send images?
2. Would adding support mean extending the iOS framework to wrap the vision encoder API (similar to how text tokens are fed)?
3. Is there currently any known workaround for experimenting with image+text inference on iOS?
Thanks a lot — llama.cpp has been an amazing enabler for on-device AI, and adding vision support on iOS would unlock many multimodal use cases.
Beta Was this translation helpful? Give feedback.
All reactions