Add vision/image support for local inference models#8442
Conversation
Strip image_url content parts from OpenAI-format messages before passing them to llama.cpp's chat template application, which only accepts 'text' and 'media_marker' types and crashes with ffi error -3 on anything else. Each stripped image is replaced with a placeholder text note so the model (and user) can see that an image was attached but not processed. A tracing::warn! is emitted when images are stripped. Also handle MessageContent::Image in the emulator path's extract_text_content(), which previously silently dropped image content. Signed-off-by: jh-block <[email protected]>
Phase 1 of local inference vision support: - Enable the 'mtmd' cargo feature on llama-cpp-2 (macOS Metal + other platforms) - Add MmprojSpec to FeaturedModel for vision-capable models (gemma-4 E4B/26B) - Add mmproj_path, mmproj_source_url, mmproj_size_bytes to LocalModelEntry - Add vision_capable field to ModelSettings, derived from featured model table - Add has_vision() and mmproj_download_status() helpers on LocalModelEntry - Add featured_mmproj_spec() lookup for featured model mmproj metadata - Refactor resolve_model_path to return ResolvedModelPaths struct (includes mmproj_path) - Download mmproj GGUF alongside text model in download_hf_model endpoint - Expose vision_capable and mmproj_status in LocalModelResponse API - Clean up mmproj files on model deletion - All new registry fields use serde defaults for backward compatibility Signed-off-by: jh-block <[email protected]>
Add multimodal.rs module with two image extraction functions for the local inference vision pipeline: - extract_images_from_messages_json: For the native tool-calling path (Jinja templates). Walks OpenAI-format messages JSON, decodes base64 image_url parts into raw bytes, and replaces them with media marker text parts that llama.cpp's mtmd tokenizer expects. - extract_images_from_messages: For the emulated tools path. Scans Message structs for ImageContent entries, extracts decoded bytes, and substitutes text marker placeholders. Both functions reject remote HTTP(S) image URLs since local inference cannot fetch them. Includes comprehensive unit tests covering marker replacement, text preservation, multiple images, no-image passthrough, HTTP URL rejection, and mixed text+image content across messages. Signed-off-by: jh-block <[email protected]>
- Extend LoadedModel to hold an optional MtmdContext for vision models - Initialize MtmdContext from the mmproj GGUF during model load when the mmproj file is available on disk - Add mmproj_size_bytes to ModelSettings and subtract the vision encoder memory footprint from available memory when estimating max context - Thread mmproj_size_bytes from the registry entry through resolve_model_path - Update estimate_max_context_for_memory and validate_and_compute_context callers to account for mmproj overhead Signed-off-by: jh-block <[email protected]>
Implement Phase 4 of the local inference vision plan: when images are present in the conversation, use llama.cpp's mtmd pipeline to tokenize and prefill the context with interleaved text tokens and image embeddings instead of the text-only path. Key changes: - inference_engine.rs: Add create_and_prefill_multimodal() which creates MtmdBitmaps from extracted image bytes, tokenizes via mtmd (replacing <__media__> markers with image embeddings), validates context limits, and evals all chunks in one pass. Add images field to GenerationContext. - inference_native_tools.rs: Branch before tokenization — when images are present, use the multimodal prefill path; otherwise use the existing text-only str_to_token + create_and_prefill_context path. - inference_emulated_tools.rs: Same multimodal prefill branch for the tool emulation path. - local_inference.rs: In stream(), extract images from messages before building chat templates. Vision-capable models (with mmproj downloaded) get images replaced with mtmd media markers; non-vision models retain existing strip/placeholder behavior. Extracted images are passed through GenerationContext to the generation functions. - multimodal.rs: Remove blanket dead_code allow now that extract_images_from_messages is actively used. Signed-off-by: jh-block <[email protected]>
… overflow handling Phase 5 of the local inference vision plan: - Add image_token_estimate field to ModelSettings (default 256 tokens per image) for pre-tokenization budget planning before exact mtmd token counts are known. - Account for estimated image tokens when deciding between full and compact tool schemas in the native tools path, preventing context overflow from image-heavy conversations using the full verbose schema. - Context overflow after mtmd tokenization already returns ContextLengthExceeded from create_and_prefill_multimodal (implemented in Phase 4). Signed-off-by: jh-block <[email protected]>
- Add 6 unit tests for extract_images_from_messages (Message-based API):
replaces image with marker, preserves text, multiple images, no images,
invalid base64 handling, and mixed content across messages
- Add integration tests for vision:
- test_local_inference_vision_produces_output: sends an image to a
vision-capable model and verifies it produces a text response
- test_local_inference_vision_text_only_model_graceful: verifies that
sending an image to a text-only model doesn't crash (graceful error
or placeholder text)
- Both gated behind TEST_VISION_MODEL env var and #[ignore]
- Add Phase 3B (vision testing) to goose-self-test.yaml with smoke test
and error handling validation steps
Signed-off-by: jh-block <[email protected]>
- Regenerate OpenAPI spec to expose vision_capable and mmproj_status fields on LocalModelResponse - Add VisionBadge component showing vision encoder status: - Green 'Vision' badge when encoder is downloaded - Yellow 'Vision encoder downloading...' during download - Muted 'Vision' badge for vision-capable models without encoder - Display badge on both downloaded and featured model cards - Add i18n messages for vision-related labels Signed-off-by: jh-block <[email protected]>
Models downloaded before vision support was added have no mmproj_path or vision_capable fields in the registry. Fix two issues: - sync_with_featured now updates existing entries with missing mmproj fields and vision_capable settings, not just adding new entries - ensure_featured_models_in_registry detects downloaded models that need a vision encoder and auto-starts the mmproj download This means users who already had gemma-4 downloaded will automatically get the vision encoder on the next settings page visit. Signed-off-by: jh-block <[email protected]>
…lder text Two fixes: 1. The backfill loop in ensure_featured_models_in_registry only iterated FEATURED_MODELS entries (specific quants like Q4_K_M), missing other quantizations of the same repo (e.g. Q8_0). Add a second pass that scans all registry models and backfills mmproj data for any model whose repo has a known mmproj spec. This also auto-triggers the mmproj download for models that are already downloaded. 2. Update image placeholder text from 'not yet supported with local models' to 'not supported with the currently selected model' since vision support now exists for some local models. Signed-off-by: jh-block <[email protected]>
The VisionBadge showed 'downloading' but never updated because the UI only fetched the model list once on mount. Add a useEffect that polls listLocalModels every 2 seconds while any model has an mmproj download in progress, stopping once all downloads complete. Also show download percentage in the badge when available. Signed-off-by: jh-block <[email protected]>
The mtmd context was only initialized during model load. If the mmproj file finished downloading after the model was already loaded, the vision encoder was never initialized and image requests would fail with 'This model does not have vision support'. Now, at inference time, if images are present but mtmd_ctx is None, attempt to initialize it from the mmproj file. This handles the case where the model was loaded before or during the vision encoder download. Signed-off-by: jh-block <[email protected]>
eval_chunks was called with logits_last=false, meaning no logits were computed for the final prompt position. The subsequent sampler.sample() call then hit a ggml assertion failure (SIGABRT) trying to read nonexistent logits. Set logits_last=true so the sampler has valid logits to work with after multimodal prefill. Signed-off-by: jh-block <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6486c21184
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Address two code review findings: 1. Don't delete the mmproj file when deleting a model if another downloaded model shares the same vision encoder file (e.g. two quantizations of the same repo). 2. Populate mmproj_size_bytes from the actual file on disk during the backfill pass. This was hardcoded to 0, which defeated the memory budgeting that subtracts vision encoder overhead from available context. Signed-off-by: jh-block <[email protected]>
The mmproj filename 'mmproj-BF16.gguf' is generic and shared between different model architectures (gemma-4-E4B and gemma-4-26B). Store mmproj files under models/<repo-name>/ instead of flat in models/. Includes migration: on first load, moves existing mmproj files from the old flat path to the new namespaced path, and updates registry entries that pointed to the old location. Signed-off-by: jh-block <[email protected]>
Signed-off-by: jh-block <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24171768d6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ate mmproj downloads Two fixes from code review: 1. Set vision_capable unconditionally when a featured mmproj spec matches, not only inside the path-mismatch branch. Previously models with correct mmproj_path but vision_capable=false would not be corrected. 2. Deduplicate mmproj downloads by destination path. Multiple quantizations of the same repo share one mmproj file, so the backfill loop could queue concurrent downloads to the same path with different download IDs. Signed-off-by: jh-block <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8116b76c2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
crates/goose/src/providers/local_inference/local_model_registry.rs
Outdated
Show resolved
Hide resolved
Two fixes: 1. Auto-download now retries when a previous attempt failed or was cancelled, instead of only starting when no progress record exists. Only an actively downloading state blocks a new attempt. 2. Simplified mmproj_download_status: any non-Downloading state from the download manager maps to NotDownloaded, since file existence is already checked first. This prevents a stale Completed record from reporting Downloaded when the file is missing. Signed-off-by: jh-block <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 20e4605c83
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…proj Consolidate mmproj metadata population into LocalModelEntry::enrich_with_featured_mmproj(), called from add_model() and sync_with_featured(). This ensures CLI and server download paths both get vision encoder metadata without duplicating the logic. Also fix estimate_max_context_for_memory: only return None (meaning 'no cap') when raw available memory is unknown, not when mmproj overhead exhausts it. Previously, a large mmproj could saturating_sub to 0 which returned None, disabling the memory cap entirely. Signed-off-by: jh-block <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e901489ee5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The cancel handler only stopped the model download, leaving the vision encoder download running. Now cancels both. Signed-off-by: jh-block <[email protected]>
michaelneale
left a comment
There was a problem hiding this comment.
makes sense - mmproj works well in my experience. feels clumsy (not goose side) that it is a different loading experience/api etc under the covers still... like gguf is missing something.
Summary
Enable image input for vision-capable local models (gemma-4) via llama.cpp's multimodal (mtmd) subsystem. Previously, attaching an image to a local inference chat caused an FFI crash.
What this does
image_urlcontent parts before template application so non-vision models get a clear message instead of a SIGABRTmtmdcargo feature in llama-cpp-2, extract images from messages, replace them with media markers, tokenize via mtmd, and prefill usingeval_chunksfor interleaved text+image evaluationRisks and future work
Mutex<Option<LoadedModel>>serializes access, but this means only one image can be processed at a time.mtmdcompiles additional C++ (stb_image, clip, miniaudio), increasing build times.Screenshot
With
unsloth/gemma-4-26B-A4B-it-GGUF:Q8_K_XL: