Add vision/image support for local inference models by jh-block · Pull Request #8442 · aaif-goose/goose

jh-block · 2026-04-09T14:58:07Z

Summary

Enable image input for vision-capable local models (gemma-4) via llama.cpp's multimodal (mtmd) subsystem. Previously, attaching an image to a local inference chat caused an FFI crash.

What this does

Crash fix: Strip image_url content parts before template application so non-vision models get a clear message instead of a SIGABRT
Vision pipeline: Enable the mtmd cargo feature in llama-cpp-2, extract images from messages, replace them with media markers, tokenize via mtmd, and prefill using eval_chunks for interleaved text+image evaluation
mmproj management: Featured vision models (gemma-4) have their vision encoder (mmproj GGUF) auto-downloaded alongside the text model. Existing models that predate this change get backfilled on the next settings page visit
UI indicators: Vision-capable models show a badge in Settings > Local Inference with download status for the vision encoder, with live polling during download
Lazy initialization: If the vision encoder finishes downloading after the model is already loaded, the mtmd context is initialized on the next image request rather than requiring a reload
Token budgeting: Image token costs are estimated for pre-tokenization budget checks and precisely measured after mtmd tokenization, with context overflow errors when images push past the limit

Risks and future work

mmproj availability: Only featured models (gemma-4 from unsloth repos) have known mmproj files. Community-quantized repos may not publish them. Non-featured models currently cannot use vision.
Memory pressure: The vision encoder consumes ~100-300MB of GPU/CPU memory. Memory accounting uses an approximate deduction — on constrained systems this may over-allocate context.
Image resolution: High-resolution images can consume many tokens. llama.cpp's vision encoder handles resizing internally but behavior may vary across model architectures. A max-resolution cap could be added if needed.
Concurrency: mtmd encode/eval are not thread-safe. The existing Mutex<Option<LoadedModel>> serializes access, but this means only one image can be processed at a time.
Audio: The mtmd pipeline also supports audio input but that is explicitly out of scope here.
Build time: Enabling mtmd compiles additional C++ (stb_image, clip, miniaudio), increasing build times.

Screenshot

With unsloth/gemma-4-26B-A4B-it-GGUF:Q8_K_XL:

Strip image_url content parts from OpenAI-format messages before passing them to llama.cpp's chat template application, which only accepts 'text' and 'media_marker' types and crashes with ffi error -3 on anything else. Each stripped image is replaced with a placeholder text note so the model (and user) can see that an image was attached but not processed. A tracing::warn! is emitted when images are stripped. Also handle MessageContent::Image in the emulator path's extract_text_content(), which previously silently dropped image content. Signed-off-by: jh-block <[email protected]>

Phase 1 of local inference vision support: - Enable the 'mtmd' cargo feature on llama-cpp-2 (macOS Metal + other platforms) - Add MmprojSpec to FeaturedModel for vision-capable models (gemma-4 E4B/26B) - Add mmproj_path, mmproj_source_url, mmproj_size_bytes to LocalModelEntry - Add vision_capable field to ModelSettings, derived from featured model table - Add has_vision() and mmproj_download_status() helpers on LocalModelEntry - Add featured_mmproj_spec() lookup for featured model mmproj metadata - Refactor resolve_model_path to return ResolvedModelPaths struct (includes mmproj_path) - Download mmproj GGUF alongside text model in download_hf_model endpoint - Expose vision_capable and mmproj_status in LocalModelResponse API - Clean up mmproj files on model deletion - All new registry fields use serde defaults for backward compatibility Signed-off-by: jh-block <[email protected]>

Add multimodal.rs module with two image extraction functions for the local inference vision pipeline: - extract_images_from_messages_json: For the native tool-calling path (Jinja templates). Walks OpenAI-format messages JSON, decodes base64 image_url parts into raw bytes, and replaces them with media marker text parts that llama.cpp's mtmd tokenizer expects. - extract_images_from_messages: For the emulated tools path. Scans Message structs for ImageContent entries, extracts decoded bytes, and substitutes text marker placeholders. Both functions reject remote HTTP(S) image URLs since local inference cannot fetch them. Includes comprehensive unit tests covering marker replacement, text preservation, multiple images, no-image passthrough, HTTP URL rejection, and mixed text+image content across messages. Signed-off-by: jh-block <[email protected]>

- Extend LoadedModel to hold an optional MtmdContext for vision models - Initialize MtmdContext from the mmproj GGUF during model load when the mmproj file is available on disk - Add mmproj_size_bytes to ModelSettings and subtract the vision encoder memory footprint from available memory when estimating max context - Thread mmproj_size_bytes from the registry entry through resolve_model_path - Update estimate_max_context_for_memory and validate_and_compute_context callers to account for mmproj overhead Signed-off-by: jh-block <[email protected]>

Implement Phase 4 of the local inference vision plan: when images are present in the conversation, use llama.cpp's mtmd pipeline to tokenize and prefill the context with interleaved text tokens and image embeddings instead of the text-only path. Key changes: - inference_engine.rs: Add create_and_prefill_multimodal() which creates MtmdBitmaps from extracted image bytes, tokenizes via mtmd (replacing <__media__> markers with image embeddings), validates context limits, and evals all chunks in one pass. Add images field to GenerationContext. - inference_native_tools.rs: Branch before tokenization — when images are present, use the multimodal prefill path; otherwise use the existing text-only str_to_token + create_and_prefill_context path. - inference_emulated_tools.rs: Same multimodal prefill branch for the tool emulation path. - local_inference.rs: In stream(), extract images from messages before building chat templates. Vision-capable models (with mmproj downloaded) get images replaced with mtmd media markers; non-vision models retain existing strip/placeholder behavior. Extracted images are passed through GenerationContext to the generation functions. - multimodal.rs: Remove blanket dead_code allow now that extract_images_from_messages is actively used. Signed-off-by: jh-block <[email protected]>

… overflow handling Phase 5 of the local inference vision plan: - Add image_token_estimate field to ModelSettings (default 256 tokens per image) for pre-tokenization budget planning before exact mtmd token counts are known. - Account for estimated image tokens when deciding between full and compact tool schemas in the native tools path, preventing context overflow from image-heavy conversations using the full verbose schema. - Context overflow after mtmd tokenization already returns ContextLengthExceeded from create_and_prefill_multimodal (implemented in Phase 4). Signed-off-by: jh-block <[email protected]>

- Add 6 unit tests for extract_images_from_messages (Message-based API): replaces image with marker, preserves text, multiple images, no images, invalid base64 handling, and mixed content across messages - Add integration tests for vision: - test_local_inference_vision_produces_output: sends an image to a vision-capable model and verifies it produces a text response - test_local_inference_vision_text_only_model_graceful: verifies that sending an image to a text-only model doesn't crash (graceful error or placeholder text) - Both gated behind TEST_VISION_MODEL env var and #[ignore] - Add Phase 3B (vision testing) to goose-self-test.yaml with smoke test and error handling validation steps Signed-off-by: jh-block <[email protected]>

- Regenerate OpenAPI spec to expose vision_capable and mmproj_status fields on LocalModelResponse - Add VisionBadge component showing vision encoder status: - Green 'Vision' badge when encoder is downloaded - Yellow 'Vision encoder downloading...' during download - Muted 'Vision' badge for vision-capable models without encoder - Display badge on both downloaded and featured model cards - Add i18n messages for vision-related labels Signed-off-by: jh-block <[email protected]>

Models downloaded before vision support was added have no mmproj_path or vision_capable fields in the registry. Fix two issues: - sync_with_featured now updates existing entries with missing mmproj fields and vision_capable settings, not just adding new entries - ensure_featured_models_in_registry detects downloaded models that need a vision encoder and auto-starts the mmproj download This means users who already had gemma-4 downloaded will automatically get the vision encoder on the next settings page visit. Signed-off-by: jh-block <[email protected]>

…lder text Two fixes: 1. The backfill loop in ensure_featured_models_in_registry only iterated FEATURED_MODELS entries (specific quants like Q4_K_M), missing other quantizations of the same repo (e.g. Q8_0). Add a second pass that scans all registry models and backfills mmproj data for any model whose repo has a known mmproj spec. This also auto-triggers the mmproj download for models that are already downloaded. 2. Update image placeholder text from 'not yet supported with local models' to 'not supported with the currently selected model' since vision support now exists for some local models. Signed-off-by: jh-block <[email protected]>

The VisionBadge showed 'downloading' but never updated because the UI only fetched the model list once on mount. Add a useEffect that polls listLocalModels every 2 seconds while any model has an mmproj download in progress, stopping once all downloads complete. Also show download percentage in the badge when available. Signed-off-by: jh-block <[email protected]>

The mtmd context was only initialized during model load. If the mmproj file finished downloading after the model was already loaded, the vision encoder was never initialized and image requests would fail with 'This model does not have vision support'. Now, at inference time, if images are present but mtmd_ctx is None, attempt to initialize it from the mmproj file. This handles the case where the model was loaded before or during the vision encoder download. Signed-off-by: jh-block <[email protected]>

eval_chunks was called with logits_last=false, meaning no logits were computed for the final prompt position. The subsequent sampler.sample() call then hit a ggml assertion failure (SIGABRT) trying to read nonexistent logits. Set logits_last=true so the sampler has valid logits to work with after multimodal prefill. Signed-off-by: jh-block <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6486c21184

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

crates/goose-server/src/routes/local_inference.rs

Address two code review findings: 1. Don't delete the mmproj file when deleting a model if another downloaded model shares the same vision encoder file (e.g. two quantizations of the same repo). 2. Populate mmproj_size_bytes from the actual file on disk during the backfill pass. This was hardcoded to 0, which defeated the memory budgeting that subtracts vision encoder overhead from available context. Signed-off-by: jh-block <[email protected]>

The mmproj filename 'mmproj-BF16.gguf' is generic and shared between different model architectures (gemma-4-E4B and gemma-4-26B). Store mmproj files under models/<repo-name>/ instead of flat in models/. Includes migration: on first load, moves existing mmproj files from the old flat path to the new namespaced path, and updates registry entries that pointed to the old location. Signed-off-by: jh-block <[email protected]>

Signed-off-by: jh-block <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24171768d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

crates/goose-server/src/routes/local_inference.rs

…ate mmproj downloads Two fixes from code review: 1. Set vision_capable unconditionally when a featured mmproj spec matches, not only inside the path-mismatch branch. Previously models with correct mmproj_path but vision_capable=false would not be corrected. 2. Deduplicate mmproj downloads by destination path. Multiple quantizations of the same repo share one mmproj file, so the backfill loop could queue concurrent downloads to the same path with different download IDs. Signed-off-by: jh-block <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8116b76c2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

crates/goose-server/src/routes/local_inference.rs

crates/goose/src/providers/local_inference/local_model_registry.rs

Two fixes: 1. Auto-download now retries when a previous attempt failed or was cancelled, instead of only starting when no progress record exists. Only an actively downloading state blocks a new attempt. 2. Simplified mmproj_download_status: any non-Downloading state from the download manager maps to NotDownloaded, since file existence is already checked first. This prevents a stale Completed record from reporting Downloaded when the file is missing. Signed-off-by: jh-block <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20e4605c83

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

crates/goose-cli/src/cli.rs

crates/goose/src/providers/local_inference/inference_engine.rs

…proj Consolidate mmproj metadata population into LocalModelEntry::enrich_with_featured_mmproj(), called from add_model() and sync_with_featured(). This ensures CLI and server download paths both get vision encoder metadata without duplicating the logic. Also fix estimate_max_context_for_memory: only return None (meaning 'no cap') when raw available memory is unknown, not when mmproj overhead exhausts it. Previously, a large mmproj could saturating_sub to 0 which returned None, disabling the memory cap entirely. Signed-off-by: jh-block <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e901489ee5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

crates/goose-server/src/routes/local_inference.rs

The cancel handler only stopped the model download, leaving the vision encoder download running. Now cancels both. Signed-off-by: jh-block <[email protected]>

michaelneale

makes sense - mmproj works well in my experience. feels clumsy (not goose side) that it is a different loading experience/api etc under the covers still... like gguf is missing something.

jh-block added 13 commits April 9, 2026 16:57

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

crates/goose-server/src/routes/local_inference.rs Outdated Show resolved Hide resolved

crates/goose-server/src/routes/local_inference.rs Outdated Show resolved Hide resolved

crates/goose-server/src/routes/local_inference.rs Outdated Show resolved Hide resolved

jh-block added 3 commits April 9, 2026 17:15

Remove mmproj flat-path migration logic

2417176

Signed-off-by: jh-block <[email protected]>

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

crates/goose-server/src/routes/local_inference.rs Show resolved Hide resolved

crates/goose-server/src/routes/local_inference.rs Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

crates/goose-server/src/routes/local_inference.rs Outdated Show resolved Hide resolved

crates/goose/src/providers/local_inference/local_model_registry.rs Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

crates/goose-cli/src/cli.rs Show resolved Hide resolved

crates/goose/src/providers/local_inference/inference_engine.rs Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

crates/goose-server/src/routes/local_inference.rs Show resolved Hide resolved

Cancel mmproj download when model download is cancelled

ac742a5

The cancel handler only stopped the model download, leaving the vision encoder download running. Now cancels both. Signed-off-by: jh-block <[email protected]>

jh-block requested review from DOsinga and michaelneale April 9, 2026 16:35

michaelneale approved these changes Apr 12, 2026

View reviewed changes

jh-block added this pull request to the merge queue Apr 13, 2026

Merged via the queue into main with commit de317d5 Apr 13, 2026
21 checks passed

jh-block deleted the jhugo/local-inference-multimodal branch April 13, 2026 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision/image support for local inference models#8442

Add vision/image support for local inference models#8442
jh-block merged 20 commits intomainfrom
jhugo/local-inference-multimodal

jh-block commented Apr 9, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

michaelneale left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jh-block commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this does

Risks and future work

Screenshot

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

michaelneale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jh-block commented Apr 9, 2026 •

edited

Loading