Skip to content

Add vision/image support for local inference models#8442

Merged
jh-block merged 20 commits intomainfrom
jhugo/local-inference-multimodal
Apr 13, 2026
Merged

Add vision/image support for local inference models#8442
jh-block merged 20 commits intomainfrom
jhugo/local-inference-multimodal

Conversation

@jh-block
Copy link
Copy Markdown
Collaborator

@jh-block jh-block commented Apr 9, 2026

Summary

Enable image input for vision-capable local models (gemma-4) via llama.cpp's multimodal (mtmd) subsystem. Previously, attaching an image to a local inference chat caused an FFI crash.

What this does

  • Crash fix: Strip image_url content parts before template application so non-vision models get a clear message instead of a SIGABRT
  • Vision pipeline: Enable the mtmd cargo feature in llama-cpp-2, extract images from messages, replace them with media markers, tokenize via mtmd, and prefill using eval_chunks for interleaved text+image evaluation
  • mmproj management: Featured vision models (gemma-4) have their vision encoder (mmproj GGUF) auto-downloaded alongside the text model. Existing models that predate this change get backfilled on the next settings page visit
  • UI indicators: Vision-capable models show a badge in Settings > Local Inference with download status for the vision encoder, with live polling during download
  • Lazy initialization: If the vision encoder finishes downloading after the model is already loaded, the mtmd context is initialized on the next image request rather than requiring a reload
  • Token budgeting: Image token costs are estimated for pre-tokenization budget checks and precisely measured after mtmd tokenization, with context overflow errors when images push past the limit

Risks and future work

  • mmproj availability: Only featured models (gemma-4 from unsloth repos) have known mmproj files. Community-quantized repos may not publish them. Non-featured models currently cannot use vision.
  • Memory pressure: The vision encoder consumes ~100-300MB of GPU/CPU memory. Memory accounting uses an approximate deduction — on constrained systems this may over-allocate context.
  • Image resolution: High-resolution images can consume many tokens. llama.cpp's vision encoder handles resizing internally but behavior may vary across model architectures. A max-resolution cap could be added if needed.
  • Concurrency: mtmd encode/eval are not thread-safe. The existing Mutex<Option<LoadedModel>> serializes access, but this means only one image can be processed at a time.
  • Audio: The mtmd pipeline also supports audio input but that is explicitly out of scope here.
  • Build time: Enabling mtmd compiles additional C++ (stb_image, clip, miniaudio), increasing build times.

Screenshot

With unsloth/gemma-4-26B-A4B-it-GGUF:Q8_K_XL:

image

jh-block added 13 commits April 9, 2026 16:57
Strip image_url content parts from OpenAI-format messages before passing
them to llama.cpp's chat template application, which only accepts 'text'
and 'media_marker' types and crashes with ffi error -3 on anything else.

Each stripped image is replaced with a placeholder text note so the model
(and user) can see that an image was attached but not processed. A
tracing::warn! is emitted when images are stripped.

Also handle MessageContent::Image in the emulator path's
extract_text_content(), which previously silently dropped image content.

Signed-off-by: jh-block <[email protected]>
Phase 1 of local inference vision support:

- Enable the 'mtmd' cargo feature on llama-cpp-2 (macOS Metal + other platforms)
- Add MmprojSpec to FeaturedModel for vision-capable models (gemma-4 E4B/26B)
- Add mmproj_path, mmproj_source_url, mmproj_size_bytes to LocalModelEntry
- Add vision_capable field to ModelSettings, derived from featured model table
- Add has_vision() and mmproj_download_status() helpers on LocalModelEntry
- Add featured_mmproj_spec() lookup for featured model mmproj metadata
- Refactor resolve_model_path to return ResolvedModelPaths struct (includes mmproj_path)
- Download mmproj GGUF alongside text model in download_hf_model endpoint
- Expose vision_capable and mmproj_status in LocalModelResponse API
- Clean up mmproj files on model deletion
- All new registry fields use serde defaults for backward compatibility

Signed-off-by: jh-block <[email protected]>
Add multimodal.rs module with two image extraction functions for the
local inference vision pipeline:

- extract_images_from_messages_json: For the native tool-calling path
  (Jinja templates). Walks OpenAI-format messages JSON, decodes base64
  image_url parts into raw bytes, and replaces them with media marker
  text parts that llama.cpp's mtmd tokenizer expects.

- extract_images_from_messages: For the emulated tools path. Scans
  Message structs for ImageContent entries, extracts decoded bytes,
  and substitutes text marker placeholders.

Both functions reject remote HTTP(S) image URLs since local inference
cannot fetch them. Includes comprehensive unit tests covering marker
replacement, text preservation, multiple images, no-image passthrough,
HTTP URL rejection, and mixed text+image content across messages.

Signed-off-by: jh-block <[email protected]>
- Extend LoadedModel to hold an optional MtmdContext for vision models
- Initialize MtmdContext from the mmproj GGUF during model load when
  the mmproj file is available on disk
- Add mmproj_size_bytes to ModelSettings and subtract the vision encoder
  memory footprint from available memory when estimating max context
- Thread mmproj_size_bytes from the registry entry through resolve_model_path
- Update estimate_max_context_for_memory and validate_and_compute_context
  callers to account for mmproj overhead

Signed-off-by: jh-block <[email protected]>
Implement Phase 4 of the local inference vision plan: when images are
present in the conversation, use llama.cpp's mtmd pipeline to tokenize
and prefill the context with interleaved text tokens and image
embeddings instead of the text-only path.

Key changes:

- inference_engine.rs: Add create_and_prefill_multimodal() which creates
  MtmdBitmaps from extracted image bytes, tokenizes via mtmd (replacing
  <__media__> markers with image embeddings), validates context limits,
  and evals all chunks in one pass. Add images field to GenerationContext.

- inference_native_tools.rs: Branch before tokenization — when images
  are present, use the multimodal prefill path; otherwise use the
  existing text-only str_to_token + create_and_prefill_context path.

- inference_emulated_tools.rs: Same multimodal prefill branch for the
  tool emulation path.

- local_inference.rs: In stream(), extract images from messages before
  building chat templates. Vision-capable models (with mmproj downloaded)
  get images replaced with mtmd media markers; non-vision models retain
  existing strip/placeholder behavior. Extracted images are passed
  through GenerationContext to the generation functions.

- multimodal.rs: Remove blanket dead_code allow now that
  extract_images_from_messages is actively used.

Signed-off-by: jh-block <[email protected]>
… overflow handling

Phase 5 of the local inference vision plan:

- Add image_token_estimate field to ModelSettings (default 256 tokens per image)
  for pre-tokenization budget planning before exact mtmd token counts are known.
- Account for estimated image tokens when deciding between full and compact tool
  schemas in the native tools path, preventing context overflow from image-heavy
  conversations using the full verbose schema.
- Context overflow after mtmd tokenization already returns ContextLengthExceeded
  from create_and_prefill_multimodal (implemented in Phase 4).

Signed-off-by: jh-block <[email protected]>
- Add 6 unit tests for extract_images_from_messages (Message-based API):
  replaces image with marker, preserves text, multiple images, no images,
  invalid base64 handling, and mixed content across messages
- Add integration tests for vision:
  - test_local_inference_vision_produces_output: sends an image to a
    vision-capable model and verifies it produces a text response
  - test_local_inference_vision_text_only_model_graceful: verifies that
    sending an image to a text-only model doesn't crash (graceful error
    or placeholder text)
  - Both gated behind TEST_VISION_MODEL env var and #[ignore]
- Add Phase 3B (vision testing) to goose-self-test.yaml with smoke test
  and error handling validation steps

Signed-off-by: jh-block <[email protected]>
- Regenerate OpenAPI spec to expose vision_capable and mmproj_status
  fields on LocalModelResponse
- Add VisionBadge component showing vision encoder status:
  - Green 'Vision' badge when encoder is downloaded
  - Yellow 'Vision encoder downloading...' during download
  - Muted 'Vision' badge for vision-capable models without encoder
- Display badge on both downloaded and featured model cards
- Add i18n messages for vision-related labels

Signed-off-by: jh-block <[email protected]>
Models downloaded before vision support was added have no mmproj_path
or vision_capable fields in the registry. Fix two issues:

- sync_with_featured now updates existing entries with missing mmproj
  fields and vision_capable settings, not just adding new entries
- ensure_featured_models_in_registry detects downloaded models that
  need a vision encoder and auto-starts the mmproj download

This means users who already had gemma-4 downloaded will automatically
get the vision encoder on the next settings page visit.

Signed-off-by: jh-block <[email protected]>
…lder text

Two fixes:

1. The backfill loop in ensure_featured_models_in_registry only iterated
   FEATURED_MODELS entries (specific quants like Q4_K_M), missing other
   quantizations of the same repo (e.g. Q8_0). Add a second pass that
   scans all registry models and backfills mmproj data for any model
   whose repo has a known mmproj spec. This also auto-triggers the
   mmproj download for models that are already downloaded.

2. Update image placeholder text from 'not yet supported with local
   models' to 'not supported with the currently selected model' since
   vision support now exists for some local models.

Signed-off-by: jh-block <[email protected]>
The VisionBadge showed 'downloading' but never updated because the UI
only fetched the model list once on mount. Add a useEffect that polls
listLocalModels every 2 seconds while any model has an mmproj download
in progress, stopping once all downloads complete.

Also show download percentage in the badge when available.

Signed-off-by: jh-block <[email protected]>
The mtmd context was only initialized during model load. If the mmproj
file finished downloading after the model was already loaded, the
vision encoder was never initialized and image requests would fail
with 'This model does not have vision support'.

Now, at inference time, if images are present but mtmd_ctx is None,
attempt to initialize it from the mmproj file. This handles the case
where the model was loaded before or during the vision encoder
download.

Signed-off-by: jh-block <[email protected]>
eval_chunks was called with logits_last=false, meaning no logits were
computed for the final prompt position. The subsequent sampler.sample()
call then hit a ggml assertion failure (SIGABRT) trying to read
nonexistent logits.

Set logits_last=true so the sampler has valid logits to work with
after multimodal prefill.

Signed-off-by: jh-block <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6486c21184

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

jh-block added 3 commits April 9, 2026 17:15
Address two code review findings:

1. Don't delete the mmproj file when deleting a model if another
   downloaded model shares the same vision encoder file (e.g. two
   quantizations of the same repo).

2. Populate mmproj_size_bytes from the actual file on disk during
   the backfill pass. This was hardcoded to 0, which defeated the
   memory budgeting that subtracts vision encoder overhead from
   available context.

Signed-off-by: jh-block <[email protected]>
The mmproj filename 'mmproj-BF16.gguf' is generic and shared between
different model architectures (gemma-4-E4B and gemma-4-26B). Store
mmproj files under models/<repo-name>/ instead of flat in models/.

Includes migration: on first load, moves existing mmproj files from
the old flat path to the new namespaced path, and updates registry
entries that pointed to the old location.

Signed-off-by: jh-block <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24171768d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…ate mmproj downloads

Two fixes from code review:

1. Set vision_capable unconditionally when a featured mmproj spec
   matches, not only inside the path-mismatch branch. Previously
   models with correct mmproj_path but vision_capable=false would
   not be corrected.

2. Deduplicate mmproj downloads by destination path. Multiple
   quantizations of the same repo share one mmproj file, so the
   backfill loop could queue concurrent downloads to the same path
   with different download IDs.

Signed-off-by: jh-block <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8116b76c2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Two fixes:

1. Auto-download now retries when a previous attempt failed or was
   cancelled, instead of only starting when no progress record
   exists. Only an actively downloading state blocks a new attempt.

2. Simplified mmproj_download_status: any non-Downloading state from
   the download manager maps to NotDownloaded, since file existence
   is already checked first. This prevents a stale Completed record
   from reporting Downloaded when the file is missing.

Signed-off-by: jh-block <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20e4605c83

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…proj

Consolidate mmproj metadata population into LocalModelEntry::enrich_with_featured_mmproj(),
called from add_model() and sync_with_featured(). This ensures CLI and server download paths
both get vision encoder metadata without duplicating the logic.

Also fix estimate_max_context_for_memory: only return None (meaning 'no cap') when raw
available memory is unknown, not when mmproj overhead exhausts it. Previously, a large
mmproj could saturating_sub to 0 which returned None, disabling the memory cap entirely.

Signed-off-by: jh-block <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e901489ee5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The cancel handler only stopped the model download, leaving the
vision encoder download running. Now cancels both.

Signed-off-by: jh-block <[email protected]>
@jh-block jh-block requested review from DOsinga and michaelneale April 9, 2026 16:35
Copy link
Copy Markdown
Collaborator

@michaelneale michaelneale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense - mmproj works well in my experience. feels clumsy (not goose side) that it is a different loading experience/api etc under the covers still... like gguf is missing something.

@jh-block jh-block added this pull request to the merge queue Apr 13, 2026
Merged via the queue into main with commit de317d5 Apr 13, 2026
21 checks passed
@jh-block jh-block deleted the jhugo/local-inference-multimodal branch April 13, 2026 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants