Skip to content

feat(ollama): detect and advertise vision capability with live E2E#1892

Merged
roryford merged 1 commit into
mainfrom
feat/ollama-vision-capability-e2e
Jun 16, 2026
Merged

feat(ollama): detect and advertise vision capability with live E2E#1892
roryford merged 1 commit into
mainfrom
feat/ollama-vision-capability-e2e

Conversation

@roryford

Copy link
Copy Markdown
Owner

What

Closes the #1 vision gap: the Ollama backend has always had the image wire path (OllamaImagesField), but it never advertised supportsVision and nothing tested image input. This detects, advertises, wires, and tests it end-to-end.

Changes

  • Detect: OllamaModelProbe reads the per-model vision flag from /api/show's capabilities: ["vision", ...] list (qwen2.5vl / moondream / llava), alongside the existing thinking detection. No template fallback — vision is advertised-only.
  • Advertise: OllamaBackend.isVisionModel (state-lock-guarded, mirrors _isThinkingModel) drives capabilities.supportsVision, routed through a new central BackendVisionCapability.ollamaSupportsImageInput(probedVision:) gate (matches the cloud families' pattern).
  • Wire: OllamaBackend now conforms to StructuredHistoryReceiver and lifts MessagePart.image payloads onto Ollama's message-level images: [base64] field. Snapshot-and-cleared like toolAwareHistory; text-only turns keep the existing plain-string path, preserving every existing wire-shape assertion.

Tests

  • OllamaManifestProbeTests (+91): deterministic vision detection via mocked /api/show — no live server.
  • BackendVisionCapabilityTests (+11): gate unit coverage.
  • OllamaVisionE2ETests (new): live E2E — generates a solid-color PNG in code, attaches it via setStructuredHistory, asserts grounded output from a real vision model. Cross-platform-gated (CoreGraphics/ImageIO); auto-skips when no vision model is installed.

Verification

  • swift build --build-tests — exit 0 (clean).
  • Live swift test --filter OllamaVisionE2ETests against local Ollama — orchestrator to run before marking ready.

Relates to the multimodal token-accounting gap (#1885): images now flow through a real path that the token estimator must account for.

Draft — orchestrator will run live verification + watch CI before merge. One of a 3-PR set (#1890 tool discovery + docs, #1891 cancellation E2E).

Ollama's image wire path (OllamaImagesField) has always existed, but the
backend never advertised it and nothing tested it. Detect the per-model vision
capability from /api/show's capabilities: ["vision", ...] list at load time,
surface it via OllamaBackend.isVisionModel, and route it through a new central
BackendVisionCapability.ollamaSupportsImageInput(probedVision:) gate so
capabilities.supportsVision reflects reality.

Wire a real multimodal request path: OllamaBackend now conforms to
StructuredHistoryReceiver and lifts MessagePart.image payloads onto Ollama's
message-level images: [base64] field (snapshot-and-cleared like toolAwareHistory;
text-only turns still use the existing plain-string path, preserving wire-shape
assertions).

Tests: deterministic OllamaManifestProbeTests for vision detection (mocked
/api/show, no live server) + BackendVisionCapability unit coverage, and a live
OllamaVisionE2ETests that generates a solid-color PNG in code, attaches it, and
asserts grounded output from a real vision model (qwen2.5vl/moondream/llava),
auto-skipping when none is installed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@roryford roryford marked this pull request as ready for review June 16, 2026 09:02
@roryford roryford merged commit e870118 into main Jun 16, 2026
11 checks passed
@roryford roryford deleted the feat/ollama-vision-capability-e2e branch June 16, 2026 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant