Skip to content

view() tool blocked on upstream: OWUI Python toolkit tools cannot return model-visible images #40

Description

@rndmcnlly

Context

Lathe would benefit from a view() tool that lets the agent visually perceive an image on the sandbox filesystem — useful when driving a headless browser, inspecting generated charts, iterating on visual output, etc. The sandbox can produce the image; the problem is getting it into the model's context window.

The upstream gap

OWUI has a clean convention for human-facing rich output from tools: HTMLResponse with Content-Disposition: inline renders an iframe embed, and the model gets a status string. There is no symmetric convention for model-facing visual output — a tool producing an image that the model needs to see on its next turn.

This is tracked upstream as:

What partially works today

The OWUI codebase has two tool execution paths with different image handling:

  1. Legacy function-calling path (chat_completion_tools_handler): tool_result_files with type: "image" are emitted as socket events to the frontend for display, but are not injected into the model's next context. The model receives only a text summary like "Image file read successfully.".

  2. Native tool-calling path (Responses API / OR-aligned, around middleware.py:4248): For MCP tools returning {"type": "image", ...} content items, image data URIs are added as input_image parts in function_call_output items — meaning the model can see them. However, this only applies to MCP tools on the native path.

Python toolkit tools (which is what Lathe is) return plain strings. There is no return type or convention that causes OWUI to inject an image as an image_url content part in the model's next turn.

Related upstream discussions

Together these describe the full routing matrix {human, model} × {text, rich} — three of four cells are served; the model-facing visual cell is empty.

What this means for Lathe

A view(path) tool is blocked until OWUI provides a return convention that Python toolkit tools can use to deliver image data to the model. No workaround within Lathe's architecture (single-file toolkit, no OWUI storage dependency) can cleanly bridge this gap.

When the upstream feature lands, implementing view() in Lathe should be straightforward: download the image bytes from the sandbox via the toolbox API, base64-encode them, and return them using whatever convention OWUI establishes.

Action

Watch open-webui/open-webui#22591. No Lathe changes needed until the upstream convention is defined.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions