Context
Lathe would benefit from a view() tool that lets the agent visually perceive an image on the sandbox filesystem — useful when driving a headless browser, inspecting generated charts, iterating on visual output, etc. The sandbox can produce the image; the problem is getting it into the model's context window.
The upstream gap
OWUI has a clean convention for human-facing rich output from tools: HTMLResponse with Content-Disposition: inline renders an iframe embed, and the model gets a status string. There is no symmetric convention for model-facing visual output — a tool producing an image that the model needs to see on its next turn.
This is tracked upstream as:
What partially works today
The OWUI codebase has two tool execution paths with different image handling:
-
Legacy function-calling path (chat_completion_tools_handler): tool_result_files with type: "image" are emitted as socket events to the frontend for display, but are not injected into the model's next context. The model receives only a text summary like "Image file read successfully.".
-
Native tool-calling path (Responses API / OR-aligned, around middleware.py:4248): For MCP tools returning {"type": "image", ...} content items, image data URIs are added as input_image parts in function_call_output items — meaning the model can see them. However, this only applies to MCP tools on the native path.
Python toolkit tools (which is what Lathe is) return plain strings. There is no return type or convention that causes OWUI to inject an image as an image_url content part in the model's next turn.
Related upstream discussions
Together these describe the full routing matrix {human, model} × {text, rich} — three of four cells are served; the model-facing visual cell is empty.
What this means for Lathe
A view(path) tool is blocked until OWUI provides a return convention that Python toolkit tools can use to deliver image data to the model. No workaround within Lathe's architecture (single-file toolkit, no OWUI storage dependency) can cleanly bridge this gap.
When the upstream feature lands, implementing view() in Lathe should be straightforward: download the image bytes from the sandbox via the toolbox API, base64-encode them, and return them using whatever convention OWUI establishes.
Action
Watch open-webui/open-webui#22591. No Lathe changes needed until the upstream convention is defined.
Context
Lathe would benefit from a
view()tool that lets the agent visually perceive an image on the sandbox filesystem — useful when driving a headless browser, inspecting generated charts, iterating on visual output, etc. The sandbox can produce the image; the problem is getting it into the model's context window.The upstream gap
OWUI has a clean convention for human-facing rich output from tools:
HTMLResponsewithContent-Disposition: inlinerenders an iframe embed, and the model gets a status string. There is no symmetric convention for model-facing visual output — a tool producing an image that the model needs to see on its next turn.This is tracked upstream as:
What partially works today
The OWUI codebase has two tool execution paths with different image handling:
Legacy function-calling path (
chat_completion_tools_handler):tool_result_fileswithtype: "image"are emitted as socket events to the frontend for display, but are not injected into the model's next context. The model receives only a text summary like"Image file read successfully.".Native tool-calling path (Responses API / OR-aligned, around
middleware.py:4248): For MCP tools returning{"type": "image", ...}content items, image data URIs are added asinput_imageparts infunction_call_outputitems — meaning the model can see them. However, this only applies to MCP tools on the native path.Python toolkit tools (which is what Lathe is) return plain strings. There is no return type or convention that causes OWUI to inject an image as an
image_urlcontent part in the model's next turn.Related upstream discussions
Together these describe the full routing matrix
{human, model} × {text, rich}— three of four cells are served; the model-facing visual cell is empty.What this means for Lathe
A
view(path)tool is blocked until OWUI provides a return convention that Python toolkit tools can use to deliver image data to the model. No workaround within Lathe's architecture (single-file toolkit, no OWUI storage dependency) can cleanly bridge this gap.When the upstream feature lands, implementing
view()in Lathe should be straightforward: download the image bytes from the sandbox via the toolbox API, base64-encode them, and return them using whatever convention OWUI establishes.Action
Watch open-webui/open-webui#22591. No Lathe changes needed until the upstream convention is defined.