Skip to content

fix(qwen35): emit images inline in tool-response content#25

Merged
hallerite merged 5 commits into
mainfrom
fix/qwen35-tool-response-images
May 13, 2026
Merged

fix(qwen35): emit images inline in tool-response content#25
hallerite merged 5 commits into
mainfrom
fix/qwen35-tool-response-images

Conversation

@eligotts
Copy link
Copy Markdown
Contributor

Summary

  • Qwen35Renderer._render_tool silently dropped image / video parts in tool messages by pre-stringifying content via _render_content. Tool-response screenshots (the dominant image source in browser-agent traces) never made it into rendered tokens or mm_items, even though the Qwen3.5 chat template emits <|vision_start|><|image_pad|><|vision_end|> inside <tool_response> blocks just as in user messages.
  • Unify _render_tool to handle both text-only and media-bearing tool content in a single method. Internal _content_has_media dispatch; text-only path keeps the prior single-encode emit_text("\n" + content + "\n") for BPE parity; media path reuses the buffer-and-flush pattern from emit_user_with_media with emit_image threaded through as a closure arg.
  • Consecutive-tools envelope (<|im_start|>user>...<|im_end|>) logic stays in one place and remains content-agnostic (prev_is_tool / next_is_tool only). Consecutive tool messages mixing media + text-only render under one shared envelope — same as apply_chat_template.
  • Qwen36Renderer inherits the fix via subclassing.

Other renderers

  • Qwen3VLRenderer already had the equivalent abstraction (render_media_content reused across user and tool); verified live via parity probe on Qwen3-VL-4B (106 tokens, exact match). No change needed.
  • KimiK25Renderer._render_tool_body has the same class of bug (silently drops via emit_image=None default in _emit_content); confirmed via live probe (0 vs 1 <|media_pad|> token). Out of scope here — separate follow-up PR.

Test plan

  • New test_tool_response_image_byte_parity asserts byte-identity vs processor.apply_chat_template + processor(images=, text=) for three cases:
    • tool_response_with_image — single tool image
    • multi_turn_tool_response_images — multiple turns each carrying an image
    • consecutive_tools_mixed_media — consecutive tools mixing media + text-only (the wrapper-boundary case)
  • Gated by isinstance(renderer, Qwen35Renderer) so other VLM renderers skip until they grow the feature.
  • Passes locally on Qwen/Qwen3.5-0.8B, Qwen/Qwen3.5-9B, and Qwen/Qwen3.5-35B-A3B.
  • Full renderer test suite: 964 passed, 75 skipped, 1 xfailed (pre-existing). No regressions.

Known gaps

  • Test coverage is image-only. Video parts in tool content take the same NotImplementedError branch they do in emit_user_with_media, but I haven't validated <|vision_start|><|video_pad|><|vision_end|> parity end-to-end.

🤖 Generated with Claude Code

Qwen35Renderer._render_tool pre-stringified content via _render_content,
which silently dropped image / video parts. As a result, tool-response
screenshots (the dominant image source in browser-agent traces) never
made it into the rendered token stream or into mm_items — even though
the Qwen3.5 chat template emits <|vision_start|><|image_pad|><|vision_end|>
inside <tool_response> blocks just as it does in user messages.

Unify _render_tool into a single method that handles both text-only and
media-bearing tool content. Internal _content_has_media dispatch; text-only
path keeps the prior emit_text("\n" + content + "\n") for BPE parity. Media
path reuses the buffer-and-flush pattern from emit_user_with_media and now
passes emit_image through as a closure arg.

Consecutive-tools envelope (<|im_start|>user>...<|im_end|> around runs of
tool messages) lives in one place. The open/close predicate stays
content-agnostic (prev_is_tool / next_is_tool only), so consecutive tool
messages with mixed media + text-only content render under one shared
envelope — same behavior as apply_chat_template.

Qwen36Renderer inherits the fix via subclassing.

Tests:
- New test_tool_response_image_byte_parity asserts byte-identity vs
  processor.apply_chat_template + processor(images=, text=) for three
  scenarios: single tool image, multi-turn tool images, and consecutive
  tool messages mixing media + text-only (the wrapper-boundary case).
- Gated by isinstance(renderer, Qwen35Renderer); other VLM renderers
  skip until they grow the feature.
- Passes on Qwen3.5-0.8B, Qwen3.5-9B, and Qwen3.5-35B-A3B locally.
- Full renderer test suite: 964 passed, 75 skipped, 1 xfailed — no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thread ``emit_image`` from ``render()`` and ``bridge_to_next_turn()``
through ``_render_tool_body`` so image parts inside ``tool`` message
content (browser-agent screenshots, etc.) render as
``<|media_begin|>image<|media_content|><|media_pad|><|media_end|>``
inline instead of being silently dropped by ``_emit_content``.

Same bug class as Qwen3.5 (PR #25). Affects both Kimi K2.5 and K2.6
since they share ``KimiK25Renderer``.

Extends the ``_supports_tool_message_images`` gate in the multimodal
suite so ``test_tool_response_image_byte_parity`` exercises Kimi
K2.5/K2.6 against the HF processor — 8/8 Kimi multimodal cases pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite marked this pull request as ready for review May 13, 2026 16:10
hallerite added a commit that referenced this pull request May 13, 2026
Thread ``emit_image`` from ``render()`` and ``bridge_to_next_turn()``
through ``_render_tool_body`` so image parts inside ``tool`` message
content (browser-agent screenshots, etc.) render as
``<|media_begin|>image<|media_content|><|media_pad|><|media_end|>``
inline instead of being silently dropped by ``_emit_content``.

Same bug class as Qwen3.5 (PR #25). Affects both Kimi K2.5 and K2.6
since they share ``KimiK25Renderer``.

Extends the ``_supports_tool_message_images`` gate in the multimodal
suite so ``test_tool_response_image_byte_parity`` exercises Kimi
K2.5/K2.6 against the HF processor — 8/8 Kimi multimodal cases pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the fix/qwen35-tool-response-images branch from 7c70310 to 819a28c Compare May 13, 2026 16:13
hallerite and others added 3 commits May 13, 2026 16:22
Three concrete fixes to drop ty from 5 errors → 0 (warnings unchanged):

1. ``ToolSpec`` TypedDict (renderers/base.py)
   Renderer code accepts both the OpenAI envelope and flat tool shapes
   interchangeably (``tool.get("function") or tool``), but the TypedDict
   only declared the flat shape. Split into ``ToolFunctionSpec`` + a
   ``total=False`` ``ToolSpec`` that admits either. Eliminates the
   "Missing required key" errors at tests/test_client.py:90 and the
   matching "Unknown key 'function' / 'type'" warnings elsewhere.

2. qwen35.py ``_render_tool`` media branch
   ``_content_has_media`` is False unless content is a list, but ty
   can't follow that through a method call. Added an
   ``assert isinstance(raw_content, list)`` after the guard so the
   subsequent ``for item in raw_content`` is properly narrowed.

3. examples/sglang/online_multiturn_sglang.py
   ``bridge_to_next_turn`` returns ``RenderedTokens | None``; the
   example was subscripting the wrapper. Unwrapped via ``.token_ids``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``_content_has_media`` returns False unless content is a list, but ty
can't follow that through a method call, so the subsequent
``for item in raw_content`` flagged ``not-iterable`` on ``str | list |
None``.

Added an ``assert isinstance(raw_content, list)`` after the guard to
narrow the type without changing runtime behaviour. Drops the
``renderers/`` ty errors that CI checks from 1 to 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite merged commit 562db77 into main May 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants