fix(qwen35): emit images inline in tool-response content#25
Merged
Conversation
Qwen35Renderer._render_tool pre-stringified content via _render_content,
which silently dropped image / video parts. As a result, tool-response
screenshots (the dominant image source in browser-agent traces) never
made it into the rendered token stream or into mm_items — even though
the Qwen3.5 chat template emits <|vision_start|><|image_pad|><|vision_end|>
inside <tool_response> blocks just as it does in user messages.
Unify _render_tool into a single method that handles both text-only and
media-bearing tool content. Internal _content_has_media dispatch; text-only
path keeps the prior emit_text("\n" + content + "\n") for BPE parity. Media
path reuses the buffer-and-flush pattern from emit_user_with_media and now
passes emit_image through as a closure arg.
Consecutive-tools envelope (<|im_start|>user>...<|im_end|> around runs of
tool messages) lives in one place. The open/close predicate stays
content-agnostic (prev_is_tool / next_is_tool only), so consecutive tool
messages with mixed media + text-only content render under one shared
envelope — same behavior as apply_chat_template.
Qwen36Renderer inherits the fix via subclassing.
Tests:
- New test_tool_response_image_byte_parity asserts byte-identity vs
processor.apply_chat_template + processor(images=, text=) for three
scenarios: single tool image, multi-turn tool images, and consecutive
tool messages mixing media + text-only (the wrapper-boundary case).
- Gated by isinstance(renderer, Qwen35Renderer); other VLM renderers
skip until they grow the feature.
- Passes on Qwen3.5-0.8B, Qwen3.5-9B, and Qwen3.5-35B-A3B locally.
- Full renderer test suite: 964 passed, 75 skipped, 1 xfailed — no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Thread ``emit_image`` from ``render()`` and ``bridge_to_next_turn()`` through ``_render_tool_body`` so image parts inside ``tool`` message content (browser-agent screenshots, etc.) render as ``<|media_begin|>image<|media_content|><|media_pad|><|media_end|>`` inline instead of being silently dropped by ``_emit_content``. Same bug class as Qwen3.5 (PR #25). Affects both Kimi K2.5 and K2.6 since they share ``KimiK25Renderer``. Extends the ``_supports_tool_message_images`` gate in the multimodal suite so ``test_tool_response_image_byte_parity`` exercises Kimi K2.5/K2.6 against the HF processor — 8/8 Kimi multimodal cases pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
hallerite
added a commit
that referenced
this pull request
May 13, 2026
Thread ``emit_image`` from ``render()`` and ``bridge_to_next_turn()`` through ``_render_tool_body`` so image parts inside ``tool`` message content (browser-agent screenshots, etc.) render as ``<|media_begin|>image<|media_content|><|media_pad|><|media_end|>`` inline instead of being silently dropped by ``_emit_content``. Same bug class as Qwen3.5 (PR #25). Affects both Kimi K2.5 and K2.6 since they share ``KimiK25Renderer``. Extends the ``_supports_tool_message_images`` gate in the multimodal suite so ``test_tool_response_image_byte_parity`` exercises Kimi K2.5/K2.6 against the HF processor — 8/8 Kimi multimodal cases pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7c70310 to
819a28c
Compare
Three concrete fixes to drop ty from 5 errors → 0 (warnings unchanged):
1. ``ToolSpec`` TypedDict (renderers/base.py)
Renderer code accepts both the OpenAI envelope and flat tool shapes
interchangeably (``tool.get("function") or tool``), but the TypedDict
only declared the flat shape. Split into ``ToolFunctionSpec`` + a
``total=False`` ``ToolSpec`` that admits either. Eliminates the
"Missing required key" errors at tests/test_client.py:90 and the
matching "Unknown key 'function' / 'type'" warnings elsewhere.
2. qwen35.py ``_render_tool`` media branch
``_content_has_media`` is False unless content is a list, but ty
can't follow that through a method call. Added an
``assert isinstance(raw_content, list)`` after the guard so the
subsequent ``for item in raw_content`` is properly narrowed.
3. examples/sglang/online_multiturn_sglang.py
``bridge_to_next_turn`` returns ``RenderedTokens | None``; the
example was subscripting the wrapper. Unwrapped via ``.token_ids``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts commit cabbecf.
``_content_has_media`` returns False unless content is a list, but ty can't follow that through a method call, so the subsequent ``for item in raw_content`` flagged ``not-iterable`` on ``str | list | None``. Added an ``assert isinstance(raw_content, list)`` after the guard to narrow the type without changing runtime behaviour. Drops the ``renderers/`` ty errors that CI checks from 1 to 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hallerite
approved these changes
May 13, 2026
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Qwen35Renderer._render_toolsilently dropped image / video parts in tool messages by pre-stringifying content via_render_content. Tool-response screenshots (the dominant image source in browser-agent traces) never made it into rendered tokens ormm_items, even though the Qwen3.5 chat template emits<|vision_start|><|image_pad|><|vision_end|>inside<tool_response>blocks just as in user messages._render_toolto handle both text-only and media-bearing tool content in a single method. Internal_content_has_mediadispatch; text-only path keeps the prior single-encodeemit_text("\n" + content + "\n")for BPE parity; media path reuses the buffer-and-flush pattern fromemit_user_with_mediawithemit_imagethreaded through as a closure arg.<|im_start|>user>...<|im_end|>) logic stays in one place and remains content-agnostic (prev_is_tool/next_is_toolonly). Consecutive tool messages mixing media + text-only render under one shared envelope — same asapply_chat_template.Qwen36Rendererinherits the fix via subclassing.Other renderers
Qwen3VLRendereralready had the equivalent abstraction (render_media_contentreused across user and tool); verified live via parity probe on Qwen3-VL-4B (106 tokens, exact match). No change needed.KimiK25Renderer._render_tool_bodyhas the same class of bug (silently drops viaemit_image=Nonedefault in_emit_content); confirmed via live probe (0 vs 1<|media_pad|>token). Out of scope here — separate follow-up PR.Test plan
test_tool_response_image_byte_parityasserts byte-identity vsprocessor.apply_chat_template+processor(images=, text=)for three cases:tool_response_with_image— single tool imagemulti_turn_tool_response_images— multiple turns each carrying an imageconsecutive_tools_mixed_media— consecutive tools mixing media + text-only (the wrapper-boundary case)isinstance(renderer, Qwen35Renderer)so other VLM renderers skip until they grow the feature.Qwen/Qwen3.5-0.8B,Qwen/Qwen3.5-9B, andQwen/Qwen3.5-35B-A3B.Known gaps
NotImplementedErrorbranch they do inemit_user_with_media, but I haven't validated<|vision_start|><|video_pad|><|vision_end|>parity end-to-end.🤖 Generated with Claude Code