fix(qwen35): emit images inline in tool-response content by eligotts · Pull Request #25 · PrimeIntellect-ai/renderers

eligotts · 2026-05-13T07:13:09Z

Summary

Qwen35Renderer._render_tool silently dropped image / video parts in tool messages by pre-stringifying content via _render_content. Tool-response screenshots (the dominant image source in browser-agent traces) never made it into rendered tokens or mm_items, even though the Qwen3.5 chat template emits <|vision_start|><|image_pad|><|vision_end|> inside <tool_response> blocks just as in user messages.
Unify _render_tool to handle both text-only and media-bearing tool content in a single method. Internal _content_has_media dispatch; text-only path keeps the prior single-encode emit_text("\n" + content + "\n") for BPE parity; media path reuses the buffer-and-flush pattern from emit_user_with_media with emit_image threaded through as a closure arg.
Consecutive-tools envelope (<|im_start|>user>...<|im_end|>) logic stays in one place and remains content-agnostic (prev_is_tool / next_is_tool only). Consecutive tool messages mixing media + text-only render under one shared envelope — same as apply_chat_template.
Qwen36Renderer inherits the fix via subclassing.

Other renderers

Qwen3VLRenderer already had the equivalent abstraction (render_media_content reused across user and tool); verified live via parity probe on Qwen3-VL-4B (106 tokens, exact match). No change needed.
KimiK25Renderer._render_tool_body has the same class of bug (silently drops via emit_image=None default in _emit_content); confirmed via live probe (0 vs 1 <|media_pad|> token). Out of scope here — separate follow-up PR.

Test plan

New test_tool_response_image_byte_parity asserts byte-identity vs processor.apply_chat_template + processor(images=, text=) for three cases:
- tool_response_with_image — single tool image
- multi_turn_tool_response_images — multiple turns each carrying an image
- consecutive_tools_mixed_media — consecutive tools mixing media + text-only (the wrapper-boundary case)
Gated by isinstance(renderer, Qwen35Renderer) so other VLM renderers skip until they grow the feature.
Passes locally on Qwen/Qwen3.5-0.8B, Qwen/Qwen3.5-9B, and Qwen/Qwen3.5-35B-A3B.
Full renderer test suite: 964 passed, 75 skipped, 1 xfailed (pre-existing). No regressions.

Known gaps

Test coverage is image-only. Video parts in tool content take the same NotImplementedError branch they do in emit_user_with_media, but I haven't validated <|vision_start|><|video_pad|><|vision_end|> parity end-to-end.

🤖 Generated with Claude Code

Qwen35Renderer._render_tool pre-stringified content via _render_content, which silently dropped image / video parts. As a result, tool-response screenshots (the dominant image source in browser-agent traces) never made it into the rendered token stream or into mm_items — even though the Qwen3.5 chat template emits <|vision_start|><|image_pad|><|vision_end|> inside <tool_response> blocks just as it does in user messages. Unify _render_tool into a single method that handles both text-only and media-bearing tool content. Internal _content_has_media dispatch; text-only path keeps the prior emit_text("\n" + content + "\n") for BPE parity. Media path reuses the buffer-and-flush pattern from emit_user_with_media and now passes emit_image through as a closure arg. Consecutive-tools envelope (<|im_start|>user>...<|im_end|> around runs of tool messages) lives in one place. The open/close predicate stays content-agnostic (prev_is_tool / next_is_tool only), so consecutive tool messages with mixed media + text-only content render under one shared envelope — same behavior as apply_chat_template. Qwen36Renderer inherits the fix via subclassing. Tests: - New test_tool_response_image_byte_parity asserts byte-identity vs processor.apply_chat_template + processor(images=, text=) for three scenarios: single tool image, multi-turn tool images, and consecutive tool messages mixing media + text-only (the wrapper-boundary case). - Gated by isinstance(renderer, Qwen35Renderer); other VLM renderers skip until they grow the feature. - Passes on Qwen3.5-0.8B, Qwen3.5-9B, and Qwen3.5-35B-A3B locally. - Full renderer test suite: 964 passed, 75 skipped, 1 xfailed — no regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Thread ``emit_image`` from ``render()`` and ``bridge_to_next_turn()`` through ``_render_tool_body`` so image parts inside ``tool`` message content (browser-agent screenshots, etc.) render as ``<|media_begin|>image<|media_content|><|media_pad|><|media_end|>`` inline instead of being silently dropped by ``_emit_content``. Same bug class as Qwen3.5 (PR #25). Affects both Kimi K2.5 and K2.6 since they share ``KimiK25Renderer``. Extends the ``_supports_tool_message_images`` gate in the multimodal suite so ``test_tool_response_image_byte_parity`` exercises Kimi K2.5/K2.6 against the HF processor — 8/8 Kimi multimodal cases pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three concrete fixes to drop ty from 5 errors → 0 (warnings unchanged): 1. ``ToolSpec`` TypedDict (renderers/base.py) Renderer code accepts both the OpenAI envelope and flat tool shapes interchangeably (``tool.get("function") or tool``), but the TypedDict only declared the flat shape. Split into ``ToolFunctionSpec`` + a ``total=False`` ``ToolSpec`` that admits either. Eliminates the "Missing required key" errors at tests/test_client.py:90 and the matching "Unknown key 'function' / 'type'" warnings elsewhere. 2. qwen35.py ``_render_tool`` media branch ``_content_has_media`` is False unless content is a list, but ty can't follow that through a method call. Added an ``assert isinstance(raw_content, list)`` after the guard so the subsequent ``for item in raw_content`` is properly narrowed. 3. examples/sglang/online_multiturn_sglang.py ``bridge_to_next_turn`` returns ``RenderedTokens | None``; the example was subscripting the wrapper. Unwrapped via ``.token_ids``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This reverts commit cabbecf.

``_content_has_media`` returns False unless content is a list, but ty can't follow that through a method call, so the subsequent ``for item in raw_content`` flagged ``not-iterable`` on ``str | list | None``. Added an ``assert isinstance(raw_content, list)`` after the guard to narrow the type without changing runtime behaviour. Drops the ``renderers/`` ty errors that CI checks from 1 to 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eligotts mentioned this pull request May 13, 2026

feat(sft): VLM SFT via renderers + image-aware seq_len truncation PrimeIntellect-ai/prime-rl#2485

Draft

hallerite mentioned this pull request May 13, 2026

fix(kimi-k25): emit images inline in tool-response content #30

Merged

3 tasks

hallerite marked this pull request as ready for review May 13, 2026 16:10

hallerite force-pushed the fix/qwen35-tool-response-images branch from 7c70310 to 819a28c Compare May 13, 2026 16:13

hallerite and others added 3 commits May 13, 2026 16:22

Revert "fix(typing): clear ty errors blocking CI"

fddbb36

This reverts commit cabbecf.

hallerite approved these changes May 13, 2026

View reviewed changes

hallerite merged commit 562db77 into main May 13, 2026
6 checks passed

hallerite mentioned this pull request May 13, 2026

chore: bump renderers to 0.1.8.dev1 and migrate to ParsedToolCall API PrimeIntellect-ai/verifiers#1366

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwen35): emit images inline in tool-response content#25

fix(qwen35): emit images inline in tool-response content#25
hallerite merged 5 commits into
mainfrom
fix/qwen35-tool-response-images

eligotts commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eligotts commented May 13, 2026

Summary

Other renderers

Test plan

Known gaps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants