feat(renderer-client): thread multimodal sidecar through rollout + transport by hallerite · Pull Request #1346 · PrimeIntellect-ai/verifiers

hallerite · 2026-05-11T18:54:33Z

Summary

Surfaces the renderer's MultiModalData sidecar (pixel_values, placeholder ranges, mm_hashes) end-to-end so multimodal renderers can drive vLLM's /inference/v1/generate multi_modal_data features field and the trainer's mm_kwargs without going through the legacy chat-completions / MITO multimodal path.

Companion PRs:

renderers: PrimeIntellect-ai/renderers#17 — Qwen3-VL, Qwen3.5/3.6, Kimi-K2.5 image support + RenderedTokens return shape
prime-rl: orchestrator/trainer side, opening shortly

What's plumbed through

verifiers/clients/renderer_client.py

_step_multi_modal_data(step): recover the prior turn's mm_data from the trajectory step (post-parse or raw-message side)
_get_incremental_prompt_ids now returns RenderedTokens | None and forwards previous_multi_modal_data to bridge_to_next_turn so the new turn's placeholder runs cover every earlier-turn image. Text-only renderers' raw list[int] is normalized via as_rendered_tokens so callers can unpack uniformly
RendererClient.create_completion unpacks the bridged result into (prompt_ids, multi_modal_data) and forwards both to generate
parse_response_tokens copies response.multi_modal_data onto ResponseTokens

verifiers/types.py

ResponseTokens.multi_modal_data: Any | None
TrajectoryStepTokens.multi_modal_data: NotRequired[Any]

Both typed as Any to avoid a hard import dep on renderers.

verifiers/utils/save_utils.py

is_json_serializable accepts torch tensors / numpy arrays / renderer sidecar dataclasses — transportable via msgpack, and trajectories carrying them are excluded from the JSONL save at the orchestrator boundary
_strip_intermediate_mm_data(trajectory) drops tokens.multi_modal_data from all but the last step before transport. Bridge merges prior turns' mm_data into each new turn, so naively shipping it on every step duplicates every image O(N²) bytes for an N-turn rollout

verifiers/utils/serve_utils.py

Custom msgpack encoder gains torch tensor / numpy ndarray / dataclass support. Tensors encode as {__torch_tensor__: True, dtype, shape, data} with raw bytes payload (torch imported lazily, so text-only consumers don't pay for it)
decode_tensor_payload / walk_decode_tensors rehydrate on the receiving side

Test plan

Text-only RL still goes through the same code path (no mm_data → fast path unchanged)
Multimodal RL: mm_data reaches vLLM as multi_modal_data in the generate request and reaches the trainer via TrajectoryStepTokens["multi_modal_data"]
Bridge: previous-turn images carried forward, placeholder count matches the combined token sequence
Transport: round-trip a MultiModalData through msgpack + decode and check tensor shapes / dtypes preserved
JSONL save: trajectories with multi_modal_data are accepted (no spurious "not JSON-serializable") with the orchestrator passing exclude_keys={\"trajectory\"}
No O(N²) duplication in saved rollouts for N-turn multimodal episodes

🤖 Generated with Claude Code

Note

Medium Risk
Touches the renderer client request/bridging path and rollout serialization/transport; incorrect threading or encoding of multi_modal_data could break multimodal rollouts or increase payload sizes, though text-only paths should remain unaffected.

Overview
Threads renderer-emitted multimodal sidecar data (multi_modal_data) end-to-end: bridging now carries prior-turn multimodal state and normalizes bridge results to RenderedTokens, and /inference/v1/generate requests include both prompt_ids and multi_modal_data when available.

Extends response/trajectory token types to store multi_modal_data, preserves it during parse_response_tokens, and updates saving/transport utilities to tolerate/encode renderer sidecars and tensors (including stripping intermediate-step multimodal blobs to avoid O(N²) rollout growth and adding msgpack encode/decode helpers for tensors/dataclasses).

^{Reviewed by Cursor Bugbot for commit c2e1b84. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ansport Surfaces the renderer's MultiModalData sidecar (pixel_values, placeholder ranges, mm_hashes) end-to-end so multimodal renderers can drive vLLM's /inference/v1/generate `multi_modal_data` features field and the downstream trainer's `mm_kwargs` without going through the legacy chat-completions / MITO multimodal path. renderer_client.py - `_step_multi_modal_data(step)`: recover the prior turn's mm_data from the trajectory step (parsed-tokens or raw-message side). - `_get_incremental_prompt_ids` now returns `RenderedTokens | None` and forwards `previous_multi_modal_data` to `bridge_to_next_turn` so the new turn's placeholder runs cover every earlier-turn image. Without this carry-forward, vLLM sees mismatched placeholder counts and falls back to hash-cache lookup or errors. Text-only renderers' raw `list[int]` returns are normalized via `as_rendered_tokens`. - `RendererClient.create_completion` unpacks the bridged result into `(prompt_ids, multi_modal_data)` and forwards both to `generate`. - `parse_response_tokens`: copies `response.multi_modal_data` onto the emitted `ResponseTokens` so downstream consumers can read it. types.py - `ResponseTokens.multi_modal_data: Any | None` - `TrajectoryStepTokens.multi_modal_data: NotRequired[Any]` Both typed as `Any` to avoid a hard import dependency on `renderers`. utils/response_utils.py - `parse_response_tokens` propagates `multi_modal_data` onto the `TrajectoryStepTokens` output when present. utils/save_utils.py - `is_json_serializable` accepts torch tensors / numpy arrays / renderer sidecar dataclasses — these aren't JSON-native but survive the prime-rl msgpack encoder, and trajectories carrying them are excluded from the JSONL save at the orchestrator boundary (orchestrator passes `exclude_keys={"trajectory"}` to `save_rollouts`). - `_strip_intermediate_mm_data(trajectory)`: drop `tokens.multi_modal_data` from all but the last step before transport. `bridge_to_next_turn` merges prior turns' mm_data into the new turn, so naively shipping mm_data on every step duplicates every image O(N²) bytes for an N-turn rollout; only the last step's sidecar is read by the trainer. utils/serve_utils.py - Custom msgpack encoder gains torch tensor / numpy ndarray / dataclass support. Tensors are encoded as `{__torch_tensor__: True, dtype, shape, data}` with raw bytes payload. Torch is imported lazily so text-only consumers don't pay for it. - `decode_tensor_payload` / `walk_decode_tensors` rehydrate tensor payloads on the receiving side. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.}

cursor · 2026-05-11T19:03:04Z

+                # the combined token sequence. Without this, vLLM sees
+                # placeholder counts that don't match the prompt and
+                # silently falls back to hash-cache lookup (or errors).
+                previous_multi_modal_data=previous_mm_data,


New kwarg breaks existing renderer bridge implementations

High Severity

previous_multi_modal_data=previous_mm_data is passed unconditionally to bridge_to_next_turn, even when it's None (text-only rollouts). The existing test mock _BridgeRenderer.bridge_to_next_turn in tests/test_renderer_client.py only accepts *, tools=None as keyword-only args, so every bridging test will crash with TypeError: got an unexpected keyword argument 'previous_multi_modal_data'. Any renderer implementation not yet updated for the new parameter will also fail at runtime.

Additional Locations (1)

verifiers/clients/renderer_client.py#L392-L393

^{Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.}

cursor · 2026-05-11T19:03:04Z

+        "PlaceholderRange",
+        "RenderedTokens",
+    }:
+        return True


Validation permits non-JSON types causing silent corruption

Medium Severity

is_json_serializable now returns True for torch tensors and renderer dataclasses, but make_serializable (the json.dump fallback) doesn't handle these types — it falls through to str(value), silently producing garbage like "tensor([1.0, 2.0])". The validation gate lets non-serializable values into the output dict, relying entirely on an external exclude_keys mechanism that isn't enforced here.

Additional Locations (1)

verifiers/utils/save_utils.py#L94-L112

^{Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.}

hallerite mentioned this pull request May 11, 2026

feat: renderer-only multimodal path — rip MITO branch, pack pixel_values from renderer PrimeIntellect-ai/prime-rl#2473

Open

5 tasks

cursor Bot reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(renderer-client): thread multimodal sidecar through rollout + transport#1346

feat(renderer-client): thread multimodal sidecar through rollout + transport#1346
hallerite wants to merge 1 commit into
mainfrom
feat/renderer-multimodal-passthrough

hallerite commented May 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 11, 2026

Uh oh!

cursor Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's plumbed through

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

New kwarg breaks existing renderer bridge implementations

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Validation permits non-JSON types causing silent corruption

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented May 11, 2026 •

edited by cursor Bot

Loading