feat(renderer-client): thread multimodal sidecar through rollout + transport#1346
feat(renderer-client): thread multimodal sidecar through rollout + transport#1346hallerite wants to merge 1 commit into
Conversation
…ansport
Surfaces the renderer's MultiModalData sidecar (pixel_values, placeholder
ranges, mm_hashes) end-to-end so multimodal renderers can drive vLLM's
/inference/v1/generate `multi_modal_data` features field and the
downstream trainer's `mm_kwargs` without going through the legacy
chat-completions / MITO multimodal path.
renderer_client.py
- `_step_multi_modal_data(step)`: recover the prior turn's mm_data from
the trajectory step (parsed-tokens or raw-message side).
- `_get_incremental_prompt_ids` now returns `RenderedTokens | None` and
forwards `previous_multi_modal_data` to `bridge_to_next_turn` so the
new turn's placeholder runs cover every earlier-turn image. Without
this carry-forward, vLLM sees mismatched placeholder counts and falls
back to hash-cache lookup or errors. Text-only renderers' raw
`list[int]` returns are normalized via `as_rendered_tokens`.
- `RendererClient.create_completion` unpacks the bridged result into
`(prompt_ids, multi_modal_data)` and forwards both to `generate`.
- `parse_response_tokens`: copies `response.multi_modal_data` onto the
emitted `ResponseTokens` so downstream consumers can read it.
types.py
- `ResponseTokens.multi_modal_data: Any | None`
- `TrajectoryStepTokens.multi_modal_data: NotRequired[Any]`
Both typed as `Any` to avoid a hard import dependency on `renderers`.
utils/response_utils.py
- `parse_response_tokens` propagates `multi_modal_data` onto the
`TrajectoryStepTokens` output when present.
utils/save_utils.py
- `is_json_serializable` accepts torch tensors / numpy arrays / renderer
sidecar dataclasses — these aren't JSON-native but survive the
prime-rl msgpack encoder, and trajectories carrying them are excluded
from the JSONL save at the orchestrator boundary (orchestrator passes
`exclude_keys={"trajectory"}` to `save_rollouts`).
- `_strip_intermediate_mm_data(trajectory)`: drop `tokens.multi_modal_data`
from all but the last step before transport. `bridge_to_next_turn`
merges prior turns' mm_data into the new turn, so naively shipping
mm_data on every step duplicates every image O(N²) bytes for an N-turn
rollout; only the last step's sidecar is read by the trainer.
utils/serve_utils.py
- Custom msgpack encoder gains torch tensor / numpy ndarray /
dataclass support. Tensors are encoded as
`{__torch_tensor__: True, dtype, shape, data}` with raw bytes payload.
Torch is imported lazily so text-only consumers don't pay for it.
- `decode_tensor_payload` / `walk_decode_tensors` rehydrate tensor
payloads on the receiving side.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.
| # the combined token sequence. Without this, vLLM sees | ||
| # placeholder counts that don't match the prompt and | ||
| # silently falls back to hash-cache lookup (or errors). | ||
| previous_multi_modal_data=previous_mm_data, |
There was a problem hiding this comment.
New kwarg breaks existing renderer bridge implementations
High Severity
previous_multi_modal_data=previous_mm_data is passed unconditionally to bridge_to_next_turn, even when it's None (text-only rollouts). The existing test mock _BridgeRenderer.bridge_to_next_turn in tests/test_renderer_client.py only accepts *, tools=None as keyword-only args, so every bridging test will crash with TypeError: got an unexpected keyword argument 'previous_multi_modal_data'. Any renderer implementation not yet updated for the new parameter will also fail at runtime.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.
| "PlaceholderRange", | ||
| "RenderedTokens", | ||
| }: | ||
| return True |
There was a problem hiding this comment.
Validation permits non-JSON types causing silent corruption
Medium Severity
is_json_serializable now returns True for torch tensors and renderer dataclasses, but make_serializable (the json.dump fallback) doesn't handle these types — it falls through to str(value), silently producing garbage like "tensor([1.0, 2.0])". The validation gate lets non-serializable values into the output dict, relying entirely on an external exclude_keys mechanism that isn't enforced here.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.


Summary
Surfaces the renderer's
MultiModalDatasidecar (pixel_values, placeholder ranges, mm_hashes) end-to-end so multimodal renderers can drive vLLM's/inference/v1/generatemulti_modal_datafeatures field and the trainer'smm_kwargswithout going through the legacy chat-completions / MITO multimodal path.Companion PRs:
RenderedTokensreturn shapeWhat's plumbed through
verifiers/clients/renderer_client.py_step_multi_modal_data(step): recover the prior turn'smm_datafrom the trajectory step (post-parse or raw-message side)_get_incremental_prompt_idsnow returnsRenderedTokens | Noneand forwardsprevious_multi_modal_datatobridge_to_next_turnso the new turn's placeholder runs cover every earlier-turn image. Text-only renderers' rawlist[int]is normalized viaas_rendered_tokensso callers can unpack uniformlyRendererClient.create_completionunpacks the bridged result into(prompt_ids, multi_modal_data)and forwards both togenerateparse_response_tokenscopiesresponse.multi_modal_dataontoResponseTokensverifiers/types.pyResponseTokens.multi_modal_data: Any | NoneTrajectoryStepTokens.multi_modal_data: NotRequired[Any]Both typed as
Anyto avoid a hard import dep onrenderers.verifiers/utils/save_utils.pyis_json_serializableaccepts torch tensors / numpy arrays / renderer sidecar dataclasses — transportable via msgpack, and trajectories carrying them are excluded from the JSONL save at the orchestrator boundary_strip_intermediate_mm_data(trajectory)dropstokens.multi_modal_datafrom all but the last step before transport. Bridge merges prior turns'mm_datainto each new turn, so naively shipping it on every step duplicates every image O(N²) bytes for an N-turn rolloutverifiers/utils/serve_utils.py{__torch_tensor__: True, dtype, shape, data}with raw bytes payload (torch imported lazily, so text-only consumers don't pay for it)decode_tensor_payload/walk_decode_tensorsrehydrate on the receiving sideTest plan
mm_datareaches vLLM asmulti_modal_datain the generate request and reaches the trainer viaTrajectoryStepTokens["multi_modal_data"]MultiModalDatathrough msgpack + decode and check tensor shapes / dtypes preservedmulti_modal_dataare accepted (no spurious "not JSON-serializable") with the orchestrator passingexclude_keys={\"trajectory\"}🤖 Generated with Claude Code
Note
Medium Risk
Touches the renderer client request/bridging path and rollout serialization/transport; incorrect threading or encoding of
multi_modal_datacould break multimodal rollouts or increase payload sizes, though text-only paths should remain unaffected.Overview
Threads renderer-emitted multimodal sidecar data (
multi_modal_data) end-to-end: bridging now carries prior-turn multimodal state and normalizes bridge results toRenderedTokens, and/inference/v1/generaterequests include bothprompt_idsandmulti_modal_datawhen available.Extends response/trajectory token types to store
multi_modal_data, preserves it duringparse_response_tokens, and updates saving/transport utilities to tolerate/encode renderer sidecars and tensors (including stripping intermediate-step multimodal blobs to avoid O(N²) rollout growth and adding msgpack encode/decode helpers for tensors/dataclasses).Reviewed by Cursor Bugbot for commit c2e1b84. Bugbot is set up for automated code reviews on this repo. Configure here.