Skip to content

feat(renderer-client): thread multimodal sidecar through rollout + transport#1346

Open
hallerite wants to merge 1 commit into
mainfrom
feat/renderer-multimodal-passthrough
Open

feat(renderer-client): thread multimodal sidecar through rollout + transport#1346
hallerite wants to merge 1 commit into
mainfrom
feat/renderer-multimodal-passthrough

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 11, 2026

Summary

Surfaces the renderer's MultiModalData sidecar (pixel_values, placeholder ranges, mm_hashes) end-to-end so multimodal renderers can drive vLLM's /inference/v1/generate multi_modal_data features field and the trainer's mm_kwargs without going through the legacy chat-completions / MITO multimodal path.

Companion PRs:

  • renderers: PrimeIntellect-ai/renderers#17 — Qwen3-VL, Qwen3.5/3.6, Kimi-K2.5 image support + RenderedTokens return shape
  • prime-rl: orchestrator/trainer side, opening shortly

What's plumbed through

verifiers/clients/renderer_client.py

  • _step_multi_modal_data(step): recover the prior turn's mm_data from the trajectory step (post-parse or raw-message side)
  • _get_incremental_prompt_ids now returns RenderedTokens | None and forwards previous_multi_modal_data to bridge_to_next_turn so the new turn's placeholder runs cover every earlier-turn image. Text-only renderers' raw list[int] is normalized via as_rendered_tokens so callers can unpack uniformly
  • RendererClient.create_completion unpacks the bridged result into (prompt_ids, multi_modal_data) and forwards both to generate
  • parse_response_tokens copies response.multi_modal_data onto ResponseTokens

verifiers/types.py

  • ResponseTokens.multi_modal_data: Any | None
  • TrajectoryStepTokens.multi_modal_data: NotRequired[Any]

Both typed as Any to avoid a hard import dep on renderers.

verifiers/utils/save_utils.py

  • is_json_serializable accepts torch tensors / numpy arrays / renderer sidecar dataclasses — transportable via msgpack, and trajectories carrying them are excluded from the JSONL save at the orchestrator boundary
  • _strip_intermediate_mm_data(trajectory) drops tokens.multi_modal_data from all but the last step before transport. Bridge merges prior turns' mm_data into each new turn, so naively shipping it on every step duplicates every image O(N²) bytes for an N-turn rollout

verifiers/utils/serve_utils.py

  • Custom msgpack encoder gains torch tensor / numpy ndarray / dataclass support. Tensors encode as {__torch_tensor__: True, dtype, shape, data} with raw bytes payload (torch imported lazily, so text-only consumers don't pay for it)
  • decode_tensor_payload / walk_decode_tensors rehydrate on the receiving side

Test plan

  • Text-only RL still goes through the same code path (no mm_data → fast path unchanged)
  • Multimodal RL: mm_data reaches vLLM as multi_modal_data in the generate request and reaches the trainer via TrajectoryStepTokens["multi_modal_data"]
  • Bridge: previous-turn images carried forward, placeholder count matches the combined token sequence
  • Transport: round-trip a MultiModalData through msgpack + decode and check tensor shapes / dtypes preserved
  • JSONL save: trajectories with multi_modal_data are accepted (no spurious "not JSON-serializable") with the orchestrator passing exclude_keys={\"trajectory\"}
  • No O(N²) duplication in saved rollouts for N-turn multimodal episodes

🤖 Generated with Claude Code


Note

Medium Risk
Touches the renderer client request/bridging path and rollout serialization/transport; incorrect threading or encoding of multi_modal_data could break multimodal rollouts or increase payload sizes, though text-only paths should remain unaffected.

Overview
Threads renderer-emitted multimodal sidecar data (multi_modal_data) end-to-end: bridging now carries prior-turn multimodal state and normalizes bridge results to RenderedTokens, and /inference/v1/generate requests include both prompt_ids and multi_modal_data when available.

Extends response/trajectory token types to store multi_modal_data, preserves it during parse_response_tokens, and updates saving/transport utilities to tolerate/encode renderer sidecars and tensors (including stripping intermediate-step multimodal blobs to avoid O(N²) rollout growth and adding msgpack encode/decode helpers for tensors/dataclasses).

Reviewed by Cursor Bugbot for commit c2e1b84. Bugbot is set up for automated code reviews on this repo. Configure here.

…ansport

Surfaces the renderer's MultiModalData sidecar (pixel_values, placeholder
ranges, mm_hashes) end-to-end so multimodal renderers can drive vLLM's
/inference/v1/generate `multi_modal_data` features field and the
downstream trainer's `mm_kwargs` without going through the legacy
chat-completions / MITO multimodal path.

renderer_client.py
- `_step_multi_modal_data(step)`: recover the prior turn's mm_data from
  the trajectory step (parsed-tokens or raw-message side).
- `_get_incremental_prompt_ids` now returns `RenderedTokens | None` and
  forwards `previous_multi_modal_data` to `bridge_to_next_turn` so the
  new turn's placeholder runs cover every earlier-turn image. Without
  this carry-forward, vLLM sees mismatched placeholder counts and falls
  back to hash-cache lookup or errors. Text-only renderers' raw
  `list[int]` returns are normalized via `as_rendered_tokens`.
- `RendererClient.create_completion` unpacks the bridged result into
  `(prompt_ids, multi_modal_data)` and forwards both to `generate`.
- `parse_response_tokens`: copies `response.multi_modal_data` onto the
  emitted `ResponseTokens` so downstream consumers can read it.

types.py
- `ResponseTokens.multi_modal_data: Any | None`
- `TrajectoryStepTokens.multi_modal_data: NotRequired[Any]`
Both typed as `Any` to avoid a hard import dependency on `renderers`.

utils/response_utils.py
- `parse_response_tokens` propagates `multi_modal_data` onto the
  `TrajectoryStepTokens` output when present.

utils/save_utils.py
- `is_json_serializable` accepts torch tensors / numpy arrays / renderer
  sidecar dataclasses — these aren't JSON-native but survive the
  prime-rl msgpack encoder, and trajectories carrying them are excluded
  from the JSONL save at the orchestrator boundary (orchestrator passes
  `exclude_keys={"trajectory"}` to `save_rollouts`).
- `_strip_intermediate_mm_data(trajectory)`: drop `tokens.multi_modal_data`
  from all but the last step before transport. `bridge_to_next_turn`
  merges prior turns' mm_data into the new turn, so naively shipping
  mm_data on every step duplicates every image O(N²) bytes for an N-turn
  rollout; only the last step's sidecar is read by the trainer.

utils/serve_utils.py
- Custom msgpack encoder gains torch tensor / numpy ndarray /
  dataclass support. Tensors are encoded as
  `{__torch_tensor__: True, dtype, shape, data}` with raw bytes payload.
  Torch is imported lazily so text-only consumers don't pay for it.
- `decode_tensor_payload` / `walk_decode_tensors` rehydrate tensor
  payloads on the receiving side.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.

# the combined token sequence. Without this, vLLM sees
# placeholder counts that don't match the prompt and
# silently falls back to hash-cache lookup (or errors).
previous_multi_modal_data=previous_mm_data,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New kwarg breaks existing renderer bridge implementations

High Severity

previous_multi_modal_data=previous_mm_data is passed unconditionally to bridge_to_next_turn, even when it's None (text-only rollouts). The existing test mock _BridgeRenderer.bridge_to_next_turn in tests/test_renderer_client.py only accepts *, tools=None as keyword-only args, so every bridging test will crash with TypeError: got an unexpected keyword argument 'previous_multi_modal_data'. Any renderer implementation not yet updated for the new parameter will also fail at runtime.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.

"PlaceholderRange",
"RenderedTokens",
}:
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation permits non-JSON types causing silent corruption

Medium Severity

is_json_serializable now returns True for torch tensors and renderer dataclasses, but make_serializable (the json.dump fallback) doesn't handle these types — it falls through to str(value), silently producing garbage like "tensor([1.0, 2.0])". The validation gate lets non-serializable values into the output dict, relying entirely on an external exclude_keys mechanism that isn't enforced here.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c2e1b84. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant