feat: renderer-only multimodal path — rip MITO branch, pack pixel_values from renderer by hallerite · Pull Request #2473 · PrimeIntellect-ai/prime-rl

hallerite · 2026-05-11T18:55:19Z

Summary

VLM RL now goes through the renderer path exclusively. The renderer owns the processor, ships byte-identical pixel_values to vLLM (via /inference/v1/generate multi_modal_data) and to the trainer (via mm_kwargs). Net: -1112 LOC, no more legacy chat-completions / MITO image cache.

Companion PRs:

renderers: PrimeIntellect-ai/renderers#17 — image support for Qwen3-VL, Qwen3.5/3.6, Kimi-K2.5
verifiers: PrimeIntellect-ai/verifiers#1346 — RendererClient threads multi_modal_data through the rollout + transport layer

Commits

1. a479a028 feat: drop use_renderer=True VLM skip; pack pixel_values from renderer

interleave_rollout consumes the renderer-emitted multi_modal_data on each trajectory step (when present), packs per-image pixel_values + image_grid_thw onto the TrainingSample, and computes mm_token_type_ids — no VLMImageCache lookup required when the rollout came from a multimodal renderer
VLMImageCache remains as the fallback for MITO / chat-completions rollouts
Orchestrator config: drop the validate_renderer_vs_vlm validator that previously rejected use_renderer=True for VLMs (it's now supported)
e2e test: real Qwen3VLRenderer + RendererClient → mocked /inference/v1/generate POST, plus a roundtrip through vLLM's GenerateRequest pydantic model + decode_mm_kwargs_item. Strongest end-to-end check we can run without a GPU
A/B configs for color-codeword (feat-renderer vs main-mito), logging to W&B project multimodal-renderer

2. b385fbf2 refactor(orchestrator): rip MITO multimodal path, renderer-only for VLMs

Delete the MITO chat-completions multimodal branch from the orchestrator and the ~370 lines of image-cache/preprocess machinery in trajectories.py that supported it
Renderer-shipped mm_token_type_ids: orchestrator reads renderer.mm_token_type_id_map (1 = image_pad, 2 = video_pad) and stamps a per-token list onto each TrainingSample. Trainer's _get_qwen3_vl_mm_token_type_ids auto-path remains as a fallback but the renderer is now the source of truth
forward() now takes a generic mm_kwargs: dict (e.g. {pixel_values, image_grid_thw}) instead of the Qwen3-VL-specific (pixel_values, image_grid_thw) keyword pair, so adding new VLM families (Gemma3, LLaVA, etc.) doesn't require touching forward
Config validator: orchestrator.use_renderer must be true when model.vlm is set — fail at config-load instead of producing cryptic runtime errors
Test cleanup: drop 25 tests for removed helpers (VLMImageCache, _extract_images, etc.); update the one remaining renderer-mm trajectory test to pass mm_token_type_ids_mapping directly

Net diff

12 files changed, 666 insertions(+), 1778 deletions(-)

Deps

The companion verifiers + renderers PRs need to land first. Once merged:

bump verifiers rev in pyproject.toml to the merged commit
bump renderers to the next release that contains PrimeIntellect-ai/renderers#17

Test plan

Existing orchestrator unit tests pass (126 + 2 new)
e2e test exercises real Qwen3VLRenderer → /inference/v1/generate payload roundtrip
A/B run against MITO baseline on color-codeword for ≥20 steps with reward / KL parity
No regressions on text-only RL (the renderer/orchestrator paths are dispatched on model.vlm)
Trainer accepts the generic mm_kwargs dict for both image and video families

🤖 Generated with Claude Code

Note

High Risk
High risk because it rewires the multimodal RL data path end-to-end (orchestrator tokenization/packing, transport serialization, and trainer forward inputs) and removes the previous MITO/VLM image-cache fallback, so any mismatch can break VLM training or silently corrupt model inputs.

Overview
VLM RL now requires the renderer path: configs set use_renderer=true/use_token_client=false, and OrchestratorConfig enforces renderer usage when model.vlm is present while removing the old “renderer unsupported for VLMs” restriction.

Removes the legacy MITO/VLM preprocessing stack (orchestrator-side AutoProcessor, image cache building, and related helpers/tests) and instead consumes renderer-emitted multi_modal_data on trajectory steps, packing it into model-agnostic mm_kwargs plus explicit mm_token_type_ids derived from renderer token IDs.

Trainer/transport become model-agnostic for multimodal inputs: replaces Qwen3-VL-specific pixel_values/image_grid_thw fields with a serialized mm_kwargs dict (EncodedTensor values), updates batch loading to decode and move these tensors to CUDA, and updates forward() to **-unpack mm_kwargs (with a small Qwen-VL MRoPE special-case).

Adds a CPU-only integration test that round-trips a real Qwen3VLRenderer + RendererClient features payload through vLLM request parsing/decoding, and introduces a dedicated renderer A/B config for color-codeword. Dependency sources are updated to renderer/verifiers feature branches and renderers is unpinned.

^{Reviewed by Cursor Bugbot for commit bf26a06. Bugbot is set up for automated code reviews on this repo. Configure here.}

interleave_rollout now consumes renderer-emitted multi_modal_data on each trajectory step (when present), packs the per-image pixel_values and image_grid_thw onto the TrainingSample, and computes mm_token_type_ids — no VLMImageCache lookup required when the rollout went through a multimodal renderer. VLMImageCache stays as the fallback for MITO/chat-completions rollouts. build_trajectory_step (renderers package) was updated separately to surface mm_data on its output, so the pretokenize-fallback path also carries images through correctly. Other changes: - orchestrator config: removed validate_renderer_vs_vlm validator that previously rejected use_renderer=True for VLMs (it's now supported). - e2e test: a real Qwen3VLRenderer + RendererClient -> /inference/v1/generate mock POST, plus a roundtrip through vllm's GenerateRequest pydantic model and decode_mm_kwargs_item. Strongest end-to-end check we can run without a GPU. - 20-step A/B configs for color-codeword (feat-renderer vs main-mito), both logging to W&B project 'multimodal-renderer'. 126 orchestrator unit tests pass (+2 new for renderer-mm packing). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Delete the MITO chat-completions multimodal branch from the orchestrator and the ~370 lines of image-cache/preprocess machinery in trajectories.py that supported it. VLM training now goes through the renderer path exclusively — the renderer owns the processor, ships byte-identical pixel_values to both vLLM (via /inference/v1/generate features) and the trainer (via mm_kwargs). Renderer-shipped mm_token_type_ids: the orchestrator reads the renderer's `mm_token_type_id_map` property (1=image_pad, 2=video_pad) and stamps a per-token list onto each TrainingSample. Trainer's `_get_qwen3_vl_mm_token_type_ids` auto-path remains as a fallback but the renderer is now the source of truth. forward() now takes a generic `mm_kwargs: dict` (e.g. {pixel_values, image_grid_thw}) instead of the Qwen3-VL-specific (pixel_values, image_grid_thw) keyword pair, so adding new VLM families (Gemma3, LLaVA, etc.) doesn't require touching forward. Config validator: orchestrator.use_renderer must be true when model.vlm is set — fail at config-load instead of producing cryptic runtime errors. Test cleanup: drop 25 tests for removed helpers (VLMImageCache, _extract_images, etc.); update the one remaining renderer-mm trajectory test to pass `mm_token_type_ids_mapping` directly. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b385fbf. Configure here.}

cursor · 2026-05-11T19:00:23Z

+            if mm_token_type_ids_auto is not None:
+                kwargs["mm_token_type_ids"] = mm_token_type_ids_auto
+            elif mm_token_type_ids is not None:
+                kwargs["mm_token_type_ids"] = mm_token_type_ids


Renderer-shipped mm_token_type_ids silently ignored for Qwen3-VL

High Severity

The precedence of mm_token_type_ids in forward() is inverted relative to the stated design. The auto-computed _get_qwen3_vl_mm_token_type_ids is preferred over the renderer-shipped mm_token_type_ids parameter. For Qwen3-VL, the auto function always returns non-None (a tensor of zeros filled with image/video markers), so the explicitly shipped mm_token_type_ids from the orchestrator is never used. The PR description says "the renderer is now the source of truth" and the comment in interleave_rollout notes the orchestrator-shipped explicit list "produces ~30x lower mismatch KL," yet forward() discards it.

Additional Locations (1)

src/prime_rl/orchestrator/trajectories.py#L363-L374

^{Reviewed by Cursor Bugbot for commit b385fbf. Configure here.}

cursor · 2026-05-11T19:00:23Z

+                "The MITO path for VLMs has been removed; VLMs must go through "
+                "a renderer (e.g. Qwen3VLRenderer) that owns the processor."
+            )
+        return self


Breaking VLM config change missing from CHANGELOG

Medium Severity

The new vlm_requires_renderer validator makes orchestrator.use_renderer = true mandatory when model.vlm is set. Previously, VLM configs used use_renderer = false (the default). Any existing VLM config will now fail at load time with a ValueError. This is a breaking configuration change (effectively a removed valid-config combination) that requires a CHANGELOG.md entry per project rules.

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit b385fbf. Configure here.}

CI was failing for three reasons after the cherry-pick of the renderer multimodal commits onto current origin/main. Three fixes: 1. Multimodal configs missing the renderer flag pair. The orchestrator config validator (added in the same PR) requires ``use_renderer = true`` when ``model.vlm`` is set, AND it's mutually exclusive with ``use_token_client`` (default ``true``). Three configs (``rl_color_codeword_test.toml``, ``rl_color_codeword.toml``, ``ci/nightly/multimodal_color_codeword.toml``) needed both flags set explicitly in ``[orchestrator]``. Delete ``rl_color_codeword_main_mito.toml`` — that was the A/B reference for the legacy MITO path, which this PR rips out. With MITO gone the config is no longer runnable; the ``rl_color_codeword_feat_renderer`` counterpart already covers the new renderer-driven path. 2. ``test_model_forward.py`` was still calling ``forward(..., pixel_values=..., image_grid_thw=...)``. ``forward()`` now takes a generic ``mm_kwargs: dict`` (so adding new VLM families doesn't require touching the trainer signature) — update both tests to pass ``mm_kwargs={"pixel_values": ..., "image_grid_thw": ...}`` instead. 3. ``renderers`` / ``verifiers`` deps stale. The orchestrator imports ``MultiModalData`` from ``renderers.base`` (introduced in the companion renderers PR) and threads ``multi_modal_data`` end-to-end via the verifiers ``RendererClient`` changes. Pin both to their feature branches until the upstream PRs merge and PyPI / git rev-pins are bumped: - renderers @ feat/multimodal-vlm - verifiers @ feat/renderer-multimodal-passthrough Drops the ``renderers==0.1.6`` PyPI pin (the new symbols are post-0.1.6). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Picks up: - renderers c3feaa5: RendererPool implements Renderer protocol structurally (callers can drop pool unwrap + isinstance branching), size=1 fast path, is_multimodal helper with per-type cache. - verifiers 64f2555a: bugbot pass — multimodal dispatch fix (was broken for pooled renderers), tighter is_json_serializable, response-tokens mm_data strip on intermediate steps. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

hallerite and others added 2 commits May 11, 2026 18:51

cursor Bot reviewed May 11, 2026

View reviewed changes

hallerite and others added 3 commits May 11, 2026 22:36

chore: bump renderers + verifiers to pick up isinstance dispatch fix

bf26a06

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: renderer-only multimodal path — rip MITO branch, pack pixel_values from renderer#2473

feat: renderer-only multimodal path — rip MITO branch, pack pixel_values from renderer#2473
hallerite wants to merge 5 commits into
mainfrom
feat/multimodal-renderer-pr

hallerite commented May 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 11, 2026

Uh oh!

cursor Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Net diff

Deps

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Renderer-shipped mm_token_type_ids silently ignored for Qwen3-VL

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Breaking VLM config change missing from CHANGELOG

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented May 11, 2026 •

edited by cursor Bot

Loading