feat: renderer-only multimodal path — rip MITO branch, pack pixel_values from renderer#2473
feat: renderer-only multimodal path — rip MITO branch, pack pixel_values from renderer#2473hallerite wants to merge 5 commits into
Conversation
interleave_rollout now consumes renderer-emitted multi_modal_data on each trajectory step (when present), packs the per-image pixel_values and image_grid_thw onto the TrainingSample, and computes mm_token_type_ids — no VLMImageCache lookup required when the rollout went through a multimodal renderer. VLMImageCache stays as the fallback for MITO/chat-completions rollouts. build_trajectory_step (renderers package) was updated separately to surface mm_data on its output, so the pretokenize-fallback path also carries images through correctly. Other changes: - orchestrator config: removed validate_renderer_vs_vlm validator that previously rejected use_renderer=True for VLMs (it's now supported). - e2e test: a real Qwen3VLRenderer + RendererClient -> /inference/v1/generate mock POST, plus a roundtrip through vllm's GenerateRequest pydantic model and decode_mm_kwargs_item. Strongest end-to-end check we can run without a GPU. - 20-step A/B configs for color-codeword (feat-renderer vs main-mito), both logging to W&B project 'multimodal-renderer'. 126 orchestrator unit tests pass (+2 new for renderer-mm packing). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Delete the MITO chat-completions multimodal branch from the orchestrator
and the ~370 lines of image-cache/preprocess machinery in trajectories.py
that supported it. VLM training now goes through the renderer path
exclusively — the renderer owns the processor, ships byte-identical
pixel_values to both vLLM (via /inference/v1/generate features) and the
trainer (via mm_kwargs).
Renderer-shipped mm_token_type_ids: the orchestrator reads the renderer's
`mm_token_type_id_map` property (1=image_pad, 2=video_pad) and stamps a
per-token list onto each TrainingSample. Trainer's `_get_qwen3_vl_mm_token_type_ids`
auto-path remains as a fallback but the renderer is now the source of truth.
forward() now takes a generic `mm_kwargs: dict` (e.g. {pixel_values,
image_grid_thw}) instead of the Qwen3-VL-specific (pixel_values, image_grid_thw)
keyword pair, so adding new VLM families (Gemma3, LLaVA, etc.) doesn't
require touching forward.
Config validator: orchestrator.use_renderer must be true when model.vlm is
set — fail at config-load instead of producing cryptic runtime errors.
Test cleanup: drop 25 tests for removed helpers (VLMImageCache, _extract_images,
etc.); update the one remaining renderer-mm trajectory test to pass
`mm_token_type_ids_mapping` directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b385fbf. Configure here.
| if mm_token_type_ids_auto is not None: | ||
| kwargs["mm_token_type_ids"] = mm_token_type_ids_auto | ||
| elif mm_token_type_ids is not None: | ||
| kwargs["mm_token_type_ids"] = mm_token_type_ids |
There was a problem hiding this comment.
Renderer-shipped mm_token_type_ids silently ignored for Qwen3-VL
High Severity
The precedence of mm_token_type_ids in forward() is inverted relative to the stated design. The auto-computed _get_qwen3_vl_mm_token_type_ids is preferred over the renderer-shipped mm_token_type_ids parameter. For Qwen3-VL, the auto function always returns non-None (a tensor of zeros filled with image/video markers), so the explicitly shipped mm_token_type_ids from the orchestrator is never used. The PR description says "the renderer is now the source of truth" and the comment in interleave_rollout notes the orchestrator-shipped explicit list "produces ~30x lower mismatch KL," yet forward() discards it.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b385fbf. Configure here.
| "The MITO path for VLMs has been removed; VLMs must go through " | ||
| "a renderer (e.g. Qwen3VLRenderer) that owns the processor." | ||
| ) | ||
| return self |
There was a problem hiding this comment.
Breaking VLM config change missing from CHANGELOG
Medium Severity
The new vlm_requires_renderer validator makes orchestrator.use_renderer = true mandatory when model.vlm is set. Previously, VLM configs used use_renderer = false (the default). Any existing VLM config will now fail at load time with a ValueError. This is a breaking configuration change (effectively a removed valid-config combination) that requires a CHANGELOG.md entry per project rules.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit b385fbf. Configure here.
CI was failing for three reasons after the cherry-pick of the renderer
multimodal commits onto current origin/main. Three fixes:
1. Multimodal configs missing the renderer flag pair. The orchestrator
config validator (added in the same PR) requires
``use_renderer = true`` when ``model.vlm`` is set, AND it's mutually
exclusive with ``use_token_client`` (default ``true``). Three configs
(``rl_color_codeword_test.toml``, ``rl_color_codeword.toml``,
``ci/nightly/multimodal_color_codeword.toml``) needed both flags set
explicitly in ``[orchestrator]``. Delete
``rl_color_codeword_main_mito.toml`` — that was the A/B reference for
the legacy MITO path, which this PR rips out. With MITO gone the
config is no longer runnable; the ``rl_color_codeword_feat_renderer``
counterpart already covers the new renderer-driven path.
2. ``test_model_forward.py`` was still calling ``forward(...,
pixel_values=..., image_grid_thw=...)``. ``forward()`` now takes a
generic ``mm_kwargs: dict`` (so adding new VLM families doesn't
require touching the trainer signature) — update both tests to pass
``mm_kwargs={"pixel_values": ..., "image_grid_thw": ...}`` instead.
3. ``renderers`` / ``verifiers`` deps stale. The orchestrator imports
``MultiModalData`` from ``renderers.base`` (introduced in the
companion renderers PR) and threads ``multi_modal_data`` end-to-end
via the verifiers ``RendererClient`` changes. Pin both to their
feature branches until the upstream PRs merge and PyPI / git
rev-pins are bumped:
- renderers @ feat/multimodal-vlm
- verifiers @ feat/renderer-multimodal-passthrough
Drops the ``renderers==0.1.6`` PyPI pin (the new symbols are post-0.1.6).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Picks up: - renderers c3feaa5: RendererPool implements Renderer protocol structurally (callers can drop pool unwrap + isinstance branching), size=1 fast path, is_multimodal helper with per-type cache. - verifiers 64f2555a: bugbot pass — multimodal dispatch fix (was broken for pooled renderers), tighter is_json_serializable, response-tokens mm_data strip on intermediate steps. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>


Summary
VLM RL now goes through the renderer path exclusively. The renderer owns the processor, ships byte-identical
pixel_valuesto vLLM (via/inference/v1/generatemulti_modal_data) and to the trainer (viamm_kwargs). Net: -1112 LOC, no more legacy chat-completions / MITO image cache.Companion PRs:
RendererClientthreadsmulti_modal_datathrough the rollout + transport layerCommits
1.
a479a028feat: drop use_renderer=True VLM skip; pack pixel_values from rendererinterleave_rolloutconsumes the renderer-emittedmulti_modal_dataon each trajectory step (when present), packs per-imagepixel_values+image_grid_thwonto theTrainingSample, and computesmm_token_type_ids— noVLMImageCachelookup required when the rollout came from a multimodal rendererVLMImageCacheremains as the fallback for MITO / chat-completions rolloutsvalidate_renderer_vs_vlmvalidator that previously rejecteduse_renderer=Truefor VLMs (it's now supported)Qwen3VLRenderer+RendererClient→ mocked/inference/v1/generatePOST, plus a roundtrip through vLLM'sGenerateRequestpydantic model +decode_mm_kwargs_item. Strongest end-to-end check we can run without a GPUmultimodal-renderer2.
b385fbf2refactor(orchestrator): rip MITO multimodal path, renderer-only for VLMstrajectories.pythat supported itmm_token_type_ids: orchestrator readsrenderer.mm_token_type_id_map(1 = image_pad, 2 = video_pad) and stamps a per-token list onto eachTrainingSample. Trainer's_get_qwen3_vl_mm_token_type_idsauto-path remains as a fallback but the renderer is now the source of truthforward()now takes a genericmm_kwargs: dict(e.g.{pixel_values, image_grid_thw}) instead of the Qwen3-VL-specific(pixel_values, image_grid_thw)keyword pair, so adding new VLM families (Gemma3, LLaVA, etc.) doesn't require touchingforwardorchestrator.use_renderermust be true whenmodel.vlmis set — fail at config-load instead of producing cryptic runtime errorsVLMImageCache,_extract_images, etc.); update the one remaining renderer-mm trajectory test to passmm_token_type_ids_mappingdirectlyNet diff
Deps
The companion verifiers + renderers PRs need to land first. Once merged:
verifiersrev inpyproject.tomlto the merged commitrenderersto the next release that contains PrimeIntellect-ai/renderers#17Test plan
Qwen3VLRenderer→/inference/v1/generatepayload roundtripmodel.vlm)mm_kwargsdict for both image and video families🤖 Generated with Claude Code
Note
High Risk
High risk because it rewires the multimodal RL data path end-to-end (orchestrator tokenization/packing, transport serialization, and trainer forward inputs) and removes the previous MITO/VLM image-cache fallback, so any mismatch can break VLM training or silently corrupt model inputs.
Overview
VLM RL now requires the renderer path: configs set
use_renderer=true/use_token_client=false, andOrchestratorConfigenforces renderer usage whenmodel.vlmis present while removing the old “renderer unsupported for VLMs” restriction.Removes the legacy MITO/VLM preprocessing stack (orchestrator-side
AutoProcessor, image cache building, and related helpers/tests) and instead consumes renderer-emittedmulti_modal_dataon trajectory steps, packing it into model-agnosticmm_kwargsplus explicitmm_token_type_idsderived from renderer token IDs.Trainer/transport become model-agnostic for multimodal inputs: replaces Qwen3-VL-specific
pixel_values/image_grid_thwfields with a serializedmm_kwargsdict (EncodedTensorvalues), updates batch loading to decode and move these tensors to CUDA, and updatesforward()to**-unpackmm_kwargs(with a small Qwen-VL MRoPE special-case).Adds a CPU-only integration test that round-trips a real
Qwen3VLRenderer+RendererClientfeatures payload through vLLM request parsing/decoding, and introduces a dedicated renderer A/B config forcolor-codeword. Dependency sources are updated to renderer/verifiers feature branches andrenderersis unpinned.Reviewed by Cursor Bugbot for commit bf26a06. Bugbot is set up for automated code reviews on this repo. Configure here.