Skip to content

feat: renderer-only multimodal path — rip MITO branch, pack pixel_values from renderer#2473

Open
hallerite wants to merge 5 commits into
mainfrom
feat/multimodal-renderer-pr
Open

feat: renderer-only multimodal path — rip MITO branch, pack pixel_values from renderer#2473
hallerite wants to merge 5 commits into
mainfrom
feat/multimodal-renderer-pr

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 11, 2026

Summary

VLM RL now goes through the renderer path exclusively. The renderer owns the processor, ships byte-identical pixel_values to vLLM (via /inference/v1/generate multi_modal_data) and to the trainer (via mm_kwargs). Net: -1112 LOC, no more legacy chat-completions / MITO image cache.

Companion PRs:

Commits

1. a479a028 feat: drop use_renderer=True VLM skip; pack pixel_values from renderer

  • interleave_rollout consumes the renderer-emitted multi_modal_data on each trajectory step (when present), packs per-image pixel_values + image_grid_thw onto the TrainingSample, and computes mm_token_type_ids — no VLMImageCache lookup required when the rollout came from a multimodal renderer
  • VLMImageCache remains as the fallback for MITO / chat-completions rollouts
  • Orchestrator config: drop the validate_renderer_vs_vlm validator that previously rejected use_renderer=True for VLMs (it's now supported)
  • e2e test: real Qwen3VLRenderer + RendererClient → mocked /inference/v1/generate POST, plus a roundtrip through vLLM's GenerateRequest pydantic model + decode_mm_kwargs_item. Strongest end-to-end check we can run without a GPU
  • A/B configs for color-codeword (feat-renderer vs main-mito), logging to W&B project multimodal-renderer

2. b385fbf2 refactor(orchestrator): rip MITO multimodal path, renderer-only for VLMs

  • Delete the MITO chat-completions multimodal branch from the orchestrator and the ~370 lines of image-cache/preprocess machinery in trajectories.py that supported it
  • Renderer-shipped mm_token_type_ids: orchestrator reads renderer.mm_token_type_id_map (1 = image_pad, 2 = video_pad) and stamps a per-token list onto each TrainingSample. Trainer's _get_qwen3_vl_mm_token_type_ids auto-path remains as a fallback but the renderer is now the source of truth
  • forward() now takes a generic mm_kwargs: dict (e.g. {pixel_values, image_grid_thw}) instead of the Qwen3-VL-specific (pixel_values, image_grid_thw) keyword pair, so adding new VLM families (Gemma3, LLaVA, etc.) doesn't require touching forward
  • Config validator: orchestrator.use_renderer must be true when model.vlm is set — fail at config-load instead of producing cryptic runtime errors
  • Test cleanup: drop 25 tests for removed helpers (VLMImageCache, _extract_images, etc.); update the one remaining renderer-mm trajectory test to pass mm_token_type_ids_mapping directly

Net diff

12 files changed, 666 insertions(+), 1778 deletions(-)

Deps

The companion verifiers + renderers PRs need to land first. Once merged:

Test plan

  • Existing orchestrator unit tests pass (126 + 2 new)
  • e2e test exercises real Qwen3VLRenderer/inference/v1/generate payload roundtrip
  • A/B run against MITO baseline on color-codeword for ≥20 steps with reward / KL parity
  • No regressions on text-only RL (the renderer/orchestrator paths are dispatched on model.vlm)
  • Trainer accepts the generic mm_kwargs dict for both image and video families

🤖 Generated with Claude Code


Note

High Risk
High risk because it rewires the multimodal RL data path end-to-end (orchestrator tokenization/packing, transport serialization, and trainer forward inputs) and removes the previous MITO/VLM image-cache fallback, so any mismatch can break VLM training or silently corrupt model inputs.

Overview
VLM RL now requires the renderer path: configs set use_renderer=true/use_token_client=false, and OrchestratorConfig enforces renderer usage when model.vlm is present while removing the old “renderer unsupported for VLMs” restriction.

Removes the legacy MITO/VLM preprocessing stack (orchestrator-side AutoProcessor, image cache building, and related helpers/tests) and instead consumes renderer-emitted multi_modal_data on trajectory steps, packing it into model-agnostic mm_kwargs plus explicit mm_token_type_ids derived from renderer token IDs.

Trainer/transport become model-agnostic for multimodal inputs: replaces Qwen3-VL-specific pixel_values/image_grid_thw fields with a serialized mm_kwargs dict (EncodedTensor values), updates batch loading to decode and move these tensors to CUDA, and updates forward() to **-unpack mm_kwargs (with a small Qwen-VL MRoPE special-case).

Adds a CPU-only integration test that round-trips a real Qwen3VLRenderer + RendererClient features payload through vLLM request parsing/decoding, and introduces a dedicated renderer A/B config for color-codeword. Dependency sources are updated to renderer/verifiers feature branches and renderers is unpinned.

Reviewed by Cursor Bugbot for commit bf26a06. Bugbot is set up for automated code reviews on this repo. Configure here.

hallerite and others added 2 commits May 11, 2026 18:51
interleave_rollout now consumes renderer-emitted multi_modal_data on each
trajectory step (when present), packs the per-image pixel_values and
image_grid_thw onto the TrainingSample, and computes mm_token_type_ids —
no VLMImageCache lookup required when the rollout went through a
multimodal renderer. VLMImageCache stays as the fallback for
MITO/chat-completions rollouts.

build_trajectory_step (renderers package) was updated separately to
surface mm_data on its output, so the pretokenize-fallback path also
carries images through correctly.

Other changes:
- orchestrator config: removed validate_renderer_vs_vlm validator that
  previously rejected use_renderer=True for VLMs (it's now supported).
- e2e test: a real Qwen3VLRenderer + RendererClient -> /inference/v1/generate
  mock POST, plus a roundtrip through vllm's GenerateRequest pydantic
  model and decode_mm_kwargs_item. Strongest end-to-end check we can
  run without a GPU.
- 20-step A/B configs for color-codeword (feat-renderer vs main-mito),
  both logging to W&B project 'multimodal-renderer'.

126 orchestrator unit tests pass (+2 new for renderer-mm packing).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Delete the MITO chat-completions multimodal branch from the orchestrator
and the ~370 lines of image-cache/preprocess machinery in trajectories.py
that supported it. VLM training now goes through the renderer path
exclusively — the renderer owns the processor, ships byte-identical
pixel_values to both vLLM (via /inference/v1/generate features) and the
trainer (via mm_kwargs).

Renderer-shipped mm_token_type_ids: the orchestrator reads the renderer's
`mm_token_type_id_map` property (1=image_pad, 2=video_pad) and stamps a
per-token list onto each TrainingSample. Trainer's `_get_qwen3_vl_mm_token_type_ids`
auto-path remains as a fallback but the renderer is now the source of truth.

forward() now takes a generic `mm_kwargs: dict` (e.g. {pixel_values,
image_grid_thw}) instead of the Qwen3-VL-specific (pixel_values, image_grid_thw)
keyword pair, so adding new VLM families (Gemma3, LLaVA, etc.) doesn't
require touching forward.

Config validator: orchestrator.use_renderer must be true when model.vlm is
set — fail at config-load instead of producing cryptic runtime errors.

Test cleanup: drop 25 tests for removed helpers (VLMImageCache, _extract_images,
etc.); update the one remaining renderer-mm trajectory test to pass
`mm_token_type_ids_mapping` directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b385fbf. Configure here.

if mm_token_type_ids_auto is not None:
kwargs["mm_token_type_ids"] = mm_token_type_ids_auto
elif mm_token_type_ids is not None:
kwargs["mm_token_type_ids"] = mm_token_type_ids
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renderer-shipped mm_token_type_ids silently ignored for Qwen3-VL

High Severity

The precedence of mm_token_type_ids in forward() is inverted relative to the stated design. The auto-computed _get_qwen3_vl_mm_token_type_ids is preferred over the renderer-shipped mm_token_type_ids parameter. For Qwen3-VL, the auto function always returns non-None (a tensor of zeros filled with image/video markers), so the explicitly shipped mm_token_type_ids from the orchestrator is never used. The PR description says "the renderer is now the source of truth" and the comment in interleave_rollout notes the orchestrator-shipped explicit list "produces ~30x lower mismatch KL," yet forward() discards it.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b385fbf. Configure here.

"The MITO path for VLMs has been removed; VLMs must go through "
"a renderer (e.g. Qwen3VLRenderer) that owns the processor."
)
return self
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking VLM config change missing from CHANGELOG

Medium Severity

The new vlm_requires_renderer validator makes orchestrator.use_renderer = true mandatory when model.vlm is set. Previously, VLM configs used use_renderer = false (the default). Any existing VLM config will now fail at load time with a ValueError. This is a breaking configuration change (effectively a removed valid-config combination) that requires a CHANGELOG.md entry per project rules.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit b385fbf. Configure here.

hallerite and others added 3 commits May 11, 2026 22:36
CI was failing for three reasons after the cherry-pick of the renderer
multimodal commits onto current origin/main. Three fixes:

1. Multimodal configs missing the renderer flag pair. The orchestrator
   config validator (added in the same PR) requires
   ``use_renderer = true`` when ``model.vlm`` is set, AND it's mutually
   exclusive with ``use_token_client`` (default ``true``). Three configs
   (``rl_color_codeword_test.toml``, ``rl_color_codeword.toml``,
   ``ci/nightly/multimodal_color_codeword.toml``) needed both flags set
   explicitly in ``[orchestrator]``. Delete
   ``rl_color_codeword_main_mito.toml`` — that was the A/B reference for
   the legacy MITO path, which this PR rips out. With MITO gone the
   config is no longer runnable; the ``rl_color_codeword_feat_renderer``
   counterpart already covers the new renderer-driven path.

2. ``test_model_forward.py`` was still calling ``forward(...,
   pixel_values=..., image_grid_thw=...)``. ``forward()`` now takes a
   generic ``mm_kwargs: dict`` (so adding new VLM families doesn't
   require touching the trainer signature) — update both tests to pass
   ``mm_kwargs={"pixel_values": ..., "image_grid_thw": ...}`` instead.

3. ``renderers`` / ``verifiers`` deps stale. The orchestrator imports
   ``MultiModalData`` from ``renderers.base`` (introduced in the
   companion renderers PR) and threads ``multi_modal_data`` end-to-end
   via the verifiers ``RendererClient`` changes. Pin both to their
   feature branches until the upstream PRs merge and PyPI / git
   rev-pins are bumped:

     - renderers @ feat/multimodal-vlm
     - verifiers @ feat/renderer-multimodal-passthrough

   Drops the ``renderers==0.1.6`` PyPI pin (the new symbols are post-0.1.6).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Picks up:
- renderers c3feaa5: RendererPool implements Renderer protocol structurally
  (callers can drop pool unwrap + isinstance branching), size=1 fast path,
  is_multimodal helper with per-type cache.
- verifiers 64f2555a: bugbot pass — multimodal dispatch fix (was broken
  for pooled renderers), tighter is_json_serializable, response-tokens
  mm_data strip on intermediate steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant