Skip to content

Guard --mllm against continuous batching (silent empty output)#601

Open
eejd wants to merge 1 commit into
waybarrios:mainfrom
eejd:fix/guard-mllm-continuous-batching
Open

Guard --mllm against continuous batching (silent empty output)#601
eejd wants to merge 1 commit into
waybarrios:mainfrom
eejd:fix/guard-mllm-continuous-batching

Conversation

@eejd

@eejd eejd commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Problem

Serving a multimodal model with both --mllm and --continuous-batching (BatchedEngine) returns
empty output for every request, silently — while --mllm alone (SimpleEngine) works. The batched
path pre-formats the prompt and hands a string to the MLLM scheduler, which re-processes it with a
mismatched processor/token-injection context.

Change

Until the batched-MLLM path is fixed, fail loudly: raise a clear ValueError when force_mllm and
use_batching are combined, at both engine-construction entry points (_build_engine for the
registry/residency path and load_model for the single-model serve path), mirroring the existing
"MLLM draft models are supported only by SimpleEngine" guard.

Validation

serve <vlm> --mllm --continuous-batching now errors with a clear message ("use SimpleEngine by
omitting --continuous-batching") instead of serving empty.

(The deeper fix — making the batched-MLLM path actually generate — is a separate, larger change; this
PR is the safe guard.)

🤖 Generated with Claude Code

Serving a multimodal model with both --mllm and --continuous-batching
(BatchedEngine) silently produces empty output for every request, while
--mllm alone (SimpleEngine) works. Until the batched-MLLM path is fixed,
fail loudly instead: raise a clear ValueError pointing users to omit
--continuous-batching.

The guard is added at both engine-construction entry points that combine
force_mllm with batching: _build_engine() (registry/residency path) and
load_model() (single-model serve path), mirroring the existing
"MLLM draft models are supported only by SimpleEngine" guard.

Tracking issue for the underlying batched-MLLM fix: eejd/agent-services-hive#70.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@waybarrios

Copy link
Copy Markdown
Owner

Verified this against the code and the guard is incomplete: it only fires for explicit --mllm, but BatchedEngine auto-detects multimodal models at batched.py L189:

# BatchedEngine.__init__ — force_mllm is not the only way into the MLLM path
self._is_mllm = force_mllm or is_mllm_model(model_name)

So this still hits the silent-empty-output path with both guards in place:

# no --mllm flag, model auto-detected as MLLM, guard never fires
vllm-mlx serve mlx-community/Qwen2.5-VL-7B-Instruct-4bit --continuous-batching

The fix is to mirror what BatchedEngine actually checks, in both _build_engine and load_model; is_mllm_model is already imported in server.py:

# _build_engine
if spec.use_batching and (spec.force_mllm or is_mllm_model(spec.model_name)):
    raise ValueError(
        "MLLM models are not supported with continuous batching (silent empty "
        "output). Run with SimpleEngine by omitting --continuous-batching."
    )

# load_model
if use_batching and (force_mllm or is_mllm_model(model_name)):
    raise ValueError(...)  # same message

With that widened, this is mergeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants