Guard --mllm against continuous batching (silent empty output) by eejd · Pull Request #601 · waybarrios/vllm-mlx

eejd · 2026-06-06T22:04:00Z

Problem

Serving a multimodal model with both --mllm and --continuous-batching (BatchedEngine) returns
empty output for every request, silently — while --mllm alone (SimpleEngine) works. The batched
path pre-formats the prompt and hands a string to the MLLM scheduler, which re-processes it with a
mismatched processor/token-injection context.

Change

Until the batched-MLLM path is fixed, fail loudly: raise a clear ValueError when force_mllm and
use_batching are combined, at both engine-construction entry points (_build_engine for the
registry/residency path and load_model for the single-model serve path), mirroring the existing
"MLLM draft models are supported only by SimpleEngine" guard.

Validation

serve <vlm> --mllm --continuous-batching now errors with a clear message ("use SimpleEngine by
omitting --continuous-batching") instead of serving empty.

(The deeper fix — making the batched-MLLM path actually generate — is a separate, larger change; this
PR is the safe guard.)

🤖 Generated with Claude Code

Serving a multimodal model with both --mllm and --continuous-batching (BatchedEngine) silently produces empty output for every request, while --mllm alone (SimpleEngine) works. Until the batched-MLLM path is fixed, fail loudly instead: raise a clear ValueError pointing users to omit --continuous-batching. The guard is added at both engine-construction entry points that combine force_mllm with batching: _build_engine() (registry/residency path) and load_model() (single-model serve path), mirroring the existing "MLLM draft models are supported only by SimpleEngine" guard. Tracking issue for the underlying batched-MLLM fix: eejd/agent-services-hive#70. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

waybarrios · 2026-06-12T03:37:47Z

Verified this against the code and the guard is incomplete: it only fires for explicit --mllm, but BatchedEngine auto-detects multimodal models at batched.py L189:

# BatchedEngine.__init__ — force_mllm is not the only way into the MLLM path
self._is_mllm = force_mllm or is_mllm_model(model_name)

So this still hits the silent-empty-output path with both guards in place:

# no --mllm flag, model auto-detected as MLLM, guard never fires
vllm-mlx serve mlx-community/Qwen2.5-VL-7B-Instruct-4bit --continuous-batching

The fix is to mirror what BatchedEngine actually checks, in both _build_engine and load_model; is_mllm_model is already imported in server.py:

# _build_engine
if spec.use_batching and (spec.force_mllm or is_mllm_model(spec.model_name)):
    raise ValueError(
        "MLLM models are not supported with continuous batching (silent empty "
        "output). Run with SimpleEngine by omitting --continuous-batching."
    )

# load_model
if use_batching and (force_mllm or is_mllm_model(model_name)):
    raise ValueError(...)  # same message

With that widened, this is mergeable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard --mllm against continuous batching (silent empty output)#601

Guard --mllm against continuous batching (silent empty output)#601
eejd wants to merge 1 commit into
waybarrios:mainfrom
eejd:fix/guard-mllm-continuous-batching

eejd commented Jun 6, 2026

Uh oh!

waybarrios commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eejd commented Jun 6, 2026

Problem

Change

Validation

Uh oh!

waybarrios commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants