Guard --mllm against continuous batching (silent empty output)#601
Open
eejd wants to merge 1 commit into
Open
Conversation
Serving a multimodal model with both --mllm and --continuous-batching (BatchedEngine) silently produces empty output for every request, while --mllm alone (SimpleEngine) works. Until the batched-MLLM path is fixed, fail loudly instead: raise a clear ValueError pointing users to omit --continuous-batching. The guard is added at both engine-construction entry points that combine force_mllm with batching: _build_engine() (registry/residency path) and load_model() (single-model serve path), mirroring the existing "MLLM draft models are supported only by SimpleEngine" guard. Tracking issue for the underlying batched-MLLM fix: eejd/agent-services-hive#70. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
|
Verified this against the code and the guard is incomplete: it only fires for explicit # BatchedEngine.__init__ — force_mllm is not the only way into the MLLM path
self._is_mllm = force_mllm or is_mllm_model(model_name)So this still hits the silent-empty-output path with both guards in place: # no --mllm flag, model auto-detected as MLLM, guard never fires
vllm-mlx serve mlx-community/Qwen2.5-VL-7B-Instruct-4bit --continuous-batchingThe fix is to mirror what BatchedEngine actually checks, in both _build_engine and load_model; # _build_engine
if spec.use_batching and (spec.force_mllm or is_mllm_model(spec.model_name)):
raise ValueError(
"MLLM models are not supported with continuous batching (silent empty "
"output). Run with SimpleEngine by omitting --continuous-batching."
)
# load_model
if use_batching and (force_mllm or is_mllm_model(model_name)):
raise ValueError(...) # same messageWith that widened, this is mergeable. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Serving a multimodal model with both
--mllmand--continuous-batching(BatchedEngine) returnsempty output for every request, silently — while
--mllmalone (SimpleEngine) works. The batchedpath pre-formats the prompt and hands a string to the MLLM scheduler, which re-processes it with a
mismatched processor/token-injection context.
Change
Until the batched-MLLM path is fixed, fail loudly: raise a clear
ValueErrorwhenforce_mllmanduse_batchingare combined, at both engine-construction entry points (_build_enginefor theregistry/residency path and
load_modelfor the single-model serve path), mirroring the existing"MLLM draft models are supported only by SimpleEngine"guard.Validation
serve <vlm> --mllm --continuous-batchingnow errors with a clear message ("use SimpleEngine byomitting --continuous-batching") instead of serving empty.
(The deeper fix — making the batched-MLLM path actually generate — is a separate, larger change; this
PR is the safe guard.)
🤖 Generated with Claude Code