fix(dflash): route image-bearing requests to the VLM fallback engine#1056
Open
0xAlcibiades wants to merge 5 commits into
Open
fix(dflash): route image-bearing requests to the VLM fallback engine#10560xAlcibiades wants to merge 5 commits into
0xAlcibiades wants to merge 5 commits into
Conversation
DFlashEngine.chat() and stream_chat() flatten incoming messages through
the tokenizer's chat template before calling generate(prompt=...). Image
content blocks (image_url, image, input_image) get rendered to text
placeholders or stripped entirely, so the target model's vision encoder
never receives the image bytes. The request returns 200 OK but the model
behaves as if no image was attached.
This commit detects image content parts in messages and, when present,
delegates to the VLMBatchedEngine fallback that's already wired up when
the underlying model is multimodal. The transition mirrors the existing
long-context fallback in _evict_dflash_and_start_fallback — once images
appear, DFlash is evicted and subsequent requests run through the VLM
engine. This matches the existing one-way fallback semantics ("reloading
dflash models is expensive", per the in-tree comment).
If the configured fallback isn't a VLM (i.e. DFlash on a text-only
model), the engine now raises a clear RuntimeError instead of silently
producing wrong output.
Adds 8 unit tests for the new helpers (_content_has_image_part and
_messages_have_images) covering string content, list content, dict-typed
parts, attribute-typed parts (Pydantic-style), the three accepted part
types, mixed content, and defensive non-dict message handling.
Fixes jundot#1055
Related: jundot#791
popfido
pushed a commit
to popfido/omlx
that referenced
this pull request
May 5, 2026
PR jundot#1050 (thread-local generation stream) is now merged upstream. Drops the mlx_vlm.generate.generation_stream monkey-patch from _init_mlx_thread() that was needed while the PR was open. Also picks up: - jundot#1053 DFlash speculative decoding GPU hang/perf fix - jundot#1055 batch_generate / server decode gap fix - jundot#1056 hunyuan_vl / gemma3n cache-offset cleanup
# Conflicts: # tests/test_dflash_engine.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1055.
Problem
DFlashEngine.chat()/stream_chat()flatten messages through the tokenizer's chat template before callinggenerate(prompt=...). Image content blocks (image_url,image,input_image) get rendered as text placeholders or stripped — the target model's vision encoder never receives the image bytes. Multimodal requests return 200 OK with a coherent text answer based on a fabricated assumption that no image was attached. Silent failure, indistinguishable downstream from a genuine ambiguity.The same root cause was filed as a feature request in #791 against Qwen3.5 / oMLX 0.3.5; it persists on Qwen3.6 / 0.3.8.
Fix
Detect image content parts in incoming messages. When any are present, transition into the existing fallback mode (
_evict_dflash_and_start_fallback) and delegate toself._fallback_engine.chat(...)/.stream_chat(...). TheVLMBatchedEngineis already configured as the fallback when the underlying model is multimodal — making it work is closer to existing behavior than failing loudly.Mirrors the existing long-context fallback architecture: one-way transition, DFlash is evicted on first vision request, subsequent requests (text or image) run through the VLM engine for the rest of the engine's lifetime. The in-tree comment at
dflash.py:548already explains why eviction is one-way ("reloading dflash models is expensive"); this commit applies the same trade-off to multimodal triggers.What changed
omlx/engine/dflash.py— adds two module-level helpers and the routing guard:_content_has_image_part(content)and_messages_have_images(messages)— mirror the part-type set used byVLMBatchedEngine._count_content_parts(vlm.py:680-685).chat()(line 752) andstream_chat()(line 781). On detection: log → evict-and-fallback (if not yet in fallback) → delegate to the fallback engine.RuntimeErrorraised when_fallback_engine_type != "vlm"and an image arrives. This is a config-error case (DFlash on a text-only model receiving multimodal input), and surfacing it loudly is strictly better than silent corruption.tests/test_dflash_engine.py— addsTestImageContentDetectionwith 8 unit cases covering:image_url,image,input_image)getTradeoffs / things to know
_evict_dflash_and_start_fallback) is an existing condition with the long-context path, not introduced here. Worth a future lock if it becomes a problem.count_chat_tokens()is unchanged and will still under-count tokens on image-bearing requests at the dflash layer. Once routing decides the request goes to the VLM engine, the VLM engine does its own token accounting that includes vision soft tokens, so this isn't load-bearing for correctness — but if you'd like a consistent-counting story I can extend the patch.Verification
Reproduced the original silent-failure on:
unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit-generalz-lab/Qwen3.6-35B-A3B-DFlashThe added tests pass standalone (no engine startup required — they exercise the helper functions only). Full test suite hasn't been run as part of this PR — happy to run it locally if the maintainer wants verification before merge.