fix(dflash): route image-bearing requests to the VLM fallback engine by 0xAlcibiades · Pull Request #1056 · jundot/omlx

0xAlcibiades · 2026-05-04T16:27:01Z

Problem

DFlashEngine.chat() / stream_chat() flatten messages through the tokenizer's chat template before calling generate(prompt=...). Image content blocks (image_url, image, input_image) get rendered as text placeholders or stripped — the target model's vision encoder never receives the image bytes. Multimodal requests return 200 OK with a coherent text answer based on a fabricated assumption that no image was attached. Silent failure, indistinguishable downstream from a genuine ambiguity.

The same root cause was filed as a feature request in #791 against Qwen3.5 / oMLX 0.3.5; it persists on Qwen3.6 / 0.3.8.

Fix

Detect image content parts in incoming messages. When any are present, transition into the existing fallback mode (_evict_dflash_and_start_fallback) and delegate to self._fallback_engine.chat(...) / .stream_chat(...). The VLMBatchedEngine is already configured as the fallback when the underlying model is multimodal — making it work is closer to existing behavior than failing loudly.

Mirrors the existing long-context fallback architecture: one-way transition, DFlash is evicted on first vision request, subsequent requests (text or image) run through the VLM engine for the rest of the engine's lifetime. The in-tree comment at dflash.py:548 already explains why eviction is one-way ("reloading dflash models is expensive"); this commit applies the same trade-off to multimodal triggers.

What changed

omlx/engine/dflash.py — adds two module-level helpers and the routing guard:

_content_has_image_part(content) and _messages_have_images(messages) — mirror the part-type set used by VLMBatchedEngine._count_content_parts (vlm.py:680-685).
New early-exit branch at the top of chat() (line 752) and stream_chat() (line 781). On detection: log → evict-and-fallback (if not yet in fallback) → delegate to the fallback engine.
RuntimeError raised when _fallback_engine_type != "vlm" and an image arrives. This is a config-error case (DFlash on a text-only model receiving multimodal input), and surfacing it loudly is strictly better than silent corruption.

tests/test_dflash_engine.py — adds TestImageContentDetection with 8 unit cases covering:

Text-only string and list content (negative cases)
All three accepted image part types (image_url, image, input_image)
Mixed text + image content
Pydantic-style content parts that use attribute access instead of dict get
Defensive handling of non-dict message shapes

Tradeoffs / things to know

One-way eviction, matching the existing context-overflow pattern. Workloads that mix text and images will lose DFlash speedups for the rest of the session after the first image arrives. Two-way (reload DFlash after image request) would be a larger architectural change; happy to follow up if you'd prefer that direction.
Concurrent eviction race (two image requests arriving simultaneously could both call _evict_dflash_and_start_fallback) is an existing condition with the long-context path, not introduced here. Worth a future lock if it becomes a problem.
count_chat_tokens() is unchanged and will still under-count tokens on image-bearing requests at the dflash layer. Once routing decides the request goes to the VLM engine, the VLM engine does its own token accounting that includes vision soft tokens, so this isn't load-bearing for correctness — but if you'd like a consistent-counting story I can extend the patch.

Verification

Reproduced the original silent-failure on:

Target: unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit-general
Draft: z-lab/Qwen3.6-35B-A3B-DFlash
oMLX 0.3.8 on M2 Ultra Mac Pro 128GB

The added tests pass standalone (no engine startup required — they exercise the helper functions only). Full test suite hasn't been run as part of this PR — happy to run it locally if the maintainer wants verification before merge.

DFlashEngine.chat() and stream_chat() flatten incoming messages through the tokenizer's chat template before calling generate(prompt=...). Image content blocks (image_url, image, input_image) get rendered to text placeholders or stripped entirely, so the target model's vision encoder never receives the image bytes. The request returns 200 OK but the model behaves as if no image was attached. This commit detects image content parts in messages and, when present, delegates to the VLMBatchedEngine fallback that's already wired up when the underlying model is multimodal. The transition mirrors the existing long-context fallback in _evict_dflash_and_start_fallback — once images appear, DFlash is evicted and subsequent requests run through the VLM engine. This matches the existing one-way fallback semantics ("reloading dflash models is expensive", per the in-tree comment). If the configured fallback isn't a VLM (i.e. DFlash on a text-only model), the engine now raises a clear RuntimeError instead of silently producing wrong output. Adds 8 unit tests for the new helpers (_content_has_image_part and _messages_have_images) covering string content, list content, dict-typed parts, attribute-typed parts (Pydantic-style), the three accepted part types, mixed content, and defensive non-dict message handling. Fixes jundot#1055 Related: jundot#791

PR jundot#1050 (thread-local generation stream) is now merged upstream. Drops the mlx_vlm.generate.generation_stream monkey-patch from _init_mlx_thread() that was needed while the PR was open. Also picks up: - jundot#1053 DFlash speculative decoding GPU hang/perf fix - jundot#1055 batch_generate / server decode gap fix - jundot#1056 hunyuan_vl / gemma3n cache-offset cleanup

# Conflicts: # tests/test_dflash_engine.py

0xAlcibiades added 2 commits May 5, 2026 18:38

Merge branch 'main' into fix/dflash-vision-bypass

cec968c

# Conflicts: # tests/test_dflash_engine.py

fix(dflash): unwind merge

df46553

0xAlcibiades mentioned this pull request May 6, 2026

[Feature Request] Native video input in /v1/chat/completions (video_url / input_video) #1076

Open

0xAlcibiades added 2 commits May 5, 2026 21:24

fix(dflash): API layer stripping vlm

85a4f37

fix(prefill): scheduler bug

a0dc9ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dflash): route image-bearing requests to the VLM fallback engine#1056

fix(dflash): route image-bearing requests to the VLM fallback engine#1056
0xAlcibiades wants to merge 5 commits into
jundot:mainfrom
0xAlcibiades:fix/dflash-vision-bypass

0xAlcibiades commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0xAlcibiades commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

What changed

Tradeoffs / things to know

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0xAlcibiades commented May 4, 2026 •

edited

Loading