Skip to content

fix(dflash): route image-bearing requests to the VLM fallback engine#1056

Open
0xAlcibiades wants to merge 5 commits into
jundot:mainfrom
0xAlcibiades:fix/dflash-vision-bypass
Open

fix(dflash): route image-bearing requests to the VLM fallback engine#1056
0xAlcibiades wants to merge 5 commits into
jundot:mainfrom
0xAlcibiades:fix/dflash-vision-bypass

Conversation

@0xAlcibiades
Copy link
Copy Markdown

@0xAlcibiades 0xAlcibiades commented May 4, 2026

Fixes #1055.

Problem

DFlashEngine.chat() / stream_chat() flatten messages through the tokenizer's chat template before calling generate(prompt=...). Image content blocks (image_url, image, input_image) get rendered as text placeholders or stripped — the target model's vision encoder never receives the image bytes. Multimodal requests return 200 OK with a coherent text answer based on a fabricated assumption that no image was attached. Silent failure, indistinguishable downstream from a genuine ambiguity.

The same root cause was filed as a feature request in #791 against Qwen3.5 / oMLX 0.3.5; it persists on Qwen3.6 / 0.3.8.

Fix

Detect image content parts in incoming messages. When any are present, transition into the existing fallback mode (_evict_dflash_and_start_fallback) and delegate to self._fallback_engine.chat(...) / .stream_chat(...). The VLMBatchedEngine is already configured as the fallback when the underlying model is multimodal — making it work is closer to existing behavior than failing loudly.

Mirrors the existing long-context fallback architecture: one-way transition, DFlash is evicted on first vision request, subsequent requests (text or image) run through the VLM engine for the rest of the engine's lifetime. The in-tree comment at dflash.py:548 already explains why eviction is one-way ("reloading dflash models is expensive"); this commit applies the same trade-off to multimodal triggers.

What changed

omlx/engine/dflash.py — adds two module-level helpers and the routing guard:

  • _content_has_image_part(content) and _messages_have_images(messages) — mirror the part-type set used by VLMBatchedEngine._count_content_parts (vlm.py:680-685).
  • New early-exit branch at the top of chat() (line 752) and stream_chat() (line 781). On detection: log → evict-and-fallback (if not yet in fallback) → delegate to the fallback engine.
  • RuntimeError raised when _fallback_engine_type != "vlm" and an image arrives. This is a config-error case (DFlash on a text-only model receiving multimodal input), and surfacing it loudly is strictly better than silent corruption.

tests/test_dflash_engine.py — adds TestImageContentDetection with 8 unit cases covering:

  • Text-only string and list content (negative cases)
  • All three accepted image part types (image_url, image, input_image)
  • Mixed text + image content
  • Pydantic-style content parts that use attribute access instead of dict get
  • Defensive handling of non-dict message shapes

Tradeoffs / things to know

  • One-way eviction, matching the existing context-overflow pattern. Workloads that mix text and images will lose DFlash speedups for the rest of the session after the first image arrives. Two-way (reload DFlash after image request) would be a larger architectural change; happy to follow up if you'd prefer that direction.
  • Concurrent eviction race (two image requests arriving simultaneously could both call _evict_dflash_and_start_fallback) is an existing condition with the long-context path, not introduced here. Worth a future lock if it becomes a problem.
  • count_chat_tokens() is unchanged and will still under-count tokens on image-bearing requests at the dflash layer. Once routing decides the request goes to the VLM engine, the VLM engine does its own token accounting that includes vision soft tokens, so this isn't load-bearing for correctness — but if you'd like a consistent-counting story I can extend the patch.

Verification

Reproduced the original silent-failure on:

  • Target: unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit-general
  • Draft: z-lab/Qwen3.6-35B-A3B-DFlash
  • oMLX 0.3.8 on M2 Ultra Mac Pro 128GB

The added tests pass standalone (no engine startup required — they exercise the helper functions only). Full test suite hasn't been run as part of this PR — happy to run it locally if the maintainer wants verification before merge.

DFlashEngine.chat() and stream_chat() flatten incoming messages through
the tokenizer's chat template before calling generate(prompt=...). Image
content blocks (image_url, image, input_image) get rendered to text
placeholders or stripped entirely, so the target model's vision encoder
never receives the image bytes. The request returns 200 OK but the model
behaves as if no image was attached.

This commit detects image content parts in messages and, when present,
delegates to the VLMBatchedEngine fallback that's already wired up when
the underlying model is multimodal. The transition mirrors the existing
long-context fallback in _evict_dflash_and_start_fallback — once images
appear, DFlash is evicted and subsequent requests run through the VLM
engine. This matches the existing one-way fallback semantics ("reloading
dflash models is expensive", per the in-tree comment).

If the configured fallback isn't a VLM (i.e. DFlash on a text-only
model), the engine now raises a clear RuntimeError instead of silently
producing wrong output.

Adds 8 unit tests for the new helpers (_content_has_image_part and
_messages_have_images) covering string content, list content, dict-typed
parts, attribute-typed parts (Pydantic-style), the three accepted part
types, mixed content, and defensive non-dict message handling.

Fixes jundot#1055
Related: jundot#791
popfido pushed a commit to popfido/omlx that referenced this pull request May 5, 2026
PR jundot#1050 (thread-local generation stream) is now merged upstream.
Drops the mlx_vlm.generate.generation_stream monkey-patch from
_init_mlx_thread() that was needed while the PR was open.

Also picks up:
- jundot#1053 DFlash speculative decoding GPU hang/perf fix
- jundot#1055 batch_generate / server decode gap fix
- jundot#1056 hunyuan_vl / gemma3n cache-offset cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Vision requests silently ignored when DFlash speculative decoding is enabled

1 participant