Related issues / PRs surveyed
No issue or PR requests video_url / input_video content parts.
Problem: Qwen 3.6-35B-A3B is natively video-capable (VideoMME 86.6, VideoMMMU 83.7), but oMLX exposes none of that. There's no video content-part type in any API model, and a client sending {"type":"video_url",...} either gets a 422 or has the field flattened by the chat template before the encoder ever sees it.
Use case: Any agent or pipeline wanting to send a clip to a video-capable VL model — UI testing harnesses, surveillance/OCR over recorded footage, screen-capture analysis. Hits today on every Qwen 3.6 / Qwen 3-VL deployment that wants to use the model the way the model card describes.
Proposed solution: Add video_url and input_video content-part types to the OpenAI / Anthropic / Responses content unions in omlx/api/. In the API handlers, decode the URI (HTTP / data URI / local path) and extract uniformly-sampled frames via ffmpeg, which is already a soft dependency per omlx/api/audio_routes.py:40-44. Inject the frames as image_url parts before the engine call. Per-model max_video_frames (default 16) and sampling strategy (uniform | keyframe) live in ModelSettings.
Alternatives: Client-side frame decomposition works (PR #1056 routes multi-image requests correctly to the VLM sidecar, and Qwen 3.6's mRoPE handles multi-image positional encoding natively), but every client has to reinvent it and the server can't deduplicate frames across requests. Re-enabling mlx-vlm's video_processor (currently disabled at omlx/engine/vlm.py:101-156 to avoid a torchvision dependency) was considered but adds ~1GB of bundle weight for a path that ffmpeg+frames already covers.
Extra context:
-
omlx/request.py:140 declares videos: Optional[List[Any]] = None and omlx/engine_core.py:282-323 accepts and forwards videos=, but no caller ever sets it. Dead-code plumbing that can be wired up.
-
Reproducer (returns 422 today):
curl -sS -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
http://HOST:1234/v1/chat/completions \
-d '{
"model": "unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit-general",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What'\''s happening in this clip?"},
{"type": "video_url", "video_url": {"url": "https://.../clip.mp4"}}
]
}]
}'
Related issues / PRs surveyed
sam3_video.No issue or PR requests
video_url/input_videocontent parts.Problem: Qwen 3.6-35B-A3B is natively video-capable (VideoMME 86.6, VideoMMMU 83.7), but oMLX exposes none of that. There's no video content-part type in any API model, and a client sending
{"type":"video_url",...}either gets a 422 or has the field flattened by the chat template before the encoder ever sees it.Use case: Any agent or pipeline wanting to send a clip to a video-capable VL model — UI testing harnesses, surveillance/OCR over recorded footage, screen-capture analysis. Hits today on every Qwen 3.6 / Qwen 3-VL deployment that wants to use the model the way the model card describes.
Proposed solution: Add
video_urlandinput_videocontent-part types to the OpenAI / Anthropic / Responses content unions inomlx/api/. In the API handlers, decode the URI (HTTP / data URI / local path) and extract uniformly-sampled frames via ffmpeg, which is already a soft dependency peromlx/api/audio_routes.py:40-44. Inject the frames asimage_urlparts before the engine call. Per-modelmax_video_frames(default 16) and sampling strategy (uniform|keyframe) live inModelSettings.Alternatives: Client-side frame decomposition works (PR #1056 routes multi-image requests correctly to the VLM sidecar, and Qwen 3.6's mRoPE handles multi-image positional encoding natively), but every client has to reinvent it and the server can't deduplicate frames across requests. Re-enabling mlx-vlm's
video_processor(currently disabled atomlx/engine/vlm.py:101-156to avoid a torchvision dependency) was considered but adds ~1GB of bundle weight for a path that ffmpeg+frames already covers.Extra context:
omlx/request.py:140declaresvideos: Optional[List[Any]] = Noneandomlx/engine_core.py:282-323accepts and forwardsvideos=, but no caller ever sets it. Dead-code plumbing that can be wired up.Reproducer (returns 422 today):