Skip to content

[Feature Request] Native video input in /v1/chat/completions (video_url / input_video) #1076

@0xAlcibiades

Description

@0xAlcibiades

Related issues / PRs surveyed

No issue or PR requests video_url / input_video content parts.

Problem: Qwen 3.6-35B-A3B is natively video-capable (VideoMME 86.6, VideoMMMU 83.7), but oMLX exposes none of that. There's no video content-part type in any API model, and a client sending {"type":"video_url",...} either gets a 422 or has the field flattened by the chat template before the encoder ever sees it.

Use case: Any agent or pipeline wanting to send a clip to a video-capable VL model — UI testing harnesses, surveillance/OCR over recorded footage, screen-capture analysis. Hits today on every Qwen 3.6 / Qwen 3-VL deployment that wants to use the model the way the model card describes.

Proposed solution: Add video_url and input_video content-part types to the OpenAI / Anthropic / Responses content unions in omlx/api/. In the API handlers, decode the URI (HTTP / data URI / local path) and extract uniformly-sampled frames via ffmpeg, which is already a soft dependency per omlx/api/audio_routes.py:40-44. Inject the frames as image_url parts before the engine call. Per-model max_video_frames (default 16) and sampling strategy (uniform | keyframe) live in ModelSettings.

Alternatives: Client-side frame decomposition works (PR #1056 routes multi-image requests correctly to the VLM sidecar, and Qwen 3.6's mRoPE handles multi-image positional encoding natively), but every client has to reinvent it and the server can't deduplicate frames across requests. Re-enabling mlx-vlm's video_processor (currently disabled at omlx/engine/vlm.py:101-156 to avoid a torchvision dependency) was considered but adds ~1GB of bundle weight for a path that ffmpeg+frames already covers.

Extra context:

  • omlx/request.py:140 declares videos: Optional[List[Any]] = None and omlx/engine_core.py:282-323 accepts and forwards videos=, but no caller ever sets it. Dead-code plumbing that can be wired up.

  • Reproducer (returns 422 today):

    curl -sS -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
      http://HOST:1234/v1/chat/completions \
      -d '{
        "model": "unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit-general",
        "messages": [{
          "role": "user",
          "content": [
            {"type": "text", "text": "What'\''s happening in this clip?"},
            {"type": "video_url", "video_url": {"url": "https://.../clip.mp4"}}
          ]
        }]
      }'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions