[Feature Request] Native video input in /v1/chat/completions (video_url / input_video)

**Related issues / PRs surveyed**

- #498 (open) — generic "Poor Multimodal Model Support" gap report.
- #591 (open) — sibling FR for audio input on multimodal models.
- #835 (open) — narrower model-load issue for `sam3_video`.
- #154 (closed) — chat-UI ask without backend wiring.
- #1055, #1056 — DFlash + vision routing, image-only.

No issue or PR requests `video_url` / `input_video` content parts.

**Problem:** Qwen 3.6-35B-A3B is natively video-capable (VideoMME 86.6, VideoMMMU 83.7), but oMLX exposes none of that. There's no video content-part type in any API model, and a client sending `{"type":"video_url",...}` either gets a 422 or has the field flattened by the chat template before the encoder ever sees it.

**Use case:** Any agent or pipeline wanting to send a clip to a video-capable VL model — UI testing harnesses, surveillance/OCR over recorded footage, screen-capture analysis. Hits today on every Qwen 3.6 / Qwen 3-VL deployment that wants to use the model the way the model card describes.

**Proposed solution:** Add `video_url` and `input_video` content-part types to the OpenAI / Anthropic / Responses content unions in `omlx/api/`. In the API handlers, decode the URI (HTTP / data URI / local path) and extract uniformly-sampled frames via ffmpeg, which is already a soft dependency per `omlx/api/audio_routes.py:40-44`. Inject the frames as `image_url` parts before the engine call. Per-model `max_video_frames` (default 16) and sampling strategy (`uniform` | `keyframe`) live in `ModelSettings`.

**Alternatives:** Client-side frame decomposition works (PR #1056 routes multi-image requests correctly to the VLM sidecar, and Qwen 3.6's mRoPE handles multi-image positional encoding natively), but every client has to reinvent it and the server can't deduplicate frames across requests. Re-enabling mlx-vlm's `video_processor` (currently disabled at `omlx/engine/vlm.py:101-156` to avoid a torchvision dependency) was considered but adds ~1GB of bundle weight for a path that ffmpeg+frames already covers.

**Extra context:**

- `omlx/request.py:140` declares `videos: Optional[List[Any]] = None` and `omlx/engine_core.py:282-323` accepts and forwards `videos=`, but no caller ever sets it. Dead-code plumbing that can be wired up.
- Reproducer (returns 422 today):

  ```bash
  curl -sS -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
    http://HOST:1234/v1/chat/completions \
    -d '{
      "model": "unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit-general",
      "messages": [{
        "role": "user",
        "content": [
          {"type": "text", "text": "What'\''s happening in this clip?"},
          {"type": "video_url", "video_url": {"url": "https://.../clip.mp4"}}
        ]
      }]
    }'
  ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Native video input in /v1/chat/completions (video_url / input_video) #1076

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request] Native video input in /v1/chat/completions (video_url / input_video) #1076

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions