feat(mllm): auto-extract audio from video_url on omni models by txdadlab · Pull Request #591 · waybarrios/vllm-mlx

txdadlab · 2026-06-03T22:24:44Z

Summary

For omni models (those exposing a sound_encoder — e.g. Nemotron-H Nano Omni, Qwen2.5-Omni), a video_url is logically an A/V input. Today vllm-mlx feeds only the visual frames to the model; the audio track is silently dropped. This PR wires audio through the existing OpenAI-style content-block path so the model can fuse A/V in a single forward pass.

Bug

Both code paths handling video drop audio:

Path	Where	Result
`_generate_native_video` → `_prepare_native_video_inputs`	`self.processor(text=[text], images=…, videos=…)` — no `audio=`	sound_clips never reach the model
`chat()` / `stream_chat()` fallback	Frame extraction uses `cv2.VideoCapture` — audio track ignored	No audio in `_msg_audio_inputs`

The HF processor on these models accepts audio= and the model expects sound_clips/input_features in its forward pass — but the wiring between them is missing for video inputs.

Fix

Five small changes, all in vllm_mlx/models/mllm.py (+225 / −7):

extract_audio_from_video(video_path) — probes for an audio stream via ffprobe, extracts 16 kHz mono PCM WAV via ffmpeg into a temp file registered with _temp_manager for auto-cleanup. No-op if ffmpeg is missing or the video has no audio.
load() — sets self._video_native_with_audio = hasattr(self.model, "sound_encoder"). Decoupled from _video_native because some omni models (Nemotron-H Omni) don't expose video_token_id and run through the frames-as-images fallback.
_translate_messages_for_native_video — handles audio/audio_url blocks explicitly; auto-extracts audio from video_url when the message doesn't already carry an explicit audio block. Caller-provided audio always wins.
_prepare_native_video_inputs — collects translated audio paths, passes audio= to the HF processor, and forwards sound_clips / input_features / feature_attention_mask / audio_feature_lengths / sound_feature_lengths / sound_attention_mask into gen_kwargs so the omni model's sound encoder gets fed alongside the visual stream.
chat() and stream_chat() fallback paths — extracts audio from video_url for omni models, merges into _msg_audio_inputs so the existing audio plumbing (chat-template counts, all_audio_inputs, mlx_vlm.generate(audio=…)) picks it up unchanged.

Manual verification

Tested with mlx-community/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-nvfp4 on Apple Silicon (M5 Max). The model's audio path itself was broken by an unrelated nvfp4 dtype bug — fixed in upstream Blaizzy/mlx-vlm#1279 (also opened today).

Single request with only a video_url block (30-second 480p clip with speech):

INFO vllm_mlx.models.mllm: Omni model detected: video_url will auto-extract audio for A/V fusion
INFO vllm_mlx.models.mllm: Video: 900 total frames @ 30.0 fps, extracting 60 frames
INFO vllm_mlx.models.mllm: Applying chat template with 1 messages, 60 images, 1 audios

The 1 audios is the auto-extracted track. Before the fix that count was 0. Model output went from "visuals only, transcript inferred from on-screen text" to "verbatim transcript + per-sentence correlation with the on-screen graphic shown when it was said."

Prompt-token cost: +414 tokens for the audio stream (25,229 vs 24,815 before). Wall-time cost: ~8× because the Parakeet encoder runs over 30s of audio — fair trade for actually hearing the source.

Relationship to #352

PR #352 (feat(mllm): extract audio track from video inputs) attempted the same goal but has been DIRTY/CONFLICTING since April 29 with no author activity. This PR is an alternative take that I'd be happy to coordinate with @miguel-flowstate on. Notable differences:

Aspect	#352	This PR
Trigger	Opt-in CLI flag `--extract-audio-from-video` + per-request override	Auto-detected per model via `hasattr(self.model, "sound_encoder")`
Paths covered	Fallback path only	Both `_generate_native_video` and fallback
Explicit `audio_url` alongside `video_url`	Unclear from summary	Honors the explicit choice (skips auto-extract)
Temp-file cleanup	Missing (review issue #1)	Routes through existing `_temp_manager.register()`
ffmpeg + raw URLs	Flagged for SSRF via ffmpeg URL demuxers (review issue #5)	Resolves to local path via `process_video_input()` before ffmpeg ever sees it
`msg_audio_count` per-message bug	Inherited from #351 (review issue #2)	Merges auto-extracted audio into `_msg_audio_inputs` per message
`import os` duplication	Yes (review issue #3)	No

Happy to fold this into #352 if @miguel-flowstate prefers, or to keep them separate. If the maintainers want to close one in favor of the other, fully understand either direction.

Design questions worth flagging

Should auto-extraction be opt-out via a flag? Right now any omni model with a sound_encoder will get its video_url audio fused automatically. That matches what most callers want from an omni model, but if a user explicitly wants frames-only on an omni model they'd need to either (a) send the frames as image_urls and skip video_url, or (b) we add a --no-extract-audio-from-video flag. I lean toward not adding the flag until someone hits the case, but happy to add it if you prefer opt-out semantics.
Should non-omni models with native video (e.g. Qwen-VL) also get the auto-extraction? Currently they don't — gated on sound_encoder. Qwen-VL doesn't have a sound encoder so calling the processor with audio= would error. Treating "has sound_encoder" as the marker is the safe runtime check.
Tests. I held tests for this PR pending design feedback; happy to add a synthetic integration test (generates a 3-second test clip with say+ffmpeg, asserts the log line shows 1 audios) if you'd like that before merge.

Files touched

vllm_mlx/models/mllm.py  +225/-7

No public API changes. No new dependencies (ffmpeg is already an implicit dependency for any video processing in vllm-mlx today).

For multimodal omni models (those exposing a sound_encoder, e.g. Nemotron-H Nano Omni, Qwen2.5-Omni), a video_url is logically an A/V input. Previously vllm-mlx fed only the visual frames to the model: the audio track was silently dropped on both the _generate_native_video path (no audio kwarg ever reached the HF processor) and the fallback path (frame extraction never touched the video's audio stream). The model returned visually-grounded descriptions but never "heard" the video. This wires audio through the existing OpenAI-style content-block path: 1. Add extract_audio_from_video(video_path): probes for an audio stream with ffprobe and, if present, extracts a 16 kHz mono PCM WAV via ffmpeg into a temp file registered with _temp_manager (auto-cleaned alongside the other request temp files). 2. Set self._video_native_with_audio at load time as hasattr(self.model, "sound_encoder"). Decoupled from _video_native because some omni models (Nemotron-H Omni) don't expose video_token_id at config level and run through the frames-as-images fallback path. 3. In _translate_messages_for_native_video: handle audio/audio_url blocks explicitly, and auto-extract audio from any video_url when the message doesn't already carry an explicit audio block. (Explicit caller-provided audio wins.) 4. In _prepare_native_video_inputs: collect translated audio paths, pass audio= to self.processor(...), and forward sound_clips, input_features, feature_attention_mask, audio_feature_lengths, sound_feature_lengths, sound_attention_mask into gen_kwargs so the omni model's sound encoder gets fed alongside the visual stream. 5. In chat() and stream_chat() fallback paths: extract audio from video_url for omni models, merge into _msg_audio_inputs so the existing audio plumbing picks it up. No change for non-omni models. Notes - ffmpeg is invoked only after process_video_input has resolved the user-supplied source to a local path on disk, so raw user URLs are never passed to ffmpeg's URL-protocol demuxers (avoids SSRF via http://, rtsp://, etc.). - Auto-extraction is gated on the runtime presence of sound_encoder. Non-omni models are completely unaffected. - Videos without an audio track are detected by ffprobe and skipped silently. Manual repro (Nemotron-H Nano Omni nvfp4 on Apple Silicon, fixed by mlx-vlm #1279 for the audio path): curl /v1/chat/completions -d '{"model":"nemotron-omni", "messages":[{"role":"user","content":[ {"type":"text","text":"Transcribe verbatim what the speaker says, then name which on-screen graphic was shown when they said each sentence."}, {"type":"video_url","video_url":{"url":"data:video/mp4;base64,..."}} ]}]}' Before: model described visuals only. After: model returns transcript + per-sentence visual correlation; server log shows "Omni model detected: ... A/V fusion" and the chat-template counter reports both images and audios. Related: PR waybarrios#352 attempted the same goal via an opt-in CLI flag on the fallback path only. This change is opt-out-by-omission instead (auto- detected per model), covers both code paths, and addresses the SSRF / temp-leak / per-message-audio-count concerns raised in waybarrios#352's review.

Thump604

Thanks for the detailed write-up. I agree the feature direction is valuable:
for omni models, a video_url can reasonably mean A/V input, and explicit
caller-provided audio should win over auto-extraction.

I do not think this is merge-ready yet. The patch touches five behavioral
surfaces in one file: native video preparation, message translation, non-stream
fallback, stream fallback, and subprocess/temp-file extraction. That needs
automated coverage before it lands.

Specific blockers:

Please add focused tests for the routing contract:
- omni model + video_url with no explicit audio auto-adds one audio input;
- explicit audio / audio_url suppresses video auto-extraction for that
  message;
- non-omni models do not pass audio= to the processor;
- native-video path forwards processor audio outputs into gen_kwargs;
- fallback chat() and stream_chat() merge extracted audio into the
  per-message audio map so template counts and all_audio_inputs stay
  aligned.
The request path now runs ffprobe and ffmpeg synchronously, with a
hardcoded 600s extraction timeout. Even if this is acceptable for the first
implementation, it needs tests around missing ffmpeg, missing audio track,
subprocess timeout/failure, zero-byte output, and temp-file cleanup. A bounded
helper makes that much easier to test.
In the fallback paths, process_video_input() is called once for audio
extraction and _prepare_video() then resolves/processes the same video
again for frames. For remote inputs this can duplicate download/work. Please
either reuse the resolved local video path for both audio and frame
extraction, or document and test why the duplicate processing is intentional.
_video_native_with_audio = hasattr(self.model, "sound_encoder") treats a
present-but-None attribute as enabled. Please make the predicate explicit
enough for model wrappers that define the attribute without a usable encoder,
and add a test.

I would keep this PR focused on the feature, but it needs those tests/design
guards before merge. The manual verification is useful evidence, not a
replacement for regression coverage on these path splits.

Addresses code review feedback on waybarrios#591. Adds regression coverage for the five behavioral surfaces the prior commit introduced, tightens the sound_encoder detection predicate, and removes a duplicate video resolution in the fallback paths. Code changes (vllm_mlx/models/mllm.py): - Extract _model_has_sound_encoder(model) helper using `getattr(..., None) is not None` rather than `hasattr`. Wrappers that declare sound_encoder in __init__ but leave it None until first use were previously enabled by `hasattr` and would crash the processor. (Review blocker waybarrios#4.) - _prepare_video() gains an optional `resolved_path=` kwarg. Callers that already ran process_video_input() can pass it through and skip the second resolve. Default None preserves prior behavior for all other callers. (Review blocker waybarrios#3.) - chat() and stream_chat() fallback paths now resolve each video input exactly once and pass the local path to both extract_audio_from_video and _prepare_video. Eliminates re-download of remote URLs / re-decode of base64. (Review blocker waybarrios#3.) Tests (tests/test_mllm_av_fusion.py, 19 new): - TestSoundEncoderPredicate: helper rejects missing attr and None-valued attr, accepts populated encoder. Documents the hasattr regression. - TestNativeRoutingContract: omni + video_url auto-adds one audio block; explicit audio block suppresses extraction; non-omni does not call the extractor; extraction-returns-None does not leave an empty block. (Review blocker waybarrios#1.) - TestNativePrepareInputsForwardsAudio: omni path passes audio= to the HF processor and forwards sound_clips / input_features / feature_attention_mask into gen_kwargs; non-omni path does neither. (Review blocker waybarrios#1.) - TestFallbackDedupAndMerge: process_video_input called exactly once per video input; resolved path threaded to _prepare_video; extracted audio merged into the per-message audio map for both chat() and stream_chat(). (Review blockers waybarrios#1 and waybarrios#3.) - TestExtractAudioFromVideo: six failure-mode tests for the ffmpeg helper — missing ffmpeg, no audio track, non-zero exit, zero-byte output, subprocess timeout, success-path temp_manager registration. Temp-file cleanup verified on every failure branch. (Review blocker waybarrios#2.) - Integration smoke test: synthesizes a 1-second silent-audio clip with ffmpeg lavfi, runs the real helper end-to-end, asserts a 16 kHz mono PCM WAV is produced and registered/cleaned via _temp_manager. Auto- skipped when ffmpeg or ffprobe is not on PATH. All 19 new tests pass. tests/test_mllm.py and tests/test_video.py (52 tests) still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

txdadlab · 2026-06-09T16:55:36Z

Thanks @Thump604 — good points across the board. Pushed 9a7c587 addressing all four blockers in one focused commit. Quick map of what landed where:

Blocker #1 — routing-contract tests

New tests/test_mllm_av_fusion.py with the five scenarios you called out:

test_omni_video_url_auto_adds_audio — omni + video_url with no explicit audio → exactly one audio block in translated content; extract_audio_from_video called once with the resolved local path.
test_explicit_audio_suppresses_extraction — caller-provided audio_url wins; extract_audio_from_video is never invoked.
test_non_omni_does_not_extract — non-omni models skip extraction entirely and do not pass audio= to the processor.
test_native_path_forwards_sound_clips_for_omni / test_native_path_omits_audio_kwarg_for_non_omni — _prepare_native_video_inputs forwards sound_clips / input_features / feature_attention_mask into gen_kwargs for omni, and emits no audio-related kwargs for non-omni.
TestFallbackDedupAndMerge::test_extracted_audio_merged_into_msg_audio_inputs — chat() and stream_chat() both merge per-message extracted audio into _msg_audio_inputs, keyed correctly across multiple messages and multiple videos per message.

Blocker #2 — ffmpeg helper coverage + integration smoke

extract_audio_from_video was already bounded (timeout, return-code check, zero-byte check, temp-registration), so I kept the function shape and added six failure-mode unit tests that mock subprocess/shutil:

missing ffmpeg
no audio track (ffprobe miss)
non-zero exit
zero-byte output
TimeoutExpired
success path → registered with _temp_manager

Each failure branch asserts no stray vllmmlx_va_*.wav is left behind.

Plus one opt-in integration smoke test (test_extract_audio_from_video_integration_smoke) that synthesizes a 1-second clip via ffmpeg -f lavfi and runs the helper end-to-end, asserting a 16 kHz mono PCM WAV out and proper temp-manager registration + cleanup. Auto-skipped when ffmpeg/ffprobe aren't on PATH so CI without ffmpeg stays green.

Blocker #3 — duplicate video processing

Resolved. _prepare_video() now accepts an optional resolved_path= kwarg; the chat() and stream_chat() fallback paths resolve via process_video_input() once per input and thread that local path to both audio extraction and frame extraction. Default None preserves prior behavior for the two other internal _prepare_video callers. Covered by TestFallbackDedupAndMerge::test_video_resolved_exactly_once_per_input and ::test_resolved_path_threaded_to_prepare_video.

Blocker #4 — predicate

Extracted _model_has_sound_encoder(model) using getattr(model, "sound_encoder", None) is not None. TestSoundEncoderPredicate covers the three states (missing / present-but-None / populated) and explicitly documents that hasattr was True for the None case — the regression you flagged.

Local run: 19/19 new tests pass, 52/52 existing tests in test_mllm.py and test_video.py still pass.

Thump604 · 2026-06-13T12:18:22Z

Thanks, I re-read the pushed commit and the added tests.

The four blockers I raised are addressed in shape now:

routing-contract coverage for omni auto-extract, explicit audio winning, non-omni skip, native forwarding, and fallback merge;
ffmpeg/ffprobe helper failure coverage plus the opt-in smoke;
resolved local video path reuse in the fallback path;
sound_encoder detection tightened to a non-None predicate.

The remaining blocker I see is CI lint only: Black would reformat tests/test_mllm_av_fusion.py and vllm_mlx/models/mllm.py. Please run:

black vllm_mlx/models/mllm.py tests/test_mllm_av_fusion.py

Once that is pushed and CI is green, I’m comfortable approving this from my prior review scope.

Thump604 requested changes Jun 6, 2026

View reviewed changes

waybarrios mentioned this pull request Jun 11, 2026

feat(mllm): extract audio track from video inputs #352

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mllm): auto-extract audio from video_url on omni models#591

feat(mllm): auto-extract audio from video_url on omni models#591
txdadlab wants to merge 2 commits into
waybarrios:mainfrom
txdadlab:feat/omni-av-fusion

txdadlab commented Jun 3, 2026

Uh oh!

Thump604 left a comment

Uh oh!

txdadlab commented Jun 9, 2026

Uh oh!

Thump604 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

txdadlab commented Jun 3, 2026

Summary

Bug

Fix

Manual verification

Relationship to #352

Design questions worth flagging

Files touched

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

txdadlab commented Jun 9, 2026

Uh oh!

Thump604 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants