feat(mllm): auto-extract audio from video_url on omni models#591
feat(mllm): auto-extract audio from video_url on omni models#591txdadlab wants to merge 2 commits into
Conversation
For multimodal omni models (those exposing a sound_encoder, e.g.
Nemotron-H Nano Omni, Qwen2.5-Omni), a video_url is logically an A/V
input. Previously vllm-mlx fed only the visual frames to the model:
the audio track was silently dropped on both the _generate_native_video
path (no audio kwarg ever reached the HF processor) and the fallback
path (frame extraction never touched the video's audio stream). The
model returned visually-grounded descriptions but never "heard" the
video.
This wires audio through the existing OpenAI-style content-block path:
1. Add extract_audio_from_video(video_path): probes for an audio stream
with ffprobe and, if present, extracts a 16 kHz mono PCM WAV via
ffmpeg into a temp file registered with _temp_manager (auto-cleaned
alongside the other request temp files).
2. Set self._video_native_with_audio at load time as
hasattr(self.model, "sound_encoder"). Decoupled from _video_native
because some omni models (Nemotron-H Omni) don't expose
video_token_id at config level and run through the frames-as-images
fallback path.
3. In _translate_messages_for_native_video: handle audio/audio_url
blocks explicitly, and auto-extract audio from any video_url when
the message doesn't already carry an explicit audio block. (Explicit
caller-provided audio wins.)
4. In _prepare_native_video_inputs: collect translated audio paths,
pass audio= to self.processor(...), and forward sound_clips,
input_features, feature_attention_mask, audio_feature_lengths,
sound_feature_lengths, sound_attention_mask into gen_kwargs so the
omni model's sound encoder gets fed alongside the visual stream.
5. In chat() and stream_chat() fallback paths: extract audio from
video_url for omni models, merge into _msg_audio_inputs so the
existing audio plumbing picks it up. No change for non-omni models.
Notes
- ffmpeg is invoked only after process_video_input has resolved the
user-supplied source to a local path on disk, so raw user URLs are
never passed to ffmpeg's URL-protocol demuxers (avoids SSRF via
http://, rtsp://, etc.).
- Auto-extraction is gated on the runtime presence of sound_encoder.
Non-omni models are completely unaffected.
- Videos without an audio track are detected by ffprobe and skipped
silently.
Manual repro (Nemotron-H Nano Omni nvfp4 on Apple Silicon, fixed by
mlx-vlm #1279 for the audio path):
curl /v1/chat/completions -d '{"model":"nemotron-omni",
"messages":[{"role":"user","content":[
{"type":"text","text":"Transcribe verbatim what the speaker
says, then name which on-screen graphic was shown when they
said each sentence."},
{"type":"video_url","video_url":{"url":"data:video/mp4;base64,..."}}
]}]}'
Before: model described visuals only.
After: model returns transcript + per-sentence visual correlation;
server log shows "Omni model detected: ... A/V fusion" and the
chat-template counter reports both images and audios.
Related: PR waybarrios#352 attempted the same goal via an opt-in CLI flag on the
fallback path only. This change is opt-out-by-omission instead (auto-
detected per model), covers both code paths, and addresses the SSRF /
temp-leak / per-message-audio-count concerns raised in waybarrios#352's review.
Thump604
left a comment
There was a problem hiding this comment.
Thanks for the detailed write-up. I agree the feature direction is valuable:
for omni models, a video_url can reasonably mean A/V input, and explicit
caller-provided audio should win over auto-extraction.
I do not think this is merge-ready yet. The patch touches five behavioral
surfaces in one file: native video preparation, message translation, non-stream
fallback, stream fallback, and subprocess/temp-file extraction. That needs
automated coverage before it lands.
Specific blockers:
-
Please add focused tests for the routing contract:
- omni model +
video_urlwith no explicit audio auto-adds one audio input; - explicit
audio/audio_urlsuppresses video auto-extraction for that
message; - non-omni models do not pass
audio=to the processor; - native-video path forwards processor audio outputs into
gen_kwargs; - fallback
chat()andstream_chat()merge extracted audio into the
per-message audio map so template counts andall_audio_inputsstay
aligned.
- omni model +
-
The request path now runs
ffprobeandffmpegsynchronously, with a
hardcoded 600s extraction timeout. Even if this is acceptable for the first
implementation, it needs tests around missingffmpeg, missing audio track,
subprocess timeout/failure, zero-byte output, and temp-file cleanup. A bounded
helper makes that much easier to test. -
In the fallback paths,
process_video_input()is called once for audio
extraction and_prepare_video()then resolves/processes the same video
again for frames. For remote inputs this can duplicate download/work. Please
either reuse the resolved local video path for both audio and frame
extraction, or document and test why the duplicate processing is intentional. -
_video_native_with_audio = hasattr(self.model, "sound_encoder")treats a
present-but-Noneattribute as enabled. Please make the predicate explicit
enough for model wrappers that define the attribute without a usable encoder,
and add a test.
I would keep this PR focused on the feature, but it needs those tests/design
guards before merge. The manual verification is useful evidence, not a
replacement for regression coverage on these path splits.
Addresses code review feedback on waybarrios#591. Adds regression coverage for the five behavioral surfaces the prior commit introduced, tightens the sound_encoder detection predicate, and removes a duplicate video resolution in the fallback paths. Code changes (vllm_mlx/models/mllm.py): - Extract _model_has_sound_encoder(model) helper using `getattr(..., None) is not None` rather than `hasattr`. Wrappers that declare sound_encoder in __init__ but leave it None until first use were previously enabled by `hasattr` and would crash the processor. (Review blocker waybarrios#4.) - _prepare_video() gains an optional `resolved_path=` kwarg. Callers that already ran process_video_input() can pass it through and skip the second resolve. Default None preserves prior behavior for all other callers. (Review blocker waybarrios#3.) - chat() and stream_chat() fallback paths now resolve each video input exactly once and pass the local path to both extract_audio_from_video and _prepare_video. Eliminates re-download of remote URLs / re-decode of base64. (Review blocker waybarrios#3.) Tests (tests/test_mllm_av_fusion.py, 19 new): - TestSoundEncoderPredicate: helper rejects missing attr and None-valued attr, accepts populated encoder. Documents the hasattr regression. - TestNativeRoutingContract: omni + video_url auto-adds one audio block; explicit audio block suppresses extraction; non-omni does not call the extractor; extraction-returns-None does not leave an empty block. (Review blocker waybarrios#1.) - TestNativePrepareInputsForwardsAudio: omni path passes audio= to the HF processor and forwards sound_clips / input_features / feature_attention_mask into gen_kwargs; non-omni path does neither. (Review blocker waybarrios#1.) - TestFallbackDedupAndMerge: process_video_input called exactly once per video input; resolved path threaded to _prepare_video; extracted audio merged into the per-message audio map for both chat() and stream_chat(). (Review blockers waybarrios#1 and waybarrios#3.) - TestExtractAudioFromVideo: six failure-mode tests for the ffmpeg helper — missing ffmpeg, no audio track, non-zero exit, zero-byte output, subprocess timeout, success-path temp_manager registration. Temp-file cleanup verified on every failure branch. (Review blocker waybarrios#2.) - Integration smoke test: synthesizes a 1-second silent-audio clip with ffmpeg lavfi, runs the real helper end-to-end, asserts a 16 kHz mono PCM WAV is produced and registered/cleaned via _temp_manager. Auto- skipped when ffmpeg or ffprobe is not on PATH. All 19 new tests pass. tests/test_mllm.py and tests/test_video.py (52 tests) still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @Thump604 — good points across the board. Pushed 9a7c587 addressing all four blockers in one focused commit. Quick map of what landed where: Blocker #1 — routing-contract tests New
Blocker #2 — ffmpeg helper coverage + integration smoke
Each failure branch asserts no stray Plus one opt-in integration smoke test ( Blocker #3 — duplicate video processing Resolved. Blocker #4 — predicate Extracted Local run: 19/19 new tests pass, 52/52 existing tests in |
|
Thanks, I re-read the pushed commit and the added tests. The four blockers I raised are addressed in shape now:
The remaining blocker I see is CI lint only: Black would reformat black vllm_mlx/models/mllm.py tests/test_mllm_av_fusion.pyOnce that is pushed and CI is green, I’m comfortable approving this from my prior review scope. |
Summary
For omni models (those exposing a
sound_encoder— e.g. Nemotron-H Nano Omni, Qwen2.5-Omni), avideo_urlis logically an A/V input. Today vllm-mlx feeds only the visual frames to the model; the audio track is silently dropped. This PR wires audio through the existing OpenAI-style content-block path so the model can fuse A/V in a single forward pass.Bug
Both code paths handling video drop audio:
_generate_native_video→_prepare_native_video_inputsself.processor(text=[text], images=…, videos=…)— noaudio=chat()/stream_chat()fallbackcv2.VideoCapture— audio track ignored_msg_audio_inputsThe HF processor on these models accepts
audio=and the model expectssound_clips/input_featuresin its forward pass — but the wiring between them is missing for video inputs.Fix
Five small changes, all in
vllm_mlx/models/mllm.py(+225 / −7):extract_audio_from_video(video_path)— probes for an audio stream viaffprobe, extracts 16 kHz mono PCM WAV viaffmpeginto a temp file registered with_temp_managerfor auto-cleanup. No-op ifffmpegis missing or the video has no audio.load()— setsself._video_native_with_audio = hasattr(self.model, "sound_encoder"). Decoupled from_video_nativebecause some omni models (Nemotron-H Omni) don't exposevideo_token_idand run through the frames-as-images fallback._translate_messages_for_native_video— handlesaudio/audio_urlblocks explicitly; auto-extracts audio fromvideo_urlwhen the message doesn't already carry an explicit audio block. Caller-provided audio always wins._prepare_native_video_inputs— collects translated audio paths, passesaudio=to the HF processor, and forwardssound_clips/input_features/feature_attention_mask/audio_feature_lengths/sound_feature_lengths/sound_attention_maskintogen_kwargsso the omni model's sound encoder gets fed alongside the visual stream.chat()andstream_chat()fallback paths — extracts audio fromvideo_urlfor omni models, merges into_msg_audio_inputsso the existing audio plumbing (chat-template counts,all_audio_inputs,mlx_vlm.generate(audio=…)) picks it up unchanged.Manual verification
Tested with
mlx-community/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-nvfp4on Apple Silicon (M5 Max). The model's audio path itself was broken by an unrelated nvfp4 dtype bug — fixed in upstream Blaizzy/mlx-vlm#1279 (also opened today).Single request with only a
video_urlblock (30-second 480p clip with speech):The
1 audiosis the auto-extracted track. Before the fix that count was 0. Model output went from "visuals only, transcript inferred from on-screen text" to "verbatim transcript + per-sentence correlation with the on-screen graphic shown when it was said."Prompt-token cost: +414 tokens for the audio stream (25,229 vs 24,815 before). Wall-time cost: ~8× because the Parakeet encoder runs over 30s of audio — fair trade for actually hearing the source.
Relationship to #352
PR #352 (
feat(mllm): extract audio track from video inputs) attempted the same goal but has been DIRTY/CONFLICTING since April 29 with no author activity. This PR is an alternative take that I'd be happy to coordinate with @miguel-flowstate on. Notable differences:--extract-audio-from-video+ per-request overridehasattr(self.model, "sound_encoder")_generate_native_videoand fallbackaudio_urlalongsidevideo_url_temp_manager.register()process_video_input()before ffmpeg ever sees itmsg_audio_countper-message bug_msg_audio_inputsper messageimport osduplicationHappy to fold this into #352 if @miguel-flowstate prefers, or to keep them separate. If the maintainers want to close one in favor of the other, fully understand either direction.
Design questions worth flagging
sound_encoderwill get itsvideo_urlaudio fused automatically. That matches what most callers want from an omni model, but if a user explicitly wants frames-only on an omni model they'd need to either (a) send the frames asimage_urls and skipvideo_url, or (b) we add a--no-extract-audio-from-videoflag. I lean toward not adding the flag until someone hits the case, but happy to add it if you prefer opt-out semantics.sound_encoder. Qwen-VL doesn't have a sound encoder so calling the processor withaudio=would error. Treating "has sound_encoder" as the marker is the safe runtime check.say+ffmpeg, asserts the log line shows1 audios) if you'd like that before merge.Files touched
No public API changes. No new dependencies (ffmpeg is already an implicit dependency for any video processing in vllm-mlx today).