Skip to content

feat(mllm): auto-extract audio from video_url on omni models#591

Open
txdadlab wants to merge 2 commits into
waybarrios:mainfrom
txdadlab:feat/omni-av-fusion
Open

feat(mllm): auto-extract audio from video_url on omni models#591
txdadlab wants to merge 2 commits into
waybarrios:mainfrom
txdadlab:feat/omni-av-fusion

Conversation

@txdadlab

@txdadlab txdadlab commented Jun 3, 2026

Copy link
Copy Markdown

Summary

For omni models (those exposing a sound_encoder — e.g. Nemotron-H Nano Omni, Qwen2.5-Omni), a video_url is logically an A/V input. Today vllm-mlx feeds only the visual frames to the model; the audio track is silently dropped. This PR wires audio through the existing OpenAI-style content-block path so the model can fuse A/V in a single forward pass.

Bug

Both code paths handling video drop audio:

Path Where Result
_generate_native_video_prepare_native_video_inputs self.processor(text=[text], images=…, videos=…) — no audio= sound_clips never reach the model
chat() / stream_chat() fallback Frame extraction uses cv2.VideoCapture — audio track ignored No audio in _msg_audio_inputs

The HF processor on these models accepts audio= and the model expects sound_clips/input_features in its forward pass — but the wiring between them is missing for video inputs.

Fix

Five small changes, all in vllm_mlx/models/mllm.py (+225 / −7):

  1. extract_audio_from_video(video_path) — probes for an audio stream via ffprobe, extracts 16 kHz mono PCM WAV via ffmpeg into a temp file registered with _temp_manager for auto-cleanup. No-op if ffmpeg is missing or the video has no audio.

  2. load() — sets self._video_native_with_audio = hasattr(self.model, "sound_encoder"). Decoupled from _video_native because some omni models (Nemotron-H Omni) don't expose video_token_id and run through the frames-as-images fallback.

  3. _translate_messages_for_native_video — handles audio/audio_url blocks explicitly; auto-extracts audio from video_url when the message doesn't already carry an explicit audio block. Caller-provided audio always wins.

  4. _prepare_native_video_inputs — collects translated audio paths, passes audio= to the HF processor, and forwards sound_clips / input_features / feature_attention_mask / audio_feature_lengths / sound_feature_lengths / sound_attention_mask into gen_kwargs so the omni model's sound encoder gets fed alongside the visual stream.

  5. chat() and stream_chat() fallback paths — extracts audio from video_url for omni models, merges into _msg_audio_inputs so the existing audio plumbing (chat-template counts, all_audio_inputs, mlx_vlm.generate(audio=…)) picks it up unchanged.

Manual verification

Tested with mlx-community/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-nvfp4 on Apple Silicon (M5 Max). The model's audio path itself was broken by an unrelated nvfp4 dtype bug — fixed in upstream Blaizzy/mlx-vlm#1279 (also opened today).

Single request with only a video_url block (30-second 480p clip with speech):

INFO vllm_mlx.models.mllm: Omni model detected: video_url will auto-extract audio for A/V fusion
INFO vllm_mlx.models.mllm: Video: 900 total frames @ 30.0 fps, extracting 60 frames
INFO vllm_mlx.models.mllm: Applying chat template with 1 messages, 60 images, 1 audios

The 1 audios is the auto-extracted track. Before the fix that count was 0. Model output went from "visuals only, transcript inferred from on-screen text" to "verbatim transcript + per-sentence correlation with the on-screen graphic shown when it was said."

Prompt-token cost: +414 tokens for the audio stream (25,229 vs 24,815 before). Wall-time cost: ~8× because the Parakeet encoder runs over 30s of audio — fair trade for actually hearing the source.

Relationship to #352

PR #352 (feat(mllm): extract audio track from video inputs) attempted the same goal but has been DIRTY/CONFLICTING since April 29 with no author activity. This PR is an alternative take that I'd be happy to coordinate with @miguel-flowstate on. Notable differences:

Aspect #352 This PR
Trigger Opt-in CLI flag --extract-audio-from-video + per-request override Auto-detected per model via hasattr(self.model, "sound_encoder")
Paths covered Fallback path only Both _generate_native_video and fallback
Explicit audio_url alongside video_url Unclear from summary Honors the explicit choice (skips auto-extract)
Temp-file cleanup Missing (review issue #1) Routes through existing _temp_manager.register()
ffmpeg + raw URLs Flagged for SSRF via ffmpeg URL demuxers (review issue #5) Resolves to local path via process_video_input() before ffmpeg ever sees it
msg_audio_count per-message bug Inherited from #351 (review issue #2) Merges auto-extracted audio into _msg_audio_inputs per message
import os duplication Yes (review issue #3) No

Happy to fold this into #352 if @miguel-flowstate prefers, or to keep them separate. If the maintainers want to close one in favor of the other, fully understand either direction.

Design questions worth flagging

  • Should auto-extraction be opt-out via a flag? Right now any omni model with a sound_encoder will get its video_url audio fused automatically. That matches what most callers want from an omni model, but if a user explicitly wants frames-only on an omni model they'd need to either (a) send the frames as image_urls and skip video_url, or (b) we add a --no-extract-audio-from-video flag. I lean toward not adding the flag until someone hits the case, but happy to add it if you prefer opt-out semantics.
  • Should non-omni models with native video (e.g. Qwen-VL) also get the auto-extraction? Currently they don't — gated on sound_encoder. Qwen-VL doesn't have a sound encoder so calling the processor with audio= would error. Treating "has sound_encoder" as the marker is the safe runtime check.
  • Tests. I held tests for this PR pending design feedback; happy to add a synthetic integration test (generates a 3-second test clip with say+ffmpeg, asserts the log line shows 1 audios) if you'd like that before merge.

Files touched

vllm_mlx/models/mllm.py  +225/-7

No public API changes. No new dependencies (ffmpeg is already an implicit dependency for any video processing in vllm-mlx today).

For multimodal omni models (those exposing a sound_encoder, e.g.
Nemotron-H Nano Omni, Qwen2.5-Omni), a video_url is logically an A/V
input. Previously vllm-mlx fed only the visual frames to the model:
the audio track was silently dropped on both the _generate_native_video
path (no audio kwarg ever reached the HF processor) and the fallback
path (frame extraction never touched the video's audio stream). The
model returned visually-grounded descriptions but never "heard" the
video.

This wires audio through the existing OpenAI-style content-block path:

1. Add extract_audio_from_video(video_path): probes for an audio stream
   with ffprobe and, if present, extracts a 16 kHz mono PCM WAV via
   ffmpeg into a temp file registered with _temp_manager (auto-cleaned
   alongside the other request temp files).

2. Set self._video_native_with_audio at load time as
   hasattr(self.model, "sound_encoder"). Decoupled from _video_native
   because some omni models (Nemotron-H Omni) don't expose
   video_token_id at config level and run through the frames-as-images
   fallback path.

3. In _translate_messages_for_native_video: handle audio/audio_url
   blocks explicitly, and auto-extract audio from any video_url when
   the message doesn't already carry an explicit audio block. (Explicit
   caller-provided audio wins.)

4. In _prepare_native_video_inputs: collect translated audio paths,
   pass audio= to self.processor(...), and forward sound_clips,
   input_features, feature_attention_mask, audio_feature_lengths,
   sound_feature_lengths, sound_attention_mask into gen_kwargs so the
   omni model's sound encoder gets fed alongside the visual stream.

5. In chat() and stream_chat() fallback paths: extract audio from
   video_url for omni models, merge into _msg_audio_inputs so the
   existing audio plumbing picks it up. No change for non-omni models.

Notes
- ffmpeg is invoked only after process_video_input has resolved the
  user-supplied source to a local path on disk, so raw user URLs are
  never passed to ffmpeg's URL-protocol demuxers (avoids SSRF via
  http://, rtsp://, etc.).
- Auto-extraction is gated on the runtime presence of sound_encoder.
  Non-omni models are completely unaffected.
- Videos without an audio track are detected by ffprobe and skipped
  silently.

Manual repro (Nemotron-H Nano Omni nvfp4 on Apple Silicon, fixed by
mlx-vlm #1279 for the audio path):
  curl /v1/chat/completions -d '{"model":"nemotron-omni",
    "messages":[{"role":"user","content":[
      {"type":"text","text":"Transcribe verbatim what the speaker
       says, then name which on-screen graphic was shown when they
       said each sentence."},
      {"type":"video_url","video_url":{"url":"data:video/mp4;base64,..."}}
    ]}]}'

Before: model described visuals only.
After: model returns transcript + per-sentence visual correlation;
server log shows "Omni model detected: ... A/V fusion" and the
chat-template counter reports both images and audios.

Related: PR waybarrios#352 attempted the same goal via an opt-in CLI flag on the
fallback path only. This change is opt-out-by-omission instead (auto-
detected per model), covers both code paths, and addresses the SSRF /
temp-leak / per-message-audio-count concerns raised in waybarrios#352's review.

@Thump604 Thump604 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed write-up. I agree the feature direction is valuable:
for omni models, a video_url can reasonably mean A/V input, and explicit
caller-provided audio should win over auto-extraction.

I do not think this is merge-ready yet. The patch touches five behavioral
surfaces in one file: native video preparation, message translation, non-stream
fallback, stream fallback, and subprocess/temp-file extraction. That needs
automated coverage before it lands.

Specific blockers:

  1. Please add focused tests for the routing contract:

    • omni model + video_url with no explicit audio auto-adds one audio input;
    • explicit audio / audio_url suppresses video auto-extraction for that
      message;
    • non-omni models do not pass audio= to the processor;
    • native-video path forwards processor audio outputs into gen_kwargs;
    • fallback chat() and stream_chat() merge extracted audio into the
      per-message audio map so template counts and all_audio_inputs stay
      aligned.
  2. The request path now runs ffprobe and ffmpeg synchronously, with a
    hardcoded 600s extraction timeout. Even if this is acceptable for the first
    implementation, it needs tests around missing ffmpeg, missing audio track,
    subprocess timeout/failure, zero-byte output, and temp-file cleanup. A bounded
    helper makes that much easier to test.

  3. In the fallback paths, process_video_input() is called once for audio
    extraction and _prepare_video() then resolves/processes the same video
    again for frames. For remote inputs this can duplicate download/work. Please
    either reuse the resolved local video path for both audio and frame
    extraction, or document and test why the duplicate processing is intentional.

  4. _video_native_with_audio = hasattr(self.model, "sound_encoder") treats a
    present-but-None attribute as enabled. Please make the predicate explicit
    enough for model wrappers that define the attribute without a usable encoder,
    and add a test.

I would keep this PR focused on the feature, but it needs those tests/design
guards before merge. The manual verification is useful evidence, not a
replacement for regression coverage on these path splits.

Addresses code review feedback on waybarrios#591. Adds regression coverage for the
five behavioral surfaces the prior commit introduced, tightens the
sound_encoder detection predicate, and removes a duplicate video
resolution in the fallback paths.

Code changes (vllm_mlx/models/mllm.py):

- Extract _model_has_sound_encoder(model) helper using
  `getattr(..., None) is not None` rather than `hasattr`. Wrappers that
  declare sound_encoder in __init__ but leave it None until first use
  were previously enabled by `hasattr` and would crash the processor.
  (Review blocker waybarrios#4.)

- _prepare_video() gains an optional `resolved_path=` kwarg. Callers
  that already ran process_video_input() can pass it through and skip
  the second resolve. Default None preserves prior behavior for all
  other callers. (Review blocker waybarrios#3.)

- chat() and stream_chat() fallback paths now resolve each video input
  exactly once and pass the local path to both extract_audio_from_video
  and _prepare_video. Eliminates re-download of remote URLs / re-decode
  of base64. (Review blocker waybarrios#3.)

Tests (tests/test_mllm_av_fusion.py, 19 new):

- TestSoundEncoderPredicate: helper rejects missing attr and None-valued
  attr, accepts populated encoder. Documents the hasattr regression.

- TestNativeRoutingContract: omni + video_url auto-adds one audio block;
  explicit audio block suppresses extraction; non-omni does not call
  the extractor; extraction-returns-None does not leave an empty block.
  (Review blocker waybarrios#1.)

- TestNativePrepareInputsForwardsAudio: omni path passes audio= to the
  HF processor and forwards sound_clips / input_features /
  feature_attention_mask into gen_kwargs; non-omni path does neither.
  (Review blocker waybarrios#1.)

- TestFallbackDedupAndMerge: process_video_input called exactly once
  per video input; resolved path threaded to _prepare_video; extracted
  audio merged into the per-message audio map for both chat() and
  stream_chat(). (Review blockers waybarrios#1 and waybarrios#3.)

- TestExtractAudioFromVideo: six failure-mode tests for the ffmpeg
  helper — missing ffmpeg, no audio track, non-zero exit, zero-byte
  output, subprocess timeout, success-path temp_manager registration.
  Temp-file cleanup verified on every failure branch.
  (Review blocker waybarrios#2.)

- Integration smoke test: synthesizes a 1-second silent-audio clip with
  ffmpeg lavfi, runs the real helper end-to-end, asserts a 16 kHz mono
  PCM WAV is produced and registered/cleaned via _temp_manager. Auto-
  skipped when ffmpeg or ffprobe is not on PATH.

All 19 new tests pass. tests/test_mllm.py and tests/test_video.py
(52 tests) still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@txdadlab

txdadlab commented Jun 9, 2026

Copy link
Copy Markdown
Author

Thanks @Thump604 — good points across the board. Pushed 9a7c587 addressing all four blockers in one focused commit. Quick map of what landed where:

Blocker #1 — routing-contract tests

New tests/test_mllm_av_fusion.py with the five scenarios you called out:

  • test_omni_video_url_auto_adds_audio — omni + video_url with no explicit audio → exactly one audio block in translated content; extract_audio_from_video called once with the resolved local path.
  • test_explicit_audio_suppresses_extraction — caller-provided audio_url wins; extract_audio_from_video is never invoked.
  • test_non_omni_does_not_extract — non-omni models skip extraction entirely and do not pass audio= to the processor.
  • test_native_path_forwards_sound_clips_for_omni / test_native_path_omits_audio_kwarg_for_non_omni_prepare_native_video_inputs forwards sound_clips / input_features / feature_attention_mask into gen_kwargs for omni, and emits no audio-related kwargs for non-omni.
  • TestFallbackDedupAndMerge::test_extracted_audio_merged_into_msg_audio_inputs — chat() and stream_chat() both merge per-message extracted audio into _msg_audio_inputs, keyed correctly across multiple messages and multiple videos per message.

Blocker #2 — ffmpeg helper coverage + integration smoke

extract_audio_from_video was already bounded (timeout, return-code check, zero-byte check, temp-registration), so I kept the function shape and added six failure-mode unit tests that mock subprocess/shutil:

  • missing ffmpeg
  • no audio track (ffprobe miss)
  • non-zero exit
  • zero-byte output
  • TimeoutExpired
  • success path → registered with _temp_manager

Each failure branch asserts no stray vllmmlx_va_*.wav is left behind.

Plus one opt-in integration smoke test (test_extract_audio_from_video_integration_smoke) that synthesizes a 1-second clip via ffmpeg -f lavfi and runs the helper end-to-end, asserting a 16 kHz mono PCM WAV out and proper temp-manager registration + cleanup. Auto-skipped when ffmpeg/ffprobe aren't on PATH so CI without ffmpeg stays green.

Blocker #3 — duplicate video processing

Resolved. _prepare_video() now accepts an optional resolved_path= kwarg; the chat() and stream_chat() fallback paths resolve via process_video_input() once per input and thread that local path to both audio extraction and frame extraction. Default None preserves prior behavior for the two other internal _prepare_video callers. Covered by TestFallbackDedupAndMerge::test_video_resolved_exactly_once_per_input and ::test_resolved_path_threaded_to_prepare_video.

Blocker #4 — predicate

Extracted _model_has_sound_encoder(model) using getattr(model, "sound_encoder", None) is not None. TestSoundEncoderPredicate covers the three states (missing / present-but-None / populated) and explicitly documents that hasattr was True for the None case — the regression you flagged.

Local run: 19/19 new tests pass, 52/52 existing tests in test_mllm.py and test_video.py still pass.

@Thump604

Copy link
Copy Markdown
Collaborator

Thanks, I re-read the pushed commit and the added tests.

The four blockers I raised are addressed in shape now:

  • routing-contract coverage for omni auto-extract, explicit audio winning, non-omni skip, native forwarding, and fallback merge;
  • ffmpeg/ffprobe helper failure coverage plus the opt-in smoke;
  • resolved local video path reuse in the fallback path;
  • sound_encoder detection tightened to a non-None predicate.

The remaining blocker I see is CI lint only: Black would reformat tests/test_mllm_av_fusion.py and vllm_mlx/models/mllm.py. Please run:

black vllm_mlx/models/mllm.py tests/test_mllm_av_fusion.py

Once that is pushed and CI is green, I’m comfortable approving this from my prior review scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants