Skip to content

feat(parse): implement video key frame extraction with metadata#943

Open
mvanhorn wants to merge 1 commit intovolcengine:mainfrom
mvanhorn:osc/372-video-keyframe-extraction
Open

feat(parse): implement video key frame extraction with metadata#943
mvanhorn wants to merge 1 commit intovolcengine:mainfrom
mvanhorn:osc/372-video-keyframe-extraction

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Description

Implement video processing in VideoParser using opencv-python-headless. Replaces three stubs:

  • _extract_metadata(): extracts duration, resolution, fps, frame count from video files via cv2.VideoCapture
  • _extract_keyframes(): captures frames at configurable intervals (default 10s), returns (timestamp, jpeg_bytes) tuples
  • _generate_video_description(): produces structured markdown with metadata and keyframe timeline

Also wires real metadata into parse() so ResourceNode gets actual video dimensions instead of placeholder zeros.

Related Issue

Relates to #372

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Add _extract_metadata() in video.py using cv2 for duration/resolution/fps
  • Add _extract_keyframes() in video.py for periodic frame capture with max_frames cap (30)
  • Replace _generate_video_description() stub with metadata + keyframe timeline output
  • Wire _extract_metadata() into parse() to populate real metadata values
  • Add [video] optional dependency group in pyproject.toml (pip install openviking[video])
  • Add 7 tests in tests/parse/test_video_keyframes.py

Testing

  • Tests mock cv2.VideoCapture to verify metadata extraction, keyframe timing, and graceful fallback
  • ruff format and ruff check pass
  • All cv2.VideoCapture instances are released in finally blocks to prevent resource leaks

Design decisions

  • opencv-python-headless over moviepy/ffmpeg: lighter dependency, pure pip install, no system binary required. Headless variant avoids pulling in GUI dependencies.
  • Max 30 keyframes: caps memory usage for long videos. Configurable via the method parameter.
  • Metadata in parse(): the existing parse() returned zeros for duration/width/height/fps. Now returns real values when cv2 is available, zeros otherwise (backward compatible).
  • No VLM calls yet: this PR adds frame extraction only. VLM scene description per keyframe can be added in a follow-up once the extraction pipeline is validated.

This is the second part of the multimodal parsing trilogy: audio ASR (#805), image OCR (#942), and video processing.

This contribution was developed with AI assistance (Claude Code).

Replace the _generate_video_description() stub with a working OpenCV
integration. Adds _extract_metadata() for duration/resolution/fps and
_extract_keyframes() for periodic frame capture at configurable intervals.

Wires real metadata into parse() so ResourceNode gets actual video
dimensions instead of zeros. Degrades gracefully when opencv-python-headless
is not installed. Added as optional dependency: pip install openviking[video]

Relates to volcengine#372
@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

@mvanhorn
Copy link
Copy Markdown
Contributor Author

Same build CI issue as #942 - see #942 (comment). The video extras addition to pyproject.toml triggers the build matrix, which has a pre-existing No module named pip failure. lint, tests, and CLA all pass.

@qin-ctx qin-ctx self-assigned this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants