⚠️ IMPORTANT: This project uses Moondream2 (2025-01-09 release), CLIP, Llama 3.1 8B Instruct, and Whisper large-v3-turbo via the Hugging Face Transformers library.
đź’ˇ NOTE: This project offers two options for content synthesis:
- OpenAI GPT-4o API (Default, recommended if you don't have access to LLama 3.1 8B Instruct yet)
- Local Meta-Llama-3.1-8B-Instruct (Recommended if you want to run everything locally, requires requesting access to the model from Meta on Hugging Face repository here)
⚠️ AUTHENTICATION: When using OpenAI, make sure to set your API key in the.env
file with the keyOPENAI_API_KEY
.
- Quick Start
- Features
- Architecture Overview
- Process Flow
- Installation
- Usage
- Model Prompts
- Output Format
- Frame Analysis Visualization
- Requirements
- Performance Considerations
- Troubleshooting
- Video Output Format
- Recent Updates
- Advanced Features & Limitations
A powerful video summarization tool that combines multiple AI models to provide comprehensive video understanding through audio transcription, intelligent frame selection, visual description, and content summarization.
Reference Installation for more a detailed guide on how to install the dependencies on different devices (ffmpeg, pyvips, etc) and get things running.
# Create a virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
.\venv\Scripts\activate # Windows
# Install dependencies and run with default settings
pip install -r requirements.txt
# Login to Hugging Face (required for model access)
huggingface-cli login
# Install Pyvips and ffmpeg to device (linux)
apt-get install libvips libvips-dev ffmpeg
# Run with web interface (recommended)
python main.py --web
# Run with all features enabled (including local LLM) on 'video.mp4'
python main.py video.mp4 --frame-selection --local --save --synthesis-captions --transcribe --debug
# Run with all features enabled (including local LLM) on 'inputs' folder
python main.py inputs --frame-selection --local --save --synthesis-captions --transcribe --debug
- Intelligent frame selection using CLIP embeddings and clustering
- High-quality audio transcription using Whisper large-v3-turbo
- Visual scene description using Moondream2
- Dynamic content synthesis with GPT-4o/Llama 3.1
- Interactive web interface
- Accessible video output with captions
- Local LLM support with Meta-Llama-3.1-8B-Instruct
- Intelligent frame descriptions with timestamps
- High-accuracy speech transcription
- Adaptive text wrapping and positioning
- Accessibility-focused caption design
- Choice between detailed or synthesized captions
- Original audio preservation with proper synchronization
-
Full Frame Descriptions
- Detailed descriptions for every key frame
- Technical visual details
- More frequent updates
- Timestamps included
-
Synthesis Captions (Highly recommended)
- Context-aware, narrative-focused captions
- Dynamic quantity based on video length
- Automatic deduplication of close captions (< 1.2s apart)
- Better for storytelling and overview
- Clean text without timestamps
- High contrast caption backgrounds (70% opacity)
- Responsive font sizing
- Frame descriptions: Standard size
- Speech transcriptions: 4 sizes larger
- Automatic text wrapping
- Minimum readable text size
- Caption persistence between transitions
- Clear timestamp indicators
- Separated visual and speech captions
- Frame descriptions at top
- Speech transcriptions centered near bottom
- Tight background boxes for speech transcriptions
- Original audio track preservation
-
Hosted LLM (Default)
- Uses OpenAI's GPT-4o
- Requires API key
- Faster processing
- Higher reliability
- Integrates video metadata
-
Local LLM (New!)
- Uses Meta-Llama-3.1-8B-Instruct
- No API key required
- Full offline operation
- Automatic fallback to hosted LLM
- ~8GB GPU memory required
- Compatible response format
- Supports metadata context
- Chunk-level metadata integration
graph TD
A[Input Video] --> B[Frame Selection]
A --> C[Audio Processing]
A --> MD[Metadata Extraction]
subgraph "Frame Selection Pipeline"
B --> D[CLIP Embeddings]
D --> E[Similarity Analysis]
E --> F[Key Frame Detection]
F --> G[Frame Clustering]
G --> H[Cache Management]
end
subgraph "Audio Pipeline"
C --> AA[Audio Stream Check]
AA -->|Has Audio| AE[Audio Extraction]
AA -->|No Audio| AB[Empty Transcript]
AE --> I[Whisper large-v3-turbo]
I --> J[Timestamped Transcript]
AB --> J
end
subgraph "Metadata Pipeline"
MD --> ME[FFprobe Extraction]
ME --> MF[Parse Metadata]
MF --> MG[Format Context]
end
F --> L[Selected Frames]
G --> L
L --> M[Moondream VLM]
M --> N[Frame Descriptions]
J --> O[Content Synthesis]
N --> O
MG --> O
subgraph "Summarization Pipeline"
O --> P{Model Selection}
P -->|Local| Q[Llama 3.1]
P -->|Hosted| R[gpt4o]
Q --> S[XML Validation]
R --> S
S --> T[Summary + Captions]
end
subgraph "Video Generation"
T --> U[Frame Description Overlay]
J --> V[Speech Transcript Overlay]
U --> AI[Background Layer]
AI --> AJ[Text Layer]
V --> AI
AJ --> W[Video Assembly]
A --> X[Audio Stream]
X --> AC[Audio Check]
AC -->|Has Audio| Y[FFmpeg Merge]
AC -->|No Audio| AD[Direct Output]
W --> Y
W --> AD
Y --> AK[Final Video]
AD --> AK
end
sequenceDiagram
participant User
participant Main
participant FrameSelection
participant AudioProc
participant Whisper
participant Moondream
participant LLM
participant VideoGen
User->>Main: process_video_web(video_file)
activate Main
Main->>Main: Check audio stream
par Frame Analysis
Main->>FrameSelection: process_video(video_path)
activate FrameSelection
FrameSelection->>FrameSelection: load_model()
FrameSelection->>FrameSelection: get_or_compute_embeddings()
Note over FrameSelection: Check cache first
FrameSelection->>FrameSelection: process_batch()
FrameSelection->>FrameSelection: sliding_window_filter()
FrameSelection->>FrameSelection: find_interesting_frames()
Note over FrameSelection: novelty_threshold=0.08<br/>min_skip=10<br/>n_clusters=15
FrameSelection-->>Main: key_frame_numbers
deactivate FrameSelection
and Audio Processing
Main->>AudioProc: extract_audio()
activate AudioProc
AudioProc->>AudioProc: Check audio stream
AudioProc-->>Main: audio_path
deactivate AudioProc
Main->>Whisper: model.transcribe(audio_path)
activate Whisper
Note over Whisper: Raw MP3 input<br/>No preprocessing<br/>Direct transcription
Whisper-->>Main: timestamped_transcript
deactivate Whisper
end
Main->>Moondream: describe_frames(video_path, frame_numbers)
activate Moondream
Note over Moondream: Batch process (8 frames)<br/>Generate descriptions
Moondream-->>Main: frame_descriptions
deactivate Moondream
Main->>LLM: summarize_with_hosted_llm(transcript, descriptions)
activate LLM
Note over LLM: Generate summary<br/>Create synthesis captions<br/>Validate XML format<br/>Use earliest timestamps
LLM-->>Main: synthesis_output
deactivate LLM
Main->>Main: parse_synthesis_output()
Note over Main: Extract summary<br/>Parse captions<br/>Validate format
Main->>VideoGen: create_captioned_video()
activate VideoGen
Note over VideoGen: Create summary intro (5s)<br/>Process main video<br/>Add overlays<br/>Handle captions
VideoGen->>VideoGen: Filter close captions
Note over VideoGen: min_time_gap=1.2s<br/>Adjust timing -0.5s
VideoGen->>VideoGen: Create background layer
VideoGen->>VideoGen: Add opaque text layer
Note over VideoGen: Separate rendering passes<br/>70% background opacity<br/>100% text opacity
VideoGen->>VideoGen: Add frame descriptions
VideoGen->>VideoGen: Add speech transcripts
Note over VideoGen: Position & style<br/>Handle hallucinations<br/>ASCII text normalization
VideoGen->>VideoGen: FFmpeg concat & merge
Note over VideoGen: Handle missing audio<br/>Proper audio sync
VideoGen-->>Main: output_video_path
deactivate VideoGen
Main-->>User: summary, gallery, video_path
deactivate Main
video-understanding-engine/
├── main.py # Main entry point
├── frame_selection.py # CLIP-based frame selection
├── requirements.txt # Dependencies
├── .env # Environment variables
├── prompts/ # Model prompts
│ ├── moondream_prompt.md
│ └── synthesis_prompt.md
├── logs/ # Output logs
├── frame_analysis_plots/ # Frame analysis visualizations
└── embedding_cache/ # Cached CLIP embeddings
-
Video Input Processing
- Input: Video file
- Output: Frame data, audio stream, and metadata
- Supported formats: MP4, AVI, MOV, MKV
- Maximum recommended length: 6-8 minutes
- Automatic audio stream detection
- Metadata extraction using ffprobe:
- Title, artist, duration
- Format tags and properties
- Custom metadata fields
-
Frame Selection (when using
--frame-selection
)- Model: CLIP (ViT-SO400M-14-SigLIP-384)
- Process:
- Extract frames from video
- Generate CLIP embeddings for each frame (bfloat16 precision)
- Calculate frame similarities
- Detect novel frames using sliding window analysis (window_size=30)
- Cluster similar frames using KMeans (n_clusters=15)
- Select representative frames with stricter novelty threshold
- Output: List of key frame numbers (~20% fewer than previous version)
- Cache: Embeddings stored in
embedding_cache/<video_name>.npy
- Default mode: Samples every 50 frames when not using CLIP selection
-
Audio Transcription
- Model: Whisper (large)
- Process:
- Extract audio from video
- Generate transcript with timestamps
- Apply hallucination detection (words/second analysis)
- Output: JSON with
{start, end, text}
segments - Supported languages: 100+ languages (auto-detected)
- Accuracy: ~95% for clear English speech
-
Frame Description
- Model: Moondream2 (vikhyatk/moondream2)
- Input: Selected frames (either from CLIP selection or regular sampling)
- Process:
- Batch process frames (8 at a time)
- Generate detailed descriptions
- Output: Frame descriptions with timestamps
- Batch size: Configurable (default: 8)
- Memory usage: ~4GB GPU RAM
-
Content Synthesis
- Models:
- Local (to be implemented): Meta-Llama-3.1-8B-Instruct
- Hosted: gpt-4o
- Input:
- Timestamped transcript
- Frame descriptions
- Video metadata context
- Process:
- Extract and format metadata context
- For long videos:
- Process chunks with metadata context
- Generate chunk summaries
- Create final summary with metadata
- For short videos:
- Direct synthesis with metadata context
- Output:
- Metadata-enriched video summary
- Dynamic number of synthesized captions
- Strict XML format with validation
- Error handling with fallback to frame descriptions
- Token limit: 4096 tokens
- Temperature: 0.3
- Models:
-
Video Generation
- Input: Original video, descriptions, summary, transcript
- Process:
- Create 5-second summary intro
- Process main video with captions
- Concatenate using FFmpeg
- Features:
- Predictive caption timing (0.5s early display)
- Minimum 1.2s gap between captions
- ASCII-compatible text rendering
- Graceful handling of missing audio
- Smart timestamp selection
- Adaptive font scaling
- High-contrast overlays
- Centered transcript positioning
- Individual background boxes
- Automatic audio stream detection
-
Recent Improvements
- Enhanced timing and synchronization:
- Predictive caption timing with 0.5s early display
- Smart timestamp selection using earliest scene detection
- Minimum 1.2s gap between captions
- Automatic audio stream detection
- Visual enhancements:
- Resolution-based font scaling (SD/HD/2K+)
- Improved contrast with semi-transparent overlays (70% opacity)
- Centered transcript text with individual boxes
- Dynamic padding and margins based on resolution
- Robustness improvements:
- Graceful handling of missing audio streams
- Fallback to frame descriptions if synthesis fails
- Smart handling of caption timing edge cases
- Comprehensive error handling
- ASCII-compatible text normalization
- Quality of life:
- Automatic video property detection
- Progress tracking
- Debug logging
- Informative status messages
- Web-compatible output format (h264)
- Enhanced timing and synchronization:
-
Gallery View (Web Interface)
- Always shows frame descriptions (not synthesis captions)
- Includes frame numbers and timestamps
- Used for debugging frame selection
- Helps visualize key frame detection
- Independent from video output format
-
Clone the repository:
git clone https://github.com/yourusername/video-understanding-engine.git cd video-understanding-engine
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # Linux/Mac .\venv\Scripts\activate # Windows
-
Login to Hugging Face:
# Install Hugging Face CLI if not already installed pip install huggingface_hub # Login with your token from https://huggingface.co/settings/tokens huggingface-cli login
# Linux/Ubuntu
sudo apt-get update
sudo apt-get install ffmpeg libvips libvips-dev
# macOS with Homebrew
brew install ffmpeg vips
# Windows
# 1. Download and install FFmpeg:
# - Go to https://ffmpeg.org/download.html
# - Download the Windows build
# - Extract the ZIP file
# - Add the bin folder to your system PATH
#
# 2. Install libvips:
# - Go to https://github.com/libvips/build-win64/releases
# - Download vips-dev-w64-all-8.16.0.zip for 64-bit or vips-dev-w32-all-8.16.0.zip for 32-bit
# - Extract the ZIP file
# - Copy all DLL files from vips-dev-8.16\bin to either:
# - Your project's root directory OR
# - C:\Windows\System32 (requires admin privileges)
# - Add to PATH:
# - Open System Properties -> Advanced -> Environment Variables
# - Under System Variables, find PATH
# - Add the full path to the vips-dev-8.16\bin directory
pip install -r requirements.txt
For detailed setup instructions and model information, visit docs.moondream.ai/quick-start. The Moondream model requires specific configurations and dependencies that are documented there.
Create a .env
file:
OPENAI_API_KEY=your_api_key_here # Only needed if not using local LLM
python main.py --web
The web interface provides:
- Drag-and-drop video upload
- Progress tracking
- Interactive results viewing
- Download options for JSON output
- Caption style selection:
- Full frame descriptions
- Synthesized captions
- Frame selection toggle
Basic usage:
python main.py <video_path>
Advanced options:
python main.py <video_path> [--save] [--local] [--frame-selection] [--web] [--synthesis-captions] [--transcribe] [--debug]
Options explained:
--save
: Save all outputs to JSON (includes transcript, descriptions, summary)--local
: (under development, not working yet) Use local Llama model instead of hosted LLM--frame-selection
: Use CLIP-based intelligent frame selection--web
: Launch web interface (highly recommended)--synthesis-captions
: Use synthesized narrative captions (recommended for better viewing experience)--transcribe
: Show speech transcriptions in the output video (optional, transcripts are still generated for captions)--debug
: Enable debug mode with additional metadata overlay and detailed logging
Example commands:
# Launch web interface (recommended for single video processing, easy to use)
python main.py --web
# Process single video with all features including debug mode
python main.py video.mp4 --frame-selection --local --save --synthesis-captions --transcribe --debug
# Process video with debug mode for troubleshooting
python main.py video.mp4 --debug
# Process video with captions but without visible transcriptions
python main.py video.mp4 --frame-selection --synthesis-captions
# Process all videos in a folder with transcriptions
python main.py /path/to/folder --frame-selection --transcribe
# Process 'inputs' folder with all features
python main.py inputs --frame-selection --local --save --synthesis-captions --transcribe --debug
# Quick processing with hosted LLM and synthesis captions
python main.py video.mp4 --save --synthesis-captions
Folder Processing Features:
- Automatically processes all videos in the specified folder
- Supports common video formats (.mp4, .avi, .mov, .mkv, .webm)
- Shows progress (e.g., "Processing video 1/5")
- Retries up to 10 times if a video fails
- Continues to next video even if one fails
- Maintains consistent output structure in
outputs/
directory
Now supports Meta-Llama-3.1-8B-Instruct for local content synthesis:
# Run with local LLM
python main.py video.mp4 --local
# Or enable in web interface
python main.py --web # Then check "Use Local LLM" option
Location: prompts/moondream_prompt.md
Describe this frame in detail, focusing on key visual elements, actions, and any text visible in the frame. Be specific but concise.
Location: prompts/synthesis_prompt.md
You are a video content analyzer. Your task is to create a comprehensive summary of a video based on its transcript and frame descriptions.
The input will be provided in two sections:
<transcript>
Timestamped transcript of the video's audio
</transcript>
<frame_descriptions>
Timestamped descriptions of key frames from the video
</frame_descriptions>
Create a detailed summary that:
1. Captures the main topics and themes
2. Highlights key visual and auditory information
3. Maintains chronological flow
4. Integrates both visual and audio elements coherently
Be concise but comprehensive. Focus on the most important information.
When using --save
or --debug
, outputs are saved in logs/
with the following structure:
{
"video_path": "path/to/video",
"frame_count": 1000,
"metadata": {
"title": "Video Title",
"artist": "Creator Name",
"duration": 120.5,
"format_tags": {
"key1": "value1",
"key2": "value2"
}
},
"transcript": [
{
"start": 0.0,
"end": 2.5,
"text": "Transcribed text..."
}
],
"moondream_prompt": "Prompt used for frame description",
"frame_descriptions": [
{
"frame_number": 100,
"timestamp": 4.0,
"description": "Frame description..."
}
],
"summary": "Final video summary...",
"total_run_time": 120.5,
"synthesis": {
"raw_output": "Raw synthesis output with XML tags",
"captions": [
{
"timestamp": 0.0,
"text": "Caption text..."
}
]
}
}
When using --frame-selection
, analysis plots are saved in frame_analysis_plots/
showing:
- Sequential frame similarities
- Sliding window differences
- Identified key frames
- Cluster representative frames
Plot interpretation:
- Blue line: Frame-to-frame similarity
- Orange line: Sliding window differences
- Red dots: Novel frames
- Green dots: Cluster representatives
- Python 3.10+
- CUDA-compatible GPU (recommended)
- Minimum 16GB RAM
- Storage:
- ~10GB for models
- Additional space for video processing
- FFmpeg (for video processing)
- Internet connection (for hosted LLM)
torch
: GPU accelerationwhisper
: Audio transcriptiontransformers
: LLM modelsopen_clip
: Frame analysisgradio
: Web interfaceopencv-python
: Video processingnumpy
: Numerical operationsmatplotlib
: Visualization- See
requirements.txt
for complete list
-
Frame Selection
- CLIP embeddings are cached in
embedding_cache/
- First run on a video will be slower
- Subsequent runs use cached embeddings
- Cache format: NumPy arrays (.npy)
- Cache size: ~2MB per minute of video
- CLIP embeddings are cached in
-
Model Loading
- Models are loaded/unloaded to manage memory
- GPU memory requirements:
- Whisper large-v3-turbo: ~4GB
- Moondream: ~4GB
- Llama (if local): ~8GB
- CLIP: ~2GB
- Total peak memory: ~12GB (with local LLM)
- Out of Memory
- Reduce batch size in
describe_frames()
- Reduce batch size in
- New
--debug
flag for enhanced troubleshooting - Comprehensive metadata overlay
- Real-time processing information
- Visual debug indicators
- Improved error tracking
- Enhanced timing and synchronization:
- Predictive caption timing with 0.5s early display
- Smart timestamp selection using earliest scene detection
- Minimum 1.2s gap between captions
- Automatic audio stream detection
- NEW: Simplified Audio Pipeline
- Removed all audio manipulation logic for better reliability
- Direct MP3 extraction from video
- Raw MP3 passed to Whisper model
- Cleaner transcription results
- Reduced complexity and potential failure points
- Better handling of audio streams
- NEW: Enhanced Retry Logic
- Increased retry attempts from 3 to 5 for all synthesis operations
- Better exponential backoff timing
- Improved error handling and recovery
- Separate retries for:
- Chunk synthesis
- Final summary generation
- Short video synthesis
- API calls
- More reliable caption generation
- Better handling of synthesis failures
- NEW: Better Summary Extraction
- Improved parsing of synthesis output
- Proper extraction of summary from XML format
- Cleaner logging of summaries
- Fixed summary display in web interface
- Better error messages for synthesis failures
- Proper validation of output format
- Detection threshold: 0.08 words/second
- Minimum skip: 10 frames
- Number of clusters: 15
- Automatic removal of non-speech segments
- Post-processing for better transcription quality
- Predictive caption timing: 0.5s early display
- Minimum 1.2s gap between captions
- Fallback to frame descriptions if synthesis fails
- Smart handling of caption timing edge cases
- Adaptive font scaling based on video resolution
- Maximum recommended video length: 6-8 minutes
- Automatic audio stream detection
- Detailed metadata overlay in video output:
- Video Info: Duration, Resolution, FPS, Size
- Caption Info: Video Type, Target Captions, Captions/Second, Calculation
- Enhanced logging of processing steps
- Visual indicators in debug mode:
- "DEBUG" prefix in attribution text
- Metadata overlay in top-left corner
- Automatic metadata extraction and display
- Real-time calculation visualization
- Helps with:
- Troubleshooting caption timing
- Verifying video properties
- Understanding caption calculations
- Monitoring processing flow