A powerful scene-based video retrieval system for searching through ~1,400 Vietnamese news and current affairs videos using AI-powered semantic understanding.
The Scene Video Query System is an end-to-end solution that enables semantic search across a large collection of news videos. By breaking videos into meaningful scenes and using state-of-the-art vision transformers, the system allows users to find specific moments within videos using natural language queries or similar images.
- 🎬 Automatic Scene Detection - Intelligent segmentation using TransNetV2
- 🖼️ Smart Keyframe Extraction - 3 representative frames per scene
- 🧠 Deep Learning Embeddings - ViT-L-14 for semantic understanding
- ⚡ Fast Retrieval - FAISS-powered vector search (<50ms)
- 🌐 Multi-Modal Search - Support for text and image queries
- 📊 Large Scale - Handles ~1,400 videos with thousands of scenes
┌─────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────┐ ┌──────────┐
│ Video │───▶│ TransNetV2 │───▶│ Extract │───▶│ ViT-L-14│───▶│ FAISS │
│ Input │ │Scene Detection│ │ Keyframes │ │ Features │ │ Index │
└─────────┘ └──────────────┘ └─────────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
3 frames/scene Vector embeddings Retrieval
| Component | Technology | Purpose |
|---|---|---|
| Scene Detection | TransNetV2 | Identifies scene boundaries in videos |
| Keyframe Extraction | Custom Logic | Selects 3 representative frames per scene |
| Feature Extraction | ViT-L-14 | Generates semantic vector embeddings |
| Vector Indexing | FAISS | Builds efficient similarity search index |
| Retrieval System | Cosine Similarity | Returns most relevant scenes |
git clone https://github.com/Hieuabssy/scene-video-Query-System.git
cd scene-video-Query-Systemconda create --name frontend python=3.11
cd frontend
pip install -r requirement.txt
conda create --name backend python=3.11
cd backend
pip install -r requirement.txt
- Download faiss_index-001.idx
- Download folder_index.json and image_paths.npy
You can see these files on my drive. Put these files in the same structure in the Project Structure section.
- In front-end:
python mmain.py- In back-end:
python -m uvicorn main:app --reloadThis system is designed for Vietnamese news and current affairs videos on youtube:
- Total Videos: ~1,400
- Domain: News, documentaries, current events
- Format: MP4
- Language: Vietnamese content
You can see metadata in front-end/media
scene-video-Query-System/
├── back-end/
│ ├── image_tmp/ # input for img retrivial
│ ├── imgPath/ # path of each pictures on drive
│ │ ├── folder_index.json
│ │ └── image-paths.npy
│ ├── FILE.py # connect with goodle drive
│ ├── main.py # logic on backend
│ └── index/ # include file FAISS
│ └──faiss_index-001.idx
├── front-end/
│ ├── transnetv2/ # Scene detection model
│ ├── media.json/ # Information(metadata) about all 1400 video
│ ├── Neighbor/ # result keyframes retrivial by neighbor
│ ├── Result/ # result feyframes retrivial by text
│ └── mmain.py # front end by gradio
│
├── transnetv2_2.ipynb # extract keyframe
│
├── transnettv2.py # transition detection
│── transnetv2-weights/ # include saved_model for transnetv2.py
├── README.md
└── LICENSE
- Input: Video mp4
- Output: Scene boundary predictions (file.txt)
- Architecture: Vision Transformer Large (patch 14)
- Embedding dimension: 768
- Pre-trained: CLIP/OpenCLIP weights
- Index type: IVFFlat (Inverted File with Flat quantizer)
- Distance metric: Cosine similarity
- Number of clusters: 100
- Search probe: 10
- TransNetV2: Souček & Lokoč (2020)
- Vision Transformer: Dosovitskiy et al. (2020)
- CLIP: Radford et al. (2021)
- FAISS: Johnson et al. (2019)
Hieu Nguyen
- GitHub: @Hieuabssy
- Repository: scene-video-Query-System
If you find this project useful, please consider giving it a star! ⭐
- ✨ Initial release
- ✅ TransNetV2 scene detection
- ✅ ViT-L-14 feature extraction
- ✅ FAISS vector indexing
- ✅ Text and image-based retrieval
- ✅ Support for ~1,400 Vietnamese news videos

