Scene Video Query System

A powerful scene-based video retrieval system for searching through ~1,400 Vietnamese news and current affairs videos using AI-powered semantic understanding.

🎯 Overview

The Scene Video Query System is an end-to-end solution that enables semantic search across a large collection of news videos. By breaking videos into meaningful scenes and using state-of-the-art vision transformers, the system allows users to find specific moments within videos using natural language queries or similar images.

Key Features

🎬 Automatic Scene Detection - Intelligent segmentation using TransNetV2
🖼️ Smart Keyframe Extraction - 3 representative frames per scene
🧠 Deep Learning Embeddings - ViT-L-14 for semantic understanding
⚡ Fast Retrieval - FAISS-powered vector search (<50ms)
🌐 Multi-Modal Search - Support for text and image queries
📊 Large Scale - Handles ~1,400 videos with thousands of scenes

🏗️ System Architecture

┌─────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────┐    ┌──────────┐
│  Video  │───▶│  TransNetV2  │───▶│   Extract   │───▶│  ViT-L-14│───▶│  FAISS   │
│  Input  │    │Scene Detection│   │ Keyframes   │    │ Features │    │  Index   │
└─────────┘    └──────────────┘    └─────────────┘    └──────────┘    └──────────┘
                                          │                   │              │
                                          ▼                   ▼              ▼
                                    3 frames/scene      Vector embeddings  Retrieval

Pipeline Components

Component	Technology	Purpose
Scene Detection	TransNetV2	Identifies scene boundaries in videos
Keyframe Extraction	Custom Logic	Selects 3 representative frames per scene
Feature Extraction	ViT-L-14	Generates semantic vector embeddings
Vector Indexing	FAISS	Builds efficient similarity search index
Retrieval System	Cosine Similarity	Returns most relevant scenes

🚀 Installation

1. Clone the Repository

git clone https://github.com/Hieuabssy/scene-video-Query-System.git
cd scene-video-Query-System

2. Create Virtual Environment for both frontend and backend

conda create --name frontend python=3.11
cd frontend
pip install -r requirement.txt

conda create --name backend python=3.11
cd backend
pip install -r requirement.txt

3. Follow these struct below

Download faiss_index-001.idx
Download folder_index.json and image_paths.npy

You can see these files on my drive. Put these files in the same structure in the Project Structure section.

4. Run demo

In front-end:

python mmain.py

In back-end:

python -m uvicorn main:app --reload

Demo

This is the main interface:

You can watch demo video on youtube

🎬 Watch the Demo on YouTube

📦 Dataset

This system is designed for Vietnamese news and current affairs videos on youtube:

Total Videos: ~1,400
Domain: News, documentaries, current events
Format: MP4
Language: Vietnamese content

You can see metadata in front-end/media

🗂️ Project Structure

scene-video-Query-System/
├── back-end/
│   ├── image_tmp/           # input for img retrivial
│   ├── imgPath/             # path of each pictures on drive
│   │   ├── folder_index.json 
│   │   └── image-paths.npy
│   ├── FILE.py              # connect with goodle drive
│   ├── main.py              # logic on backend
│   └── index/               # include file FAISS
│       └──faiss_index-001.idx
├── front-end/
│   ├── transnetv2/          # Scene detection model
│   ├── media.json/          # Information(metadata) about all 1400 video 
│   ├── Neighbor/            # result keyframes retrivial by neighbor
│   ├── Result/              # result feyframes retrivial by text
│   └── mmain.py             # front end by gradio
│
├── transnetv2_2.ipynb       # extract keyframe
│
├── transnettv2.py          # transition detection
│── transnetv2-weights/     # include saved_model for transnetv2.py
├── README.md
└── LICENSE

🔬 Technical Details

TransNetV2 Scene Detection

Input: Video mp4
Output: Scene boundary predictions (file.txt)

ViT-L-14 Feature Extraction

Architecture: Vision Transformer Large (patch 14)
Embedding dimension: 768
Pre-trained: CLIP/OpenCLIP weights

FAISS Indexing

Index type: IVFFlat (Inverted File with Flat quantizer)
Distance metric: Cosine similarity
Number of clusters: 100
Search probe: 10

🙏 Acknowledgments

TransNetV2: Souček & Lokoč (2020)
Vision Transformer: Dosovitskiy et al. (2020)
CLIP: Radford et al. (2021)
FAISS: Johnson et al. (2019)

📞 Contact

Hieu Nguyen

GitHub: @Hieuabssy
Repository: scene-video-Query-System

🌟 Star History

If you find this project useful, please consider giving it a star! ⭐

🗓️ Changelog

v1.0.0 (2025-01-XX)

✨ Initial release
✅ TransNetV2 scene detection
✅ ViT-L-14 feature extraction
✅ FAISS vector indexing
✅ Text and image-based retrieval
✅ Support for ~1,400 Vietnamese news videos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scene Video Query System

🎯 Overview

Key Features

🏗️ System Architecture

Pipeline Components

🚀 Installation

1. Clone the Repository

2. Create Virtual Environment for both frontend and backend

3. Follow these struct below

4. Run demo

Demo

This is the main interface:

You can watch demo video on youtube

📦 Dataset

🗂️ Project Structure

🔬 Technical Details

TransNetV2 Scene Detection

ViT-L-14 Feature Extraction

FAISS Indexing

🙏 Acknowledgments

📞 Contact

🌟 Star History

🗓️ Changelog

v1.0.0 (2025-01-XX)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
OCR		OCR
back-end		back-end
demo		demo
front-end		front-end
transnetv2-weights		transnetv2-weights
README.md		README.md
transnetv2.py		transnetv2.py

Folders and files

Latest commit

History

Repository files navigation

Scene Video Query System

🎯 Overview

Key Features

🏗️ System Architecture

Pipeline Components

🚀 Installation

1. Clone the Repository

2. Create Virtual Environment for both frontend and backend

3. Follow these struct below

4. Run demo

Demo

This is the main interface:

You can watch demo video on youtube

📦 Dataset

🗂️ Project Structure

🔬 Technical Details

TransNetV2 Scene Detection

ViT-L-14 Feature Extraction

FAISS Indexing

🙏 Acknowledgments

📞 Contact

🌟 Star History

🗓️ Changelog

v1.0.0 (2025-01-XX)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages