Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

We are excited to release a new video-text benchmark and extendable codes for multi-shot video understanding. This release contains a 134k version of our dataset. It includes detailed long summaries (human annotated + GPTV generated) for 134k videos and shot captions (human annotated) for 188k video shots. Please see DATA.md for more details.

This repo mainly focuses on our established baselines for single-shot captions, video summarization, multi-shot question-answering, multi-shot video question-ansering and zero-shot video question-answering.

Introduction

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary.

Setting Environment

This section provides guidelines on how to get the project up and running. Suppose the project root is $project_root/Shot2Story, then you can prepare the project by:

Preparing the Data

Please follow the instructions in the DATA.md to download and prepare the videos and annotations. In our code, the data should be organized in $project_root/Shot2Story/data as below:

data/
|--videos/
|  |--Jz2CSZW7_S4.7.mp4
|  |--Jz2CSZW7_S4.7_0_39.mp4
|  |--Jz2CSZW7_S4.7_40_101.mp4
|  |--Jz2CSZW7_S4.7_102_185.mp4
|  |--Jz2CSZW7_S4.7_186_226.mp4
|  |--Jz2CSZW7_S4.7_227_316.mp4
|  |--Jz2CSZW7_S4.7_317_396.mp4
|  |--...
|
|--annotations/
|  |--20k_train.json
|  |--20k_val.json
|  |--20k_test.json

Please use the videos and annotations following their original usage instructions and license.

Step 1: Download videos

Please download our cached videos from OneDrive or HF.

For example, download and prepare videos with the below commands

git lfs install
git pull https://huggingface.co/mhan/shot2story-videos videos

cd videos
cat release_134k_videos.tar.gz.* > release_134k_videos.tar.gz
tar xf release_134k_videos.tar.gz

Step 2: Prepare annotations

All of our annotations are hosted at HF. Please download files according to the tasks you are interested in. For Shot2Story-QA evaluation, please download val_qa.json and test_qa.json. For multi-shot video summarization, please download 43k_human_train.json for training using manual annotations, 90k_gptv_train.json for training using GPTV generated annotations, or using a combination of both annotations 134k_full_train.json. For single-shot captioning, please download JSON files containing _human_shot_.

Step 3: Final check

Please ensure the files are organized in the above file structure.

Python Environment

Our project runs on xx and xx. Use other versions on your own risks. Please excuate the following commands. We recommand to use conda tool to manage our running environment.

cd $project_root
git clone https://github.com/bytedance/Shot2Story.git Shot2Story
cd Shot2Story

conda env create -f conda_env.yml

Bash Variables

Please set S2S_DIR in the bash scripts to $project_root/Shot2Story; CONDA_ENV_DIR to the root directory of the conda environment. We support multi-node distributed training by setting the bash enviroment variables WORKER_GPU, WORKER_NUM, WORKER_ID, WORKER_0_HOST and WORKER_0_PORT.

Video Summarization

The commands and checkpoints in this section corresponds to Section 3.4, Section 6.2 in the paper and more implementation details can be found in Section 8 and codes.

Training

After setting the python enviroment and the system variable, please run the following command for SUM-shot

bash ./exps/summarization/run_SUM_multi_shot.sh

please run the following command for SUM-holistic

bash ./exps/summarization/run_SUM_whole_video.sh

Testing and inference

Stay tuned for metric calculation interface and summary generation codes.

Checkpoints

Model*	ASR	B	M	R	C	Checkpoint
SUM-holistic	check	10.9	18.3	26.2	6.3	ckpt
SUM-shot	check	11.7	19.7	26.8	8.6	ckpt

*These models are trained in an end-to-end approach. Our provided checkpoint only contains parameter that have been updated, i.e., Q-Former (including additional linear layer).

Single-shot Captioning

The commands and checkpoints in this section corresponds to Section 3.2, Section 3.3, Section 6.1 in the paper and more implementation details can be found in Section 8 and codes.

Training

After setting the python enviroment and the system variable, please run the following command for single-shot video captioning

# with visual signals only
bash ./exps/captioning/run_CAP_v.sh

# with visual and audio signals 
bash ./exps/captioning/run_CAP_av.sh

please run the following command for single-shot narration captioning

bash ./exps/captioning/run_CAP_a.sh

Testing and inference

Stay tuned for metric calculation interface and summary generation codes.

Checkpoints

Results for single-shot video captioning:

Modality*	B	M	R	C	Checkpoint
V	10.5	16	30.1	38.8	ckpt
V+A	10.7	16.2	29.6	37.4	ckpt

Results for single-shot narration captioning:

Modality*	B	M	R	C	Checkpoint
V+A	18.8	24.8	39	168.7	ckpt

*These models are trained in an end-to-end approach. V and A means visual signals and ASR text seperately.

Offline Demo

We provide codes and checkpoints for offline gradio demo on single GPU. The default arguments are set in demo_video.py. You can specifiy a different config file by option --cfg-path. To excuate on-the-fly shot detection, please download and transfer checkpoint for TransNetv2 to pytorch format, and place it under ./pretrain. (You can also download it from here.)

python demo_video.py

License

Our code is licensed under a Apache 2.0 License.

Our text annotations are released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. They are available strictly for non-commercial research. More guidelines of dataset can be found in here.

Contact

If you have any questions or concerns about our dataset, please don't hesitate to contact us. You can raise an issue or reach us at hanmingfei@bytedance.com. We welcome feedback and are always looking to improve our dataset.

We extend our thanks to the teams behind HD-VILA-100M, BLIP2, Whisper, MiniGPT-4, Vicuna and LLaMA. Please check original docs of LAVIS repo in original_docs. Our work builds upon their valuable contributions. Please acknowledge these resources in your work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!