We are excited to release a new video-text benchmark and extendable codes for multi-shot video understanding. This release contains a 134k version of our dataset. It includes detailed long summaries (human annotated + GPTV generated) for 134k videos and shot captions (human annotated) for 188k video shots. Please see DATA.md for more details.
This repo mainly focuses on our established baselines for single-shot captions, video summarization, multi-shot question-answering, multi-shot video question-ansering and zero-shot video question-answering.
- Introduction
- 🌟 What's new 👀
- Setting Environment
- Video Summarization
- Single-shot Captioning
- Offline Demo
- License
- Citation
- Contact
A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary.
This section provides guidelines on how to get the project up and running. Suppose the project root is
$project_root/Shot2Story
, then you can prepare the project by:
Please follow the instructions in the DATA.md to download and prepare the videos and annotations. In our code, the data should be organized in $project_root/Shot2Story/data
as below:
data/
|--videos/
| |--Jz2CSZW7_S4.7.mp4
| |--Jz2CSZW7_S4.7_0_39.mp4
| |--Jz2CSZW7_S4.7_40_101.mp4
| |--Jz2CSZW7_S4.7_102_185.mp4
| |--Jz2CSZW7_S4.7_186_226.mp4
| |--Jz2CSZW7_S4.7_227_316.mp4
| |--Jz2CSZW7_S4.7_317_396.mp4
| |--...
|
|--annotations/
| |--20k_train.json
| |--20k_val.json
| |--20k_test.json
Please use the videos and annotations following their original usage instructions and license.
Please download our cached videos from OneDrive or HF.
For example, download and prepare videos with the below commands
git lfs install
git pull https://huggingface.co/mhan/shot2story-videos videos
cd videos
cat release_134k_videos.tar.gz.* > release_134k_videos.tar.gz
tar xf release_134k_videos.tar.gz
All of our annotations are hosted at HF. Please download files according to the tasks you are interested in. For Shot2Story-QA evaluation, please download val_qa.json
and test_qa.json
. For multi-shot video summarization, please download 43k_human_train.json
for training using manual annotations, 90k_gptv_train.json
for training using GPTV generated annotations, or using a combination of both annotations 134k_full_train.json
. For single-shot captioning, please download JSON files containing _human_shot_
.
Please ensure the files are organized in the above file structure.
Our project runs on xx and xx. Use other versions on your own risks. Please excuate the following commands. We recommand to use conda tool to manage our running environment.
cd $project_root
git clone https://github.com/bytedance/Shot2Story.git Shot2Story
cd Shot2Story
conda env create -f conda_env.yml
Please set S2S_DIR
in the bash scripts to $project_root/Shot2Story
; CONDA_ENV_DIR
to the root directory of the conda environment. We support multi-node distributed training by setting the bash enviroment variables WORKER_GPU
, WORKER_NUM
, WORKER_ID
, WORKER_0_HOST
and WORKER_0_PORT
.
The commands and checkpoints in this section corresponds to Section 3.4, Section 6.2 in the paper and more implementation details can be found in Section 8 and codes.
After setting the python enviroment and the system variable, please run the following command for SUM-shot
bash ./exps/summarization/run_SUM_multi_shot.sh
please run the following command for SUM-holistic
bash ./exps/summarization/run_SUM_whole_video.sh
Stay tuned for metric calculation interface and summary generation codes.
Model* | ASR | B | M | R | C | Checkpoint |
---|---|---|---|---|---|---|
SUM-holistic | check | 10.9 | 18.3 | 26.2 | 6.3 | ckpt |
SUM-shot | check | 11.7 | 19.7 | 26.8 | 8.6 | ckpt |
*These models are trained in an end-to-end approach. Our provided checkpoint only contains parameter that have been updated, i.e., Q-Former (including additional linear layer).
The commands and checkpoints in this section corresponds to Section 3.2, Section 3.3, Section 6.1 in the paper and more implementation details can be found in Section 8 and codes.
After setting the python enviroment and the system variable, please run the following command for single-shot video captioning
# with visual signals only
bash ./exps/captioning/run_CAP_v.sh
# with visual and audio signals
bash ./exps/captioning/run_CAP_av.sh
please run the following command for single-shot narration captioning
bash ./exps/captioning/run_CAP_a.sh
Stay tuned for metric calculation interface and summary generation codes.
Results for single-shot video captioning:
Modality* | B | M | R | C | Checkpoint |
---|---|---|---|---|---|
V | 10.5 | 16 | 30.1 | 38.8 | ckpt |
V+A | 10.7 | 16.2 | 29.6 | 37.4 | ckpt |
Results for single-shot narration captioning:
Modality* | B | M | R | C | Checkpoint |
---|---|---|---|---|---|
V+A | 18.8 | 24.8 | 39 | 168.7 | ckpt |
*These models are trained in an end-to-end approach. V and A means visual signals and ASR text seperately.
We provide codes and checkpoints for offline gradio demo on single GPU. The default arguments are set in demo_video.py
. You can specifiy a different config file by option --cfg-path
. To excuate on-the-fly shot detection, please download and transfer checkpoint for TransNetv2 to pytorch format, and place it under ./pretrain
. (You can also download it from here.)
python demo_video.py
Our code is licensed under a Apache 2.0 License.
Our text annotations are released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. They are available strictly for non-commercial research. More guidelines of dataset can be found in here.
If you have any questions or concerns about our dataset, please don't hesitate to contact us. You can raise an issue or reach us at [email protected]. We welcome feedback and are always looking to improve our dataset.
We extend our thanks to the teams behind HD-VILA-100M, BLIP2, Whisper, MiniGPT-4, Vicuna and LLaMA. Please check original docs of LAVIS repo in original_docs. Our work builds upon their valuable contributions. Please acknowledge these resources in your work.