Parallel Video Action Recognition with Multi-Strategy Performance Analysis

High-performance parallel computing implementation for video action recognition using Something-Something V2 dataset. Comparing CPU parallelization (Joblib vs Dask) and GPU parallelization (DDP vs FSDP) strategies.

Team: Nilay Raut and Yash Darekar
Course: CSYE7105 High Performance Parallel Machine Learning & AI
Instructor: Prof. Handan Liu

Project Goal

Accelerate video action recognition from 6+ hours to <2 hours using parallel computing techniques while maintaining model accuracy.

Key Comparisons

CPU Preprocessing: Sequential vs Joblib vs Dask
GPU Training: Single GPU vs DDP vs FSDP1 vs FSDP2
Scaling Analysis: Strong scaling and weak scaling experiments

Dataset

Something-Something V2: 220,847 videos, 174 action classes
Size: 18.2GB
Download: Qualcomm Developer Portal

Project Structure

parallel-video-project/
├── data/
│   ├── videos/                         # Raw video files (.webm)
│   └── labels/                         # JSON label files
│       ├── labels.json                 # 174 class definitions
│       ├── train.json                  # Training split
│       ├── validation.json             # Validation split
│       ├── test.json                   # Test IDs (no labels)
│       └── test-answers.csv            # Test labels (for evaluation)
├── models/
│   └── pretrained/                     # Contains pretrained model weights
├── src/
│   └── models/                         # Model architecture definitions
│       ├── pytorch_i3d.py              # I3D model implementation
│       └── i3d_model.py                # I3D wrapper for Something-Something V2
├── notebooks/
│   ├── main_development.ipynb          # Development and exploration
│   └── results_visualization.ipynb     # Results analysis and plotting
├── scripts/
│   ├── baseline.py                     # Sequential baseline implementation
│   ├── joblib_preprocess.py            # Joblib parallelization
│   ├── dask_preprocess.py              # Dask parallelization
│   ├── train_single_gpu.py             # Single GPU training
│   ├── train_ddp.py                    # Multi-GPU DDP training
│   ├── train_fsdp1.py                  # Multi-GPU FSDP1 training
│   ├── train_fsdp2.py                  # Multi-GPU FSDP2 training
│   ├── utils.py                        # Common utility functions (used in other scripts)
│   ├── setup_i3d.py                    # Setup I3D model and pretrained weights
│   └── slurm/                          # SLURM job scripts
│       ├── cpu/
│       │   ├── baseline.sh             # Sequential processing
│       │   ├── strong_scaling.sh       # Joblib + Dask strong scaling
│       │   └── weak_scaling.sh         # Joblib + Dask weak scaling
│       ├── gpu/
│       │   ├── single_gpu.sh           # Single GPU baseline
│       │   ├── ddp_2gpu_frozen.sh      # DDP strong scaling (2 GPUs)
│       │   ├── ddp_4gpu_frozen.sh      # DDP strong scaling (4 GPUs)
│       │   ├── ddp_2gpu_weak.sh        # DDP weak scaling (2 GPUs)
│       │   ├── ddp_4gpu_weak.sh        # DDP weak scaling (4 GPUs)
│       │   ├── ddp_2gpu_full.sh        # DDP full finetunning (2 GPUs)
│       │   ├── ddp_4gpu_full.sh        # DDP full finetunning (4 GPUs)
│       │   ├── fsdp1_2gpu.sh           # FSDP1 full finetunning (2 GPUs)
│       │   ├── fsdp1_4gpu.sh           # FSDP1 full finetunning (4 GPUs)
│       │   ├── fsdp2_2gpu.sh           # FSDP2 full finetunning (2 GPUs)
│       │   └── fsdp2_4gpu.sh           # FSDP2 full finetunning (4 GPUs)
│       ├── run_cpu.sh                  # Submit all CPU experiments
│       ├── run_gpu.sh                  # Submit all GPU experiments
│       └── run_all.sh                  # Submit all experiments
│
├── tests/                              # Test scripts
│   ├── __init__.py
│   ├── test_i3d.py                     # Test I3D model
│   ├── test_ddp.py                     # Test DDP setup
│   ├── test_fsdp.py                    # Test FSDP1 setup
│   └── test_fsdp2.py                   # Test FSDP2 setup
├── results/
│   ├── metrics/                        # Performance measurements (CSV/JSON)
│   │   ├── all_results.csv             # CPU preprocessing (baseline, joblib, dask)
│   │   ├── gpu_training_results.csv    # GPU training (single_gpu, ddp, fsdp)
│   │   ├── baseline.json
│   │   ├── baseline_detailed.csv
│   │   ├── joblib_comparison.csv
│   │   ├── joblib_all_workers.json
│   │   ├── dask_comparison.csv
│   │   ├── dask_all_workers.json
│   │   ├── single_gpu_metrics.json 
│   │   ├── ddp_2gpu_metrics.json
│   │   └── fsdp_2gpu_metrics.json
│   └── plots/                          # Generated visualizations
├── venv/                               # Python virtual environment
├── logs/                               # SLURM job logs
│   ├── cpu/
│   └── gpu/
├── requirements.txt                    # Python dependencies
└── README.md                           # This file

Quick Start

1. Setup Environment

# Clone repository
git clone git@github.com:yashdeep94/parallel_video_recognition.git
cd parallel-video-project

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip3 install -r requirements.txt

2. Setup I3D Model

The I3D model setup and testing can be done either through the notebook or scripts:

Option A: Using Notebook

# Open the main development notebook
jupyter notebook notebooks/main_development.ipynb

# In the notebook, run the I3D setup cells which will:
# 1. Download the I3D model architecture
# 2. Download pretrained weights (91MB)
# 3. Test the model
# 4. Verify GPU acceleration if available

Option B: Using Scripts

# Download I3D model and pretrained weights
python3 scripts/setup_i3d.py

# Test I3D model
python3 tests/test_i3d.py

After setup, you should have:

src/models/pytorch_i3d.py - I3D architecture
src/models/i3d_model.py - I3D wrapper for 174 classes
models/pretrained/i3d_rgb_imagenet.pt - Pretrained weights

3. Prepare Data

# Place your downloaded videos in:
data/videos/  # All .webm files here
data/labels/  # JSON label files here

# Verify setup
python3 -c "from pathlib import Path; print(f'Videos found: {len(list(Path(\"data/videos\").glob(\"*.webm\")))}')"

Running Experiments

Option A: Using SLURM scripts

Submit All Experiments

# Make scripts executable
chmod +x scripts/slurm/**/*.sh

# Submit all experiments with dependencies
bash scripts/slurm/run_all.sh

# Or submit separately
bash scripts/slurm/run_cpu.sh    # CPU experiments only
bash scripts/slurm/run_gpu.sh    # GPU experiments only

Monitor Jobs

# Check job status
squeue -u $USER

# Monitor specific job output
tail -f logs/cpu/baseline_<job_id>.out
tail -f logs/gpu/ddp_strong_scaling_<job_id>.out

# Cancel all jobs
scancel -u $USER

Option B: Manual Execution (Using python scrpts)

1. Run Baseline Sequential Processing

The baseline script establishes performance metrics for sequential (non-parallel) video processing.

Basic Usage:

# Process 100 videos with default settings (32 frames, 224x224 resize)
python3 scripts/baseline.py --num_videos 100

# Process videos and save results
python3 scripts/baseline.py --num_videos 100 --save_results

Command Line Options:

Option	Default	Description
`--num_videos`	100	Number of videos to process
`--num_frames`	32	Frames to sample per video
`--resize`	224	Resize dimension (square)
`--data_dir`	data/videos	Video directory path
`--save_results`	False	Save results to CSV/JSON

2. Run Joblib Parallel Processing

Joblib enables parallel video processing using multiple CPU cores, achieving significant speedup over sequential processing.

Basic Usage:

# Test with workers (1,2,4,8)
python3 scripts/joblib_preprocess.py --num_videos 100 --workers "1,2,4,8"

# Save results for analysis
python3 scripts/joblib_preprocess.py --num_videos 100 --workers "1,2,4,8,16" --save_results

Command Line Options:

Option	Default	Description
`--num_videos`	100	Number of videos to process
`--workers`	"1,2,4,8"	Comma-separated worker counts to test
`--num_frames`	32	Frames to sample per video
`--resize`	224	Resize dimension (square)
`--data_dir`	data/videos	Video directory path
`--save_results`	False	Save results to CSV/JSON

3. Run Dask Parallel Processing

Dask enables parallel video processing using multiple CPU cores, achieving significant speedup over sequential processing.

Basic Usage:

# Test with workers (1,2,4,8)
python3 scripts/dask_preprocess.py --num_videos 100 --workers "1,2,4,8"

# Save results for analysis
python3 scripts/dask_preprocess.py --num_videos 100 --workers "1,2,4,8,16" --save_results

Command Line Options:

Option	Default	Description
`--num_videos`	100	Number of videos to process
`--workers`	"1,2,4,8"	Comma-separated worker counts to test
`--num_frames`	32	Frames to sample per video
`--resize`	224	Resize dimension (square)
`--data_dir`	data/videos	Video directory path
`--save_results`	False	Save results to CSV/JSON

4. Run Single GPU Training

Train the I3D model on a single GPU with optimized data loading and parallel preprocessing.

Basic Usage:

# Quick test with small dataset
python3 scripts/train_single_gpu.py --num_videos 100 --epochs 2 --batch_size 4 --freeze_backbone --save_results

# Full training with more data
python3 scripts/train_single_gpu.py --num_videos 1000 --epochs 10 --batch_size 8 --num_workers 8 --freeze_backbone --save_results

Command Line Options:

Option	Default	Description
`--data_dir`	data/videos	Video data directory
`--num_videos`	-1	Number of videos (-1 for all available)
`--epochs`	10	Number of training epochs
`--batch_size`	8	Batch size
`--lr`	0.001	Learning rate
`--weight_decay`	1e-4	Weight decay
`--clip_grad`	1.0	Weight decay
`--num_workers`	4	Parallel data loading workers
`--num_frames`	32	Frames to sample per video
`--resize_dim`	224	Resize dimension for frames
`--freeze_backbone`	False	Freeze I3D backbone for faster training
`--save_results`	False	Save training metrics

5. Run Multi-GPU Training with DDP

Train the I3D model across multiple GPUs using Distributed Data Parallel (DDP) for significant speedup.

Prerequisites:

Requires 2 or more GPUs
NCCL backend support
Must be run on cluster or multi-GPU system

Basic Usage:

# Test DDP setup
python3 tests/test_ddp.py

# Quick test with 2 GPUS
python3 scripts/train_ddp.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --freeze_backbone --save_results

Command Line Options:

Option	Default	Description
`--data_dir`	data/videos	Video data directory
`--num_videos`	-1	Number of videos (-1 for all available)
`--epochs`	10	Number of training epochs
`--batch_size`	8	Batch size per GPU
`--lr`	0.001	Learning rate
`--weight_decay`	1e-4	Weight decay
`--num_workers`	4	Parallel data loading workers
`--clip_grad`	1.0	Max gradient norm for clipping
`--num_frames`	32	Frames to sample per video
`--num_gpus`	None (All available)	Number of GPUs
`--resize_dim`	224	Resize dimension for frames
`--freeze_backbone`	False	Freeze I3D backbone for faster training
`--save_results`	False	Save training metrics

6. Run Multi-GPU Training with FSDP1

Train the I3D model across multiple GPUs using Fully Sharded Data Parallel 1 (FSDP1) for improved memory efficiency and scalability.

Prerequisites:

Requires 2 or more GPUs
PyTorch >= 1.12
NCCL backend support
Must be run on cluster or multi-GPU system

Basic Usage:

# Test FSDP setup
python3 tests/test_fsdp1.py

# Quick test with 2 GPUs (FULL_SHARD strategy)
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --save_results

# With mixed precision for better memory efficiency
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --mixed_precision --freeze_backbone --save_results

# Try different sharding strategies
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --sharding_strategy SHARD_GRAD_OP --save_results

Command Line Options:

Option	Default	Description
`--data_dir`	data/videos	Video data directory
`--num_videos`	-1	Number of videos (-1 for all available)
`--epochs`	10	Number of training epochs
`--batch_size`	8	Batch size per GPU
`--lr`	0.001	Learning rate
`--weight_decay`	1e-4	Weight decay
`--num_workers`	4	Parallel data loading workers
`--num_frames`	32	Frames to sample per video
`--num_gpus`	None (All available)	Number of GPUs
`--clip_grad`	1.0	Max gradient norm for clipping
`--resize_dim`	224	Resize dimension for frames
`--sharding_strategy`	FULL_SHARD	Sharding strategy (FULL_SHARD, SHARD_GRAD_OP, NO_SHARD, HYBRID_SHARD)
`--mixed_precision`	False	Enable mixed precision training (FP16)
`--min_wrap_params`	1000000	Minimum parameters for FSDP wrapping
`--freeze_backbone`	False	Freeze I3D backbone for faster training
`--save_results`	False	Save training metrics

6. Run Multi-GPU Training with FSDP2

Train the I3D model across multiple GPUs using Fully Sharded Data Parallel 2 (FSDP2) for improved memory efficiency and scalability.

Prerequisites:

Requires 2 or more GPUs
PyTorch >= 1.12
NCCL backend support
Must be run on cluster or multi-GPU system

Basic Usage:

# Test FSDP setup
python3 tests/test_fsdp2.py

# Quick test with 2 GPUs (FULL_SHARD strategy)
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --save_results

# With mixed precision for better memory efficiency
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --mixed_precision --freeze_backbone --save_results

# Try different sharding strategies
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --sharding_strategy SHARD_GRAD_OP --save_results

Command Line Options:

Option	Default	Description
`--data_dir`	data/videos	Video data directory
`--num_videos`	-1	Number of videos (-1 for all available)
`--epochs`	10	Number of training epochs
`--batch_size`	8	Batch size per GPU
`--lr`	0.001	Learning rate
`--weight_decay`	1e-4	Weight decay
`--clip_grad`	1.0	Max gradient norm for clipping
`--compile`	False	Uses torch.compile for better performance
`--num_workers`	4	Parallel data loading workers
`--num_frames`	32	Frames to sample per video
`--num_gpus`	None (All available)	Number of GPUs
`--resize_dim`	224	Resize dimension for frames
`--sharding_strategy`	FULL_SHARD	Sharding strategy (FULL_SHARD, SHARD_GRAD_OP, NO_SHARD, HYBRID_SHARD)
`--mixed_precision`	False	Enable mixed precision training (FP16)
`--min_wrap_params`	1000000	Minimum parameters for FSDP wrapping
`--freeze_backbone`	False	Freeze I3D backbone for faster training
`--save_results`	False	Save training metrics

Model Information

I3D Architecture

Base Model: Inception-V1 based 3D CNN
Pretrained on: Kinetics-400 dataset
Modified for: Something-Something V2 (174 classes)
Input: (batch, color_channels, frames, height, width) - RGB videos
Recommended frames: 32
Model size: ~47MB

Training Strategy (Transfer Learning)

Backbone Freezing: Loading pretrained weights and freezing of layers except final classification layer for faster convergence
Full Fine-tuning: Loading pretrained weights and training

Results Tracking

When using --save_results, the scripts generate:

Baseline Results:

results/metrics/baseline.json - Latest baseline run details
results/metrics/all_results.csv - Master tracker (accumulates all experiments)

Joblib Results:

results/metrics/joblib_comparison.csv - Comparison of all worker counts
results/metrics/joblib_all_workers.json - Detailed results for all configurations
results/metrics/all_results.csv - Updated with all worker configurations

Dask Results:

results/metrics/dask_comparison.csv - Comparison of all worker counts
results/metrics/dask_all_workers.json - Detailed results for all configurations
results/metrics/all_results.csv - Updated with all worker configurations

Single GPU Results:

results/metrics/gpu_training_results.csv - Master gpu result tracker
results/metrics/single_gpu_metrics.json - Result of single gpu

DDP Multiple GPU Results:

results/metrics/gpu_training_results.csv - Master gpu result tracker
results/metrics/ddp_{number_of_gpus}gpu_metrics.json - Result of DDP on multiple GPUs

FSDP1 Multiple GPU Results:

results/metrics/gpu_training_results.csv - Master gpu result tracker
results/metrics/fsdp1_{number_of_gpus}gpu_metrics.json - Result of FSDP1 on multiple GPUs

FSDP2 Multiple GPU Results:

results/metrics/gpu_training_results.csv - Master gpu result tracker
results/metrics/fsdp2_{number_of_gpus}gpu_metrics.json - Result of FSDP2 on multiple GPUs

Analysis Files:

Use these files in notebooks/results_visualization.ipynb to create:

Speedup curves
Efficiency plots
Scaling analysis
Worker optimization charts

Dependencies

PyTorch 2.0+
Joblib
Dask
OpenCV
NumPy
tqdm
pandas

See requirements.txt for complete list.

Contact

Yash Darekar: yashdevdarekar94@gmail.com
Nilay Raut: nilay09raut@gmail.com

License

Academic project for CSYE7105 Fall 2025

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data/labels		data/labels
models/pretrained		models/pretrained
notebooks		notebooks
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CLUSTER_SETUP.md		CLUSTER_SETUP.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Parallel Video Action Recognition with Multi-Strategy Performance Analysis

Project Goal

Key Comparisons

Dataset

Project Structure

Quick Start

1. Setup Environment

2. Setup I3D Model

Option A: Using Notebook

Option B: Using Scripts

3. Prepare Data

Running Experiments

Option A: Using SLURM scripts

Submit All Experiments

Monitor Jobs

Option B: Manual Execution (Using python scrpts)

1. Run Baseline Sequential Processing

Basic Usage:

Command Line Options:

2. Run Joblib Parallel Processing

Basic Usage:

Command Line Options:

3. Run Dask Parallel Processing

Basic Usage:

Command Line Options:

4. Run Single GPU Training

Basic Usage:

Command Line Options:

5. Run Multi-GPU Training with DDP

Prerequisites:

Basic Usage:

Command Line Options:

6. Run Multi-GPU Training with FSDP1

Prerequisites:

Basic Usage:

Command Line Options:

6. Run Multi-GPU Training with FSDP2

Prerequisites:

Basic Usage:

Command Line Options:

Model Information

I3D Architecture

Training Strategy (Transfer Learning)

Results Tracking

Baseline Results:

Joblib Results:

Dask Results:

Single GPU Results:

DDP Multiple GPU Results:

FSDP1 Multiple GPU Results:

FSDP2 Multiple GPU Results:

Analysis Files:

Dependencies

Contact

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages