Skip to content

yashdeep94/parallel_video_recognition

Repository files navigation

Parallel Video Action Recognition with Multi-Strategy Performance Analysis

High-performance parallel computing implementation for video action recognition using Something-Something V2 dataset. Comparing CPU parallelization (Joblib vs Dask) and GPU parallelization (DDP vs FSDP) strategies.

Team: Nilay Raut and Yash Darekar
Course: CSYE7105 High Performance Parallel Machine Learning & AI
Instructor: Prof. Handan Liu

Project Goal

Accelerate video action recognition from 6+ hours to <2 hours using parallel computing techniques while maintaining model accuracy.

Key Comparisons

  • CPU Preprocessing: Sequential vs Joblib vs Dask
  • GPU Training: Single GPU vs DDP vs FSDP1 vs FSDP2
  • Scaling Analysis: Strong scaling and weak scaling experiments

Dataset

Project Structure

parallel-video-project/
├── data/
│   ├── videos/                         # Raw video files (.webm)
│   └── labels/                         # JSON label files
│       ├── labels.json                 # 174 class definitions
│       ├── train.json                  # Training split
│       ├── validation.json             # Validation split
│       ├── test.json                   # Test IDs (no labels)
│       └── test-answers.csv            # Test labels (for evaluation)
├── models/
│   └── pretrained/                     # Contains pretrained model weights
├── src/
│   └── models/                         # Model architecture definitions
│       ├── pytorch_i3d.py              # I3D model implementation
│       └── i3d_model.py                # I3D wrapper for Something-Something V2
├── notebooks/
│   ├── main_development.ipynb          # Development and exploration
│   └── results_visualization.ipynb     # Results analysis and plotting
├── scripts/
│   ├── baseline.py                     # Sequential baseline implementation
│   ├── joblib_preprocess.py            # Joblib parallelization
│   ├── dask_preprocess.py              # Dask parallelization
│   ├── train_single_gpu.py             # Single GPU training
│   ├── train_ddp.py                    # Multi-GPU DDP training
│   ├── train_fsdp1.py                  # Multi-GPU FSDP1 training
│   ├── train_fsdp2.py                  # Multi-GPU FSDP2 training
│   ├── utils.py                        # Common utility functions (used in other scripts)
│   ├── setup_i3d.py                    # Setup I3D model and pretrained weights
│   └── slurm/                          # SLURM job scripts
│       ├── cpu/
│       │   ├── baseline.sh             # Sequential processing
│       │   ├── strong_scaling.sh       # Joblib + Dask strong scaling
│       │   └── weak_scaling.sh         # Joblib + Dask weak scaling
│       ├── gpu/
│       │   ├── single_gpu.sh           # Single GPU baseline
│       │   ├── ddp_2gpu_frozen.sh      # DDP strong scaling (2 GPUs)
│       │   ├── ddp_4gpu_frozen.sh      # DDP strong scaling (4 GPUs)
│       │   ├── ddp_2gpu_weak.sh        # DDP weak scaling (2 GPUs)
│       │   ├── ddp_4gpu_weak.sh        # DDP weak scaling (4 GPUs)
│       │   ├── ddp_2gpu_full.sh        # DDP full finetunning (2 GPUs)
│       │   ├── ddp_4gpu_full.sh        # DDP full finetunning (4 GPUs)
│       │   ├── fsdp1_2gpu.sh           # FSDP1 full finetunning (2 GPUs)
│       │   ├── fsdp1_4gpu.sh           # FSDP1 full finetunning (4 GPUs)
│       │   ├── fsdp2_2gpu.sh           # FSDP2 full finetunning (2 GPUs)
│       │   └── fsdp2_4gpu.sh           # FSDP2 full finetunning (4 GPUs)
│       ├── run_cpu.sh                  # Submit all CPU experiments
│       ├── run_gpu.sh                  # Submit all GPU experiments
│       └── run_all.sh                  # Submit all experiments
│
├── tests/                              # Test scripts
│   ├── __init__.py
│   ├── test_i3d.py                     # Test I3D model
│   ├── test_ddp.py                     # Test DDP setup
│   ├── test_fsdp.py                    # Test FSDP1 setup
│   └── test_fsdp2.py                   # Test FSDP2 setup
├── results/
│   ├── metrics/                        # Performance measurements (CSV/JSON)
│   │   ├── all_results.csv             # CPU preprocessing (baseline, joblib, dask)
│   │   ├── gpu_training_results.csv    # GPU training (single_gpu, ddp, fsdp)
│   │   ├── baseline.json
│   │   ├── baseline_detailed.csv
│   │   ├── joblib_comparison.csv
│   │   ├── joblib_all_workers.json
│   │   ├── dask_comparison.csv
│   │   ├── dask_all_workers.json
│   │   ├── single_gpu_metrics.json 
│   │   ├── ddp_2gpu_metrics.json
│   │   └── fsdp_2gpu_metrics.json
│   └── plots/                          # Generated visualizations
├── venv/                               # Python virtual environment
├── logs/                               # SLURM job logs
│   ├── cpu/
│   └── gpu/
├── requirements.txt                    # Python dependencies
└── README.md                           # This file

Quick Start

1. Setup Environment

# Clone repository
git clone git@github.com:yashdeep94/parallel_video_recognition.git
cd parallel-video-project

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip3 install -r requirements.txt

2. Setup I3D Model

The I3D model setup and testing can be done either through the notebook or scripts:

Option A: Using Notebook

# Open the main development notebook
jupyter notebook notebooks/main_development.ipynb

# In the notebook, run the I3D setup cells which will:
# 1. Download the I3D model architecture
# 2. Download pretrained weights (91MB)
# 3. Test the model
# 4. Verify GPU acceleration if available

Option B: Using Scripts

# Download I3D model and pretrained weights
python3 scripts/setup_i3d.py

# Test I3D model
python3 tests/test_i3d.py

After setup, you should have:

  • src/models/pytorch_i3d.py - I3D architecture
  • src/models/i3d_model.py - I3D wrapper for 174 classes
  • models/pretrained/i3d_rgb_imagenet.pt - Pretrained weights

3. Prepare Data

# Place your downloaded videos in:
data/videos/  # All .webm files here
data/labels/  # JSON label files here

# Verify setup
python3 -c "from pathlib import Path; print(f'Videos found: {len(list(Path(\"data/videos\").glob(\"*.webm\")))}')"

Running Experiments

Option A: Using SLURM scripts

Submit All Experiments

# Make scripts executable
chmod +x scripts/slurm/**/*.sh

# Submit all experiments with dependencies
bash scripts/slurm/run_all.sh

# Or submit separately
bash scripts/slurm/run_cpu.sh    # CPU experiments only
bash scripts/slurm/run_gpu.sh    # GPU experiments only

Monitor Jobs

# Check job status
squeue -u $USER

# Monitor specific job output
tail -f logs/cpu/baseline_<job_id>.out
tail -f logs/gpu/ddp_strong_scaling_<job_id>.out

# Cancel all jobs
scancel -u $USER

Option B: Manual Execution (Using python scrpts)

1. Run Baseline Sequential Processing

The baseline script establishes performance metrics for sequential (non-parallel) video processing.

Basic Usage:
# Process 100 videos with default settings (32 frames, 224x224 resize)
python3 scripts/baseline.py --num_videos 100

# Process videos and save results
python3 scripts/baseline.py --num_videos 100 --save_results
Command Line Options:
Option Default Description
--num_videos 100 Number of videos to process
--num_frames 32 Frames to sample per video
--resize 224 Resize dimension (square)
--data_dir data/videos Video directory path
--save_results False Save results to CSV/JSON

2. Run Joblib Parallel Processing

Joblib enables parallel video processing using multiple CPU cores, achieving significant speedup over sequential processing.

Basic Usage:
# Test with workers (1,2,4,8)
python3 scripts/joblib_preprocess.py --num_videos 100 --workers "1,2,4,8"

# Save results for analysis
python3 scripts/joblib_preprocess.py --num_videos 100 --workers "1,2,4,8,16" --save_results
Command Line Options:
Option Default Description
--num_videos 100 Number of videos to process
--workers "1,2,4,8" Comma-separated worker counts to test
--num_frames 32 Frames to sample per video
--resize 224 Resize dimension (square)
--data_dir data/videos Video directory path
--save_results False Save results to CSV/JSON

3. Run Dask Parallel Processing

Dask enables parallel video processing using multiple CPU cores, achieving significant speedup over sequential processing.

Basic Usage:
# Test with workers (1,2,4,8)
python3 scripts/dask_preprocess.py --num_videos 100 --workers "1,2,4,8"

# Save results for analysis
python3 scripts/dask_preprocess.py --num_videos 100 --workers "1,2,4,8,16" --save_results
Command Line Options:
Option Default Description
--num_videos 100 Number of videos to process
--workers "1,2,4,8" Comma-separated worker counts to test
--num_frames 32 Frames to sample per video
--resize 224 Resize dimension (square)
--data_dir data/videos Video directory path
--save_results False Save results to CSV/JSON

4. Run Single GPU Training

Train the I3D model on a single GPU with optimized data loading and parallel preprocessing.

Basic Usage:
# Quick test with small dataset
python3 scripts/train_single_gpu.py --num_videos 100 --epochs 2 --batch_size 4 --freeze_backbone --save_results

# Full training with more data
python3 scripts/train_single_gpu.py --num_videos 1000 --epochs 10 --batch_size 8 --num_workers 8 --freeze_backbone --save_results
Command Line Options:
Option Default Description
--data_dir data/videos Video data directory
--num_videos -1 Number of videos (-1 for all available)
--epochs 10 Number of training epochs
--batch_size 8 Batch size
--lr 0.001 Learning rate
--weight_decay 1e-4 Weight decay
--clip_grad 1.0 Weight decay
--num_workers 4 Parallel data loading workers
--num_frames 32 Frames to sample per video
--resize_dim 224 Resize dimension for frames
--freeze_backbone False Freeze I3D backbone for faster training
--save_results False Save training metrics

5. Run Multi-GPU Training with DDP

Train the I3D model across multiple GPUs using Distributed Data Parallel (DDP) for significant speedup.

Prerequisites:
  • Requires 2 or more GPUs
  • NCCL backend support
  • Must be run on cluster or multi-GPU system
Basic Usage:
# Test DDP setup
python3 tests/test_ddp.py

# Quick test with 2 GPUS
python3 scripts/train_ddp.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --freeze_backbone --save_results
Command Line Options:
Option Default Description
--data_dir data/videos Video data directory
--num_videos -1 Number of videos (-1 for all available)
--epochs 10 Number of training epochs
--batch_size 8 Batch size per GPU
--lr 0.001 Learning rate
--weight_decay 1e-4 Weight decay
--num_workers 4 Parallel data loading workers
--clip_grad 1.0 Max gradient norm for clipping
--num_frames 32 Frames to sample per video
--num_gpus None (All available) Number of GPUs
--resize_dim 224 Resize dimension for frames
--freeze_backbone False Freeze I3D backbone for faster training
--save_results False Save training metrics

6. Run Multi-GPU Training with FSDP1

Train the I3D model across multiple GPUs using Fully Sharded Data Parallel 1 (FSDP1) for improved memory efficiency and scalability.

Prerequisites:
  • Requires 2 or more GPUs
  • PyTorch >= 1.12
  • NCCL backend support
  • Must be run on cluster or multi-GPU system
Basic Usage:
# Test FSDP setup
python3 tests/test_fsdp1.py

# Quick test with 2 GPUs (FULL_SHARD strategy)
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --save_results

# With mixed precision for better memory efficiency
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --mixed_precision --freeze_backbone --save_results

# Try different sharding strategies
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --sharding_strategy SHARD_GRAD_OP --save_results
Command Line Options:
Option Default Description
--data_dir data/videos Video data directory
--num_videos -1 Number of videos (-1 for all available)
--epochs 10 Number of training epochs
--batch_size 8 Batch size per GPU
--lr 0.001 Learning rate
--weight_decay 1e-4 Weight decay
--num_workers 4 Parallel data loading workers
--num_frames 32 Frames to sample per video
--num_gpus None (All available) Number of GPUs
--clip_grad 1.0 Max gradient norm for clipping
--resize_dim 224 Resize dimension for frames
--sharding_strategy FULL_SHARD Sharding strategy (FULL_SHARD, SHARD_GRAD_OP, NO_SHARD, HYBRID_SHARD)
--mixed_precision False Enable mixed precision training (FP16)
--min_wrap_params 1000000 Minimum parameters for FSDP wrapping
--freeze_backbone False Freeze I3D backbone for faster training
--save_results False Save training metrics

6. Run Multi-GPU Training with FSDP2

Train the I3D model across multiple GPUs using Fully Sharded Data Parallel 2 (FSDP2) for improved memory efficiency and scalability.

Prerequisites:
  • Requires 2 or more GPUs
  • PyTorch >= 1.12
  • NCCL backend support
  • Must be run on cluster or multi-GPU system
Basic Usage:
# Test FSDP setup
python3 tests/test_fsdp2.py

# Quick test with 2 GPUs (FULL_SHARD strategy)
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --save_results

# With mixed precision for better memory efficiency
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --mixed_precision --freeze_backbone --save_results

# Try different sharding strategies
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --sharding_strategy SHARD_GRAD_OP --save_results
Command Line Options:
Option Default Description
--data_dir data/videos Video data directory
--num_videos -1 Number of videos (-1 for all available)
--epochs 10 Number of training epochs
--batch_size 8 Batch size per GPU
--lr 0.001 Learning rate
--weight_decay 1e-4 Weight decay
--clip_grad 1.0 Max gradient norm for clipping
--compile False Uses torch.compile for better performance
--num_workers 4 Parallel data loading workers
--num_frames 32 Frames to sample per video
--num_gpus None (All available) Number of GPUs
--resize_dim 224 Resize dimension for frames
--sharding_strategy FULL_SHARD Sharding strategy (FULL_SHARD, SHARD_GRAD_OP, NO_SHARD, HYBRID_SHARD)
--mixed_precision False Enable mixed precision training (FP16)
--min_wrap_params 1000000 Minimum parameters for FSDP wrapping
--freeze_backbone False Freeze I3D backbone for faster training
--save_results False Save training metrics

Model Information

I3D Architecture

  • Base Model: Inception-V1 based 3D CNN
  • Pretrained on: Kinetics-400 dataset
  • Modified for: Something-Something V2 (174 classes)
  • Input: (batch, color_channels, frames, height, width) - RGB videos
  • Recommended frames: 32
  • Model size: ~47MB

Training Strategy (Transfer Learning)

  1. Backbone Freezing: Loading pretrained weights and freezing of layers except final classification layer for faster convergence
  2. Full Fine-tuning: Loading pretrained weights and training

Results Tracking

When using --save_results, the scripts generate:

Baseline Results:

  • results/metrics/baseline.json - Latest baseline run details
  • results/metrics/all_results.csv - Master tracker (accumulates all experiments)

Joblib Results:

  • results/metrics/joblib_comparison.csv - Comparison of all worker counts
  • results/metrics/joblib_all_workers.json - Detailed results for all configurations
  • results/metrics/all_results.csv - Updated with all worker configurations

Dask Results:

  • results/metrics/dask_comparison.csv - Comparison of all worker counts
  • results/metrics/dask_all_workers.json - Detailed results for all configurations
  • results/metrics/all_results.csv - Updated with all worker configurations

Single GPU Results:

  • results/metrics/gpu_training_results.csv - Master gpu result tracker
  • results/metrics/single_gpu_metrics.json - Result of single gpu

DDP Multiple GPU Results:

  • results/metrics/gpu_training_results.csv - Master gpu result tracker
  • results/metrics/ddp_{number_of_gpus}gpu_metrics.json - Result of DDP on multiple GPUs

FSDP1 Multiple GPU Results:

  • results/metrics/gpu_training_results.csv - Master gpu result tracker
  • results/metrics/fsdp1_{number_of_gpus}gpu_metrics.json - Result of FSDP1 on multiple GPUs

FSDP2 Multiple GPU Results:

  • results/metrics/gpu_training_results.csv - Master gpu result tracker
  • results/metrics/fsdp2_{number_of_gpus}gpu_metrics.json - Result of FSDP2 on multiple GPUs

Analysis Files:

Use these files in notebooks/results_visualization.ipynb to create:

  • Speedup curves
  • Efficiency plots
  • Scaling analysis
  • Worker optimization charts

Dependencies

  • PyTorch 2.0+
  • Joblib
  • Dask
  • OpenCV
  • NumPy
  • tqdm
  • pandas

See requirements.txt for complete list.

Contact

License

Academic project for CSYE7105 Fall 2025


About

High-performance parallel computing for video action recognition on Something-Something V2 dataset (220k videos, 174 classes). Implementing and comparing CPU parallelization with Joblib and Dask, GPU parallelization with PyTorch DDP and FSDP. Target: 6x speedup in video preprocessing and 3x in model training using I3D architecture.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors