High-performance parallel computing implementation for video action recognition using Something-Something V2 dataset. Comparing CPU parallelization (Joblib vs Dask) and GPU parallelization (DDP vs FSDP) strategies.
Team: Nilay Raut and Yash Darekar
Course: CSYE7105 High Performance Parallel Machine Learning & AI
Instructor: Prof. Handan Liu
Accelerate video action recognition from 6+ hours to <2 hours using parallel computing techniques while maintaining model accuracy.
- CPU Preprocessing: Sequential vs Joblib vs Dask
- GPU Training: Single GPU vs DDP vs FSDP1 vs FSDP2
- Scaling Analysis: Strong scaling and weak scaling experiments
- Something-Something V2: 220,847 videos, 174 action classes
- Size: 18.2GB
- Download: Qualcomm Developer Portal
parallel-video-project/
├── data/
│ ├── videos/ # Raw video files (.webm)
│ └── labels/ # JSON label files
│ ├── labels.json # 174 class definitions
│ ├── train.json # Training split
│ ├── validation.json # Validation split
│ ├── test.json # Test IDs (no labels)
│ └── test-answers.csv # Test labels (for evaluation)
├── models/
│ └── pretrained/ # Contains pretrained model weights
├── src/
│ └── models/ # Model architecture definitions
│ ├── pytorch_i3d.py # I3D model implementation
│ └── i3d_model.py # I3D wrapper for Something-Something V2
├── notebooks/
│ ├── main_development.ipynb # Development and exploration
│ └── results_visualization.ipynb # Results analysis and plotting
├── scripts/
│ ├── baseline.py # Sequential baseline implementation
│ ├── joblib_preprocess.py # Joblib parallelization
│ ├── dask_preprocess.py # Dask parallelization
│ ├── train_single_gpu.py # Single GPU training
│ ├── train_ddp.py # Multi-GPU DDP training
│ ├── train_fsdp1.py # Multi-GPU FSDP1 training
│ ├── train_fsdp2.py # Multi-GPU FSDP2 training
│ ├── utils.py # Common utility functions (used in other scripts)
│ ├── setup_i3d.py # Setup I3D model and pretrained weights
│ └── slurm/ # SLURM job scripts
│ ├── cpu/
│ │ ├── baseline.sh # Sequential processing
│ │ ├── strong_scaling.sh # Joblib + Dask strong scaling
│ │ └── weak_scaling.sh # Joblib + Dask weak scaling
│ ├── gpu/
│ │ ├── single_gpu.sh # Single GPU baseline
│ │ ├── ddp_2gpu_frozen.sh # DDP strong scaling (2 GPUs)
│ │ ├── ddp_4gpu_frozen.sh # DDP strong scaling (4 GPUs)
│ │ ├── ddp_2gpu_weak.sh # DDP weak scaling (2 GPUs)
│ │ ├── ddp_4gpu_weak.sh # DDP weak scaling (4 GPUs)
│ │ ├── ddp_2gpu_full.sh # DDP full finetunning (2 GPUs)
│ │ ├── ddp_4gpu_full.sh # DDP full finetunning (4 GPUs)
│ │ ├── fsdp1_2gpu.sh # FSDP1 full finetunning (2 GPUs)
│ │ ├── fsdp1_4gpu.sh # FSDP1 full finetunning (4 GPUs)
│ │ ├── fsdp2_2gpu.sh # FSDP2 full finetunning (2 GPUs)
│ │ └── fsdp2_4gpu.sh # FSDP2 full finetunning (4 GPUs)
│ ├── run_cpu.sh # Submit all CPU experiments
│ ├── run_gpu.sh # Submit all GPU experiments
│ └── run_all.sh # Submit all experiments
│
├── tests/ # Test scripts
│ ├── __init__.py
│ ├── test_i3d.py # Test I3D model
│ ├── test_ddp.py # Test DDP setup
│ ├── test_fsdp.py # Test FSDP1 setup
│ └── test_fsdp2.py # Test FSDP2 setup
├── results/
│ ├── metrics/ # Performance measurements (CSV/JSON)
│ │ ├── all_results.csv # CPU preprocessing (baseline, joblib, dask)
│ │ ├── gpu_training_results.csv # GPU training (single_gpu, ddp, fsdp)
│ │ ├── baseline.json
│ │ ├── baseline_detailed.csv
│ │ ├── joblib_comparison.csv
│ │ ├── joblib_all_workers.json
│ │ ├── dask_comparison.csv
│ │ ├── dask_all_workers.json
│ │ ├── single_gpu_metrics.json
│ │ ├── ddp_2gpu_metrics.json
│ │ └── fsdp_2gpu_metrics.json
│ └── plots/ # Generated visualizations
├── venv/ # Python virtual environment
├── logs/ # SLURM job logs
│ ├── cpu/
│ └── gpu/
├── requirements.txt # Python dependencies
└── README.md # This file
# Clone repository
git clone git@github.com:yashdeep94/parallel_video_recognition.git
cd parallel-video-project
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip3 install -r requirements.txtThe I3D model setup and testing can be done either through the notebook or scripts:
# Open the main development notebook
jupyter notebook notebooks/main_development.ipynb
# In the notebook, run the I3D setup cells which will:
# 1. Download the I3D model architecture
# 2. Download pretrained weights (91MB)
# 3. Test the model
# 4. Verify GPU acceleration if available# Download I3D model and pretrained weights
python3 scripts/setup_i3d.py
# Test I3D model
python3 tests/test_i3d.pyAfter setup, you should have:
src/models/pytorch_i3d.py- I3D architecturesrc/models/i3d_model.py- I3D wrapper for 174 classesmodels/pretrained/i3d_rgb_imagenet.pt- Pretrained weights
# Place your downloaded videos in:
data/videos/ # All .webm files here
data/labels/ # JSON label files here
# Verify setup
python3 -c "from pathlib import Path; print(f'Videos found: {len(list(Path(\"data/videos\").glob(\"*.webm\")))}')"# Make scripts executable
chmod +x scripts/slurm/**/*.sh
# Submit all experiments with dependencies
bash scripts/slurm/run_all.sh
# Or submit separately
bash scripts/slurm/run_cpu.sh # CPU experiments only
bash scripts/slurm/run_gpu.sh # GPU experiments only# Check job status
squeue -u $USER
# Monitor specific job output
tail -f logs/cpu/baseline_<job_id>.out
tail -f logs/gpu/ddp_strong_scaling_<job_id>.out
# Cancel all jobs
scancel -u $USERThe baseline script establishes performance metrics for sequential (non-parallel) video processing.
# Process 100 videos with default settings (32 frames, 224x224 resize)
python3 scripts/baseline.py --num_videos 100
# Process videos and save results
python3 scripts/baseline.py --num_videos 100 --save_results| Option | Default | Description |
|---|---|---|
--num_videos |
100 | Number of videos to process |
--num_frames |
32 | Frames to sample per video |
--resize |
224 | Resize dimension (square) |
--data_dir |
data/videos | Video directory path |
--save_results |
False | Save results to CSV/JSON |
Joblib enables parallel video processing using multiple CPU cores, achieving significant speedup over sequential processing.
# Test with workers (1,2,4,8)
python3 scripts/joblib_preprocess.py --num_videos 100 --workers "1,2,4,8"
# Save results for analysis
python3 scripts/joblib_preprocess.py --num_videos 100 --workers "1,2,4,8,16" --save_results| Option | Default | Description |
|---|---|---|
--num_videos |
100 | Number of videos to process |
--workers |
"1,2,4,8" | Comma-separated worker counts to test |
--num_frames |
32 | Frames to sample per video |
--resize |
224 | Resize dimension (square) |
--data_dir |
data/videos | Video directory path |
--save_results |
False | Save results to CSV/JSON |
Dask enables parallel video processing using multiple CPU cores, achieving significant speedup over sequential processing.
# Test with workers (1,2,4,8)
python3 scripts/dask_preprocess.py --num_videos 100 --workers "1,2,4,8"
# Save results for analysis
python3 scripts/dask_preprocess.py --num_videos 100 --workers "1,2,4,8,16" --save_results| Option | Default | Description |
|---|---|---|
--num_videos |
100 | Number of videos to process |
--workers |
"1,2,4,8" | Comma-separated worker counts to test |
--num_frames |
32 | Frames to sample per video |
--resize |
224 | Resize dimension (square) |
--data_dir |
data/videos | Video directory path |
--save_results |
False | Save results to CSV/JSON |
Train the I3D model on a single GPU with optimized data loading and parallel preprocessing.
# Quick test with small dataset
python3 scripts/train_single_gpu.py --num_videos 100 --epochs 2 --batch_size 4 --freeze_backbone --save_results
# Full training with more data
python3 scripts/train_single_gpu.py --num_videos 1000 --epochs 10 --batch_size 8 --num_workers 8 --freeze_backbone --save_results| Option | Default | Description |
|---|---|---|
--data_dir |
data/videos | Video data directory |
--num_videos |
-1 | Number of videos (-1 for all available) |
--epochs |
10 | Number of training epochs |
--batch_size |
8 | Batch size |
--lr |
0.001 | Learning rate |
--weight_decay |
1e-4 | Weight decay |
--clip_grad |
1.0 | Weight decay |
--num_workers |
4 | Parallel data loading workers |
--num_frames |
32 | Frames to sample per video |
--resize_dim |
224 | Resize dimension for frames |
--freeze_backbone |
False | Freeze I3D backbone for faster training |
--save_results |
False | Save training metrics |
Train the I3D model across multiple GPUs using Distributed Data Parallel (DDP) for significant speedup.
- Requires 2 or more GPUs
- NCCL backend support
- Must be run on cluster or multi-GPU system
# Test DDP setup
python3 tests/test_ddp.py
# Quick test with 2 GPUS
python3 scripts/train_ddp.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --freeze_backbone --save_results| Option | Default | Description |
|---|---|---|
--data_dir |
data/videos | Video data directory |
--num_videos |
-1 | Number of videos (-1 for all available) |
--epochs |
10 | Number of training epochs |
--batch_size |
8 | Batch size per GPU |
--lr |
0.001 | Learning rate |
--weight_decay |
1e-4 | Weight decay |
--num_workers |
4 | Parallel data loading workers |
--clip_grad |
1.0 | Max gradient norm for clipping |
--num_frames |
32 | Frames to sample per video |
--num_gpus |
None (All available) | Number of GPUs |
--resize_dim |
224 | Resize dimension for frames |
--freeze_backbone |
False | Freeze I3D backbone for faster training |
--save_results |
False | Save training metrics |
Train the I3D model across multiple GPUs using Fully Sharded Data Parallel 1 (FSDP1) for improved memory efficiency and scalability.
- Requires 2 or more GPUs
- PyTorch >= 1.12
- NCCL backend support
- Must be run on cluster or multi-GPU system
# Test FSDP setup
python3 tests/test_fsdp1.py
# Quick test with 2 GPUs (FULL_SHARD strategy)
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --save_results
# With mixed precision for better memory efficiency
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --mixed_precision --freeze_backbone --save_results
# Try different sharding strategies
python3 scripts/train_fsdp1.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --sharding_strategy SHARD_GRAD_OP --save_results| Option | Default | Description |
|---|---|---|
--data_dir |
data/videos | Video data directory |
--num_videos |
-1 | Number of videos (-1 for all available) |
--epochs |
10 | Number of training epochs |
--batch_size |
8 | Batch size per GPU |
--lr |
0.001 | Learning rate |
--weight_decay |
1e-4 | Weight decay |
--num_workers |
4 | Parallel data loading workers |
--num_frames |
32 | Frames to sample per video |
--num_gpus |
None (All available) | Number of GPUs |
--clip_grad |
1.0 | Max gradient norm for clipping |
--resize_dim |
224 | Resize dimension for frames |
--sharding_strategy |
FULL_SHARD | Sharding strategy (FULL_SHARD, SHARD_GRAD_OP, NO_SHARD, HYBRID_SHARD) |
--mixed_precision |
False | Enable mixed precision training (FP16) |
--min_wrap_params |
1000000 | Minimum parameters for FSDP wrapping |
--freeze_backbone |
False | Freeze I3D backbone for faster training |
--save_results |
False | Save training metrics |
Train the I3D model across multiple GPUs using Fully Sharded Data Parallel 2 (FSDP2) for improved memory efficiency and scalability.
- Requires 2 or more GPUs
- PyTorch >= 1.12
- NCCL backend support
- Must be run on cluster or multi-GPU system
# Test FSDP setup
python3 tests/test_fsdp2.py
# Quick test with 2 GPUs (FULL_SHARD strategy)
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --save_results
# With mixed precision for better memory efficiency
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --mixed_precision --freeze_backbone --save_results
# Try different sharding strategies
python3 scripts/train_fsdp2.py --num_videos 1000 --epochs 10 --batch_size 8 --num_gpus 2 --sharding_strategy SHARD_GRAD_OP --save_results| Option | Default | Description |
|---|---|---|
--data_dir |
data/videos | Video data directory |
--num_videos |
-1 | Number of videos (-1 for all available) |
--epochs |
10 | Number of training epochs |
--batch_size |
8 | Batch size per GPU |
--lr |
0.001 | Learning rate |
--weight_decay |
1e-4 | Weight decay |
--clip_grad |
1.0 | Max gradient norm for clipping |
--compile |
False | Uses torch.compile for better performance |
--num_workers |
4 | Parallel data loading workers |
--num_frames |
32 | Frames to sample per video |
--num_gpus |
None (All available) | Number of GPUs |
--resize_dim |
224 | Resize dimension for frames |
--sharding_strategy |
FULL_SHARD | Sharding strategy (FULL_SHARD, SHARD_GRAD_OP, NO_SHARD, HYBRID_SHARD) |
--mixed_precision |
False | Enable mixed precision training (FP16) |
--min_wrap_params |
1000000 | Minimum parameters for FSDP wrapping |
--freeze_backbone |
False | Freeze I3D backbone for faster training |
--save_results |
False | Save training metrics |
- Base Model: Inception-V1 based 3D CNN
- Pretrained on: Kinetics-400 dataset
- Modified for: Something-Something V2 (174 classes)
- Input:
(batch, color_channels, frames, height, width)- RGB videos - Recommended frames: 32
- Model size: ~47MB
- Backbone Freezing: Loading pretrained weights and freezing of layers except final classification layer for faster convergence
- Full Fine-tuning: Loading pretrained weights and training
When using --save_results, the scripts generate:
results/metrics/baseline.json- Latest baseline run detailsresults/metrics/all_results.csv- Master tracker (accumulates all experiments)
results/metrics/joblib_comparison.csv- Comparison of all worker countsresults/metrics/joblib_all_workers.json- Detailed results for all configurationsresults/metrics/all_results.csv- Updated with all worker configurations
results/metrics/dask_comparison.csv- Comparison of all worker countsresults/metrics/dask_all_workers.json- Detailed results for all configurationsresults/metrics/all_results.csv- Updated with all worker configurations
results/metrics/gpu_training_results.csv- Master gpu result trackerresults/metrics/single_gpu_metrics.json- Result of single gpu
results/metrics/gpu_training_results.csv- Master gpu result trackerresults/metrics/ddp_{number_of_gpus}gpu_metrics.json- Result of DDP on multiple GPUs
results/metrics/gpu_training_results.csv- Master gpu result trackerresults/metrics/fsdp1_{number_of_gpus}gpu_metrics.json- Result of FSDP1 on multiple GPUs
results/metrics/gpu_training_results.csv- Master gpu result trackerresults/metrics/fsdp2_{number_of_gpus}gpu_metrics.json- Result of FSDP2 on multiple GPUs
Use these files in notebooks/results_visualization.ipynb to create:
- Speedup curves
- Efficiency plots
- Scaling analysis
- Worker optimization charts
- PyTorch 2.0+
- Joblib
- Dask
- OpenCV
- NumPy
- tqdm
- pandas
See requirements.txt for complete list.
- Yash Darekar: yashdevdarekar94@gmail.com
- Nilay Raut: nilay09raut@gmail.com
Academic project for CSYE7105 Fall 2025