This repository showcases hands-on projects leveraging distributed multi-GPU training to fine-tune large language models (LLMs), demonstrating expertise in PyTorch Distributed, DeepSpeed, Ray (Tune, Train), and MosaicML's LLM Foundry. Each project includes detailed experiment tracking, evaluation, and final model weights.
| Project | Framework / Tool | Model | Hardware | Experiment Tracking | Resources |
|---|---|---|---|---|---|
| PyTorch DDP Multi-GPU Training | PyTorch DDP | Qwen2-0.5B-Instruct | 2×T4 16GB | MLflow | |
| PyTorch FSDP Multi-GPU Training | PyTorch FSDP | OPT-1.3B | 2×T4 16GB | W&B | |
| DeepSpeed ZeRO-2 Offload Training | DeepSpeed ZeRO-2 Offload | Llama-3.2-1B-Instruct | 1×P100 16GB1 | W&B | |
| DeepSpeed Pipeline Parallelism | DeepSpeed Pipeline + ZeRO-1 | Llama-3.2-1B-Instruct | 2×T4 16GB | W&B | |
| LLM Foundry FSDP Fine-tuning | MosaicML's LLM Foundry, FSDP | OPT-1.3B | 2×T4 16GB | W&B | |
| Ray Train with DeepSpeed ZeRO-3 | Ray Train, DeepSpeed ZeRO-3 | BLOOMZ-1b1 | 2×T4 16GB | W&B | |
| Ray Tune Hyperparameter Optimization | Ray Tune, PyTorch | Qwen2-0.5B-Instruct | 2×T4 16GB | W&B |
Most experiments were run on Kaggle with 2 × T4 16GB GPUs
Footnotes
-
DeepSpeed ZeRO-2 offload peaked at ~37 GB CPU RAM, exceeding Kaggle’s 30 GB CPU RAM limit, so the project was run on Vast.ai. ↩