This project provides a infrastructure setup for exploring distributed machine learning (ML) workloads on Kubernetes (K8s) using Amazon EKS. It includes Terraform modules for provisioning the infrastructure, Docker configurations for building custom images, and example scripts for running distributed training jobs using various frameworks and schedulers.
The project is organized into three main directories:
infra/
: Contains Terraform modules for provisioning the EKS cluster, GPU support, and distributed training tools.docker/
: Includes Docker configurations for building custom images for MPI, GPU, and distributed training workloads.examples/
: Provides example scripts and configurations for running distributed training jobs using:- Ray
- Torchrun
- MPI
- Training Operator
- Volcano Scheduler
.
├── docker/ # Docker configurations for custom images
│ ├── g4ad_rocm_build/ # AMD GPU (ROCm) build
│ ├── g4dn_cuda_build/ # NVIDIA GPU (CUDA) build
│ └── standard_mpi_runner/ # Standard MPI runner
├── examples/ # Example scripts and configurations
│ ├── kuberay/ # Ray cluster examples
│ ├── local/ # Local distributed training examples
│ ├── mpi/ # MPI-based distributed training examples
│ ├── ray/ # Ray-based distributed training examples
│ ├── torchrun/ # Torchrun-based distributed training examples
│ ├── training_operator/ # Training Operator examples
│ └── volcano_scheduler/ # Volcano Scheduler examples
└── infra/ # Terraform modules for EKS infrastructure
├── ec2/ # EC2 instance configurations
└── eks/ # EKS cluster configurations
- EKS Cluster: Fully managed Kubernetes cluster with GPU support.
- Distributed Training: Support for Ray, Torchrun, MPI, Training Operator, and Volcano Scheduler.
- GPU Support: Configurations for both NVIDIA (CUDA) and AMD (ROCm) GPUs.
- Monitoring: Integrated Prometheus and Grafana for cluster monitoring.
- Cost Management: Kubecost for cost monitoring and optimization.
- Automated Scaling: Karpenter for dynamic node provisioning.
To use this project, you need the following tools installed and configured:
- Terraform (>= 1.0): For provisioning infrastructure.
- AWS CLI: For interacting with AWS services.
- kubectl: For managing Kubernetes clusters.
- Helm (v3): For deploying Kubernetes applications.
- Python: For running distributed training scripts.
- Install the AWS CLI: AWS CLI Installation Guide.
- Configure your AWS credentials:
aws configure
- Navigate to the
infra/eks/
directory. - Initialize Terraform:
terraform init
- Apply the Terraform configuration:
terraform apply
- Update your
kubectl
configuration to connect to the EKS cluster:aws eks --region <region> update-kubeconfig --name <cluster-name>
- Navigate to the
examples/
directory and choose the framework you want to use (e.g., Ray, Torchrun, MPI, Training Operator, or Volcano Scheduler). - Follow the README in each subdirectory to deploy and run distributed training jobs.
- Navigate to
examples/ray/
. - Follow the instructions in the
README.md
to deploy a Ray cluster and submit a job.
- Navigate to
examples/volcano_scheduler/
. - Update the
multinode_ddp_volcano_mpi_job.yaml
manifest with your training script. - Submit the job:
kubectl apply -f multinode_ddp_volcano_mpi_job.yaml
- Navigate to
examples/training_operator/
. - Update the job manifest (e.g.,
multinode_ddp_pytorch_job.yaml
) with your training script. - Submit the job:
kubectl apply -f multinode_ddp_pytorch_job.yaml
-
Grafana Dashboard:
kubectl port-forward svc/kube-prometheus-stack-grafana 8080:80 -n kube-prometheus-stack
Access at
http://localhost:8080
. -
Kubernetes Dashboard:
kubectl port-forward service/kubernetes-dashboard-kong-proxy -n kubernetes-dashboard 8443:443
Access at
https://localhost:8443
. -
Job Status: Use
kubectl get jobs
orkubectl get pods
to monitor the status of your jobs.
To destroy the infrastructure and avoid unnecessary costs:
terraform destroy
- Terraform Documentation: https://www.terraform.io/docs/
- AWS EKS Documentation: https://docs.aws.amazon.com/eks/
- Kubernetes Documentation: https://kubernetes.io/docs/
- Ray Documentation: https://docs.ray.io/
- PyTorch Distributed Training: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html