This directory contains Terraform modules for provisioning MLOps-focused infrastructure in AWS. The modules support both EKS-based and EC2-based deployments, allowing you to choose the most suitable environment for your machine learning workloads.
A production-grade Kubernetes cluster optimized for ML workloads, featuring:
- Managed node groups with GPU support (NVIDIA & AMD)
- Comprehensive monitoring stack (Prometheus & Grafana)
- MLOps tooling (Karpenter, Ray, Volcano scheduler)
- Cost monitoring with Kubecost
- Load balancing and ingress management
Simulated local environments for ML development and testing, offering:
- Single and multi-node configurations
- Support for NVIDIA and AMD GPUs
- MPI-ready setups for distributed training
- Systems Manager (SSM) based access
- Flexible instance type configurations
Before deploying either module, ensure you have the following tools and configurations in place:
# Required tools and versions
terraform >= 1.0
aws-cli >= 2.0
kubectl >= 1.25 # For EKS deployments
helm >= 3.0 # For EKS deployments
- Configure AWS CLI with appropriate credentials:
aws configure
- Verify your configuration:
aws sts get-caller-identity
Create a terraform.tfvars
file in the respective module directory to set your variables:
# terraform.tfvars example
region = "us-west-2"
owner = "team-ml"
# For EKS
training_job_multinode_gpu_enabled = true
ray_cluster_enabled = true
# For EC2
singlenode_multigpu = true
multinode_gpu = false
Navigate to the desired module directory and run:
# Initialize Terraform
terraform init
# Review the deployment plan
terraform plan
# Apply the configuration
terraform apply
For EKS deployments:
# Configure kubectl
aws eks update-kubeconfig --region <region> --name <cluster_name>
# Verify cluster access
kubectl get nodes
For EC2 deployments:
# List instance IDs
terraform output instance_ids
# Connect via Session Manager
aws ssm start-session --target <INSTANCE_ID>
To destroy the infrastructure when no longer needed:
# Review destruction plan
terraform plan -destroy
# Destroy resources
terraform destroy
Note: For EKS deployments, ensure all services and load balancers are properly removed before destroying the infrastructure to prevent orphaned AWS resources.
- If
terraform destroy
fails for EKS, check for remaining load balancers or persistent volumes - For EC2 deployments, ensure all instances are running before attempting SSM connections
- AMD GPU instances require manual ROCm installation after deployment
- Review the individual module READMEs for detailed configuration options
- Check the deployment-specific security considerations
- Configure monitoring and alerting for production deployments
For detailed information about each module, refer to their respective README files in the eks/
and ec2/
directories.