This repository contains Terraform configurations for deploying vLLM on Kubernetes clusters across multiple cloud providers, including Civo, Oracle Cloud (OCI), AWS, GCP, and Azure. This setup is based on the llmcache vllm production-stack project.
- Civo Kubernetes ☁️
- Oracle Cloud (OCI) OKE 🏛️
- Amazon Web Services (AWS) EKS 🟠
- Google Cloud Platform (GCP) GKE 🔵
- Microsoft Azure AKS 🔷
- PagedAttention 🚀: Enables efficient memory management to handle large context sizes, reducing redundant computation and maximizing GPU utilization.
- Continuous Batching 📦: Dynamically schedules incoming requests for maximum throughput.
- Multi-GPU and Distributed Support: Seamless scaling across multiple GPUs and nodes.
- Fast Token Generation ⚡: Optimized kernel implementation speeds up inference performance.
- Flexible Deployment: Works with Kubernetes, Docker, and cloud-based orchestration tools.
- Multi-Backend Support 🔗: Works across different cloud providers and Kubernetes environments.
- Interface Design: Production Stack and LMCache is built as an open, extensible framework, intentionally leaving room for community contributions and innovations. Keeping the interface flexible to support more storage and compute devices in the future.
- vLLM Support: LMCache supports the latest vLLM versions through the KV connector interface (PR) and will continue to contribute and support the latest vLLM by leveraging an vLLM upstream connector.
- KV Cache Performance: LMCache has advanced KV cache optimizations—efficient KV transfer and blending, particularly useful for long-context inference.
- Developer Friendliness: In Production Stack and LMCache, operators can directly program LLM serving logics in Python, allowing more optimizations in the long run. Production stack is also easy to setup in 5 minutes here .
Ensure you have the following installed:
Each cloud provider has its own Terraform configuration. Follow these steps:
git clone https://github.com/brokedba/vllm_lab.git
cd vllm_lab
Navigate to the specific cloud provider directory and initialize Terraform:
cd terraform/<provider>
terraform init
Update the terraform.tfvars
file with your credentials and configuration options.
terraform apply -auto-approve
Once the Kubernetes cluster is up, deploy vLLM
using Helm:
helm repo add vllm https://vllm-project.github.io/helm-charts/
helm install vllm vllm/vllm --namespace vllm --create-namespace
vllm-lab/
├── terraform/
│ ├── civo/
│ ├── oci/
│ ├── aws/
│ ├── gcp/
│ ├── azure/
├── helm/
│ ├── values.yaml
│ ├── templates/
└── README.md
- Implement monitoring with Prometheus/Grafana 📊
- Add autoscaling configurations for vLLM 📈
- Improve security configurations 🔒
PRs and issues are welcome! 🚀