|
| 1 | +--- |
| 2 | +title: Serverless |
| 3 | +weight: 4 |
| 4 | +--- |
| 5 | + |
| 6 | +## Overview |
| 7 | + |
| 8 | +This comprehensive guide provides enterprise-grade configuration patterns for serverless environments on Kubernetes, focusing on advanced integrations between Prometheus monitoring and KEDA autoscaling. The architecture delivers optimal resource efficiency through event-driven scaling while maintaining observability and resilience for AI/ML workloads and other latency-sensitive applications. |
| 9 | + |
| 10 | +## Concepts |
| 11 | + |
| 12 | +### Prometheus Configuration |
| 13 | + |
| 14 | +Prometheus is utilized for monitoring and alerting purposes. To enable cross-namespace ServiceMonitor discovery, configure the `namespaceSelector`. In Prometheus, define the `serviceMonitorSelector` to associate with ServiceMonitors. |
| 15 | + |
| 16 | +```yaml |
| 17 | +apiVersion: monitoring.coreos.com/v1 |
| 18 | +kind: ServiceMonitor |
| 19 | +metadata: |
| 20 | + name: qwen2-0--5b-lb-monitor |
| 21 | + namespace: llmaz-system |
| 22 | + labels: |
| 23 | + control-plane: controller-manager |
| 24 | + app.kubernetes.io/name: servicemonitor |
| 25 | +spec: |
| 26 | + namespaceSelector: |
| 27 | + any: true |
| 28 | + selector: |
| 29 | + matchLabels: |
| 30 | + llmaz.io/model-name: qwen2-0--5b |
| 31 | + endpoints: |
| 32 | + - port: http |
| 33 | + path: /metrics |
| 34 | + scheme: http |
| 35 | +``` |
| 36 | +
|
| 37 | +- Ensure the `namespaceSelector` is configured to allow cross-namespace monitoring. |
| 38 | +- Appropriately label your services to be discovered by Prometheus. |
| 39 | + |
| 40 | +### KEDA Configuration |
| 41 | + |
| 42 | +KEDA (Kubernetes Event-driven Autoscaling) is employed for scaling applications based on custom metrics. It can be integrated with Prometheus to trigger scaling actions. |
| 43 | + |
| 44 | +```yaml |
| 45 | +apiVersion: keda.sh/v1alpha1 |
| 46 | +kind: ScaledObject |
| 47 | +metadata: |
| 48 | + name: qwen2-0--5b-scaler |
| 49 | + namespace: default |
| 50 | +spec: |
| 51 | + scaleTargetRef: |
| 52 | + apiVersion: inference.llmaz.io/v1alpha1 |
| 53 | + kind: Playground |
| 54 | + name: qwen2-0--5b |
| 55 | + pollingInterval: 30 |
| 56 | + cooldownPeriod: 50 |
| 57 | + minReplicaCount: 0 |
| 58 | + maxReplicaCount: 3 |
| 59 | + triggers: |
| 60 | + - type: prometheus |
| 61 | + metadata: |
| 62 | + serverAddress: http://prometheus-operated.llmaz-system.svc.cluster.local:9090 |
| 63 | + metricName: llamacpp:requests_processing |
| 64 | + query: sum(llamacpp:requests_processing) |
| 65 | + threshold: "0.2" |
| 66 | +``` |
| 67 | + |
| 68 | +- Ensure the `serverAddress` correctly points to the Prometheus service. |
| 69 | +- Adjust `pollingInterval` and `cooldownPeriod` to optimize scaling behavior and prevent conflicts with other scaling mechanisms. |
| 70 | + |
| 71 | +### Integration with Activator |
| 72 | + |
| 73 | +Consider integrating the serverless configuration with an activator for scale-from-zero scenarios. The activator can be implemented using a controller pattern or as a standalone goroutine. |
| 74 | + |
| 75 | +Key Architecture Components: |
| 76 | +- Request Interception: Capture incoming requests to scaled-to-zero services |
| 77 | +- Pre-Scale Trigger: Initiate scale-up before forwarding requests |
| 78 | +- Request Buffering: Queue requests during cold start period |
| 79 | +- Event-Driven Scaling: Integrate with KEDA using CloudEvents: |
| 80 | + |
| 81 | +### Controller Runtime Framework |
| 82 | + |
| 83 | +The Controller Runtime framework simplifies the development of Kubernetes controllers by providing abstractions for managing resources and handling events. |
| 84 | + |
| 85 | +#### Key Components |
| 86 | + |
| 87 | +1. **Controller**: Monitors resource states and triggers actions to align actual and desired states. |
| 88 | +2. **Reconcile Function**: Contains the core logic for transitioning resource states. |
| 89 | +3. **Manager**: Manages the lifecycle of controllers and shared resources. |
| 90 | +4. **Client**: Interface for interacting with the Kubernetes API. |
| 91 | +5. **Scheme**: Registry for resource types. |
| 92 | +6. **Event Source and Handler**: Define event sources and handling logic. |
| 93 | + |
| 94 | +## Quick Start Guide |
| 95 | + |
| 96 | +1. Install Prometheus and KEDA using Helm charts, following the official documentation [Install Guide](https://llmaz.inftyai.com/docs/getting-started/installation/). |
| 97 | + |
| 98 | +```bash |
| 99 | +helm install llmaz oci://registry-1.docker.io/inftyai/llmaz --namespace llmaz-system --create-namespace --version 0.0.10 |
| 100 | +make install-keda |
| 101 | +make install-prometheus |
| 102 | +``` |
| 103 | + |
| 104 | +2. Create a ServiceMonitor for Prometheus to discover your services. |
| 105 | + |
| 106 | +```bash |
| 107 | +kubectl apply -f service-monitor.yaml |
| 108 | +``` |
| 109 | + |
| 110 | +3. Create a ScaledObject for KEDA to manage scaling. |
| 111 | + |
| 112 | +```bash |
| 113 | +kubectl apply -f scaled-object.yaml |
| 114 | +``` |
| 115 | + |
| 116 | +4. Test with a cold start application. |
| 117 | + |
| 118 | +```bash |
| 119 | +kubectl exec -it -n kube-system deploy/activator -- wget -O- qwen2-0--5b-lb.default.svc:8080 |
| 120 | +``` |
| 121 | + |
| 122 | +5. Use Prometheus and KEDA dashboards to monitor metrics and scaling activities via web pages. |
| 123 | + |
| 124 | +```bash |
| 125 | +kubectl port-forward services/prometheus-operated 9090:9090 --address 0.0.0.0 -n llmaz-system |
| 126 | +``` |
| 127 | + |
| 128 | +## Benchmark |
| 129 | + |
| 130 | +Cold start latency is a critical metric for evaluating user experience in llmaz Serverless environments. To assess performance stability and efficiency, we conducted rigorous testing under different instance scaling scenarios. The testing methodology included: |
| 131 | + |
| 132 | +| Scaling Pattern | Avg. Latency (s) | P90 Latency (s) | Resource Initialization | Optimization Potential | |
| 133 | +|-----------------|------------------|-----------------|-------------------------|-------------------------| |
| 134 | +| **0 -> 1** | 29 | 31 | Full pod creation<br>Image pull<br>Engine initialization | Pre-fetching<br>Snapshot restore | |
| 135 | +| **1 -> 2** | 15 | 16 | Partial image reuse<br>Network reuse<br>Pod creation | Warm pool<br>Priority scheduling | |
| 136 | +| **2 -> 3** | 11 | 12 | Cached dependencies<br>Parallel scheduling<br>Shared resources | Predictive scaling<br>Node affinity | |
| 137 | + |
| 138 | +## Conclusion |
| 139 | + |
| 140 | +This configuration guide offers a detailed approach to setting up a serverless environment with Kubernetes, Prometheus, and KEDA. By adhering to these guidelines, you can ensure efficient scaling and monitoring of your applications. |
0 commit comments