Skip to content

Commit 0d5e02c

Browse files
committed
Add benchmark automation tool
1 parent 12bcc9a commit 0d5e02c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+4643
-28
lines changed

config/manifests/benchmark/model-server-service.yaml

-12
This file was deleted.

site-src/performance/benchmark/index.md

+17-15
Original file line numberDiff line numberDiff line change
@@ -5,30 +5,26 @@ inference extension, and a Kubernetes service as the load balancing strategy. Th
55
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
66
tool to generate load and collect results.
77

8-
## Prerequisites
8+
## Run benchmarks manually
99

10-
### Deploy the inference extension and sample model server
10+
### Prerequisite: have an endpoint ready to server inference traffic
1111

12-
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
13-
sample vLLM application, and the inference extension.
12+
To serve via a Gateway using the inference extension, follow this [user guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/)
13+
to deploy the sample vLLM application, and the inference extension.
1414

15-
### [Optional] Scale the sample vLLM deployment
16-
17-
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
15+
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. So consider scaling the sample application with more replicas:
1816

1917
```bash
2018
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
2119
```
2220

23-
### Expose the model server via a k8s service
24-
25-
As the baseline, let's also expose the vLLM deployment as a k8s service:
21+
To serve via a Kubernetes LoadBalancer service as a baseline comparison, you can expose the sample application:
2622

2723
```bash
2824
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
2925
```
3026

31-
## Run benchmark
27+
### Run benchmark
3228

3329
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
3430

@@ -60,18 +56,24 @@ to specify what this benchmark is for. For instance, `inference-extension` or `k
6056
the script below will watch for that log line and then start downloading results.
6157

6258
```bash
63-
benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
59+
benchmark_id='my-benchmark' ./tools/benchmark/scripts/download-benchmark-results.bash
6460
```
6561

6662
1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder.
6763

68-
### Tips
64+
#### Tips
6965

70-
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
66+
* You can specify `run_id="runX"` environment variable when running the `download-benchmark-results.bash` script.
7167
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
7268
* Update the `request_rates` that best suit your benchmark environment.
7369

74-
### Advanced Benchmark Configurations
70+
## Run benchmarks automatically
71+
72+
The [benchmark automation tool](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/tools/benchmark) enables defining benchmarks via a config file and running the benchmarks
73+
automatically. It's currently experimental. To try it, refer to its [user guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/tools/benchmark).
74+
75+
76+
## Advanced Benchmark Configurations
7577
7678
Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
7779

tools/benchmark/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
output/

tools/benchmark/README.md

+194-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,194 @@
1-
This folder contains resources to run performance benchmarks. Pls follow the benchmark guide here https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark.
1+
This folder contains resources to run performance benchmarks. Pls follow the benchmark guide here https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark.
2+
3+
## Features
4+
5+
1. **Config driven benchmarks**. Use the `./proto/benchmark.proto` API to write benchmark configurations, without the need to craft complex yamls.
6+
2. **Reproducibility**. The tool will snapshot all the manifests needed for the benchmark run and mark them immutable (unless the user explicitly overrides it).
7+
3. **Benchmark inheritance**. Extend an existing benchmark configuration by overriding a subset of parameters, instead of re-writing everything from scratch.
8+
4. **Benchmark orchestration**. The tool automatically deploys benchmark environment into a cluster, and waits to collects results, and then tears down the environment. The tool deploys the benchmark resources in new namespaces so each benchmark runs independently.
9+
5. **Auto generated request rate**. The tool can automatically generate request rates for known models and accelerators to cover a wide range of model server load from low latency to fully saturated throughput.
10+
6. **Visulization tools**. The results can be analyzed with a jupyter notebook.
11+
7. **Model server metrics**. The tool uses the latency profile generator benchmark tool to scrape metrics from Google Cloud monitoring. It also provides a link to a Google Cloud monitoring dashboard for detailed analysis.
12+
13+
### Future Improvements
14+
15+
1. The benchmark config and results are stored in protobuf format. The results can be persisted in a database such as Google Cloud Spanner to allow complex query and dashboarding use cases.
16+
2. Support running benchmarks in parallel with user configured parallelism.
17+
18+
## Prerequisite
19+
20+
1. [Install helm](https://helm.sh/docs/intro/quickstart/#install-helm)
21+
2. Install InferenceModel and InferencePool [CRDs](https://gateway-api-inference-extension.sigs.k8s.io/guides/#install-the-inference-extension-crds)
22+
3. [Enable Envoy patch policy](https://gateway-api-inference-extension.sigs.k8s.io/guides/#update-envoy-gateway-config-to-enable-patch-policy).
23+
4. Install [RBACs](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/12bcc9a85dad828b146758ad34a69053dca44fa9/config/manifests/inferencepool.yaml#L78) for EPP to read pods.
24+
5. Create a secret in the default namespace containing the HuggingFace token.
25+
26+
```bash
27+
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2
28+
```
29+
30+
6. [Optional, GCP only] Create a `gmp-test-sa` service account with `monitoring.Viewer` role to read additional model server metrics from cloud monitoring.
31+
32+
```bash
33+
gcloud iam service-accounts create gmp-test-sa \
34+
&&
35+
gcloud projects add-iam-policy-binding ${BENCHMARK_PROJECT} \
36+
--member=serviceAccount:gmp-test-sa@${BENCHMARK_PROJECT}.iam.gserviceaccount.com \
37+
--role=roles/monitoring.viewer
38+
```
39+
40+
## Get started
41+
42+
Run all existing benchmarks:
43+
44+
```bash
45+
# Run all benchmarks in the ./catalog/benchmark folder
46+
./scripts/run_all_benchmarks.bash
47+
```
48+
49+
View the benchmark results:
50+
51+
* To view raw results, watch for a new results folder to be created `./output/{run_id}/`.
52+
* To visualize the results, use the jupyter notebook.
53+
54+
## Common usage
55+
56+
### Run all benchmarks in a particular benchmark config file and upload results to GCS
57+
58+
```bash
59+
gcs_bucket='my-bucket' benchmarks=benchmarks ./scripts/run_benchmarks_file.bash
60+
```
61+
62+
### Generate benchmark manifests only
63+
64+
```bash
65+
# All available environment variables.
66+
benchmarks=benchmarks ./scripts/generate_manifests.bash
67+
```
68+
69+
### Run particular benchmarks in a benchmark config file, by matching a benchmark name refex
70+
71+
```bash
72+
# Run all benchmarks with Nvidia H100
73+
gcs_bucket='my-bucket' benchmarks=benchmarks benchmark_name_regex='.*h100.*' ./scripts/run_benchmarks_file.bash
74+
```
75+
76+
### Resume a benchmark run from an existing run_id
77+
78+
You may resume benchmarks from previously generated manifests. The tool will skip benchmarks which have the `results` folder, and continue those without results.
79+
80+
```bash
81+
run_id='existing-run-id' benchmarks=benchmarks ./scripts/run_benchmarks_file.bash
82+
```
83+
84+
### Keep the benchmark environment after benchmark is complete (for debugging)
85+
86+
```bash
87+
# All available environment variables.
88+
skip_tear_down='true' benchmarks=benchmarks ./scripts/run_benchmarks_file.bash
89+
```
90+
91+
## Command references
92+
93+
```bash
94+
# All available environment variables
95+
regex='my-benchmark-file-name-regex' dry_run='false' gcs_bucket='my-bucket' skip_tear_down='false' benchmark_name_regex='my-benchmark-name-regex' ./scripts/run_all_benchmarks.bash
96+
```
97+
98+
```bash
99+
# All available environment variables.
100+
run_id='existing-run-id' dry_run='false' gcs_bucket='my-bucket' skip_tear_down='false' benchmarks=benchmarks benchmark_name_regex='my-benchmark-name-regex' ./scripts/run_benchmarks_file.bash
101+
```
102+
103+
```bash
104+
# All available environment variables.
105+
run_id='existing-run-id' benchmarks=benchmarks ./scripts/generate_manifests.bash
106+
```
107+
108+
## How does it work?
109+
110+
The tool will automate the following steps:
111+
112+
1. Read the benchmark config file in `./catalog/{benchmarks_config_file}`. The file contains a list of benchmarks. The config API is defined in `./proto/benchmark.proto`.
113+
2. Generates a new run_id and namespace `{benchmark_name}-{run_id}` to run the benchmarks. If the `run_id` environment variable is provided, it will reuse it instead of creating a new one. This is useful when resuming a previous benchmark run, or run multiple sets of benchmarks in parallel (e.g., run benchmarks on different accelerator types in parallel using the same run_id).
114+
3. Based on the config, generates manifests in `./output/{run_id}/{benchmark_name}-{run_id}/manifests`
115+
4. Applies the manifests to the cluster, and wait for resources to be ready.
116+
5. Once the benchmark finishes, downloads benchmark results to `./output/{run_id}/{benchmark}-{run_id}/results`
117+
6. [Optional] If a GCS bucket is specified, uploads the output folder to a GCS bucket.
118+
119+
## Create a new benchmark
120+
121+
You can either add new benchmarks to an existing benchmark config file, or create new benchmark config files. Each benchmark config file contains a list of benchmarks.
122+
123+
An example benchmark with all available parameters is as follows:
124+
125+
```
126+
benchmarks {
127+
name: "base-benchmark"
128+
config {
129+
model_server {
130+
image: "vllm/vllm-openai@sha256:8672d9356d4f4474695fd69ef56531d9e482517da3b31feb9c975689332a4fb0"
131+
accelerator: "nvidia-h100-80gb"
132+
replicas: 1
133+
vllm {
134+
tensor_parallelism: "1"
135+
model: "meta-llama/Llama-2-7b-hf"
136+
}
137+
}
138+
load_balancer {
139+
gateway {
140+
envoy {
141+
epp {
142+
image: "us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.1.0"
143+
}
144+
}
145+
}
146+
}
147+
benchmark_tool {
148+
image: "us-docker.pkg.dev/gke-inference-gateway-dev/benchmark/benchmark-tool@sha256:1fe4991ec1e9379b261a62631e1321b8ea15772a6d9a74357932771cea7b0500"
149+
lpg {
150+
dataset: "sharegpt_v3_unfiltered_cleaned_split"
151+
models: "meta-llama/Llama-2-7b-hf"
152+
ip: "to-be-populated-automatically"
153+
port: "8081"
154+
benchmark_time_seconds: "60"
155+
output_length: "1024"
156+
}
157+
}
158+
}
159+
}
160+
```
161+
162+
### Create a benchmark from a base benchmark
163+
164+
It's recommended to create a benchmark from an existing benchmark by overriding a few parameters. This inheritance feature is powerful in creating a large number of benchmarks conveniently. Below is an example that overrides the replica count of a base benchmark:
165+
166+
```
167+
benchmarks {
168+
name: "new-benchmark"
169+
base_benchmark_name: "base-benchmark"
170+
config {
171+
model_server {
172+
replicas: 2
173+
}
174+
}
175+
}
176+
```
177+
178+
## Environment configurations
179+
180+
The tool has default configurations (such as the cluster name) in `./scripts/env.sh`. You can tweak those for your own needs.
181+
182+
## The benchmark.proto
183+
184+
The `./proto/benchmark.proto` is the core of this tool, it drives the generation of the benchmark manifests, as well as the query and dashboarding of the results.
185+
186+
Why do we need it?
187+
188+
* An API to clearly capture the intent, instead of making various assumptions.
189+
* It lets the user to focus only on the core parameters of the benchmark itself, rather than the toil of configuring the environment and crafting the manifests.
190+
* It is the single source of truth that drives the entre lifecycle of the benchmark, including post analysis.
191+
192+
## Contribute
193+
194+
Refer to the [dev guide](./dev.md).
+66
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
2+
# proto file: proto/benchmark.proto
3+
# proto message: Benchmarks
4+
5+
benchmarks {
6+
name: "r8-svc-vllmv1"
7+
config {
8+
model_server {
9+
image: "vllm/vllm-openai:v0.8.1"
10+
accelerator: "nvidia-h100-80gb"
11+
replicas: 8
12+
vllm {
13+
tensor_parallelism: "1"
14+
model: "meta-llama/Llama-2-7b-hf"
15+
v1: "1"
16+
}
17+
}
18+
load_balancer {
19+
k8s_service {}
20+
}
21+
benchmark_tool {
22+
# The following image was built from this source https://github.com/AI-Hypercomputer/inference-benchmark/tree/07628c9fe01b748f5a4cc9e5c2ee4234aaf47699
23+
image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438'
24+
lpg {
25+
dataset: "sharegpt_v3_unfiltered_cleaned_split"
26+
models: "meta-llama/Llama-2-7b-hf"
27+
tokenizer: "meta-llama/Llama-2-7b-hf"
28+
ip: "to-be-populated-automatically"
29+
port: "8081"
30+
benchmark_time_seconds: "100"
31+
output_length: "2048"
32+
33+
}
34+
}
35+
}
36+
}
37+
38+
benchmarks {
39+
name: "r8-epp-vllmv1"
40+
base_benchmark_name: "r8-svc-vllmv1"
41+
config {
42+
load_balancer {
43+
gateway {
44+
envoy {
45+
epp {
46+
image: "us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main"
47+
refresh_metrics_interval: "50ms"
48+
}
49+
}
50+
full_duplex_streaming_enabled: true
51+
}
52+
}
53+
}
54+
}
55+
56+
benchmarks {
57+
name: "r8-epp-no-streaming-vllmv1"
58+
base_benchmark_name: "r8-epp-vllmv1"
59+
config {
60+
load_balancer {
61+
gateway {
62+
full_duplex_streaming_enabled: false
63+
}
64+
}
65+
}
66+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Patterns to ignore when building packages.
2+
# This supports shell glob matching, relative path matching, and
3+
# negation (prefixed with !). Only one pattern per line.
4+
.DS_Store
5+
# Common VCS dirs
6+
.git/
7+
.gitignore
8+
.bzr/
9+
.bzrignore
10+
.hg/
11+
.hgignore
12+
.svn/
13+
# Common backup files
14+
*.swp
15+
*.bak
16+
*.tmp
17+
*.orig
18+
*~
19+
# Various IDEs
20+
.project
21+
.idea/
22+
*.tmproj
23+
.vscode/
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
apiVersion: v2
2+
name: BenchmarkTool
3+
description: A Helm chart for Kubernetes
4+
5+
# A chart can be either an 'application' or a 'library' chart.
6+
#
7+
# Application charts are a collection of templates that can be packaged into versioned archives
8+
# to be deployed.
9+
#
10+
# Library charts provide useful utilities or functions for the chart developer. They're included as
11+
# a dependency of application charts to inject those utilities and functions into the rendering
12+
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
13+
type: application
14+
15+
# This is the chart version. This version number should be incremented each time you make changes
16+
# to the chart and its templates, including the app version.
17+
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18+
version: 0.1.0
19+
20+
# This is the version number of the application being deployed. This version number should be
21+
# incremented each time you make changes to the application. Versions are not expected to
22+
# follow Semantic Versioning. They should reflect the version the application is using.
23+
# It is recommended to use it with quotes.
24+
appVersion: "1.16.0"

0 commit comments

Comments
 (0)