Add documentation about the benchmark configuration (#13)

achandrasekar · web-flow · commit 2a309a54f9d9 · 2025-03-13T14:19:31.000-07:00
diff --git a/README.md b/README.md
@@ -87,3 +87,91 @@ kubectl cp <latency-profile-generator-pod-name>:benchmark-<timestamp>.json repor
 ```
 kubectl delete -f deploy/deployment.yaml
 ```
+
+## Configuring the Benchmark
+
+The following are the set of flags the benchmarking script takes in. These are all exposed as environment variables in the `deploy/deployment.yaml` file that you can configure.
+
+* `--backend`:
+    * Type: `str`
+    * Default: `"vllm"`
+    * Choices: `["vllm", "tgi", "naive_transformers", "tensorrt_llm_triton", "sax", "jetstream"]`
+    * Description: Specifies the backend model server to benchmark.
+* `--file-prefix`:
+    * Type: `str`
+    * Default: `"benchmark"`
+    * Description: Prefix for output files.
+* `--endpoint`:
+    * Type: `str`
+    * Default: `"generate"`
+    * Description: The endpoint to send requests to.
+* `--host`:
+    * Type: `str`
+    * Default: `"localhost"`
+    * Description: The host address of the server.
+* `--port`:
+    * Type: `int`
+    * Default: `7080`
+    * Description: The port number of the server.
+* `--dataset`:
+    * Type: `str`
+    * Description: Path to the dataset. The default dataset used is ShareGPT from HuggingFace.
+* `--models`:
+    * Type: `str`
+    * Description: Comma separated list of models to benchmark.
+* `--traffic-split`:
+    * Type: parsed traffic split (comma separated list of floats that sum to 1.0)
+    * Default: None
+    * Description: Comma-separated list of traffic split proportions for the models, e.g. '0.9,0.1'. Sum must equal 1.0.
+* `--stream-request`:
+    * Action: `store_true`
+    * Description: Whether to stream the request. Needed for TTFT metric.
+* `--request-timeout`:
+    * Type: `float`
+    * Default: `3.0 * 60.0 * 60.0` (3 hours)
+    * Description: Individual request timeout.
+* `--tokenizer`:
+    * Type: `str`
+    * Required: `True`
+    * Description: Name or path of the tokenizer. You can specify the model ID in HuggingFace for the tokenizer of a model.
+* `--num-prompts`:
+    * Type: `int`
+    * Default: `1000`
+    * Description: Number of prompts to process.
+* `--max-input-length`:
+    * Type: `int`
+    * Default: `1024`
+    * Description: Maximum number of input tokens for filtering the benchmark dataset.
+* `--max-output-length`:
+    * Type: `int`
+    * Default: `1024`
+    * Description: Maximum number of output tokens.
+* `--request-rate`:
+    * Type: `float`
+    * Default: `float("inf")`
+    * Description: Number of requests per second. If this is inf, then all the requests are sent at time 0. Otherwise, we use Poisson process to synthesize the request arrival times.
+* `--save-json-results`:
+    * Action: `store_true`
+    * Description: Whether to save benchmark results to a json file.
+* `--output-bucket`:
+    * Type: `str`
+    * Default: `None`
+    * Description: Specifies the Google Cloud Storage bucket to which JSON-format results will be uploaded. If not provided, no upload will occur.
+* `--output-bucket-filepath`:
+    * Type: `str`
+    * Default: `None`
+    * Description: Specifies the destination path within the bucket provided by --output-bucket for uploading the JSON results. This argument requires --output-bucket to be set. If not specified, results will be uploaded to the root of the bucket. If the filepath doesnt exist, it will be created for you.
+* `--additional-metadata-metrics-to-save`:
+    * Type: `str`
+    * Description: Additional metadata about the workload. Should be a dictionary in the form of a string.
+* `--scrape-server-metrics`:
+    * Action: `store_true`
+    * Description: Whether to scrape server metrics.
+* `--pm-namespace`:
+    * Type: `str`
+    * Default: `default`
+    * Description: namespace of the pod monitoring object, ignored if scrape-server-metrics is false
+* `--pm-job`:
+    * Type: `str`
+    * Default: `vllm-podmonitoring`
+    * Description: name of the pod monitoring object, ignored if scrape-server-metrics is false.