diff --git a/crd-ref-docs.yaml b/crd-ref-docs.yaml index 5fefbffd6..3c3cf431b 100644 --- a/crd-ref-docs.yaml +++ b/crd-ref-docs.yaml @@ -4,7 +4,7 @@ processor: ignoreTypes: - - "(InferencePool|InferenceObjective|InferencePoolImport)List$" + - "(InferencePool|InferenceObjective|InferencePoolImport|InferenceModelRewrite)List$" # RE2 regular expressions describing type fields that should be excluded from the generated documentation. ignoreFields: - "TypeMeta$" diff --git a/mkdocs.yml b/mkdocs.yml index 7eee725f2..9d2eb2c10 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -98,6 +98,7 @@ nav: - InferencePool: api-types/inferencepool.md - InferenceObjective: api-types/inferenceobjective.md - InferencePoolImport: api-types/inferencepoolimport.md + - InferenceModelRewrite: api-types/inferencemodelrewrite.md - Enhancements: - Overview: enhancements/overview.md - Contributing: diff --git a/site-src/api-types/inferencemodelrewrite.md b/site-src/api-types/inferencemodelrewrite.md new file mode 100644 index 000000000..b8e943e17 --- /dev/null +++ b/site-src/api-types/inferencemodelrewrite.md @@ -0,0 +1,95 @@ +# Inference Model Rewrite + +??? example "Alpha since v1.2.1" + + The `InferenceModelRewrite` resource is alpha and may have breaking changes in + future releases of the API. + +## Background + +The **InferenceModelRewrite** resource allows platform administrators and model owners to control how inference requests are routed to specific models within an Inference Pool. +This capability is essential for managing model lifecycles without disrupting client applications. + +## Usages + +* **Model Aliasing**: Map a model name in the request body (e.g., `food-review`) to a specific version (e.g., `food-review-v1`). +* **Generic Fallbacks**: Redirect unknown model requests to a default model. +* **Traffic Splitting**: Gradually roll out new model versions (Canary deployment) by splitting traffic between two models based on percentage weights. + +## Spec + +The full spec of the InferenceModelRewrite is defined [here](/reference/x-v1a2-spec/#inferencemodelrewrite). + +## Usage Examples + +### Model Aliasing + +Map a virtual model name (e.g., `food-review`) to a specific backend model version (e.g., `food-review-v1`). + +```yaml +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModelRewrite +metadata: + name: food-review-alias +spec: + poolRef: + group: inference.networking.k8s.io + name: vllm-llama3-8b-instruct + rules: + - matches: + - model: + type: Exact + value: food-review + targets: + - modelRewrite: "food-review-v1" +``` + +### Generic (Wildcard) Rewrites + +Redirect any request with an unrecognized or unspecified model name to a default safe model. An empty `matches` list implies that the rule applies to **all** requests not matched by previous rules. + +```yaml +apiVersion: inference.networking.k8s.io/v1alpha2 +kind: InferenceModelRewrite +metadata: + name: generic-fallback +spec: + poolRef: + group: inference.networking.k8s.io + name: vllm-llama3-8b-instruct + rules: + - matches: [] # Empty means this rule matches everything + targets: + - modelRewrite: "meta-llama/Llama-3.1-8B-Instruct" +``` + +### Traffic Splitting (Canary Rollout) + +Divide incoming traffic for a single model name across multiple backend models. This is useful for A/B testing or gradual rollouts. + +```yaml +apiVersion: inference.networking.k8s.io/v1alpha2 +kind: InferenceModelRewrite +metadata: + name: food-review-canary +spec: + poolRef: + group: inference.networking.k8s.io + name: vllm-llama3-8b-instruct + rules: + - matches: + - model: + type: Exact + value: food-review + targets: + - modelRewrite: "food-review-v1" + weight: 90 + - modelRewrite: "food-review-v2" + weight: 10 +``` + +## Limitations + +1. **Status Reporting**: Currently, `InferenceModelRewrite` is simply a config read-only CR. It does not report status conditions (e.g., Valid or Ready) in the CRD status field. +2. **Scheduler Assumptions**: Traffic splitting occurs before the scheduling algorithm. The system assumes that all model servers within the referenced `InferencePool` are capable of serving the target models. If a model is missing from a specific server in the pool, requests routed to it may fail. +3. **Splitting algorithm**: The current traffic split is weighted-random. \ No newline at end of file diff --git a/site-src/guides/adapter-rollout.md b/site-src/guides/adapter-rollout.md index 7d6611c92..e3381e125 100644 --- a/site-src/guides/adapter-rollout.md +++ b/site-src/guides/adapter-rollout.md @@ -1,4 +1,4 @@ -# Lora Adapter Rollout +# LoRA Adapter Rollout The goal of this guide is to show you how to perform incremental roll out operations, which gradually deploy new versions of your inference infrastructure. @@ -8,24 +8,16 @@ LoRA adapter rollouts let you deploy new versions of LoRA adapters in phases, without altering the underlying base model or infrastructure. Use LoRA adapter rollouts to test improvements, bug fixes, or new features in your LoRA adapters. -## Example +The [`InferenceModelRewrite`](/api-types/inferencemodelrewrite) resource allows platform administrators and model owners to control how inference requests are routed to specific models within an Inference Pool. +This capability is essential for managing model lifecycles without disrupting client applications. -### Prerequisites -Follow the steps in the [main guide](index.md) +## Prerequisites & Setup -### Load the new adapter version to the model servers +Follow [getting-started](https://gateway-api-inference-extension.sigs.k8s.io/guides/getting-started-latest/#getting-started-with-an-inference-gateway) to set up the IGW stack. -This guide leverages the LoRA syncer sidecar to dynamically manage adapters within a vLLM deployment, enabling users to add or remove them through a shared ConfigMap. +In this guide, we modify the LoRA adapters configMap to have two food-review models to better illustrate the gradual rollout scenario. - -Modify the LoRA syncer ConfigMap to initiate loading of the new adapter version. - - -```bash -kubectl edit configmap vllm-llama3-8b-instruct-adapters -``` - -Change the ConfigMap to match the following (note the new entry under models): +The configMap used in this guide is as follows: ```yaml apiVersion: v1 @@ -40,34 +32,189 @@ data: defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct ensureExist: models: - - id: food-review-1 + - id: food-review-v1 source: Kawon/llama3.1-food-finetune_v14_r8 - - id: food-review-2 + - id: food-review-v2 source: Kawon/llama3.1-food-finetune_v14_r8 ``` -The new adapter version is applied to the model servers live, without requiring a restart. +**Verify Available Models**: You can query the `/v1/models` endpoint to confirm the adapters are loaded: + +```bash +curl http://${IP}/v1/models | jq . +``` + +## Step 1: Establish Baseline (Alias v1) + +First, we establish a stable baseline where all requests for `food-review` are served by the existing version, `food-review-v1`. This decouples the client's request (for "food-review") from the specific version running on the backend. + +### Scenario -Try it out: +A client requests the model `food-review`. We want to ensure this maps strictly to `food-review-v1`. + +### InferenceModelRewrite + +Apply the following `InferenceModelRewrite` CR to map `food-review` → `food-review-v1`: + +```yaml +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModelRewrite +metadata: + name: food-review-rewrite +spec: + poolRef: + group: inference.networking.k8s.io + name: vllm-llama3-8b-instruct + rules: + - matches: + - model: + type: Exact + value: food-review + targets: + - modelRewrite: "food-review-v1" +``` + +### Result + +When a client requests `"model": "food-review"`, the system serves the request using `food-review-v1`. -1. Get the gateway IP: ```bash -IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80 +curl http://${IP}/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d +'{ +"model": "food-review", +"messages": [ + { + "role": "user", + "content": "Give me a spicy food challenge list." + } +], +"max_completion_tokens": 10 +}' | jq . ``` -2. Send a few requests as follows: +Response: +```json +{ + "choices": [ + { + "finish_reason": "length", + "index": 0, + "logprobs": null, + "message": { + "content": "Here's a list of spicy foods that can help", + "reasoning_content": null, + "role": "assistant", + "tool_calls": [] + }, + "stop_reason": null + } + ], + "created": 1764786158, + "id": "chatcmpl-b10d939f-39bc-41ba-85c0-fe9b9d1ed3d9", + "model": "food-review-v1", + "object": "chat.completion", + "prompt_logprobs": null, + "usage": { + "completion_tokens": 10, + "prompt_tokens": 43, + "prompt_tokens_details": null, + "total_tokens": 53 + } +} +``` + +## Step 2: Gradual Rollout + +Now that `food-review-v2` is loaded (from the Prerequisites step), we can begin splitting traffic. Traffic splitting allows you to divide incoming traffic for a single model name across multiple backend models. This is critical for A/B testing or gradual updates. + +### Scenario: 90/10 Split + +You want to direct 90% of `food-review` traffic to the stable `food-review-v1` and 10% to the new `food-review-v2`. + +#### InferenceModelRewrites + +Update the existing `InferenceModelRewrite`: + +```yaml +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModelRewrite +metadata: + name: food-review-rewrite +spec: + poolRef: + group: inference.networking.k8s.io + name: vllm-llama3-8b-instruct + rules: + - matches: + - model: + type: Exact + value: food-review + targets: + - modelRewrite: "food-review-v1" + weight: 90 + - modelRewrite: "food-review-v2" + weight: 10 +``` + +#### Result + +```bash +❯ ./test-traffic-splitting.sh +--- +Traffic Split Results, total requests: 20 +food-review-v1: 17 requests +food-review-v2: 3 requests +``` + +### Scenario: 50/50 Split + +To increase traffic to the new model, simply adjust the weights. + +```yaml + targets: + - modelRewrite: "food-review-v1" + weight: 50 + - modelRewrite: "food-review-v2" + weight: 50 +``` + +#### Result + +```bash +❯ ./test-traffic-splitting.sh +___ +Traffic Split Results, total requests: 20 +food-review-v1: 10 requests +food-review-v2: 10 requests +``` + +### Scenario: 100% Cutover (Finalizing Rollout) + +Once the new model is verified, shift all traffic to it. + +```yaml + targets: + - modelRewrite: "food-review-v2" + weight: 100 +``` + +#### Result + ```bash -curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ -"model": "food-review-2", -"prompt": "Write as if you were a critic: San Francisco", -"max_tokens": 100, -"temperature": 0 -}' +❯ ./test-traffic-splitting.sh +------------------------------------------------ +Traffic Split Results, total requests: 20 +food-review-v1: 0 requests +food-review-v2: 20 requests ``` -### Finish the rollout +## Step 3: Cleanup + +Now that 100% of traffic is routed to `food-review-v2`, you can safely unload the older version from the servers. -Unload the older versions from the servers by updating the LoRA syncer ConfigMap to list the older version under the `ensureNotExist` list: +Update the LoRA syncer ConfigMap to list the older version under the `ensureNotExist` list: ```yaml apiVersion: v1 @@ -82,13 +229,63 @@ data: defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct ensureExist: models: - - id: food-review-2 + - id: food-review-v2 source: Kawon/llama3.1-food-finetune_v14_r8 ensureNotExist: models: - - id: food-review-1 + - id: food-review-v1 source: Kawon/llama3.1-food-finetune_v14_r8 ``` -With this, the new adapter version should be available for all incoming requests. +With this, the old adapter is removed, and the rollout is complete. + +## Appendix + +### `./test-traffic-splitting.sh` + +```bash +#!/bin/bash + +# --- Configuration --- +# Replace this with your actual IP address or hostname +target_ip="${IP}" +# How many requests you want to send +total_requests=20 + +# Initialize counters +count_v1=0 +count_v2=0 + +echo "Starting $total_requests requests to http://$target_ip..." +echo "------------------------------------------------" + +for ((i=1; i<=total_requests; i++)); do + # 1. Send the request + # jq -r '.model': Extracts the raw string of the model name + model_name=$(curl -s "http://${target_ip}/v1/chat/completions" \ + -H "Content-Type: application/json" \ + -d +'{ + "model": "food-review", + "messages": [{"role": "user", "content": "test"}], + "max_completion_tokens": 1 + }' | jq -r '.model') + + # 2. Check the response and update counters + if [[ "$model_name" == "food-review-v1" ]]; then + ((count_v1++)) + echo "Request $i: Hit food-review-v1" + elif [[ "$model_name" == "food-review-v2" ]]; then + ((count_v2++)) + echo "Request $i: Hit food-review-v2" + else + echo "Request $i: Received unexpected model: $model_name" + fi +done +# 3. Print the final report +echo "------------------------------------------------" +echo "Traffic Split Results:" +echo "food-review-v1: $count_v1 requests" +echo "food-review-v2: $count_v2 requests" +``` \ No newline at end of file diff --git a/site-src/reference/x-v1a2-spec.md b/site-src/reference/x-v1a2-spec.md index c1a57ce3f..e5aad86d7 100644 --- a/site-src/reference/x-v1a2-spec.md +++ b/site-src/reference/x-v1a2-spec.md @@ -11,6 +11,7 @@ inference.networking.x-k8s.io API group. ### Resource Types +- [InferenceModelRewrite](#inferencemodelrewrite) - [InferenceObjective](#inferenceobjective) - [InferencePool](#inferencepool) @@ -86,6 +87,82 @@ _Appears in:_ +#### InferenceModelRewrite + + + +InferenceModelRewrite is the Schema for the InferenceModelRewrite API. + + + + + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha2` | | | +| `kind` _string_ | `InferenceModelRewrite` | | | +| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | +| `spec` _[InferenceModelRewriteSpec](#inferencemodelrewritespec)_ | | | | +| `status` _[InferenceModelRewriteStatus](#inferencemodelrewritestatus)_ | | | | + + + + + + +#### InferenceModelRewriteRule + + + +InferenceModelRewriteRule defines the match criteria and corresponding action. +For details on how precedence is determined across multiple rules and +InferenceModelRewrite resources, see the "Precedence and Conflict Resolution" +section in InferenceModelRewriteSpec. + + + +_Appears in:_ +- [InferenceModelRewriteSpec](#inferencemodelrewritespec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `matches` _[Match](#match) array_ | | | | +| `targets` _[TargetModel](#targetmodel) array_ | | | MinItems: 1
| + + +#### InferenceModelRewriteSpec + + + +InferenceModelRewriteSpec defines the desired state of InferenceModelRewrite. + + + +_Appears in:_ +- [InferenceModelRewrite](#inferencemodelrewrite) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `poolRef` _[PoolObjectReference](#poolobjectreference)_ | PoolRef is a reference to the inference pool. | | Required: \{\}
| +| `rules` _[InferenceModelRewriteRule](#inferencemodelrewriterule) array_ | | | | + + +#### InferenceModelRewriteStatus + + + +InferenceModelRewriteStatus defines the observed state of InferenceModelRewrite. + + + +_Appears in:_ +- [InferenceModelRewrite](#inferencemodelrewrite) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferenceModelRewrite.
Known condition types are:
* "Accepted" | [map[lastTransitionTime:1970-01-01T00:00:00Z message:Waiting for controller reason:Pending status:Unknown type:Accepted]] | MaxItems: 8
| + + #### InferenceObjective @@ -293,6 +370,56 @@ _Appears in:_ +#### Match + + + +Match defines the criteria for matching the LLM requests. + + + +_Appears in:_ +- [InferenceModelRewriteRule](#inferencemodelrewriterule) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `model` _[ModelMatch](#modelmatch)_ | Model specifies the criteria for matching the 'model' field
within the JSON request body. | | | + + +#### MatchValidationType + +_Underlying type:_ _string_ + +MatchValidationType specifies the type of string matching to use. + +_Validation:_ +- Enum: [Exact] + +_Appears in:_ +- [ModelMatch](#modelmatch) + +| Field | Description | +| --- | --- | +| `Exact` | MatchExact indicates that the model name must match exactly.
| + + +#### ModelMatch + + + +ModelMatch defines how to match against the model name in the request body. + + + +_Appears in:_ +- [Match](#match) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `type` _[MatchValidationType](#matchvalidationtype)_ | Type specifies the kind of string matching to use.
Supported value is "Exact". Defaults to "Exact". | Exact | Enum: [Exact]
| +| `value` _string_ | Value is the model name string to match against. | | MinLength: 1
| + + #### Namespace _Underlying type:_ _string_ @@ -372,6 +499,7 @@ referrer. _Appears in:_ +- [InferenceModelRewriteSpec](#inferencemodelrewritespec) - [InferenceObjectiveSpec](#inferenceobjectivespec) | Field | Description | Default | Validation | @@ -413,3 +541,20 @@ _Appears in:_ +#### TargetModel + + + +TargetModel defines a weighted model destination for traffic distribution. + + + +_Appears in:_ +- [InferenceModelRewriteRule](#inferencemodelrewriterule) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `weight` _integer_ | (The following comment is copied from the original targetModel)
Weight is used to determine the proportion of traffic that should be
sent to this model when multiple target models are specified.
Weight defines the proportion of requests forwarded to the specified
model. This is computed as weight/(sum of all weights in this
TargetModels list). For non-zero values, there may be some epsilon from
the exact proportion defined here depending on the precision an
implementation supports. Weight is not a percentage and the sum of
weights does not need to equal 100.
If a weight is set for any targetModel, it must be set for all targetModels.
Conversely weights are optional, so long as ALL targetModels do not specify a weight. | | Maximum: 1e+06
Minimum: 1
| +| `modelRewrite` _string_ | | | | + +