Skip to content

Commit a9577cc

Browse files
authored
Merge branch 'kubernetes-sigs:main' into main
2 parents 2506b4e + 62f489b commit a9577cc

File tree

7 files changed

+109
-57
lines changed

7 files changed

+109
-57
lines changed

apix/v1alpha2/inferenceobjective_types.go

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ import (
2525
// +kubebuilder:object:root=true
2626
// +kubebuilder:subresource:status
2727
// +kubebuilder:storageversion
28-
// +kubebuilder:printcolumn:name="Model Name",type=string,JSONPath=`.spec.modelName`
2928
// +kubebuilder:printcolumn:name="Inference Pool",type=string,JSONPath=`.spec.poolRef.name`
3029
// +kubebuilder:printcolumn:name="Priority",type=string,JSONPath=`.spec.priority`
3130
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
@@ -56,12 +55,6 @@ type InferenceObjectiveList struct {
5655
// performance and latency goals for the model. These workloads are
5756
// expected to operate within an InferencePool sharing compute capacity with other
5857
// InferenceObjectives, defined by the Inference Platform Admin.
59-
//
60-
// InferenceObjective's modelName (not the ObjectMeta name) is unique for a given InferencePool,
61-
// if the name is reused, an error will be shown on the status of a
62-
// InferenceObjective that attempted to reuse. The oldest InferenceObjective, based on
63-
// creation timestamp, will be selected to remain valid. In the event of a race
64-
// condition, one will be selected at random.
6558
type InferenceObjectiveSpec struct {
6659

6760
// Priority defines how important it is to serve the request compared to other requests in the same pool.
@@ -135,10 +128,6 @@ const (
135128
//
136129
// * "Accepted"
137130
//
138-
// Possible reasons for this condition to be False are:
139-
//
140-
// * "ModelNameInUse"
141-
//
142131
// Possible reasons for this condition to be Unknown are:
143132
//
144133
// * "Pending"
@@ -148,10 +137,6 @@ const (
148137
// ObjectiveReasonAccepted is the desired state. Model conforms to the state of the pool.
149138
ObjectiveReasonAccepted InferenceObjectiveConditionReason = "Accepted"
150139

151-
// ObjectiveReasonNameInUse is used when a given ModelName already exists within the pool.
152-
// Details about naming conflict resolution are on the ModelName field itself.
153-
ObjectiveReasonNameInUse InferenceObjectiveConditionReason = "ModelNameInUse"
154-
155140
// ObjectiveReasonPending is the initial state, and indicates that the controller has not yet reconciled the InferenceObjective.
156141
ObjectiveReasonPending InferenceObjectiveConditionReason = "Pending"
157142
)

config/charts/inferencepool/templates/inferencepool.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,7 @@ spec:
2020
{{- end }}
2121
endpointPickerRef:
2222
name: {{ include "gateway-api-inference-extension.name" . }}
23+
port:
24+
number: {{ .Values.inferenceExtension.extProcPort | default 9002 }}
25+
26+

config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,6 @@ spec:
1515
scope: Namespaced
1616
versions:
1717
- additionalPrinterColumns:
18-
- jsonPath: .spec.modelName
19-
name: Model Name
20-
type: string
2118
- jsonPath: .spec.poolRef.name
2219
name: Inference Pool
2320
type: string
@@ -61,12 +58,6 @@ spec:
6158
performance and latency goals for the model. These workloads are
6259
expected to operate within an InferencePool sharing compute capacity with other
6360
InferenceObjectives, defined by the Inference Platform Admin.
64-
65-
InferenceObjective's modelName (not the ObjectMeta name) is unique for a given InferencePool,
66-
if the name is reused, an error will be shown on the status of a
67-
InferenceObjective that attempted to reuse. The oldest InferenceObjective, based on
68-
creation timestamp, will be selected to remain valid. In the event of a race
69-
condition, one will be selected at random.
7061
properties:
7162
poolRef:
7263
description: PoolRef is a reference to the inference pool, the pool

mkdocs.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ theme:
1212
logo: images/logo/logo-text-large-horizontal-white.png
1313
favicon: images/favicon-64.png
1414
features:
15+
- content.code.annotate
1516
- search.highlight
1617
- navigation.tabs
1718
- navigation.top
@@ -55,7 +56,7 @@ nav:
5556
Design Principles: concepts/design-principles.md
5657
Conformance: concepts/conformance.md
5758
Roles and Personas: concepts/roles-and-personas.md
58-
- Implementations:
59+
- Implementations:
5960
- Gateways: implementations/gateways.md
6061
- Model Servers: implementations/model-servers.md
6162
- FAQ: faq.md
@@ -70,7 +71,7 @@ nav:
7071
- InferencePool Rollout: guides/inferencepool-rollout.md
7172
- Metrics and Observability: guides/metrics-and-observability.md
7273
- Configuration Guide:
73-
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
74+
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
7475
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
7576
- Troubleshooting Guide: guides/troubleshooting.md
7677
- Implementer Guides:

site-src/guides/index.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -137,8 +137,9 @@ Tooling:
137137

138138
=== "GKE"
139139

140-
1. Enable the Gateway API and configure proxy-only subnets when necessary. See [Deploy Gateways](https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways)
141-
for detailed instructions.
140+
1. Enable the Google Kubernetes Engine API, Compute Engine API, the Network Services API and configure proxy-only subnets when necessary.
141+
See [Deploy Inference Gateways](https://cloud.google.com/kubernetes-engine/docs/how-to/deploy-gke-inference-gateway)
142+
for detailed instructions.
142143

143144
2. Deploy Inference Gateway:
144145

Lines changed: 98 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,53 @@
11
# Serve multiple generative AI models
2-
A company wants to deploy multiple large language models (LLMs) to serve different workloads.
3-
For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application.
2+
3+
A company wants to deploy multiple large language models (LLMs) to a cluster to serve different workloads.
4+
For example, they might want to deploy a Gemma3 model for a chatbot interface and a DeepSeek model for a recommendation application.
45
The company needs to ensure optimal serving performance for these LLMs.
5-
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
6-
You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.
6+
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
7+
You can then route requests based on the model name (such as `chatbot` and `recommender`) and the `Criticality` property.
78

89
## How
10+
911
The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.
10-
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md)
12+
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) (BBR)
1113
from the request body to the header. The header is then matched to dispatch
1214
requests to different `InferencePool` (and their EPPs) instances.
1315
![Serving multiple generative AI models](../images/serve-mul-gen-AI-models.png)
1416

17+
### Deploy Body-Based Routing
18+
19+
To enable body-based routing, you need to deploy the Body-Based Routing ExtProc server using Helm. Depending on your Gateway provider, you can use one of the following commands:
20+
21+
=== "GKE"
22+
23+
```bash
24+
helm install body-based-router \
25+
--set provider.name=gke \
26+
--version v0.5.1 \
27+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
28+
```
29+
30+
=== "Istio"
31+
32+
```bash
33+
helm install body-based-router \
34+
--set provider.name=istio \
35+
--version v0.5.1 \
36+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
37+
```
38+
39+
=== "Other"
40+
41+
```bash
42+
helm install body-based-router \
43+
--version v0.5.1 \
44+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
45+
```
46+
47+
### Configure HTTPRoute
48+
1549
This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
50+
1651
```yaml
1752
apiVersion: gateway.networking.k8s.io/v1
1853
kind: HTTPRoute
@@ -25,8 +60,7 @@ spec:
2560
- matches:
2661
- headers:
2762
- type: Exact
28-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
29-
name: X-Gateway-Model-Name
63+
name: X-Gateway-Model-Name # (1)!
3064
value: chatbot
3165
path:
3266
type: PathPrefix
@@ -37,38 +71,74 @@ spec:
3771
- matches:
3872
- headers:
3973
- type: Exact
40-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
41-
name: X-Gateway-Model-Name
74+
name: X-Gateway-Model-Name # (2)!
4275
value: recommender
4376
path:
4477
type: PathPrefix
4578
value: /
4679
backendRefs:
4780
- name: deepseek-r1
48-
kind: InferencePool
81+
kind: InferencePool
4982
```
5083
84+
1. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
85+
2. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
86+
5187
## Try it out
5288

5389
1. Get the gateway IP:
5490
```bash
5591
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
5692
```
57-
2. Send a few requests to model "chatbot" as follows:
58-
```bash
59-
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
60-
"model": "chatbot",
61-
"prompt": "What is the color of the sky",
62-
"max_tokens": 100,
63-
"temperature": 0
64-
}'
65-
```
66-
3. Send a few requests to model "recommender" as follows:
67-
```bash
68-
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
69-
"model": "recommender",
70-
"prompt": "Give me restaurant recommendations in Paris",
71-
"max_tokens": 100,
72-
"temperature": 0
73-
}'
74-
```
93+
94+
=== "Chat Completions API"
95+
96+
1. Send a few requests to model `chatbot` as follows:
97+
```bash
98+
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
99+
-H "Content-Type: application/json" \
100+
-d '{
101+
"model": "chatbot",
102+
"messages": [{"role": "user", "content": "What is the color of the sky?"}],
103+
"max_tokens": 100,
104+
"temperature": 0
105+
}'
106+
```
107+
108+
2. Send a few requests to model `recommender` as follows:
109+
```bash
110+
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
111+
-H "Content-Type: application/json" \
112+
-d '{
113+
"model": "recommender",
114+
"messages": [{"role": "user", "content": "Give me restaurant recommendations in Paris"}],
115+
"max_tokens": 100,
116+
"temperature": 0
117+
}'
118+
```
119+
120+
=== "Completions API"
121+
122+
1. Send a few requests to model `chatbot` as follows:
123+
```bash
124+
curl -X POST -i ${IP}:${PORT}/v1/completions \
125+
-H 'Content-Type: application/json' \
126+
-d '{
127+
"model": "chatbot",
128+
"prompt": "What is the color of the sky",
129+
"max_tokens": 100,
130+
"temperature": 0
131+
}'
132+
```
133+
134+
2. Send a few requests to model `recommender` as follows:
135+
```bash
136+
curl -X POST -i ${IP}:${PORT}/v1/completions \
137+
-H 'Content-Type: application/json' \
138+
-d '{
139+
"model": "recommender",
140+
"prompt": "Give me restaurant recommendations in Paris",
141+
"max_tokens": 100,
142+
"temperature": 0
143+
}'
144+
```

site-src/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ The following specific terms to this project:
2929
from [Model Serving](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol/README.md).
3030
- **Metrics and Capabilities**: Data provided by model serving platforms about
3131
performance, availability and capabilities to optimize routing. Includes
32-
things like [Prefix Cache] status or [LoRA Adapters] availability.
32+
things like [Prefix Cache](https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html) status or [LoRA Adapters](https://docs.vllm.ai/en/stable/features/lora.html) availability.
3333
- **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
3434

3535
[Inference Gateway]:#concepts-and-definitions

0 commit comments

Comments
 (0)