You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: site-src/guides/index.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -137,8 +137,9 @@ Tooling:
137
137
138
138
=== "GKE"
139
139
140
-
1. Enable the Gateway API and configure proxy-only subnets when necessary. See [Deploy Gateways](https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways)
141
-
for detailed instructions.
140
+
1. Enable the Google Kubernetes Engine API, Compute Engine API, the Network Services API and configure proxy-only subnets when necessary.
141
+
See [Deploy Inference Gateways](https://cloud.google.com/kubernetes-engine/docs/how-to/deploy-gke-inference-gateway)
A company wants to deploy multiple large language models (LLMs) to serve different workloads.
3
-
For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application.
2
+
3
+
A company wants to deploy multiple large language models (LLMs) to a cluster to serve different workloads.
4
+
For example, they might want to deploy a Gemma3 model for a chatbot interface and a DeepSeek model for a recommendation application.
4
5
The company needs to ensure optimal serving performance for these LLMs.
5
-
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
6
-
You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.
6
+
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
7
+
You can then route requests based on the model name (such as `chatbot` and `recommender`) and the `Criticality` property.
7
8
8
9
## How
10
+
9
11
The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.
10
-
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md)
12
+
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) (BBR)
11
13
from the request body to the header. The header is then matched to dispatch
12
14
requests to different `InferencePool` (and their EPPs) instances.
13
15

14
16
17
+
### Deploy Body-Based Routing
18
+
19
+
To enable body-based routing, you need to deploy the Body-Based Routing ExtProc server using Helm. Depending on your Gateway provider, you can use one of the following commands:
This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
50
+
16
51
```yaml
17
52
apiVersion: gateway.networking.k8s.io/v1
18
53
kind: HTTPRoute
@@ -25,8 +60,7 @@ spec:
25
60
- matches:
26
61
- headers:
27
62
- type: Exact
28
-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
29
-
name: X-Gateway-Model-Name
63
+
name: X-Gateway-Model-Name # (1)!
30
64
value: chatbot
31
65
path:
32
66
type: PathPrefix
@@ -37,38 +71,74 @@ spec:
37
71
- matches:
38
72
- headers:
39
73
- type: Exact
40
-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
41
-
name: X-Gateway-Model-Name
74
+
name: X-Gateway-Model-Name # (2)!
42
75
value: recommender
43
76
path:
44
77
type: PathPrefix
45
78
value: /
46
79
backendRefs:
47
80
- name: deepseek-r1
48
-
kind: InferencePool
81
+
kind: InferencePool
49
82
```
50
83
84
+
1. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
85
+
2. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
86
+
51
87
## Try it out
52
88
53
89
1. Get the gateway IP:
54
90
```bash
55
91
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
56
92
```
57
-
2. Send a few requests to model "chatbot" as follows:
Copy file name to clipboardExpand all lines: site-src/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,7 +29,7 @@ The following specific terms to this project:
29
29
from [Model Serving](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol/README.md).
30
30
-**Metrics and Capabilities**: Data provided by model serving platforms about
31
31
performance, availability and capabilities to optimize routing. Includes
32
-
things like [Prefix Cache] status or [LoRA Adapters] availability.
32
+
things like [Prefix Cache](https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html) status or [LoRA Adapters](https://docs.vllm.ai/en/stable/features/lora.html) availability.
33
33
-**Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
0 commit comments