Skip to content

Commit 2e49abc

Browse files
committed
Add documentation for the new text based configuration
Signed-off-by: Shmuel Kallner <[email protected]>
1 parent c884ab7 commit 2e49abc

File tree

2 files changed

+227
-0
lines changed

2 files changed

+227
-0
lines changed

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ nav:
7070
- InferencePool Rollout: guides/inferencepool-rollout.md
7171
- Metrics and Observability: guides/metrics-and-observability.md
7272
- Configuration Guide:
73+
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
7374
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
7475
- Implementer's Guide: guides/implementers.md
7576
- Implementer Guides:
Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
# Configuring Plugins via text
2+
3+
The set of lifecycle hooks (plugins) that are used by the Inference Gateway (IGW) is determined by how
4+
it is configured. The IGW can be configured in several ways, either by code or via text.
5+
6+
If configured by code either a set of predetermined environment variables must be used or one must
7+
fork the IGW and change code.
8+
9+
A simpler way to congigure the IGW is to use a text based configuration. This text is in YAML format
10+
and can either be in a file or specified in-line as a parameter. The configuration defines the set of plugins to be instantiated along with their parameters. Each plugin can also given a name, enabling
11+
the same plugin type to be instantiated multiple times, if needed. Also defined is a set of
12+
SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. The set
13+
of plugins instantiated must also include a Profile Handler, which determines which SchedulingProfiles
14+
will be used for a particular request.
15+
16+
It should be noted that while the configuration text looks like a Kubernetes Custom Resource, it is
17+
**NOT** a Kubernetes Custom Resource. Kubernetes infrastructure is used to load the configuration
18+
text and in the future will also help in versioning the text.
19+
20+
It should also be noted that even when the configuration text is loaded from a file, it is loaded at
21+
the Endpoint-Picker's (EPP) startup and changes to the file at runtime are ignored.
22+
23+
The configuration text has the following form:
24+
```yaml
25+
apiVersion: inference.networking.x-k8s.io/v1alpha1
26+
kind: EndpointPickerConfig
27+
plugins:
28+
- ....
29+
- ....
30+
schedulingProfiles:
31+
- ....
32+
- ....
33+
```
34+
35+
The first two lines of the configuration are constant and must appear as is.
36+
37+
The plugins section defines the set of plugins that will be instantiated and their parameters. Each entry in this section
38+
has the following form:
39+
```yaml
40+
- name: aName
41+
type: a-type
42+
parameters:
43+
parm1: val1
44+
parm2: val2
45+
```
46+
The fields in a plugin entry are:
47+
- *name* which is optional, provides a name by which the plugin instance can be referenced. If this
48+
field is omitted, the plugin's type will be used as its name.<br>
49+
- *type* specifies the type of the plugin to be instantiated.<br>
50+
- *parameters* which is optional, defines the set of parameters used to configure the plugin in question.
51+
The actual set of parameters varies from plugin to plugin.
52+
53+
The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling
54+
requests to pods. The number of scheduling profiles one defines, depends on the use case. For simple
55+
serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry
56+
in this section has the following form:
57+
```yaml
58+
- name: aName
59+
plugins:
60+
- pluginRef: plugin1
61+
- pluginRef: plugin2
62+
weight: 50
63+
```
64+
The fields in a schedulingProfile entry are:
65+
- *name* specifies the scheduling profile's name.
66+
- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request.
67+
Each entry in the schedulingProfile's plugins section has the following fields:
68+
- *pluginRef* is a reference to the name of the plugin instance to be used
69+
- *weight* is the weight to be used if the referenced plugin is a scorer.
70+
71+
A complete configuration might look like this:
72+
```yaml
73+
apiVersion: inference.networking.x-k8s.io/v1alpha1
74+
kind: EndpointPickerConfig
75+
plugins:
76+
- type: prefix-cache
77+
parameters:
78+
hashBlockSize: 5
79+
maxPrefixBlocksToMatch: 256
80+
lruCapacityPerServer: 31250
81+
- type: max-score
82+
- type: single-profile
83+
schedulingProfiles:
84+
- name: default
85+
plugins:
86+
- pluginRef: max-score
87+
- pluginRef: prefix-cache
88+
weight: 50
89+
```
90+
91+
If the configuration is in a file, the EPP command line argument `--configFile`
92+
should be used to specify the full path of the file in question. For example:
93+
```yaml
94+
apiVersion: apps/v1
95+
kind: Deployment
96+
metadata:
97+
name: ${EPP_NAME}
98+
...
99+
spec:
100+
...
101+
template:
102+
...
103+
spec:
104+
...
105+
containers:
106+
- name: epp
107+
image: ghcr.io/llm-d/llm-d-inference-scheduler:latest
108+
imagePullPolicy: IfNotPresent
109+
args:
110+
- -poolName
111+
- "${POOL_NAME}"
112+
...
113+
- --configFile
114+
- "/etc/epp/epp-config.yaml"
115+
```
116+
117+
If the configuration is passed as in-line text the EPP command line argument `--configText`
118+
should be used. For example:
119+
```yaml
120+
apiVersion: apps/v1
121+
kind: Deployment
122+
metadata:
123+
name: ${EPP_NAME}
124+
...
125+
spec:
126+
...
127+
template:
128+
...
129+
spec:
130+
...
131+
containers:
132+
- name: epp
133+
image: ghcr.io/llm-d/llm-d-inference-scheduler:latest
134+
imagePullPolicy: IfNotPresent
135+
args:
136+
- -poolName
137+
- "${POOL_NAME}"
138+
...
139+
- --configText
140+
- |
141+
apiVersion: inference.networking.x-k8s.io/v1alpha1
142+
kind: EndpointPickerConfig
143+
plugins:
144+
- type: prefix-cache
145+
parameters:
146+
hashBlockSize: 5
147+
maxPrefixBlocksToMatch: 256
148+
lruCapacityPerServer: 31250
149+
- type: max-score
150+
- type: single-profile
151+
schedulingProfiles:
152+
- name: default
153+
plugins:
154+
- pluginRef: max-score
155+
- pluginRef: prefix-cache
156+
weight: 50
157+
```
158+
159+
## Plugin Configuration
160+
161+
This section describes how to setup the various plugins that are available with the IGW.
162+
163+
**SingleProfileHandler**<br>
164+
Selects a single profile which is always the primary profile.<br>
165+
*Type*: single-profile<br>
166+
*Parameters*: none<br>
167+
168+
**LeastKVCacheFilter**<br>
169+
Finds the max and min KV cache of all pods, divides the whole range (max-min) by the
170+
number of pods, and finds the pods that fall into the first range.<br>
171+
*Type*: least-KV-cache<br>
172+
*Parameters*: none<br>
173+
174+
**LeastQueueFilter**<br>
175+
Finds the max and min queue size of all pods, divides the whole range (max-min) by the
176+
number of pods, and finds the pods that fall into the first range.<br>
177+
*Type*: least-queue<br>
178+
*Parameters*: none<br>
179+
180+
**LoraAffinityFilter**<br>
181+
Implements a pod selection strategy that when the use of a LoRA adapter is requested, prioritizes pods
182+
that are believed to have the specific LoRA adapter loaded. It also allows for load balancing through
183+
some randomization.<br>
184+
*Type*: lora-affinity<br>
185+
*Parameters*:<br>
186+
\- `threshold` a probability threshold to sometimes select pods that don't seem to have the LoRA
187+
adapter loaded to enable load balancing. If not specified defaults to `0.999`<br>
188+
189+
**LowQueueFilter**<br>
190+
Filters out pods who's waiting queue size is greater than the specified theshold.<br>
191+
*Type*: low-queue<br>
192+
*Parameters*:<br>
193+
\- `threshold` the waiting queue threshold. If not specified defaults to `128`<br>
194+
195+
**PrefixCachePlugin**<br>
196+
Scores pods based on the amount of the prompt is believed to be in the pod's KvCache.<br>
197+
*Type*: prefix-cache<br>
198+
*Parameters*:<br>
199+
\- `hashBlockSize` specified the size of the blocks to break up the input prompt when
200+
calculating the block hashes. If not specified defaults to `64`<br>
201+
\- `maxPrefixBlocksToMatch` specifies the maximum number of prefix blocks to match. If
202+
not specified defaults to `256`<br>
203+
\- `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries
204+
per server (pod). If not specified defaults to `31250`<br>
205+
206+
**MaxScorePicker**<br>
207+
Picks the pod with the maximum score from the list of candidates.<br>
208+
*Type*: max-score<br>
209+
*Parameters*: none<br>
210+
211+
**RandomPicker**<br>
212+
Picks a random pod from the list of candidates.<br>
213+
*Type*: random<br>
214+
*Parameters*: none<br>
215+
216+
**KvCacheScorer**<br>
217+
Scores the candidate pods based on their KV cache utilization.<br>
218+
*Type*: kv-cache<br>
219+
*Parameters*: none<br>
220+
221+
**QueueScorer**<br>
222+
Scores list of candidate pods based on the pod's waiting queue size. The lower the
223+
waiting queue size the pod has, the higher the score it will get (since it's more
224+
available to serve new request).<br>
225+
*Type*: queue<br>
226+
*Parameters*: none<br>

0 commit comments

Comments
 (0)