|
| 1 | +# Configuring Plugins via text |
| 2 | + |
| 3 | +The set of lifecycle hooks (plugins) that are used by the Inference Gateway (IGW) is determined by how |
| 4 | +it is configured. The IGW can be configured in several ways, either by code or via text. |
| 5 | + |
| 6 | +If configured by code either a set of predetermined environment variables must be used or one must |
| 7 | +fork the IGW and change code. |
| 8 | + |
| 9 | +A simpler way to congigure the IGW is to use a text based configuration. This text is in YAML format |
| 10 | +and can either be in a file or specified in-line as a parameter. The configuration defines the set of plugins to be instantiated along with their parameters. Each plugin can also given a name, enabling |
| 11 | +the same plugin type to be instantiated multiple times, if needed. Also defined is a set of |
| 12 | +SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. The set |
| 13 | +of plugins instantiated must also include a Profile Handler, which determines which SchedulingProfiles |
| 14 | +will be used for a particular request. |
| 15 | + |
| 16 | +It should be noted that while the configuration text looks like a Kubernetes Custom Resource, it is |
| 17 | +**NOT** a Kubernetes Custom Resource. Kubernetes infrastructure is used to load the configuration |
| 18 | +text and in the future will also help in versioning the text. |
| 19 | + |
| 20 | +It should also be noted that even when the configuration text is loaded from a file, it is loaded at |
| 21 | +the Endpoint-Picker's (EPP) startup and changes to the file at runtime are ignored. |
| 22 | + |
| 23 | +The configuration text has the following form: |
| 24 | +```yaml |
| 25 | +apiVersion: inference.networking.x-k8s.io/v1alpha1 |
| 26 | +kind: EndpointPickerConfig |
| 27 | +plugins: |
| 28 | +- .... |
| 29 | +- .... |
| 30 | +schedulingProfiles: |
| 31 | +- .... |
| 32 | +- .... |
| 33 | +``` |
| 34 | +
|
| 35 | +The first two lines of the configuration are constant and must appear as is. |
| 36 | +
|
| 37 | +The plugins section defines the set of plugins that will be instantiated and their parameters. Each entry in this section |
| 38 | +has the following form: |
| 39 | +```yaml |
| 40 | +- name: aName |
| 41 | + type: a-type |
| 42 | + parameters: |
| 43 | + parm1: val1 |
| 44 | + parm2: val2 |
| 45 | +``` |
| 46 | +The fields in a plugin entry are: |
| 47 | +- *name* which is optional, provides a name by which the plugin instance can be referenced. If this |
| 48 | +field is omitted, the plugin's type will be used as its name.<br> |
| 49 | +- *type* specifies the type of the plugin to be instantiated.<br> |
| 50 | +- *parameters* which is optional, defines the set of parameters used to configure the plugin in question. |
| 51 | +The actual set of parameters varies from plugin to plugin. |
| 52 | +
|
| 53 | +The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling |
| 54 | +requests to pods. The number of scheduling profiles one defines, depends on the use case. For simple |
| 55 | +serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry |
| 56 | +in this section has the following form: |
| 57 | +```yaml |
| 58 | +- name: aName |
| 59 | + plugins: |
| 60 | + - pluginRef: plugin1 |
| 61 | + - pluginRef: plugin2 |
| 62 | + weight: 50 |
| 63 | +``` |
| 64 | +The fields in a schedulingProfile entry are: |
| 65 | +- *name* specifies the scheduling profile's name. |
| 66 | +- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request. |
| 67 | +Each entry in the schedulingProfile's plugins section has the following fields: |
| 68 | + - *pluginRef* is a reference to the name of the plugin instance to be used |
| 69 | + - *weight* is the weight to be used if the referenced plugin is a scorer. |
| 70 | +
|
| 71 | +A complete configuration might look like this: |
| 72 | +```yaml |
| 73 | +apiVersion: inference.networking.x-k8s.io/v1alpha1 |
| 74 | +kind: EndpointPickerConfig |
| 75 | +plugins: |
| 76 | +- type: prefix-cache |
| 77 | + parameters: |
| 78 | + hashBlockSize: 5 |
| 79 | + maxPrefixBlocksToMatch: 256 |
| 80 | + lruCapacityPerServer: 31250 |
| 81 | +- type: max-score |
| 82 | +- type: single-profile |
| 83 | +schedulingProfiles: |
| 84 | +- name: default |
| 85 | + plugins: |
| 86 | + - pluginRef: max-score |
| 87 | + - pluginRef: prefix-cache |
| 88 | + weight: 50 |
| 89 | +``` |
| 90 | +
|
| 91 | +If the configuration is in a file, the EPP command line argument `--configFile` |
| 92 | +should be used to specify the full path of the file in question. For example: |
| 93 | +```yaml |
| 94 | +apiVersion: apps/v1 |
| 95 | +kind: Deployment |
| 96 | +metadata: |
| 97 | + name: ${EPP_NAME} |
| 98 | + ... |
| 99 | +spec: |
| 100 | + ... |
| 101 | + template: |
| 102 | + ... |
| 103 | + spec: |
| 104 | + ... |
| 105 | + containers: |
| 106 | + - name: epp |
| 107 | + image: ghcr.io/llm-d/llm-d-inference-scheduler:latest |
| 108 | + imagePullPolicy: IfNotPresent |
| 109 | + args: |
| 110 | + - -poolName |
| 111 | + - "${POOL_NAME}" |
| 112 | + ... |
| 113 | + - --configFile |
| 114 | + - "/etc/epp/epp-config.yaml" |
| 115 | +``` |
| 116 | + |
| 117 | +If the configuration is passed as in-line text the EPP command line argument `--configText` |
| 118 | +should be used. For example: |
| 119 | +```yaml |
| 120 | +apiVersion: apps/v1 |
| 121 | +kind: Deployment |
| 122 | +metadata: |
| 123 | + name: ${EPP_NAME} |
| 124 | + ... |
| 125 | +spec: |
| 126 | + ... |
| 127 | + template: |
| 128 | + ... |
| 129 | + spec: |
| 130 | + ... |
| 131 | + containers: |
| 132 | + - name: epp |
| 133 | + image: ghcr.io/llm-d/llm-d-inference-scheduler:latest |
| 134 | + imagePullPolicy: IfNotPresent |
| 135 | + args: |
| 136 | + - -poolName |
| 137 | + - "${POOL_NAME}" |
| 138 | + ... |
| 139 | + - --configText |
| 140 | + - | |
| 141 | + apiVersion: inference.networking.x-k8s.io/v1alpha1 |
| 142 | + kind: EndpointPickerConfig |
| 143 | + plugins: |
| 144 | + - type: prefix-cache |
| 145 | + parameters: |
| 146 | + hashBlockSize: 5 |
| 147 | + maxPrefixBlocksToMatch: 256 |
| 148 | + lruCapacityPerServer: 31250 |
| 149 | + - type: max-score |
| 150 | + - type: single-profile |
| 151 | + schedulingProfiles: |
| 152 | + - name: default |
| 153 | + plugins: |
| 154 | + - pluginRef: max-score |
| 155 | + - pluginRef: prefix-cache |
| 156 | + weight: 50 |
| 157 | +``` |
| 158 | + |
| 159 | +## Plugin Configuration |
| 160 | + |
| 161 | +This section describes how to setup the various plugins that are available with the IGW. |
| 162 | + |
| 163 | +**SingleProfileHandler**<br> |
| 164 | +Selects a single profile which is always the primary profile.<br> |
| 165 | +*Type*: single-profile<br> |
| 166 | +*Parameters*: none<br> |
| 167 | + |
| 168 | +**LeastKVCacheFilter**<br> |
| 169 | +Finds the max and min KV cache of all pods, divides the whole range (max-min) by the |
| 170 | +number of pods, and finds the pods that fall into the first range.<br> |
| 171 | +*Type*: least-KV-cache<br> |
| 172 | +*Parameters*: none<br> |
| 173 | + |
| 174 | +**LeastQueueFilter**<br> |
| 175 | +Finds the max and min queue size of all pods, divides the whole range (max-min) by the |
| 176 | +number of pods, and finds the pods that fall into the first range.<br> |
| 177 | +*Type*: least-queue<br> |
| 178 | +*Parameters*: none<br> |
| 179 | + |
| 180 | +**LoraAffinityFilter**<br> |
| 181 | +Implements a pod selection strategy that when the use of a LoRA adapter is requested, prioritizes pods |
| 182 | +that are believed to have the specific LoRA adapter loaded. It also allows for load balancing through |
| 183 | +some randomization.<br> |
| 184 | +*Type*: lora-affinity<br> |
| 185 | +*Parameters*:<br> |
| 186 | +\- `threshold` a probability threshold to sometimes select pods that don't seem to have the LoRA |
| 187 | + adapter loaded to enable load balancing. If not specified defaults to `0.999`<br> |
| 188 | + |
| 189 | +**LowQueueFilter**<br> |
| 190 | +Filters out pods who's waiting queue size is greater than the specified theshold.<br> |
| 191 | +*Type*: low-queue<br> |
| 192 | +*Parameters*:<br> |
| 193 | +\- `threshold` the waiting queue threshold. If not specified defaults to `128`<br> |
| 194 | + |
| 195 | +**PrefixCachePlugin**<br> |
| 196 | +Scores pods based on the amount of the prompt is believed to be in the pod's KvCache.<br> |
| 197 | +*Type*: prefix-cache<br> |
| 198 | +*Parameters*:<br> |
| 199 | +\- `hashBlockSize` specified the size of the blocks to break up the input prompt when |
| 200 | + calculating the block hashes. If not specified defaults to `64`<br> |
| 201 | +\- `maxPrefixBlocksToMatch` specifies the maximum number of prefix blocks to match. If |
| 202 | + not specified defaults to `256`<br> |
| 203 | +\- `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries |
| 204 | + per server (pod). If not specified defaults to `31250`<br> |
| 205 | + |
| 206 | +**MaxScorePicker**<br> |
| 207 | +Picks the pod with the maximum score from the list of candidates.<br> |
| 208 | +*Type*: max-score<br> |
| 209 | +*Parameters*: none<br> |
| 210 | + |
| 211 | +**RandomPicker**<br> |
| 212 | +Picks a random pod from the list of candidates.<br> |
| 213 | +*Type*: random<br> |
| 214 | +*Parameters*: none<br> |
| 215 | + |
| 216 | +**KvCacheScorer**<br> |
| 217 | +Scores the candidate pods based on their KV cache utilization.<br> |
| 218 | +*Type*: kv-cache<br> |
| 219 | +*Parameters*: none<br> |
| 220 | + |
| 221 | +**QueueScorer**<br> |
| 222 | +Scores list of candidate pods based on the pod's waiting queue size. The lower the |
| 223 | +waiting queue size the pod has, the higher the score it will get (since it's more |
| 224 | +available to serve new request).<br> |
| 225 | +*Type*: queue<br> |
| 226 | +*Parameters*: none<br> |
0 commit comments