Graduate scheduler v2

The scheduler V2 implements prefix cache aware scheduling and load aware (queue and kv cache) scheduling using weighted scorers. It doesn’t handle LoRA affinity yet. The goal is to graduate the v2 scheduler as the default, and deprecate v1. 


### Graduation Criteria

1. Implement LoRA affinity as a Scorer plugin
2. Regression benchmark multi-lora case, and verify v2 >= v1
3. [DONE in #768] Regression benchmark base model case and verify v2 >= v1
4. Benchmark the prefix cache plugin overhead with different configurations, and verify the overhead of the default config is minimal even for zero/low cache hit case


### LoRA Affinity

The current LoRA affinity algorithm is based on the “Filter” model, and exploits several heuristics that are not easy to reason about. I propose to reimplement this as a LoRA affinity plugin.

Currently LoRA filter purely uses LoRA metrics scraped from the model servers, and suffer from the “head of line blocking” problem when EPP receives a traffic spike with new LoRA adapters, or for a fresh inference pool where LoRA affinity hasn’t been established yet. As a result, EPP may fail to establish a stable affinity. As discussed [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/discussions/678), we can record LoRA request history in EPP to establish a strong affinity, that’s not affected by the metrics scrape freshness.

The plugin should maintain a map of adapters and the number of running requests per pod.


```
{
  "adapter1": {
	"pod1": 2,
   },
  "adapter2": {
"pod1": 1,
"pod2": 3,
  }
}
```


The plugin will implement the following extension points: \
 1. Scorer



*   Pods with the requested adapter in active_adapters, and zero waiting_adapters: score = 1
*   Pods with total_adapters &lt; max_adapters, score = 0.75 (have room to quickly load the adapter)
*   Pods with waiting_adapters, (0-0.5 based on number of waiting adapters). The more waiting adapters the lower the score

2. PreRequest



*   Increments the running requests counter per adapter/pod.

3. PostResponse



*   Decrement the running requests counter per adapter/pod.

The plugin will periodically reconcile its state based on the scraped LoRA metrics. This can be at a much lower frequency (e.g., every second) than the current metrics scraping interval of 50ms. Note once the affinity is established, the chance of discrepancy between the EPP recorded state and model server reported state should be very low.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Graduate scheduler v2 #967

Graduation Criteria

LoRA Affinity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Graduate scheduler v2 #967

Description

Graduation Criteria

LoRA Affinity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions