Skip to content

Graduate scheduler v2 #967

@liu-cong

Description

@liu-cong

The scheduler V2 implements prefix cache aware scheduling and load aware (queue and kv cache) scheduling using weighted scorers. It doesn’t handle LoRA affinity yet. The goal is to graduate the v2 scheduler as the default, and deprecate v1.

Graduation Criteria

  1. Implement LoRA affinity as a Scorer plugin
  2. Regression benchmark multi-lora case, and verify v2 >= v1
  3. [DONE in Add prefix cache aware scheduling #768] Regression benchmark base model case and verify v2 >= v1
  4. Benchmark the prefix cache plugin overhead with different configurations, and verify the overhead of the default config is minimal even for zero/low cache hit case

LoRA Affinity

The current LoRA affinity algorithm is based on the “Filter” model, and exploits several heuristics that are not easy to reason about. I propose to reimplement this as a LoRA affinity plugin.

Currently LoRA filter purely uses LoRA metrics scraped from the model servers, and suffer from the “head of line blocking” problem when EPP receives a traffic spike with new LoRA adapters, or for a fresh inference pool where LoRA affinity hasn’t been established yet. As a result, EPP may fail to establish a stable affinity. As discussed here, we can record LoRA request history in EPP to establish a strong affinity, that’s not affected by the metrics scrape freshness.

The plugin should maintain a map of adapters and the number of running requests per pod.

{
  "adapter1": {
	"pod1": 2,
   },
  "adapter2": {
"pod1": 1,
"pod2": 3,
  }
}

The plugin will implement the following extension points: \

  1. Scorer
  • Pods with the requested adapter in active_adapters, and zero waiting_adapters: score = 1
  • Pods with total_adapters < max_adapters, score = 0.75 (have room to quickly load the adapter)
  • Pods with waiting_adapters, (0-0.5 based on number of waiting adapters). The more waiting adapters the lower the score
  1. PreRequest
  • Increments the running requests counter per adapter/pod.
  1. PostResponse
  • Decrement the running requests counter per adapter/pod.

The plugin will periodically reconcile its state based on the scraped LoRA metrics. This can be at a much lower frequency (e.g., every second) than the current metrics scraping interval of 50ms. Note once the affinity is established, the chance of discrepancy between the EPP recorded state and model server reported state should be very low.

Metadata

Metadata

Assignees

Labels

needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions