-
Notifications
You must be signed in to change notification settings - Fork 76
Add prefix cache aware scheduling #768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@ahg-g completely agree. excellent pointer from kube-scheduler 👍 |
Thanks everyone for the feedback so far. I am not feeling well since yesterday. I hope to get back to this tomorrow with more benchmarking results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall the PR looks ok.
I do have some concerns about the way the scheduler config was defined, which breaks the abstraction and adds plugins specifics to general purpose files.
I attached a code suggestion for how to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liu-cong thanks for incorporating latest changes. I think current version looks good.
the only comment I have is about the changes to config.go
.
I think this sets a very confusing precedent.
as a new adopter, am I supposed to initialize the plugin vars and then call NewSchedulerWithConfig? am I supposed to use ConfigOption with empty arrays? why one plugin has a dedicated function while other don't? am I supposed to add a dedicated "Add" function for every plugin that implements more than one interface?
I would feel much more comfortable with this PR if config.go
stays clean and you add a helper function in main.go
instead of having a helper function to add prefix specifics inside the general purpose scheduler packages.
I'm also happy to implement a follow up PR that uses reflection to generalize plugin setting, something like
SchedulerConfig.WithPlugin(plugin)
which internally checks using reflection which interfaces the plugin implements and adds them to all relevant places.
I agree with this, but I think that designing a proper config API is out of scope for this PR. I would recommend to move forward with this, and have a quick follow up to define that. We are missing a few things:
|
Thanks @liu-cong, this is great, I have a couple of questions: 1) can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate) to verify that prefix-cache aware scorer doesn't have negative impact 2) just for sanity checking, can we also add a line for a normal service (I assume the baseline line is for v1 scheduler, not k8s Service)? |
This is not a blocker to merge this, we can do a followup. |
As I mentioned in a previous comment, this would be the ideal. I don't want to do it here since this PR is always way too large. I created a follow up issue and also added a TODO to replace the |
Yes I will add some data with the sglang test tool.
/hold |
/retest |
@ahg-g Pls take a look at the updated results. |
/unhold |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, liu-cong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@liu-cong I don't see the comparison with Service, do you have that too? |
The "Run2" graph has a comparison with svc |
This PR introduces an experimental scheduler v2 (disabled by default, enabled by an env var) that has the following major changes:
QueueScorer
andKVCacheScorer
to scheduler, and ranks pods based on a weighted score. The benchmark results show that this has slight better NTPOT compared to the v1 implementation using filters.PrefixCacheScorer
based on proposal (disabled by default).The eventual goal is to graduate v2 to the default implementation.
Benchmark Results
TLDR:
Benchmark1: High and low prefix cache hit ratio
TLDR:
Benchmark Setup
Model server: vLLM 0.8.4 with
--enable-prefix-caching
, base modelmeta-llama/Llama-3.1-8B-Instruct
, on 3 H100 80GB.EPP baseline image:
us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v20250502-0ae7d1d
Benchmark tool: SGLang benchmark tool, using the
'generated-shared-prefix'
dataset.Benchmark command:
High Prefix Cache Hit Ratio (system-prompt-len=3000, question-len=128)
<style type="text/css"></style>
Low Prefix Cache Hit Ratio (system-prompt-len=128, question-len=3000)
<style type="text/css"></style>
Benchmark 2: Regression test
Benchmark set up
Followed the regression test guide
Result with queue and kv-cache scorers only (prefix plugin is disabled)
QueueScorer
andKVCacheScorer
, call themw_q, w_k
respectively. And in one testw_q=1, w_k=2
gave the best performance in NTPOT, in another testw_q=1, w_k=1
won. We recommendw_q=1, w_k=1
as the default configuration as it's simple.prefix-disabled-12
meansw_q=1, w_k=2
.Run1:

Run2:

Result with prefix plugin enabled
w_p
, in the below diagrams,prefix-enabled-123
meansw_p=1, w_q=2, w_k=3
.111
gave the best results for this dataset, so having111
as the default configuration seems to make sense.Compare to baseline

Comparing different weights

Prefix Plugin Overhead
PreSchedule
part, with a P95 latency of 0.2ms. The overall scheduler latency is below 0.5ms (P95). The baseline scheduler latency is about 0.2ms, so prefix cache plugin overall adds about 0.3 ms of scheduling latency.Memory

Latency

Prefix hit rate
With this dataset we see a very high prefix cache hit ratio 95%, that's how we get significant improvements.