Add prefix cache aware scheduling #768

liu-cong · 2025-05-01T04:58:04Z

This PR introduces an experimental scheduler v2 (disabled by default, enabled by an env var) that has the following major changes:

Adds QueueScorer and KVCacheScorer to scheduler, and ranks pods based on a weighted score. The benchmark results show that this has slight better NTPOT compared to the v1 implementation using filters.
Adds PrefixCacheScorer based on proposal (disabled by default).
The v2 doesn't support LoRA affinity yet.

The eventual goal is to graduate v2 to the default implementation.

Benchmark Results

TLDR:

With prefix plugin disabled, v2 has similar performance as v1, and in some cases has noticeable (5-7%) improvement in p90 NTPOT over v1.
With prefix plugin enabled, v2 has significantly improved performance compared to v1 thanks to prefix cache aware scheduling.
The memory footprint of prefix plugin is quite small, for the benchmark case (llama 3 8B, 6 H100) the footprint is 18MB.
The overhead of the prefix plugin adds about 0.3ms to the baseline scheduling latency of 0.2ms.

Benchmark1: High and low prefix cache hit ratio

TLDR:

In high prefix cache hit ratio test (system prompt length = 2048, user question length = 256):
- Improved P50 and P95 TTFT by 70% and 34%, respectively
- Increased input throughput by 35%
- Improved P95 ITL by 49%
- Improved cache hit ratio from (20-60%) in baseline to very consistent 89%
In low prefix cache hit ratio test (system prompt length = 256, user question length = 2048):
- Improved P50 and P95 TTFT by 17% and 13%, respectively
- Increased input throughput by 1.4% (not significant)
- Improved P95 ITL by 17%
- Improved cache hit ratio from (3-5%) in baseline to (4.5-6%)

Benchmark Setup

Model server: vLLM 0.8.4 with --enable-prefix-caching, base model meta-llama/Llama-3.1-8B-Instruct, on 3 H100 80GB.

EPP baseline image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v20250502-0ae7d1d

Benchmark tool: SGLang benchmark tool, using the 'generated-shared-prefix' dataset.

Benchmark command:

# High cache hit ratio
python3 sglang/bench_serving.py --host=${IP} --port=${PORT} \
--dataset-name='generated-shared-prefix' --model=$MODEL --tokenizer=$MODEL --backend=vllm \
--gsp-num-groups=256 \
--gsp-prompts-per-group=16 \
--gsp-system-prompt-len=2048 \ 
--gsp-question-len=256 \
--gsp-output-len=256 \
--request-rate=800 --max-concurrency=300

# Low cache hit ratio
python3 sglang/bench_serving.py --host=${IP} --port=${PORT} \
--dataset-name='generated-shared-prefix' --model=$MODEL --tokenizer=$MODEL --backend=vllm \
--gsp-num-groups=256 \
--gsp-prompts-per-group=16 \
--gsp-system-prompt-len=256 \ 
--gsp-question-len=2048 \
--gsp-output-len=256 \
--request-rate=800 --max-concurrency=300

High Prefix Cache Hit Ratio (system-prompt-len=3000, question-len=128)

	QPS	Concurrency	Cache hit	P50 TTFT	P95 TTFT	Total Throughput	Input Throughput	P50 ITL	P95 ITL
baseline	32	278	20%-60%	1268	3439	86342	78045	39	114
prefix111	43	269	89%	379	2255	116877	105646	39	58
(baseline-prefix)/baseline				70.11	34.43	-35.37	-35.37	0.00	49.12

Low Prefix Cache Hit Ratio (system-prompt-len=128, question-len=3000)

	QPS	Concurrency	Cache hit	P50 TTFT	P95 TTFT	Total Throughput	Input Throughput	P50 ITL	P95 ITL
baseline	26	293	3% - 5%	1556	4712	70500	63722	41	418
prefix111	27	293	4.5 %- 6%	1286	4076	71508	64634	40	345
(baseline-prefix)/baseline				17.35	13.50	-1.43	-1.43	2.44	17.46

Benchmark 2: Regression test

Benchmark set up

Followed the regression test guide

Result with queue and kv-cache scorers only (prefix plugin is disabled)

I experimented different weights of the QueueScorer and KVCacheScorer, call them w_q, w_k respectively. And in one test w_q=1, w_k=2 gave the best performance in NTPOT, in another test w_q=1, w_k=1 won. We recommend w_q=1, w_k=1 as the default configuration as it's simple.
In the diagram below, prefix-disabled-12 means w_q=1, w_k=2.
Compared to v1, we see noticeable improvements in NTPOT compared to v1 (up to 7%), and slightly decrease in throughput (the graphs shows a zoomed in view but the actual difference is < 1%). I am not too sure about this because a lower NTPOT typically means a higher throughput.

Run1:

Run2:

Result with prefix plugin enabled

I experimented various weights, including prefix score weight w_p, in the below diagrams, prefix-enabled-123 means w_p=1, w_q=2, w_k=3.
All weight combinations gave significant improvements over baseline.
111 gave the best results for this dataset, so having 111 as the default configuration seems to make sense.

Compare to baseline

Comparing different weights

Prefix Plugin Overhead

Memory: As analyzed in the code, the estimated memory footprint is just 17.4MB, which is quite small compared to the baseline EPP memory footprint (150MB). Looking at epp pod memory usage, with prefix enabled, the pod memory usage is actually lower than baseline when serving traffic. When not serving traffic, with prefix enabled EPP pod had higher memory because we still keep the prefix indexes in LRU cache.
Latency: With prefix plugin, the major overhead comes from the PreSchedule part, with a P95 latency of 0.2ms. The overall scheduler latency is below 0.5ms (P95). The baseline scheduler latency is about 0.2ms, so prefix cache plugin overall adds about 0.3 ms of scheduling latency.

Memory

Latency

Prefix hit rate

With this dataset we see a very high prefix cache hit ratio 95%, that's how we get significant improvements.

k8s-ci-robot · 2025-05-01T04:58:06Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

netlify · 2025-05-01T04:58:22Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`47febd5`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/681e6fef2766a50008f8fbae
😎 Deploy Preview	https://deploy-preview-768--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

cmd/epp/main.go

pkg/epp/scheduling/plugins/prefix/plugin.go

pkg/epp/scheduling/scheduler_v2.go

pkg/epp/scheduling/types/types.go

pkg/epp/metrics/metrics.go

cmd/epp/main.go

pkg/epp/scheduling/config_v2.go

pkg/epp/scheduling/plugins/filter/filter.go

pkg/epp/scheduling/plugins/prefix/plugin.go

pkg/epp/scheduling/types/types.go

pkg/epp/scheduling/plugins/prefix/plugin.go

nirrozenbaum · 2025-05-06T07:56:34Z

I recommend defining an abstract parameters map (e.g., map[string]Interface) instead of explicitly adding parameters.

@nirrozenbaum a plugin can have multiple extension points and in some cases those extensions will need to pass data across them. As suggested above, we can have an abstract parameters map, to pass this data.

See CycleState concept in kube-scheduler: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/cycle_state.go#L32 and https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/interface.go#L556

@ahg-g completely agree. excellent pointer from kube-scheduler 👍
abstract parameters map is great (much better than specific fields that are relevant for specific plugin(s))

liu-cong · 2025-05-06T15:15:50Z

Thanks everyone for the feedback so far. I am not feeling well since yesterday. I hope to get back to this tomorrow with more benchmarking results.

pkg/epp/scheduling/plugins/filter/filter.go

pkg/epp/metrics/testdata/prefix_indexer_hit_bytes_metric

pkg/epp/scheduling/config_v2.go

pkg/epp/scheduling/plugins/prefix/indexer.go

pkg/epp/scheduling/plugins/prefix/linked_list.go

pkg/epp/scheduling/plugins/prefix/plugin.go

pkg/epp/scheduling/types/types.go

nirrozenbaum

overall the PR looks ok.
I do have some concerns about the way the scheduler config was defined, which breaks the abstraction and adds plugins specifics to general purpose files.
I attached a code suggestion for how to fix it.

pkg/epp/scheduling/config.go

cmd/epp/main.go

nirrozenbaum

@liu-cong thanks for incorporating latest changes. I think current version looks good.

the only comment I have is about the changes to config.go.
I think this sets a very confusing precedent.
as a new adopter, am I supposed to initialize the plugin vars and then call NewSchedulerWithConfig? am I supposed to use ConfigOption with empty arrays? why one plugin has a dedicated function while other don't? am I supposed to add a dedicated "Add" function for every plugin that implements more than one interface?

I would feel much more comfortable with this PR if config.go stays clean and you add a helper function in main.go instead of having a helper function to add prefix specifics inside the general purpose scheduler packages.

I'm also happy to implement a follow up PR that uses reflection to generalize plugin setting, something like

SchedulerConfig.WithPlugin(plugin)

which internally checks using reflection which interfaces the plugin implements and adds them to all relevant places.

pkg/epp/scheduling/plugins/prefix/plugin.go

ahg-g · 2025-05-09T16:09:15Z

@liu-cong thanks for incorporating latest changes. I think current version looks good.

the only comment I have is about the changes to config.go. I think this sets a very confusing precedent. as a new adopter, am I supposed to initialize the plugin vars and then call NewSchedulerWithConfig? am I supposed to use ConfigOption with empty arrays? why one plugin has a dedicated function while other don't? am I supposed to add a dedicated "Add" function for every plugin that implements more than one interface?

I would feel much more comfortable with this PR if config.go stays clean and you add a helper function in main.go instead of having a helper function to add prefix specifics inside the general purpose scheduler packages.

I'm also happy to implement a follow up PR that uses reflection to generalize plugin setting, something like
SchedulerConfig.WithPlugin(plugin)
which internally checks using reflection which interfaces the plugin implements and adds them to all relevant places.

I agree with this, but I think that designing a proper config API is out of scope for this PR. I would recommend to move forward with this, and have a quick follow up to define that. We are missing a few things:

A config API to spec plugin configurations and scorers weights (see component config in upstream k8s)
A plugin registry for in and out of tree plugin instantiation and registration
Plugin profiles to define grouping of plugins to be executed for a specific scheduling cycle

ahg-g · 2025-05-09T16:38:11Z

Thanks @liu-cong, this is great, I have a couple of questions: 1) can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate) to verify that prefix-cache aware scorer doesn't have negative impact 2) just for sanity checking, can we also add a line for a normal service (I assume the baseline line is for v1 scheduler, not k8s Service)?

ahg-g · 2025-05-09T19:43:11Z

Thanks @liu-cong, this is great, I have a couple of questions: 1) can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate) to verify that prefix-cache aware scorer doesn't have negative impact 2) just for sanity checking, can we also add a line for a normal service (I assume the baseline line is for v1 scheduler, not k8s Service)?

This is not a blocker to merge this, we can do a followup.

liu-cong · 2025-05-09T20:46:50Z

I'm also happy to implement a follow up PR that uses reflection to generalize plugin setting, something like

As I mentioned in a previous comment, this would be the ideal. I don't want to do it here since this PR is always way too large.

I created a follow up issue and also added a TODO to replace the AddPrefixPlugin helper method with the new generic method: #813

liu-cong · 2025-05-09T20:56:24Z

can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate

Yes I will add some data with the sglang test tool.

(I assume the baseline line is for v1 scheduler, not k8s Service)
Yes baseline is v1 scheduler. I can run a sanity check on k8s service as well.

/hold

liu-cong · 2025-05-09T21:28:24Z

/retest

liu-cong · 2025-05-09T22:34:36Z

Thanks @liu-cong, this is great, I have a couple of questions: 1) can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate) to verify that prefix-cache aware scorer doesn't have negative impact 2) just for sanity checking, can we also add a line for a normal service (I assume the baseline line is for v1 scheduler, not k8s Service)?

@ahg-g Pls take a look at the updated results.

liu-cong · 2025-05-09T22:34:44Z

/unhold

ahg-g · 2025-05-10T18:19:46Z

/approve
/lgtm

k8s-ci-robot · 2025-05-10T18:19:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2025-05-11T16:28:24Z

@liu-cong I don't see the comparison with Service, do you have that too?

liu-cong · 2025-05-12T16:28:44Z

@liu-cong I don't see the comparison with Service, do you have that too?

The "Run2" graph has a comparison with svc

smarterclayton · 2025-05-13T19:27:45Z

Cache Hit rate - is that "tokens % hit" or "request % hit"? The latter doesn't really help, the former is more correct.

liu-cong · 2025-05-13T20:30:26Z

Cache Hit rate - is that "tokens % hit" or "request % hit"? The latter doesn't really help, the former is more correct.

It's the former.

* Add prefix cache aware scheduling * Replace scheduler v2 with config v2 * Add score weight to XXScorerConfig * Address comments * Clean up * Change to use container/list lib * cleanup * Add TODO * make linter happy

nayihz · 2025-05-15T08:52:06Z

pkg/epp/scheduling/plugins/prefix/plugin.go

+				if _, ok := res[server]; !ok {
+					res[server] = i + 1
+				}
+			}


Can we break the for-loop directly from here because we have found the longest prefix.

We try to find more than one server up to numServers, so that we have more candidates to score against in the final weighted scoring.

* Add prefix cache aware scheduling * Replace scheduler v2 with config v2 * Add score weight to XXScorerConfig * Address comments * Clean up * Change to use container/list lib * cleanup * Add TODO * make linter happy

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 1, 2025

k8s-ci-robot requested review from Jeffwan and nirrozenbaum May 1, 2025 04:58

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 1, 2025

nirrozenbaum reviewed May 1, 2025

View reviewed changes

cmd/epp/main.go Outdated Show resolved Hide resolved

nirrozenbaum reviewed May 1, 2025

View reviewed changes

pkg/epp/scheduling/plugins/prefix/plugin.go Outdated Show resolved Hide resolved

nirrozenbaum reviewed May 1, 2025

View reviewed changes

pkg/epp/scheduling/scheduler_v2.go Outdated Show resolved Hide resolved

nirrozenbaum reviewed May 1, 2025

View reviewed changes

pkg/epp/scheduling/types/types.go Outdated Show resolved Hide resolved

liu-cong force-pushed the prefix-cache branch from b275441 to 415a624 Compare May 1, 2025 16:01

ahg-g reviewed May 5, 2025

View reviewed changes

kfswain reviewed May 6, 2025

View reviewed changes

liu-cong force-pushed the prefix-cache branch from d8ba48d to 61ce732 Compare May 8, 2025 05:34

nirrozenbaum reviewed May 8, 2025

View reviewed changes

pkg/epp/scheduling/types/types.go Show resolved Hide resolved

nirrozenbaum reviewed May 8, 2025

View reviewed changes

pkg/epp/scheduling/config.go Outdated Show resolved Hide resolved

pkg/epp/scheduling/config.go Outdated Show resolved Hide resolved

cmd/epp/main.go Outdated Show resolved Hide resolved

cmd/epp/main.go Outdated Show resolved Hide resolved

oglok mentioned this pull request May 8, 2025

Prefix Aware Scorer #801

Open

liu-cong force-pushed the prefix-cache branch from d932c4e to 6844ef4 Compare May 8, 2025 21:59

liu-cong marked this pull request as ready for review May 9, 2025 00:13

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 9, 2025

k8s-ci-robot requested a review from nirrozenbaum May 9, 2025 00:13

liu-cong force-pushed the prefix-cache branch from 6844ef4 to 8543a3a Compare May 9, 2025 00:15

nirrozenbaum reviewed May 9, 2025

View reviewed changes

pkg/epp/scheduling/plugins/prefix/plugin.go Outdated Show resolved Hide resolved

liu-cong added 2 commits May 9, 2025 20:39

Change to use container/list lib

3274a73

cleanup

81b24a5

liu-cong mentioned this pull request May 9, 2025

Add a generic AddPlugin option to configure the scheduler #813

Closed

Add TODO

9769357

liu-cong force-pushed the prefix-cache branch from 8543a3a to 9769357 Compare May 9, 2025 20:45

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2025

make linter happy

47febd5

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2025

k8s-ci-robot assigned ahg-g May 10, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2025

k8s-ci-robot merged commit 10ec261 into kubernetes-sigs:main May 10, 2025
8 checks passed

liu-cong mentioned this pull request May 13, 2025

Prefix Cache Aware Routing Implementation #499

Closed

nayihz reviewed May 15, 2025

View reviewed changes

This was referenced Jun 12, 2025

Graduate scheduler v2 #967

Open

refactor: Replace prefix cache structure with golang-lru #928

Merged

Add prefix cache aware scheduling #768

Add prefix cache aware scheduling #768

Uh oh!

Conversation

liu-cong commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Benchmark1: High and low prefix cache hit ratio

Benchmark Setup

High Prefix Cache Hit Ratio (system-prompt-len=3000, question-len=128)

Low Prefix Cache Hit Ratio (system-prompt-len=128, question-len=3000)

Benchmark 2: Regression test

Benchmark set up

Result with queue and kv-cache scorers only (prefix plugin is disabled)

Result with prefix plugin enabled

Prefix Plugin Overhead

Prefix hit rate

Uh oh!

k8s-ci-robot commented May 1, 2025

Uh oh!

netlify bot commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirrozenbaum commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liu-cong commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirrozenbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirrozenbaum left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahg-g commented May 9, 2025

Uh oh!

ahg-g commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahg-g commented May 9, 2025

Uh oh!

liu-cong commented May 9, 2025

Uh oh!

liu-cong commented May 9, 2025

Uh oh!

liu-cong commented May 9, 2025

Uh oh!

liu-cong commented May 9, 2025

Uh oh!

liu-cong commented May 9, 2025

Uh oh!

ahg-g commented May 10, 2025

liu-cong commented May 1, 2025 •

edited

Loading

netlify bot commented May 1, 2025 •

edited

Loading

nirrozenbaum commented May 6, 2025 •

edited

Loading

nirrozenbaum left a comment •

edited

Loading

ahg-g commented May 9, 2025 •

edited

Loading