Skip to content

Add prefix cache aware scheduling #768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 10, 2025

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented May 1, 2025

This PR introduces an experimental scheduler v2 (disabled by default, enabled by an env var) that has the following major changes:

  • Adds QueueScorer and KVCacheScorer to scheduler, and ranks pods based on a weighted score. The benchmark results show that this has slight better NTPOT compared to the v1 implementation using filters.
  • Adds PrefixCacheScorer based on proposal (disabled by default).
  • The v2 doesn't support LoRA affinity yet.

The eventual goal is to graduate v2 to the default implementation.

Benchmark Results

TLDR:

  • With prefix plugin disabled, v2 has similar performance as v1, and in some cases has noticeable (5-7%) improvement in p90 NTPOT over v1.
  • With prefix plugin enabled, v2 has significantly improved performance compared to v1 thanks to prefix cache aware scheduling.
  • The memory footprint of prefix plugin is quite small, for the benchmark case (llama 3 8B, 6 H100) the footprint is 18MB.
  • The overhead of the prefix plugin adds about 0.3ms to the baseline scheduling latency of 0.2ms.

Benchmark1: High and low prefix cache hit ratio

TLDR:

  • In high prefix cache hit ratio test (system prompt length = 2048, user question length = 256):
    • Improved P50 and P95 TTFT by 70% and 34%, respectively
    • Increased input throughput by 35%
    • Improved P95 ITL by 49%
    • Improved cache hit ratio from (20-60%) in baseline to very consistent 89%
  • In low prefix cache hit ratio test (system prompt length = 256, user question length = 2048):
    • Improved P50 and P95 TTFT by 17% and 13%, respectively
    • Increased input throughput by 1.4% (not significant)
    • Improved P95 ITL by 17%
    • Improved cache hit ratio from (3-5%) in baseline to (4.5-6%)

Benchmark Setup

Model server: vLLM 0.8.4 with --enable-prefix-caching, base model meta-llama/Llama-3.1-8B-Instruct, on 3 H100 80GB.

EPP baseline image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v20250502-0ae7d1d

Benchmark tool: SGLang benchmark tool, using the 'generated-shared-prefix' dataset.

Benchmark command:

# High cache hit ratio
python3 sglang/bench_serving.py --host=${IP} --port=${PORT} \
--dataset-name='generated-shared-prefix' --model=$MODEL --tokenizer=$MODEL --backend=vllm \
--gsp-num-groups=256 \
--gsp-prompts-per-group=16 \
--gsp-system-prompt-len=2048 \ 
--gsp-question-len=256 \
--gsp-output-len=256 \
--request-rate=800 --max-concurrency=300

# Low cache hit ratio
python3 sglang/bench_serving.py --host=${IP} --port=${PORT} \
--dataset-name='generated-shared-prefix' --model=$MODEL --tokenizer=$MODEL --backend=vllm \
--gsp-num-groups=256 \
--gsp-prompts-per-group=16 \
--gsp-system-prompt-len=256 \ 
--gsp-question-len=2048 \
--gsp-output-len=256 \
--request-rate=800 --max-concurrency=300

High Prefix Cache Hit Ratio (system-prompt-len=3000, question-len=128)

<style type="text/css"></style>

  QPS Concurrency Cache hit P50 TTFT P95 TTFT Total Throughput Input Throughput P50 ITL P95 ITL
baseline 32 278 20%-60% 1268 3439 86342 78045 39 114
prefix111 43 269 89% 379 2255 116877 105646 39 58
(baseline-prefix)/baseline       70.11 34.43 -35.37 -35.37 0.00 49.12

image
image

Low Prefix Cache Hit Ratio (system-prompt-len=128, question-len=3000)

<style type="text/css"></style>

  QPS Concurrency Cache hit P50 TTFT P95 TTFT Total Throughput Input Throughput P50 ITL P95 ITL
baseline 26 293 3% - 5% 1556 4712 70500 63722 41 418
prefix111 27 293 4.5 %- 6% 1286 4076 71508 64634 40 345
(baseline-prefix)/baseline       17.35 13.50 -1.43 -1.43 2.44 17.46

image
image

Benchmark 2: Regression test

Benchmark set up

Followed the regression test guide

Result with queue and kv-cache scorers only (prefix plugin is disabled)

  • I experimented different weights of the QueueScorer and KVCacheScorer, call them w_q, w_k respectively. And in one test w_q=1, w_k=2 gave the best performance in NTPOT, in another test w_q=1, w_k=1 won. We recommend w_q=1, w_k=1 as the default configuration as it's simple.
  • In the diagram below, prefix-disabled-12 means w_q=1, w_k=2.
  • Compared to v1, we see noticeable improvements in NTPOT compared to v1 (up to 7%), and slightly decrease in throughput (the graphs shows a zoomed in view but the actual difference is < 1%). I am not too sure about this because a lower NTPOT typically means a higher throughput.

Run1:
image

Run2:
image

Result with prefix plugin enabled

  • I experimented various weights, including prefix score weight w_p, in the below diagrams, prefix-enabled-123 means w_p=1, w_q=2, w_k=3.
  • All weight combinations gave significant improvements over baseline.
  • 111 gave the best results for this dataset, so having 111 as the default configuration seems to make sense.

Compare to baseline
image

Comparing different weights
image

Prefix Plugin Overhead

  • Memory: As analyzed in the code, the estimated memory footprint is just 17.4MB, which is quite small compared to the baseline EPP memory footprint (150MB). Looking at epp pod memory usage, with prefix enabled, the pod memory usage is actually lower than baseline when serving traffic. When not serving traffic, with prefix enabled EPP pod had higher memory because we still keep the prefix indexes in LRU cache.
  • Latency: With prefix plugin, the major overhead comes from the PreSchedule part, with a P95 latency of 0.2ms. The overall scheduler latency is below 0.5ms (P95). The baseline scheduler latency is about 0.2ms, so prefix cache plugin overall adds about 0.3 ms of scheduling latency.

Memory
image

Latency
image

Prefix hit rate

With this dataset we see a very high prefix cache hit ratio 95%, that's how we get significant improvements.

image

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 1, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 1, 2025
Copy link

netlify bot commented May 1, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 47febd5
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/681e6fef2766a50008f8fbae
😎 Deploy Preview https://deploy-preview-768--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@nirrozenbaum
Copy link
Contributor

nirrozenbaum commented May 6, 2025

I recommend defining an abstract parameters map (e.g., map[string]Interface) instead of explicitly adding parameters.

@nirrozenbaum a plugin can have multiple extension points and in some cases those extensions will need to pass data across them. As suggested above, we can have an abstract parameters map, to pass this data.

See CycleState concept in kube-scheduler: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/cycle_state.go#L32 and https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/interface.go#L556

@ahg-g completely agree. excellent pointer from kube-scheduler 👍
abstract parameters map is great (much better than specific fields that are relevant for specific plugin(s))

@liu-cong
Copy link
Contributor Author

liu-cong commented May 6, 2025

Thanks everyone for the feedback so far. I am not feeling well since yesterday. I hope to get back to this tomorrow with more benchmarking results.

Copy link
Contributor

@nirrozenbaum nirrozenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall the PR looks ok.
I do have some concerns about the way the scheduler config was defined, which breaks the abstraction and adds plugins specifics to general purpose files.
I attached a code suggestion for how to fix it.

@oglok oglok mentioned this pull request May 8, 2025
@liu-cong liu-cong marked this pull request as ready for review May 9, 2025 00:13
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 9, 2025
@k8s-ci-robot k8s-ci-robot requested a review from nirrozenbaum May 9, 2025 00:13
Copy link
Contributor

@nirrozenbaum nirrozenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liu-cong thanks for incorporating latest changes. I think current version looks good.

the only comment I have is about the changes to config.go.
I think this sets a very confusing precedent.
as a new adopter, am I supposed to initialize the plugin vars and then call NewSchedulerWithConfig? am I supposed to use ConfigOption with empty arrays? why one plugin has a dedicated function while other don't? am I supposed to add a dedicated "Add" function for every plugin that implements more than one interface?

I would feel much more comfortable with this PR if config.go stays clean and you add a helper function in main.go instead of having a helper function to add prefix specifics inside the general purpose scheduler packages.

I'm also happy to implement a follow up PR that uses reflection to generalize plugin setting, something like

SchedulerConfig.WithPlugin(plugin)

which internally checks using reflection which interfaces the plugin implements and adds them to all relevant places.

@ahg-g
Copy link
Contributor

ahg-g commented May 9, 2025

@liu-cong thanks for incorporating latest changes. I think current version looks good.

the only comment I have is about the changes to config.go. I think this sets a very confusing precedent. as a new adopter, am I supposed to initialize the plugin vars and then call NewSchedulerWithConfig? am I supposed to use ConfigOption with empty arrays? why one plugin has a dedicated function while other don't? am I supposed to add a dedicated "Add" function for every plugin that implements more than one interface?

I would feel much more comfortable with this PR if config.go stays clean and you add a helper function in main.go instead of having a helper function to add prefix specifics inside the general purpose scheduler packages.

I'm also happy to implement a follow up PR that uses reflection to generalize plugin setting, something like

SchedulerConfig.WithPlugin(plugin)

which internally checks using reflection which interfaces the plugin implements and adds them to all relevant places.

I agree with this, but I think that designing a proper config API is out of scope for this PR. I would recommend to move forward with this, and have a quick follow up to define that. We are missing a few things:

  1. A config API to spec plugin configurations and scorers weights (see component config in upstream k8s)
  2. A plugin registry for in and out of tree plugin instantiation and registration
  3. Plugin profiles to define grouping of plugins to be executed for a specific scheduling cycle

@ahg-g
Copy link
Contributor

ahg-g commented May 9, 2025

Thanks @liu-cong, this is great, I have a couple of questions: 1) can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate) to verify that prefix-cache aware scorer doesn't have negative impact 2) just for sanity checking, can we also add a line for a normal service (I assume the baseline line is for v1 scheduler, not k8s Service)?

@ahg-g
Copy link
Contributor

ahg-g commented May 9, 2025

Thanks @liu-cong, this is great, I have a couple of questions: 1) can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate) to verify that prefix-cache aware scorer doesn't have negative impact 2) just for sanity checking, can we also add a line for a normal service (I assume the baseline line is for v1 scheduler, not k8s Service)?

This is not a blocker to merge this, we can do a followup.

@liu-cong
Copy link
Contributor Author

liu-cong commented May 9, 2025

I'm also happy to implement a follow up PR that uses reflection to generalize plugin setting, something like

As I mentioned in a previous comment, this would be the ideal. I don't want to do it here since this PR is always way too large.

I created a follow up issue and also added a TODO to replace the AddPrefixPlugin helper method with the new generic method: #813

@liu-cong
Copy link
Contributor Author

liu-cong commented May 9, 2025

  1. can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate

Yes I will add some data with the sglang test tool.

(I assume the baseline line is for v1 scheduler, not k8s Service)
Yes baseline is v1 scheduler. I can run a sanity check on k8s service as well.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2025
@liu-cong
Copy link
Contributor Author

liu-cong commented May 9, 2025

/retest

@liu-cong
Copy link
Contributor Author

liu-cong commented May 9, 2025

Thanks @liu-cong, this is great, I have a couple of questions: 1) can we have a run using sharegpt dataset (or any dataset for which we expect low cache hit rate) to verify that prefix-cache aware scorer doesn't have negative impact 2) just for sanity checking, can we also add a line for a normal service (I assume the baseline line is for v1 scheduler, not k8s Service)?

@ahg-g Pls take a look at the updated results.

@liu-cong
Copy link
Contributor Author

liu-cong commented May 9, 2025

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2025
@ahg-g
Copy link
Contributor

ahg-g commented May 10, 2025

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit 10ec261 into kubernetes-sigs:main May 10, 2025
8 checks passed
@ahg-g
Copy link
Contributor

ahg-g commented May 11, 2025

@liu-cong I don't see the comparison with Service, do you have that too?

@liu-cong
Copy link
Contributor Author

@liu-cong I don't see the comparison with Service, do you have that too?

The "Run2" graph has a comparison with svc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants