refactor: Replace prefix cache structure with golang-lru #928

kfirtoledo · 2025-06-05T13:15:52Z

Replace the prefix cache structure with golang-lru in the prefix indexer.
Add TestPrefixPluginStress, a stress test for the prefix scoring plugin using prompts of increasing length.

linux-foundation-easycla · 2025-06-05T13:15:56Z

The committers listed above are authorized under a signed CLA.

✅ login: kfirtoledo / name: Kfir Toledo (0192528, 5845ffd, 20609d0, e7255c8, ca45167)

netlify · 2025-06-05T13:15:57Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`ca45167`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/684e787bd2930d0008e25aa2
😎 Deploy Preview	https://deploy-preview-928--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-06-05T13:16:01Z

Welcome @kfirtoledo!

It looks like this is your first PR to kubernetes-sigs/gateway-api-inference-extension 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api-inference-extension has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-06-05T13:16:02Z

Hi @kfirtoledo. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cmd/epp/main.go

nirrozenbaum · 2025-06-05T15:29:02Z

just to give more context on this PR -
This work is done in order to converge GIE prefix scorer with llm-d prefix scorer.
cc: @liu-cong

liu-cong · 2025-06-05T15:42:17Z

/assign

nirrozenbaum · 2025-06-05T19:08:54Z

/ok-to-test

cmd/epp/main.go

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin_test.go

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

pkg/epp/scheduling/framework/plugins/multi/prefix/indexer.go

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

cmd/epp/main.go

Signed-off-by: Kfir Toledo <[email protected]> Co-authored-by: Maroon Ayoub <[email protected]>

Signed-off-by: Kfir Toledo <[email protected]>

kfirtoledo · 2025-06-12T18:29:08Z

FIXES #960

liu-cong

This looks really nice! Just a few nits!

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

pkg/epp/scheduling/framework/plugins/multi/prefix/indexer.go

Signed-off-by: Kfir Toledo <[email protected]>

liu-cong

/lgtm

/hold in case you want to handle a couple nits

liu-cong · 2025-06-12T20:20:16Z

pkg/epp/scheduling/framework/plugins/multi/prefix/indexer.go

-	maxCacheSize int
-	table        map[BlockHash]map[ServerID]*list.Element // from any prefix cache to the cache entry to find the server
-	ll           *list.List                               // LinkedList to keep track of the order of entries
+	mu         sync.RWMutex


nit: another optimization is to use different mutex for the hashToPods and podToLRU, but I don't think it's very important.

Also, I used it only for the hashToPods operation, except in the ReportLRUSize, which we can remove if it hurts the performance

liu-cong · 2025-06-12T20:26:04Z

pkg/epp/scheduling/framework/plugins/multi/prefix/indexer.go

+	// Check if the LRU pod exist
+	lruForPod, exists := i.podToLRU[pod]
+	if !exists {
+		newLRU, err := lru.NewWithEvict[BlockHash, struct{}](i.maxLRUSize, i.makeEvictionFn(pod))


I read the lru code and the reason this could fail is only because if the LRU size is <= 0... Which IMO we can simply handle when initiating the indexer we set maxLRUSize = max(maxLRUSize, 1). We can then safely add a comment and ignore this error.

The way you handle the error is OK, but it adds some complexity to read, one might think: what if there is an error, do I end up with inaccurate score?

liu-cong · 2025-06-12T20:26:59Z

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin_test.go

@@ -136,3 +141,56 @@ func TestPrefixPlugin(t *testing.T) {

 	plugin.PostCycle(context.Background(), cycleState5, &types.ProfileRunResult{TargetPod: pod1})
 }
+
+// TestPrefixPluginStress is a stress test for the prefix scoring plugin, using prompts of increasing length.
+func BenchmarkPrefixPluginStress(b *testing.B) {


do you mind sharing some results of running this test?

Because it has become a benchmark test (not a unit test that runs every time), I make it longer- also add a check for prompts with length 1 to 1024:

So with the new implementation pprof :
CPU: Duration: 3.12s
Mem: 716.88MB total

The old one:
CPU: Duration: 29.02s
Mem: 9097.67MB total

liu-cong · 2025-06-12T20:32:47Z

/hold

I'd like to run some benchmarks with this change.

nirrozenbaum · 2025-06-14T17:18:56Z

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

@@ -174,29 +165,33 @@ func (m *Plugin) PostCycle(ctx context.Context, cycleState *types.CycleState, re
 		log.FromContext(ctx).Error(err, "failed to read prefix plugin cycle state")
 		return
 	}
-	m.indexer.Add(state.PrefixHashes, ServerID(targetPod.NamespacedName))
+	err = m.indexer.Add(state.PrefixHashes, ServerID(targetPod.NamespacedName))


have you considered replacing PostCycle with PostResponse? or is it planned for a follow up PR?

Not in this PR, but we can do it in a following PR (cc @liu-cong)

Both the PreRequest and PostResponse extension points will be needed: #971

Yeah that should be a follow up

Signed-off-by: Kfir Toledo <[email protected]>

liu-cong · 2025-06-15T17:41:11Z

/lgtm

/hold cancel

I ran benchmarks comparing this change with existing prefix plugin, using the same setup with #768. I ran the Low and high prefix cache ratio tests using the sglang benchmark tool.

TLDR:

The e2e metrics are very similar, indicating no regression.
This PR improved the scheduler e2e latency from ~0.8 ms to less than 0.6ms, indicating perf improvement on the prefix plugin.

Great work ! @kfirtoledo

kfswain · 2025-06-17T15:44:26Z

/approve

Thanks all for the effort here!

TLDR:

The e2e metrics are very similar, indicating no regression.

This PR improved the scheduler e2e latency from ~0.8 ms to less than 0.6ms, indicating perf improvement on the prefix plugin.

This is great!

k8s-ci-robot · 2025-06-17T15:44:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfirtoledo, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…sigs#928) * refactor: Replace prefix cache structure with golang-lru Signed-off-by: Kfir Toledo <[email protected]> Co-authored-by: Maroon Ayoub <[email protected]> * fix: rename prefix scorer parameters and convert test to benchmark test Signed-off-by: Kfir Toledo <[email protected]> * feat: Add per server LRU capacity Signed-off-by: Kfir Toledo <[email protected]> * fix: Fix typos and error handle Signed-off-by: Kfir Toledo <[email protected]> * fix: add safety check for LRUCapacityPerServer Signed-off-by: Kfir Toledo <[email protected]> --------- Signed-off-by: Kfir Toledo <[email protected]> Co-authored-by: Maroon Ayoub <[email protected]>

…e it easier to add plugins (#881) * configuration implementation (after rebase...) Signed-off-by: Shmuel Kallner <[email protected]> * Moved plugin registry back to pkg/epp/plugins Signed-off-by: Shmuel Kallner <[email protected]> * Removed unneeded 'forced imports' of scorers Signed-off-by: Shmuel Kallner <[email protected]> * Changed 'profilepicker' to 'profilehandler' in new and old code Signed-off-by: Shmuel Kallner <[email protected]> * Pass the configured SchedulingProfiles to LoadSchedulerConfig Signed-off-by: Shmuel Kallner <[email protected]> * Ensure that both the configText and configFile flags are not specified Signed-off-by: Shmuel Kallner <[email protected]> * Load RequestControl plugins from the configuration Signed-off-by: Shmuel Kallner <[email protected]> * Register all plugin factories Signed-off-by: Shmuel Kallner <[email protected]> * Review fixes Signed-off-by: Shmuel Kallner <[email protected]> * Reverted unneeded change Signed-off-by: Shmuel Kallner <[email protected]> * Updates from review comments Signed-off-by: Shmuel Kallner <[email protected]> * Added a stub interface for plugins to get data from the EPP Signed-off-by: Shmuel Kallner <[email protected]> * Added a temporary implementation of plugins.Handle Signed-off-by: Shmuel Kallner <[email protected]> * Added pluginName and plugins.Handle to plugin factory interface Signed-off-by: Shmuel Kallner <[email protected]> * Updated plugin factory signatures to reflect new API Signed-off-by: Shmuel Kallner <[email protected]> * Updated plugin instantiation to reflect new API Signed-off-by: Shmuel Kallner <[email protected]> * Updated plugin instantiation to reflect new API Signed-off-by: Shmuel Kallner <[email protected]> * Updated tests to reflect new API Signed-off-by: Shmuel Kallner <[email protected]> * Do not rename the imported package Signed-off-by: Shmuel Kallner <[email protected]> * Only upper layer of code should log errors Signed-off-by: Shmuel Kallner <[email protected]> * Only pass what is needed to instantiate the plugins Signed-off-by: Shmuel Kallner <[email protected]> * Review updates Signed-off-by: Shmuel Kallner <[email protected]> * Review update Signed-off-by: Shmuel Kallner <[email protected]> * Review update. Make more clear that the code only checks for already defined names Signed-off-by: Shmuel Kallner <[email protected]> * fixed e2e doc in makefile (does not require GPUs) (#976) Signed-off-by: Nir Rozenbaum <[email protected]> * API: Adds 5xx Status Code for Invalid ExtRef (#991) Signed-off-by: Daneyon Hansen <[email protected]> * feat(conformance): Add test for invalid EPP service reference (#959) * fix boilerplate header * add tests for InferencePoolInvalidEPPService * change to expect error on httproute refcond * moved the creation of the context to main.go. (#995) this is useful when writing a different main like llm-d, allowing to propogate the same context to the whole system. Signed-off-by: Nir Rozenbaum <[email protected]> * fix dead links (#989) * feat: add health check for epp cluster (#966) * feat: add health check for epp cluster Signed-off-by: zhengkezhou1 <[email protected]> * remove tls Signed-off-by: zhengkezhou1 <[email protected]> * don't use tls Signed-off-by: zhengkezhou1 <[email protected]> * health checking flag Signed-off-by: zhengkezhou1 <[email protected]> * fix import Signed-off-by: zhengkezhou1 <[email protected]> * add tls options Signed-off-by: zhengkezhou1 <[email protected]> --------- Signed-off-by: zhengkezhou1 <[email protected]> * Server unit test and utility to help with such tests (#820) Signed-off-by: Ira <[email protected]> * Update dynamic-lora-sidecar to expose metrics to track loaded adapters (#980) * Add a metrics to track loaded adapters * Update the sample manifests * Add explanation of metrics from dyanmic LoRA adapter sidecar * Add explanation of metrics from dyanmic LoRA adapter sidecar (take 2) * Update metrics.md based on feedback * refactor: Replace prefix cache structure with golang-lru (#928) * refactor: Replace prefix cache structure with golang-lru Signed-off-by: Kfir Toledo <[email protected]> Co-authored-by: Maroon Ayoub <[email protected]> * fix: rename prefix scorer parameters and convert test to benchmark test Signed-off-by: Kfir Toledo <[email protected]> * feat: Add per server LRU capacity Signed-off-by: Kfir Toledo <[email protected]> * fix: Fix typos and error handle Signed-off-by: Kfir Toledo <[email protected]> * fix: add safety check for LRUCapacityPerServer Signed-off-by: Kfir Toledo <[email protected]> --------- Signed-off-by: Kfir Toledo <[email protected]> Co-authored-by: Maroon Ayoub <[email protected]> * feat(conformance): Add HTTPRouteMultipleRulesDifferentPools test (#834) * copy of accepted inference pool test to start from. * add yaml file for the test * update time out * update the yaml file to add port 9002 * read timeout config from local repo * remove excess comments * correct spelling for scenarios * check route condition on RouteConditionResolvedRefs * remove empty lines in yaml * set optional/defaulted fields as unspecified * fix timeout * fix boilerplate header * change varialbe names to use primary secondary consistently. * remove extra comments * factor out common code * Add actual http traffic validation using echo-basic * remove extra comments from manifest * remove modifiedTimeoutConfig.HTTPRouteMustHaveCondition per review comment. * intermediate update * fix the test run * factor out common code * move epp def to shared manifest * remove extra comments * revert back to two epps * add to do for epp image * switch to GeneralMustHaveConditionTimeout * undo gateway version changes * remove unused HTTPRouteMustHaveConditions * update doc string for GetPod * update docstring * Remove resource type from names in manifests. * remove type from name * remove health check * add todo for combining getpod methods * configuration implementation (after rebase...) Signed-off-by: Shmuel Kallner <[email protected]> * After review, made code more obvious Signed-off-by: Shmuel Kallner <[email protected]> * Fixed merge issues Signed-off-by: Shmuel Kallner <[email protected]> --------- Signed-off-by: Shmuel Kallner <[email protected]> Signed-off-by: Nir Rozenbaum <[email protected]> Signed-off-by: Daneyon Hansen <[email protected]> Signed-off-by: zhengkezhou1 <[email protected]> Signed-off-by: Ira <[email protected]> Signed-off-by: Kfir Toledo <[email protected]> Co-authored-by: Nir Rozenbaum <[email protected]> Co-authored-by: Daneyon Hansen <[email protected]> Co-authored-by: sina chavoshi <[email protected]> Co-authored-by: Xudong Wang <[email protected]> Co-authored-by: Zhengke Zhou <[email protected]> Co-authored-by: Ira Rosen <[email protected]> Co-authored-by: Shotaro Kohama <[email protected]> Co-authored-by: Kfir Toledo <[email protected]> Co-authored-by: Maroon Ayoub <[email protected]>

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 5, 2025

k8s-ci-robot requested review from ahg-g and kfswain June 5, 2025 13:15

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 5, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 5, 2025

kfirtoledo force-pushed the prefix branch 2 times, most recently from 7dc0b4a to f3d636b Compare June 5, 2025 13:50

vMaroon reviewed Jun 5, 2025

View reviewed changes

cmd/epp/main.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned liu-cong Jun 5, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 5, 2025

liu-cong reviewed Jun 5, 2025

View reviewed changes

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 5, 2025

liu-cong reviewed Jun 9, 2025

View reviewed changes

cmd/epp/main.go Outdated Show resolved Hide resolved

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2025

kfirtoledo force-pushed the prefix branch from 3a6a66a to 7eca1dc Compare June 11, 2025 07:31

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2025

kfirtoledo force-pushed the prefix branch from 7eca1dc to 616c670 Compare June 11, 2025 07:32

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2025

kfirtoledo and others added 3 commits June 12, 2025 14:11

refactor: Replace prefix cache structure with golang-lru

5845ffd

Signed-off-by: Kfir Toledo <[email protected]> Co-authored-by: Maroon Ayoub <[email protected]>

fix: rename prefix scorer parameters and convert test to benchmark test

20609d0

Signed-off-by: Kfir Toledo <[email protected]>

feat: Add per server LRU capacity

0192528

Signed-off-by: Kfir Toledo <[email protected]>

kfirtoledo force-pushed the prefix branch from d8d6659 to 0192528 Compare June 12, 2025 11:16

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2025

liu-cong reviewed Jun 12, 2025

View reviewed changes

fix: Fix typos and error handle

e7255c8

Signed-off-by: Kfir Toledo <[email protected]>

kfirtoledo force-pushed the prefix branch from 74b1f53 to e7255c8 Compare June 12, 2025 20:05

liu-cong reviewed Jun 12, 2025

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 12, 2025

nirrozenbaum reviewed Jun 14, 2025

View reviewed changes

fix: add safety check for LRUCapacityPerServer

ca45167

Signed-off-by: Kfir Toledo <[email protected]>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2025

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jun 15, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2025

k8s-ci-robot merged commit 191e710 into kubernetes-sigs:main Jun 17, 2025
9 checks passed

liu-cong mentioned this pull request Jun 17, 2025

[Prefix Plugin Enhancement] Add per server LRU capacity #960

Closed

kfirtoledo mentioned this pull request Jun 18, 2025

Align/merge the the two prefix-cache aware routing plugins llm-d/llm-d-inference-scheduler#132

Closed

refactor: Replace prefix cache structure with golang-lru #928

refactor: Replace prefix cache structure with golang-lru #928

Uh oh!

Conversation

kfirtoledo commented Jun 5, 2025

Uh oh!

linux-foundation-easycla bot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Jun 5, 2025

Uh oh!

k8s-ci-robot commented Jun 5, 2025

Uh oh!

Uh oh!

nirrozenbaum commented Jun 5, 2025

Uh oh!

liu-cong commented Jun 5, 2025

Uh oh!

nirrozenbaum commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfirtoledo commented Jun 12, 2025

Uh oh!

liu-cong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liu-cong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-cong commented Jun 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-cong commented Jun 15, 2025

Uh oh!

kfswain commented Jun 17, 2025

Uh oh!

k8s-ci-robot commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Jun 5, 2025 •

edited

Loading

netlify bot commented Jun 5, 2025 •

edited

Loading