-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Core] Prefix cache: frequency- and cost-aware eviction (opt-in) #27539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Core] Prefix cache: frequency- and cost-aware eviction (opt-in) #27539
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new frequency- and cost-aware eviction policy for the prefix cache, which is a valuable addition for optimizing performance under mixed workloads. The implementation is well-structured, introducing a pluggable EvictionPolicy interface and integrating it cleanly with the existing BlockPool and configuration system. The new policy is opt-in, ensuring backward compatibility. The added unit tests and provided benchmarks demonstrate the effectiveness of the new policy. I've found one high-severity issue in the get_new_blocks logic where some cached blocks might be dropped from the new policy's tracking, falling back to LRU eviction. This could reduce the effectiveness of the new feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Amin Sedaghat <[email protected]>
bb46baa to
3e40ef6
Compare
Signed-off-by: Amin Sedaghat <[email protected]>
Signed-off-by: Amin Sedaghat <[email protected]>
Resolved conflicts in vllm/config/cache.py by keeping both the new KV offloading fields from upstream and the eviction policy fields from the feature branch. Signed-off-by: Amin Sedaghat <[email protected]>
Prefix Caching: Frequency- and Cost-Aware Eviction (opt-in)
Closes: #23641
Summary
EvictionPolicyinterface; integrates with existingFreeKVCacheBlockQueueandBlockPoolwithout hot‑path regressions.first_access_tsandaccess_count; retention score computed in the policy to avoid per-access overhead.Motivation
The current prefix cache uses LRU. Under mixed workloads and constrained KV cache, frequently reused and/or expensive prefixes can be evicted prematurely. An opt‑in, frequency‑ and cost‑aware strategy can improve reuse and latency without impacting users who prefer LRU.
Design and Approach
vllm.v1.core.eviction_policies.base.EvictionPolicy) and a concrete implementationFrequencyCostEvictionPolicy.CacheConfigand passed throughEngineArgs→VllmConfig→KVCacheManager→BlockPool.first_access_ts(monotonic timestamp)access_count(int)entry_finderdict for O(1) invalidation. Lazy deletion ensures O(log N) inserts/updates and cheap removals.score = (block_size ** alpha) / (access_count * (1 + decay_factor * age_sec))--eviction-cost-alpha,--eviction-time-decay.FreeKVCacheBlockQueueremains the source of truth for free blocks;get_new_blocksprioritizes non‑cached‑free blocks first (O(1)).--prefix-cache-eviction-policy lru(default).--prefix-cache-eviction-policy frequency_costwith--eviction-cost-alphaand--eviction-time-decay.vllm/vllm/config/cache.pyand are auto‑exposed byEngineArgs.add_cli_args.vllm/vllm/config/cache.py: addsprefix_cache_eviction_policy,eviction_cost_alpha,eviction_time_decay.vllm/vllm/engine/arg_utils.py: wires flags throughEngineArgsintoCacheConfig.vllm/vllm/v1/core/eviction_policies/: new package withfrequency_cost.pyand__init__.py.vllm/vllm/v1/core/block_pool.py: consults policy only when cached‑free blocks are needed; notifies policy on access/release; preserves LRU queue semantics.Complexity & Hot‑Path
FreeKVCacheBlockQueue.Tests
pytest -q -m cpu_test vllm/tests/v1/core/test_eviction_policies.py vllm/tests/v1/core/test_prefix_caching.pyBenchmarks (RTX A6000)
facebook/opt-125mfor quick, reproducible runs.Medium regime (moderate prompts)
Eviction‑pressure regime (force evictions via small KV cache)
Notes
--kv-cache-memory-bytes <bytes>to skip profiling. Example used above.vllm/benchmarks/benchmark_prefix_caching.py; no new script is introduced.Reproducibility
Docs
EngineArgs; if preferred, I can add a short entry to the Engine Arguments docs to surface these flags explicitly.Rollout & Backward Compatibility
Appendix: CLI Flags
--prefix-cache-eviction-policy {lru,frequency_cost}(default:lru)--eviction-cost-alpha <float>(default: 2.0)--eviction-time-decay <float>(default: 0.0)cc @vllm-project/maintainers