[Core] Prefix cache: frequency- and cost-aware eviction (opt-in) #27539

Aminsed · 2025-10-27T00:34:14Z

Prefix Caching: Frequency- and Cost-Aware Eviction (opt-in)

Summary

Adds an optional prefix-cache eviction policy that considers frequency and estimated cost, while keeping LRU as the default.
Pluggable EvictionPolicy interface; integrates with existing FreeKVCacheBlockQueue and BlockPool without hot‑path regressions.
Minimal block metadata: only first_access_ts and access_count; retention score computed in the policy to avoid per-access overhead.

Motivation

The current prefix cache uses LRU. Under mixed workloads and constrained KV cache, frequently reused and/or expensive prefixes can be evicted prematurely. An opt‑in, frequency‑ and cost‑aware strategy can improve reuse and latency without impacting users who prefer LRU.

Design and Approach

Policy interface and wiring

Introduces a policy interface (vllm.v1.core.eviction_policies.base.EvictionPolicy) and a concrete implementation FrequencyCostEvictionPolicy.
Policy is configured from CacheConfig and passed through EngineArgs → VllmConfig → KVCacheManager → BlockPool.

Minimal metadata, lazy data structure

Block metadata limited to:
- first_access_ts (monotonic timestamp)
- access_count (int)
Policy maintains a min‑heap of cached‑free blocks keyed by a retention score and an entry_finder dict for O(1) invalidation. Lazy deletion ensures O(log N) inserts/updates and cheap removals.

Retention score (lower is more evictable)

Score combines frequency and cost. Cost proxies block size, with tunable exponent alpha; optional time decay de‑emphasizes stale history:
- score = (block_size ** alpha) / (access_count * (1 + decay_factor * age_sec))
- Tunables exposed as CLI flags: --eviction-cost-alpha, --eviction-time-decay.

Integration without hot‑path regressions

FreeKVCacheBlockQueue remains the source of truth for free blocks; get_new_blocks prioritizes non‑cached‑free blocks first (O(1)).
Only when more blocks are needed do we consult the policy for cached‑free candidates; removals use the queue’s O(1) unlink.
Reads do not recalculate scores; score recomputation happens only on relevant events.

Backward compatibility and flags

LRU remains default: --prefix-cache-eviction-policy lru (default).
Opt‑in: --prefix-cache-eviction-policy frequency_cost with --eviction-cost-alpha and --eviction-time-decay.
Flags live in vllm/vllm/config/cache.py and are auto‑exposed by EngineArgs.add_cli_args.

Implementation touch points (high‑level)

vllm/vllm/config/cache.py: adds prefix_cache_eviction_policy, eviction_cost_alpha, eviction_time_decay.
vllm/vllm/engine/arg_utils.py: wires flags through EngineArgs into CacheConfig.
vllm/vllm/v1/core/eviction_policies/: new package with frequency_cost.py and __init__.py.
vllm/vllm/v1/core/block_pool.py: consults policy only when cached‑free blocks are needed; notifies policy on access/release; preserves LRU queue semantics.

Complexity & Hot‑Path

Access updates are O(1) to bump counters/timestamps; heap operations happen when blocks enter/leave cached‑free pool: O(log N) where N is cached‑free size.
Common allocations remain O(1) from FreeKVCacheBlockQueue.

Tests

CPU‑only unit tests (eviction policy + prefix caching) passed locally:
- pytest -q -m cpu_test vllm/tests/v1/core/test_eviction_policies.py vllm/tests/v1/core/test_prefix_caching.py
- Result: 28 passed (Py3.12)

Benchmarks (RTX A6000)

Env: Torch 2.9.0+cu128; vLLM v0.11.1rc4.dev; CUDA bf16/fp16, V1 engine; single GPU.
Model: facebook/opt-125m for quick, reproducible runs.

Medium regime (moderate prompts)

python vllm/benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --num-prompts 64 \
  --repeat-count 8 \
  --input-length-range 256:512 \
  --prefix-len 128 \
  --seed 123 \
  --prefix-cache-eviction-policy lru

python vllm/benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --num-prompts 64 \
  --repeat-count 8 \
  --input-length-range 256:512 \
  --prefix-len 128 \
  --seed 123 \
  --prefix-cache-eviction-policy frequency_cost \
  --eviction-cost-alpha 2.0 \
  --eviction-time-decay 0.0

LRU: 0.381 s
frequency_cost: 0.383 s
Δ: -0.5% (vs LRU)

Eviction‑pressure regime (force evictions via small KV cache)

python vllm/benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --num-prompts 64 \
  --repeat-count 8 \
  --input-length-range 512:1024 \
  --prefix-len 256 \
  --seed 123 \
  --kv-cache-memory-bytes 2147483648 \
  --prefix-cache-eviction-policy lru

python vllm/benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --num-prompts 64 \
  --repeat-count 8 \
  --input-length-range 512:1024 \
  --prefix-len 256 \
  --seed 123 \
  --kv-cache-memory-bytes 2147483648 \
  --prefix-cache-eviction-policy frequency_cost \
  --eviction-cost-alpha 2.0 \
  --eviction-time-decay 0.0

LRU: 0.695 s
frequency_cost: 0.689 s
Δ: +0.9% faster

Notes

With very large KV (few evictions), overhead can dominate; with constrained KV (typical under load), frequency_cost yields consistent wins. LRU remains default.
If you hit a GPU memory profiling assertion during init (due to other processes changing usage), pass --kv-cache-memory-bytes <bytes> to skip profiling. Example used above.
Benchmarking used the existing script vllm/benchmarks/benchmark_prefix_caching.py; no new script is introduced.

Reproducibility

Python 3.12; single GPU.
Minimal install for local source checkout:

cd /path/to/vLLM/vllm
python3 -m venv .venv && source .venv/bin/activate
python -m pip install -U pip
python -m pip install -r requirements/common.txt
# Optional: editable install if desired for local import/version detection
python -m pip install -e . --no-build-isolation

Docs

Flags are auto-exposed by EngineArgs; if preferred, I can add a short entry to the Engine Arguments docs to surface these flags explicitly.

Rollout & Backward Compatibility

Default eviction policy is unchanged (LRU).
New policy is opt‑in and guarded by clear CLI flags; docs are embedded in flag help.

Appendix: CLI Flags

--prefix-cache-eviction-policy {lru,frequency_cost} (default: lru)
--eviction-cost-alpha <float> (default: 2.0)
--eviction-time-decay <float> (default: 0.0)

cc @vllm-project/maintainers

github-actions · 2025-10-27T00:34:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a new frequency- and cost-aware eviction policy for the prefix cache, which is a valuable addition for optimizing performance under mixed workloads. The implementation is well-structured, introducing a pluggable EvictionPolicy interface and integrating it cleanly with the existing BlockPool and configuration system. The new policy is opt-in, ensuring backward compatibility. The added unit tests and provided benchmarks demonstrate the effectiveness of the new policy. I've found one high-severity issue in the get_new_blocks logic where some cached blocks might be dropped from the new policy's tracking, falling back to LRU eviction. This could reduce the effectiveness of the new feature.

vllm/v1/core/block_pool.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/core/block_pool.py

Signed-off-by: Amin Sedaghat <[email protected]>

Resolved conflicts in vllm/config/cache.py by keeping both the new KV offloading fields from upstream and the eviction policy fields from the feature branch. Signed-off-by: Amin Sedaghat <[email protected]>

Aminsed requested review from ApostaC, DarkLight1337, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners October 27, 2025 00:34

mergify bot added ci/build v1 labels Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

vllm/v1/core/block_pool.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 27, 2025

View reviewed changes

vllm/v1/core/block_pool.py Outdated Show resolved Hide resolved

[Core] Prefix cache: frequency- and cost-aware eviction (opt-in)

3e40ef6

Signed-off-by: Amin Sedaghat <[email protected]>

Aminsed force-pushed the feat/prefix-cache-frequency-cost-eviction branch from bb46baa to 3e40ef6 Compare October 27, 2025 00:40

Aminsed added 3 commits October 26, 2025 21:00

[Core] Prefix cache: re-register deferred cached blocks with policy

c3b4690

Signed-off-by: Amin Sedaghat <[email protected]>

[Core] Prefix cache: move access stats into policy; fix pre-commit

f3fee75

Signed-off-by: Amin Sedaghat <[email protected]>

Merge upstream/main into feat/prefix-cache-frequency-cost-eviction

ace285f

Resolved conflicts in vllm/config/cache.py by keeping both the new KV offloading fields from upstream and the eviction policy fields from the feature branch. Signed-off-by: Amin Sedaghat <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Prefix cache: frequency- and cost-aware eviction (opt-in) #27539

[Core] Prefix cache: frequency- and cost-aware eviction (opt-in) #27539

Aminsed commented Oct 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Core] Prefix cache: frequency- and cost-aware eviction (opt-in) #27539

Are you sure you want to change the base?

[Core] Prefix cache: frequency- and cost-aware eviction (opt-in) #27539

Conversation

Aminsed commented Oct 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prefix Caching: Frequency- and Cost-Aware Eviction (opt-in)

Summary

Motivation

Design and Approach

Complexity & Hot‑Path

Tests

Benchmarks (RTX A6000)

Reproducibility

Docs

Rollout & Backward Compatibility

Appendix: CLI Flags

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aminsed commented Oct 27, 2025 •

edited by github-actions bot

Loading