Skip to content

Conversation

@Aminsed
Copy link

@Aminsed Aminsed commented Oct 27, 2025

Prefix Caching: Frequency- and Cost-Aware Eviction (opt-in)

Closes: #23641

Summary

  • Adds an optional prefix-cache eviction policy that considers frequency and estimated cost, while keeping LRU as the default.
  • Pluggable EvictionPolicy interface; integrates with existing FreeKVCacheBlockQueue and BlockPool without hot‑path regressions.
  • Minimal block metadata: only first_access_ts and access_count; retention score computed in the policy to avoid per-access overhead.

Motivation

The current prefix cache uses LRU. Under mixed workloads and constrained KV cache, frequently reused and/or expensive prefixes can be evicted prematurely. An opt‑in, frequency‑ and cost‑aware strategy can improve reuse and latency without impacting users who prefer LRU.

Design and Approach

  1. Policy interface and wiring
  • Introduces a policy interface (vllm.v1.core.eviction_policies.base.EvictionPolicy) and a concrete implementation FrequencyCostEvictionPolicy.
  • Policy is configured from CacheConfig and passed through EngineArgsVllmConfigKVCacheManagerBlockPool.
  1. Minimal metadata, lazy data structure
  • Block metadata limited to:
    • first_access_ts (monotonic timestamp)
    • access_count (int)
  • Policy maintains a min‑heap of cached‑free blocks keyed by a retention score and an entry_finder dict for O(1) invalidation. Lazy deletion ensures O(log N) inserts/updates and cheap removals.
  1. Retention score (lower is more evictable)
  • Score combines frequency and cost. Cost proxies block size, with tunable exponent alpha; optional time decay de‑emphasizes stale history:
    • score = (block_size ** alpha) / (access_count * (1 + decay_factor * age_sec))
    • Tunables exposed as CLI flags: --eviction-cost-alpha, --eviction-time-decay.
  1. Integration without hot‑path regressions
  • FreeKVCacheBlockQueue remains the source of truth for free blocks; get_new_blocks prioritizes non‑cached‑free blocks first (O(1)).
  • Only when more blocks are needed do we consult the policy for cached‑free candidates; removals use the queue’s O(1) unlink.
  • Reads do not recalculate scores; score recomputation happens only on relevant events.
  1. Backward compatibility and flags
  • LRU remains default: --prefix-cache-eviction-policy lru (default).
  • Opt‑in: --prefix-cache-eviction-policy frequency_cost with --eviction-cost-alpha and --eviction-time-decay.
  • Flags live in vllm/vllm/config/cache.py and are auto‑exposed by EngineArgs.add_cli_args.
  1. Implementation touch points (high‑level)
  • vllm/vllm/config/cache.py: adds prefix_cache_eviction_policy, eviction_cost_alpha, eviction_time_decay.
  • vllm/vllm/engine/arg_utils.py: wires flags through EngineArgs into CacheConfig.
  • vllm/vllm/v1/core/eviction_policies/: new package with frequency_cost.py and __init__.py.
  • vllm/vllm/v1/core/block_pool.py: consults policy only when cached‑free blocks are needed; notifies policy on access/release; preserves LRU queue semantics.

Complexity & Hot‑Path

  • Access updates are O(1) to bump counters/timestamps; heap operations happen when blocks enter/leave cached‑free pool: O(log N) where N is cached‑free size.
  • Common allocations remain O(1) from FreeKVCacheBlockQueue.

Tests

  • CPU‑only unit tests (eviction policy + prefix caching) passed locally:
    • pytest -q -m cpu_test vllm/tests/v1/core/test_eviction_policies.py vllm/tests/v1/core/test_prefix_caching.py
    • Result: 28 passed (Py3.12)

Benchmarks (RTX A6000)

  • Env: Torch 2.9.0+cu128; vLLM v0.11.1rc4.dev; CUDA bf16/fp16, V1 engine; single GPU.
  • Model: facebook/opt-125m for quick, reproducible runs.

Medium regime (moderate prompts)

python vllm/benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --num-prompts 64 \
  --repeat-count 8 \
  --input-length-range 256:512 \
  --prefix-len 128 \
  --seed 123 \
  --prefix-cache-eviction-policy lru

python vllm/benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --num-prompts 64 \
  --repeat-count 8 \
  --input-length-range 256:512 \
  --prefix-len 128 \
  --seed 123 \
  --prefix-cache-eviction-policy frequency_cost \
  --eviction-cost-alpha 2.0 \
  --eviction-time-decay 0.0
  • LRU: 0.381 s
  • frequency_cost: 0.383 s
  • Δ: -0.5% (vs LRU)

Eviction‑pressure regime (force evictions via small KV cache)

python vllm/benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --num-prompts 64 \
  --repeat-count 8 \
  --input-length-range 512:1024 \
  --prefix-len 256 \
  --seed 123 \
  --kv-cache-memory-bytes 2147483648 \
  --prefix-cache-eviction-policy lru

python vllm/benchmarks/benchmark_prefix_caching.py \
  --model facebook/opt-125m \
  --enable-prefix-caching \
  --num-prompts 64 \
  --repeat-count 8 \
  --input-length-range 512:1024 \
  --prefix-len 256 \
  --seed 123 \
  --kv-cache-memory-bytes 2147483648 \
  --prefix-cache-eviction-policy frequency_cost \
  --eviction-cost-alpha 2.0 \
  --eviction-time-decay 0.0
  • LRU: 0.695 s
  • frequency_cost: 0.689 s
  • Δ: +0.9% faster

Notes

  • With very large KV (few evictions), overhead can dominate; with constrained KV (typical under load), frequency_cost yields consistent wins. LRU remains default.
  • If you hit a GPU memory profiling assertion during init (due to other processes changing usage), pass --kv-cache-memory-bytes <bytes> to skip profiling. Example used above.
  • Benchmarking used the existing script vllm/benchmarks/benchmark_prefix_caching.py; no new script is introduced.

Reproducibility

  • Python 3.12; single GPU.
  • Minimal install for local source checkout:
cd /path/to/vLLM/vllm
python3 -m venv .venv && source .venv/bin/activate
python -m pip install -U pip
python -m pip install -r requirements/common.txt
# Optional: editable install if desired for local import/version detection
python -m pip install -e . --no-build-isolation

Docs

  • Flags are auto-exposed by EngineArgs; if preferred, I can add a short entry to the Engine Arguments docs to surface these flags explicitly.

Rollout & Backward Compatibility

  • Default eviction policy is unchanged (LRU).
  • New policy is opt‑in and guarded by clear CLI flags; docs are embedded in flag help.

Appendix: CLI Flags

  • --prefix-cache-eviction-policy {lru,frequency_cost} (default: lru)
  • --eviction-cost-alpha <float> (default: 2.0)
  • --eviction-time-decay <float> (default: 0.0)

cc @vllm-project/maintainers

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new frequency- and cost-aware eviction policy for the prefix cache, which is a valuable addition for optimizing performance under mixed workloads. The implementation is well-structured, introducing a pluggable EvictionPolicy interface and integrating it cleanly with the existing BlockPool and configuration system. The new policy is opt-in, ensuring backward compatibility. The added unit tests and provided benchmarks demonstrate the effectiveness of the new policy. I've found one high-severity issue in the get_new_blocks logic where some cached blocks might be dropped from the new policy's tracking, falling back to LRU eviction. This could reduce the effectiveness of the new feature.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Aminsed Aminsed force-pushed the feat/prefix-cache-frequency-cost-eviction branch from bb46baa to 3e40ef6 Compare October 27, 2025 00:40
Resolved conflicts in vllm/config/cache.py by keeping both the new
KV offloading fields from upstream and the eviction policy fields
from the feature branch.

Signed-off-by: Amin Sedaghat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Frequency and Cost Aware Eviction Policy for Prefix Caching

2 participants