Skip to content

[Prefix Plugin Enhancement] Add per server LRU capacity #960

@liu-cong

Description

@liu-cong

What would you like to be added:

The current prefix cache plugin enforces a global LRU capacity. To make the estimation more accurate, we should improve the prefix cache aware plugin to enforce per-pod LRU cache capacity.

We will introduce a new config knobs for cache size:

// DEPRECATED, global capacity for all servers. 
// From back of envelope calculation, a H100 GPU can hold 0.5M tokens cache for llama3 8B model. Given vllm default block 
// size of 16, and each block corresponds to a int64 hash, 0.5M tokens only requires 0.25MB storage on the EPP. With 
// modern servers, EPP can easily accommodate 10k servers. So it's unlikely we will need the global capacity limit. We can add 
// it later if we need it.
PREFIX_CACHE_LRU_CAPACITY 
// NEW, per server capacity. Note this will become an optional config because we can dynamically get this info from model 
// server metrics.
PREFIX_CACHE_LRU_CAPACITY_PER_SERVER 

The proposed prefix indexer looks like this:

type indexer struct {
   hashToPods map[BlockHash]PodSet // the lookup data structure to find pods that have the BlockHash cached
   podToLRU map[string]lru.Cache[BlockHash, any] // key is pod namespacedName, value is an LRU cache
}

type PodSet map[string]bool

When a new cache entry is added, we will update both hashToPods and podToLRU.

Add(hash BlockHash, pod NamespacedName)

When the LRU evicts an entry, we can register a onEvicted callback to also evict the entry from hashToPods.

This is initially proposed and discussed in
EPP Scheduler Algo Enhancements.

Why is this needed:

This makes the prefix estimation more accurate, as each pod(model server) has its own cache capacity. Without this, we could have imbalanced cache entries in the estimated cache, which will lead to inaccuracy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions