-
Notifications
You must be signed in to change notification settings - Fork 200
Description
What would you like to be added:
The current prefix cache plugin enforces a global LRU capacity. To make the estimation more accurate, we should improve the prefix cache aware plugin to enforce per-pod LRU cache capacity.
We will introduce a new config knobs for cache size:
// DEPRECATED, global capacity for all servers.
// From back of envelope calculation, a H100 GPU can hold 0.5M tokens cache for llama3 8B model. Given vllm default block
// size of 16, and each block corresponds to a int64 hash, 0.5M tokens only requires 0.25MB storage on the EPP. With
// modern servers, EPP can easily accommodate 10k servers. So it's unlikely we will need the global capacity limit. We can add
// it later if we need it.
PREFIX_CACHE_LRU_CAPACITY
// NEW, per server capacity. Note this will become an optional config because we can dynamically get this info from model
// server metrics.
PREFIX_CACHE_LRU_CAPACITY_PER_SERVER
The proposed prefix indexer looks like this:
type indexer struct {
hashToPods map[BlockHash]PodSet // the lookup data structure to find pods that have the BlockHash cached
podToLRU map[string]lru.Cache[BlockHash, any] // key is pod namespacedName, value is an LRU cache
}
type PodSet map[string]bool
When a new cache entry is added, we will update both hashToPods and podToLRU.
Add(hash BlockHash, pod NamespacedName)
When the LRU evicts an entry, we can register a onEvicted callback to also evict the entry from hashToPods.
This is initially proposed and discussed in
EPP Scheduler Algo Enhancements.
Why is this needed:
This makes the prefix estimation more accurate, as each pod(model server) has its own cache capacity. Without this, we could have imbalanced cache entries in the estimated cache, which will lead to inaccuracy.