[Prefix Plugin Enhancement] Add per server LRU capacity

**What would you like to be added**:

The current prefix cache plugin enforces a global LRU capacity. To make the estimation more accurate,  we should improve the prefix cache aware plugin to enforce per-pod LRU cache capacity.

We will introduce a new config knobs for cache size:

```
// DEPRECATED, global capacity for all servers. 
// From back of envelope calculation, a H100 GPU can hold 0.5M tokens cache for llama3 8B model. Given vllm default block 
// size of 16, and each block corresponds to a int64 hash, 0.5M tokens only requires 0.25MB storage on the EPP. With 
// modern servers, EPP can easily accommodate 10k servers. So it's unlikely we will need the global capacity limit. We can add 
// it later if we need it.
PREFIX_CACHE_LRU_CAPACITY 
// NEW, per server capacity. Note this will become an optional config because we can dynamically get this info from model 
// server metrics.
PREFIX_CACHE_LRU_CAPACITY_PER_SERVER 
```

The proposed prefix indexer looks like this:

```
type indexer struct {
   hashToPods map[BlockHash]PodSet // the lookup data structure to find pods that have the BlockHash cached
   podToLRU map[string]lru.Cache[BlockHash, any] // key is pod namespacedName, value is an LRU cache
}

type PodSet map[string]bool
```

When a new cache entry is added, we will update both `hashToPods` and `podToLRU`.

```
Add(hash BlockHash, pod NamespacedName)
```

When the LRU evicts an entry, we can register a `onEvicted` callback to also evict the entry from  `hashToPods`.

This is initially proposed and discussed in 
[EPP Scheduler Algo Enhancements](https://docs.google.com/document/d/1PW8ZpMPEVGIRy7P76vyc7MqaXB6VIHlqATS_tfmQPg4/edit?tab=t.efa1qf47749i).

**Why is this needed**:

This makes the prefix estimation more accurate, as each pod(model server) has its own cache capacity. Without this, we could have imbalanced cache entries in the estimated cache, which will lead to inaccuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Prefix Plugin Enhancement] Add per server LRU capacity #960

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Prefix Plugin Enhancement] Add per server LRU capacity #960

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions