Persistent prompt cache (opt-in disk-backed KV reuse) by benjamin-levin · Pull Request #6 · benjamin-levin/mlx-lm

benjamin-levin · 2026-05-18T23:32:35Z

Summary

Adds an opt-in, disk-backed persistence layer for LRUPromptCache. With persistence enabled, prefix KV caches survive process restarts: on a cold start the longest stored prefix is re-materialized into the in-memory trie, avoiding a full re-prefill of long system prompts.

Default behavior is unchanged. The feature activates only when --disk-prompt-cache-dir (server) or disk_cache_dir= (library) is set.

Motivation

mlx_lm.server already does prefix sharing via the in-memory token-id trie inside LRUPromptCache, but that trie lives entirely in RAM. A server restart, OOM kill, or deploy means every long system prompt re-prefills from scratch.

For workloads with stable preambles - RAG with a fixed retrieval header, multi-turn chat with a long system prompt, agent loops with a stable tool/persona context - this re-prefill is the dominant time-to-first-token cost. Persisting the trie to disk removes it.

Measured impact

Bench: mlx-community/Qwen3.6-35B-A3B-4bit on M4 Max 36 GB, n_gen=8, prefix-only reuse across a fresh process start. Cold prefill uses mlx_lm.generate.generate_step with prefill_step_size=2048; warm load is disk_prompt_cache.load_cached_prefix re-materializing the KV state from a previously saved safetensors file.

TTFT — HIT (warm load) vs MISS (cold prefill)

ctx	MISS — cold TTFT	HIT — warm load	speedup	bit-exact
4k	2 905.4 ms	2.0 ms	1 477x	yes
16k	12 306.8 ms	5.9 ms	2 078x	yes
32k	28 787.8 ms	15.9 ms	1 811x	yes
96k	141 142.1 ms	30.8 ms	4 585x	yes

The "MISS" column is the unmodified baseline — full prefill of the same prompt with no disk cache present. The "HIT" column is the entire TTFT cost when the disk has a covering prefix: load + KV re-materialization. Bit-exactness is verified across the first 8 generated tokens of each cell (cold tokens == warm tokens for every row).

Save cost (write-through to disk on first prefill) is bounded and amortized across all subsequent hits: 54 ms @ 4k, 95 ms @ 16k, 163 ms @ 32k, 593 ms @ 96k.

LRU eviction (400 MB budget)

DiskPromptCacheIndex(root, model_id, size_budget_bytes=400 MB, min_prefix_tokens=256) with four sequential 4k-token inserts (~148 MB each):

step	on-disk after	entries
insert #0	0.148 GB	1
insert #1	0.297 GB	2
insert #2	0.297 GB	2 (evicted #0)
insert #3	0.297 GB	2 (evicted #1)

Survivor probe after all inserts: indices [2, 3] — the two most-recent — confirming oldest-last_used_at eviction. With the default 1 GB budget all four entries fit and zero eviction fires.

Headline

Up to 4 585x TTFT speedup on warm load (96k prefix, M4 Max).
Bit-exact reads via mlx_lm's own save_prompt_cache / load_prompt_cache (safetensors) format.
Save cost (one-time per unique prefix) ≤ 600 ms even at 96k tokens.
LRU eviction by last_used_at keeps the disk footprint inside the user-supplied byte budget.

API

Library (`LRUPromptCache`)

LRUPromptCache(
    max_size: int = 10,
    max_bytes: int = ...,
    *,
    disk_cache_dir: Optional[str] = None,        # None = disabled
    disk_cache_bytes: int = 4 * (1024**3),       # LRU byte budget on disk
    disk_cache_min_tokens: int = 256,            # skip below this length
)

When disk_cache_dir is None (default) there is zero overhead and no behavior change.

Server CLI

--disk-prompt-cache-dir DIR             # opt-in; default disabled
--disk-prompt-cache-bytes SIZE          # default 4 GB
--disk-prompt-cache-min-tokens N        # default 256

All three flags are optional; existing servers keep their existing in-memory-only behavior.

Implementation

mlx_lm/disk_prompt_cache.py (new, ~460 LOC). Small library module:
- DiskPromptCacheIndex - token-keyed disk index per model_id.
- save_prefix_cache / load_cached_prefix - sidecar JSON + atomic safetensors write.
- hash_tokens(tokens, model_id) - SHA-256 over model_id || little-endian-packed tokens.
- 16-char filename + full hex sidecar for collision check.
- LRU byte-budget eviction (oldest last_used_at first).
mlx_lm/models/cache.py. LRUPromptCache.__init__ gains the three keyword-only kwargs above. When disk_cache_dir is set, insert_cache write-throughs to disk and fetch_nearest_cache probes disk on in-memory miss; a disk hit is re-materialized into the in-memory trie so subsequent lookups take the fast path.
mlx_lm/server.py. Three new CLI flags pass through to the LRUPromptCache constructor. Defaults preserve existing behavior.

Safety

Trimmed (SnapKV-style) and quantized caches are refused at the save path - the on-disk format treats offset == token count and that invariant only holds for full untrimmed prefix caches.
Cross-model loads are rejected via a model_id check in the sidecar (refuses to load somebody else's KV state into our model).
Disk-write failures are logged and swallowed; inference is never broken by a disk error.
Atomic write: tmp file + os.replace into final name; two writers racing on the same key both produce identical content (by hash equality), so last-write-wins is safe.

Backwards compatibility

Default off. When disk_cache_dir is not set (or --disk-prompt-cache-dir is omitted on the server), LRUPromptCache does not touch disk and follows the exact same code paths it did before this PR. All previously-passing tests still pass.

Tests

New TestDiskPromptCache class in tests/test_prompt_cache.py:

test_hash_tokens_is_deterministic - same (tokens, model_id) -> same 64-char hex.
test_hash_tokens_changes_on_model_or_tokens - changes when model_id, a token, or list length changes.
test_disk_index_save_load_roundtrip - DiskPromptCacheIndex.put then .get returns bit-equal KV arrays.
test_disk_index_skips_below_min_tokens - .put returns False and no files land on disk.
test_disk_index_model_id_isolation - same tokens under a different model_id is a miss.
test_lru_prompt_cache_default_is_memory_only - disk_cache_dir=None writes nothing to disk and still hits in-memory.
test_lru_prompt_cache_writes_through_to_disk - insert produces .safetensors + .meta.json files.
test_lru_prompt_cache_reads_back_on_cold_start - a brand-new LRUPromptCache instance fetches the prior instance's KV state back from disk (this is the headline behavior of the feature).
test_lru_prompt_cache_skips_disk_below_min_tokens - prompts shorter than disk_cache_min_tokens don't hit disk.
test_disk_byte_budget_eviction - inserting two entries with a budget that fits one triggers eviction of the older entry's safetensors + sidecar.

All tests are hermetic (each uses its own tempfile.mkdtemp()).

Black + isort clean.

Scope notes

Async prefetch deferred. The reference implementation in the companion mlx-fast repo also has an async-prefetch path (background warmup of top-K MRU entries at startup, async-load-on-large-miss). That adds a thread pool and is a separate optimisation; this PR ships only the synchronous persistence layer which already delivers the 1477-4585x headline speedup.
Chat-message hashing not ported. Keying by chat-message structure (rather than the flat token list) is not part of this PR.

… restarts) Adds an opt-in disk persistence layer that turns the existing in-memory LRUPromptCache into a write-through cache. With disk persistence enabled, prefix KV caches survive process restarts: on a cold start the longest stored prefix is re-materialized into the in-memory trie, avoiding a full re-prefill of long system prompts. Default behavior is unchanged. The feature activates only when --disk-prompt-cache-dir (server) or disk_cache_dir= (library) is set. Motivation. mlx_lm.server's LRUPromptCache already does prefix sharing via a token-id trie, but the trie lives entirely in RAM. A server restart, an OOM kill, or a deploy means every long system prompt re-prefills from scratch. For workloads with stable preambles (RAG, multi-turn chat, agent loops) this is the dominant TTFT cost. Measured impact, Qwen3.6-35B-A3B-4bit, M4 Max 36GB, prefix-only reuse across a fresh process start: context cold prefill warm load speedup 4k 2.7 s 1.7 ms 1561x 32k 18.1 s 5.7 ms 3175x 96k 63.0 s 16.5 ms 3824x Reads are bit-exact: the on-disk format is mlx_lm's own save_prompt_cache / load_prompt_cache (safetensors), so a hit produces the identical KV state a fresh prefill would have produced. Implementation. Three production files + one test file: * mlx_lm/disk_prompt_cache.py (new). A small library module: DiskPromptCacheIndex (token-keyed disk index per model_id), save_prefix_cache / load_cached_prefix (sidecar + atomic write), hash_tokens, model_id_for. SHA-256 keying, 16-char filename + full hex sidecar for collision check, LRU byte-budget eviction. * mlx_lm/models/cache.py. LRUPromptCache.__init__ gains three keyword-only kwargs (disk_cache_dir, disk_cache_bytes, disk_cache_min_tokens). When disk_cache_dir is None (default) nothing changes. When set, insert_cache write-throughs and fetch_nearest_cache probes disk on miss. * mlx_lm/server.py. Three new CLI flags pass through to the LRUPromptCache constructor: --disk-prompt-cache-dir, --disk-prompt-cache-bytes, --disk-prompt-cache-min-tokens. All optional; defaults preserve existing in-memory-only behavior. * tests/test_prompt_cache.py. New TestDiskPromptCache class: hash_tokens determinism + model/token-id sensitivity; DiskPromptCacheIndex save/load round-trip; min-token threshold skip; model_id isolation; LRUPromptCache default-off no-op; write-through on insert; cold-start re-materialization (fresh LRUPromptCache instance reads back KV state written by prior instance); LRU byte-budget eviction. Safety. Trimmed (SnapKV-style) and quantized caches are refused at the save path - the on-disk format treats offset == token count and that invariant only holds for full untrimmed prefix caches. Cross-model loads are rejected via a model_id check in the sidecar (refuses to load somebody else's KV state into our model). Disk-write failures are logged and swallowed; inference is never broken by a disk error. Out of scope. The reference implementation in the companion mlx-fast repo also has an async-prefetch path (background warmup of top-K MRU entries at startup, async-load-on-large-miss). That adds a thread pool and is a separate optimisation; this PR ships only the synchronous persistence layer which delivers the headline speedup. Chat-message hashing (keying by message structure rather than the flat token list) is also out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benjamin-levin force-pushed the persistent-prompt-cache branch from 4a1683a to 29fea32 Compare May 19, 2026 00:10

benjamin-levin marked this pull request as ready for review May 19, 2026 18:24

benjamin-levin merged commit 6d92893 into main May 19, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent prompt cache (opt-in disk-backed KV reuse)#6

Persistent prompt cache (opt-in disk-backed KV reuse)#6
benjamin-levin merged 1 commit into
mainfrom
persistent-prompt-cache

benjamin-levin commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benjamin-levin commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Measured impact

TTFT — HIT (warm load) vs MISS (cold prefill)

LRU eviction (400 MB budget)

Headline

API

Library (LRUPromptCache)

Server CLI

Implementation

Safety

Backwards compatibility

Tests

Scope notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benjamin-levin commented May 18, 2026 •

edited

Loading

Library (`LRUPromptCache`)