Persistent prompt cache (opt-in disk-backed KV reuse)#6
Merged
Conversation
… restarts)
Adds an opt-in disk persistence layer that turns the existing in-memory
LRUPromptCache into a write-through cache. With disk persistence
enabled, prefix KV caches survive process restarts: on a cold start the
longest stored prefix is re-materialized into the in-memory trie,
avoiding a full re-prefill of long system prompts.
Default behavior is unchanged. The feature activates only when
--disk-prompt-cache-dir (server) or disk_cache_dir= (library) is set.
Motivation. mlx_lm.server's LRUPromptCache already does prefix sharing
via a token-id trie, but the trie lives entirely in RAM. A server
restart, an OOM kill, or a deploy means every long system prompt
re-prefills from scratch. For workloads with stable preambles
(RAG, multi-turn chat, agent loops) this is the dominant TTFT cost.
Measured impact, Qwen3.6-35B-A3B-4bit, M4 Max 36GB, prefix-only reuse
across a fresh process start:
context cold prefill warm load speedup
4k 2.7 s 1.7 ms 1561x
32k 18.1 s 5.7 ms 3175x
96k 63.0 s 16.5 ms 3824x
Reads are bit-exact: the on-disk format is mlx_lm's own
save_prompt_cache / load_prompt_cache (safetensors), so a hit produces
the identical KV state a fresh prefill would have produced.
Implementation. Three production files + one test file:
* mlx_lm/disk_prompt_cache.py (new). A small library module:
DiskPromptCacheIndex (token-keyed disk index per model_id),
save_prefix_cache / load_cached_prefix (sidecar + atomic write),
hash_tokens, model_id_for. SHA-256 keying, 16-char filename + full
hex sidecar for collision check, LRU byte-budget eviction.
* mlx_lm/models/cache.py. LRUPromptCache.__init__ gains three
keyword-only kwargs (disk_cache_dir, disk_cache_bytes,
disk_cache_min_tokens). When disk_cache_dir is None (default)
nothing changes. When set, insert_cache write-throughs and
fetch_nearest_cache probes disk on miss.
* mlx_lm/server.py. Three new CLI flags pass through to the
LRUPromptCache constructor: --disk-prompt-cache-dir,
--disk-prompt-cache-bytes, --disk-prompt-cache-min-tokens. All
optional; defaults preserve existing in-memory-only behavior.
* tests/test_prompt_cache.py. New TestDiskPromptCache class:
hash_tokens determinism + model/token-id sensitivity;
DiskPromptCacheIndex save/load round-trip; min-token threshold
skip; model_id isolation; LRUPromptCache default-off no-op;
write-through on insert; cold-start re-materialization (fresh
LRUPromptCache instance reads back KV state written by prior
instance); LRU byte-budget eviction.
Safety. Trimmed (SnapKV-style) and quantized caches are refused at the
save path - the on-disk format treats offset == token count and that
invariant only holds for full untrimmed prefix caches. Cross-model
loads are rejected via a model_id check in the sidecar (refuses to
load somebody else's KV state into our model). Disk-write failures are
logged and swallowed; inference is never broken by a disk error.
Out of scope. The reference implementation in the companion mlx-fast
repo also has an async-prefetch path (background warmup of top-K MRU
entries at startup, async-load-on-large-miss). That adds a thread
pool and is a separate optimisation; this PR ships only the
synchronous persistence layer which delivers the headline speedup.
Chat-message hashing (keying by message structure rather than the
flat token list) is also out of scope here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4a1683a to
29fea32
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in, disk-backed persistence layer for
LRUPromptCache. With persistence enabled, prefix KV caches survive process restarts: on a cold start the longest stored prefix is re-materialized into the in-memory trie, avoiding a full re-prefill of long system prompts.Default behavior is unchanged. The feature activates only when
--disk-prompt-cache-dir(server) ordisk_cache_dir=(library) is set.Motivation
mlx_lm.serveralready does prefix sharing via the in-memory token-id trie insideLRUPromptCache, but that trie lives entirely in RAM. A server restart, OOM kill, or deploy means every long system prompt re-prefills from scratch.For workloads with stable preambles - RAG with a fixed retrieval header, multi-turn chat with a long system prompt, agent loops with a stable tool/persona context - this re-prefill is the dominant time-to-first-token cost. Persisting the trie to disk removes it.
Measured impact
Bench:
mlx-community/Qwen3.6-35B-A3B-4biton M4 Max 36 GB,n_gen=8, prefix-only reuse across a fresh process start. Cold prefill usesmlx_lm.generate.generate_stepwithprefill_step_size=2048; warm load isdisk_prompt_cache.load_cached_prefixre-materializing the KV state from a previously saved safetensors file.TTFT — HIT (warm load) vs MISS (cold prefill)
The "MISS" column is the unmodified baseline — full prefill of the same prompt with no disk cache present. The "HIT" column is the entire TTFT cost when the disk has a covering prefix: load + KV re-materialization. Bit-exactness is verified across the first 8 generated tokens of each cell (cold tokens == warm tokens for every row).
Save cost (write-through to disk on first prefill) is bounded and amortized across all subsequent hits: 54 ms @ 4k, 95 ms @ 16k, 163 ms @ 32k, 593 ms @ 96k.
LRU eviction (400 MB budget)
DiskPromptCacheIndex(root, model_id, size_budget_bytes=400 MB, min_prefix_tokens=256)with four sequential 4k-token inserts (~148 MB each):Survivor probe after all inserts: indices
[2, 3]— the two most-recent — confirming oldest-last_used_ateviction. With the default 1 GB budget all four entries fit and zero eviction fires.Headline
mlx_lm's ownsave_prompt_cache/load_prompt_cache(safetensors) format.last_used_atkeeps the disk footprint inside the user-supplied byte budget.API
Library (
LRUPromptCache)When
disk_cache_dir is None(default) there is zero overhead and no behavior change.Server CLI
All three flags are optional; existing servers keep their existing in-memory-only behavior.
Implementation
mlx_lm/disk_prompt_cache.py(new, ~460 LOC). Small library module:DiskPromptCacheIndex- token-keyed disk index permodel_id.save_prefix_cache/load_cached_prefix- sidecar JSON + atomic safetensors write.hash_tokens(tokens, model_id)- SHA-256 overmodel_id || little-endian-packed tokens.last_used_atfirst).mlx_lm/models/cache.py.LRUPromptCache.__init__gains the three keyword-only kwargs above. Whendisk_cache_diris set,insert_cachewrite-throughs to disk andfetch_nearest_cacheprobes disk on in-memory miss; a disk hit is re-materialized into the in-memory trie so subsequent lookups take the fast path.mlx_lm/server.py. Three new CLI flags pass through to theLRUPromptCacheconstructor. Defaults preserve existing behavior.Safety
offset == token countand that invariant only holds for full untrimmed prefix caches.model_idcheck in the sidecar (refuses to load somebody else's KV state into our model).os.replaceinto final name; two writers racing on the same key both produce identical content (by hash equality), so last-write-wins is safe.Backwards compatibility
Default off. When
disk_cache_diris not set (or--disk-prompt-cache-diris omitted on the server),LRUPromptCachedoes not touch disk and follows the exact same code paths it did before this PR. All previously-passing tests still pass.Tests
New
TestDiskPromptCacheclass intests/test_prompt_cache.py:test_hash_tokens_is_deterministic- same(tokens, model_id)-> same 64-char hex.test_hash_tokens_changes_on_model_or_tokens- changes when model_id, a token, or list length changes.test_disk_index_save_load_roundtrip-DiskPromptCacheIndex.putthen.getreturns bit-equal KV arrays.test_disk_index_skips_below_min_tokens-.putreturnsFalseand no files land on disk.test_disk_index_model_id_isolation- same tokens under a differentmodel_idis a miss.test_lru_prompt_cache_default_is_memory_only-disk_cache_dir=Nonewrites nothing to disk and still hits in-memory.test_lru_prompt_cache_writes_through_to_disk- insert produces.safetensors+.meta.jsonfiles.test_lru_prompt_cache_reads_back_on_cold_start- a brand-newLRUPromptCacheinstance fetches the prior instance's KV state back from disk (this is the headline behavior of the feature).test_lru_prompt_cache_skips_disk_below_min_tokens- prompts shorter thandisk_cache_min_tokensdon't hit disk.test_disk_byte_budget_eviction- inserting two entries with a budget that fits one triggers eviction of the older entry's safetensors + sidecar.All tests are hermetic (each uses its own
tempfile.mkdtemp()).Black + isort clean.
Scope notes
mlx-fastrepo also has an async-prefetch path (background warmup of top-K MRU entries at startup, async-load-on-large-miss). That adds a thread pool and is a separate optimisation; this PR ships only the synchronous persistence layer which already delivers the 1477-4585x headline speedup.