Skip to content

Persistent prompt cache (opt-in disk-backed KV reuse)#6

Merged
benjamin-levin merged 1 commit into
mainfrom
persistent-prompt-cache
May 19, 2026
Merged

Persistent prompt cache (opt-in disk-backed KV reuse)#6
benjamin-levin merged 1 commit into
mainfrom
persistent-prompt-cache

Conversation

@benjamin-levin
Copy link
Copy Markdown
Owner

@benjamin-levin benjamin-levin commented May 18, 2026

Summary

Adds an opt-in, disk-backed persistence layer for LRUPromptCache. With persistence enabled, prefix KV caches survive process restarts: on a cold start the longest stored prefix is re-materialized into the in-memory trie, avoiding a full re-prefill of long system prompts.

Default behavior is unchanged. The feature activates only when --disk-prompt-cache-dir (server) or disk_cache_dir= (library) is set.

Motivation

mlx_lm.server already does prefix sharing via the in-memory token-id trie inside LRUPromptCache, but that trie lives entirely in RAM. A server restart, OOM kill, or deploy means every long system prompt re-prefills from scratch.

For workloads with stable preambles - RAG with a fixed retrieval header, multi-turn chat with a long system prompt, agent loops with a stable tool/persona context - this re-prefill is the dominant time-to-first-token cost. Persisting the trie to disk removes it.

Measured impact

Bench: mlx-community/Qwen3.6-35B-A3B-4bit on M4 Max 36 GB, n_gen=8, prefix-only reuse across a fresh process start. Cold prefill uses mlx_lm.generate.generate_step with prefill_step_size=2048; warm load is disk_prompt_cache.load_cached_prefix re-materializing the KV state from a previously saved safetensors file.

TTFT — HIT (warm load) vs MISS (cold prefill)

ctx MISS — cold TTFT HIT — warm load speedup bit-exact
4k 2 905.4 ms 2.0 ms 1 477x yes
16k 12 306.8 ms 5.9 ms 2 078x yes
32k 28 787.8 ms 15.9 ms 1 811x yes
96k 141 142.1 ms 30.8 ms 4 585x yes

The "MISS" column is the unmodified baseline — full prefill of the same prompt with no disk cache present. The "HIT" column is the entire TTFT cost when the disk has a covering prefix: load + KV re-materialization. Bit-exactness is verified across the first 8 generated tokens of each cell (cold tokens == warm tokens for every row).

Save cost (write-through to disk on first prefill) is bounded and amortized across all subsequent hits: 54 ms @ 4k, 95 ms @ 16k, 163 ms @ 32k, 593 ms @ 96k.

LRU eviction (400 MB budget)

DiskPromptCacheIndex(root, model_id, size_budget_bytes=400 MB, min_prefix_tokens=256) with four sequential 4k-token inserts (~148 MB each):

step on-disk after entries
insert #0 0.148 GB 1
insert #1 0.297 GB 2
insert #2 0.297 GB 2 (evicted #0)
insert #3 0.297 GB 2 (evicted #1)

Survivor probe after all inserts: indices [2, 3] — the two most-recent — confirming oldest-last_used_at eviction. With the default 1 GB budget all four entries fit and zero eviction fires.

Headline

  • Up to 4 585x TTFT speedup on warm load (96k prefix, M4 Max).
  • Bit-exact reads via mlx_lm's own save_prompt_cache / load_prompt_cache (safetensors) format.
  • Save cost (one-time per unique prefix) ≤ 600 ms even at 96k tokens.
  • LRU eviction by last_used_at keeps the disk footprint inside the user-supplied byte budget.

API

Library (LRUPromptCache)

LRUPromptCache(
    max_size: int = 10,
    max_bytes: int = ...,
    *,
    disk_cache_dir: Optional[str] = None,        # None = disabled
    disk_cache_bytes: int = 4 * (1024**3),       # LRU byte budget on disk
    disk_cache_min_tokens: int = 256,            # skip below this length
)

When disk_cache_dir is None (default) there is zero overhead and no behavior change.

Server CLI

--disk-prompt-cache-dir DIR             # opt-in; default disabled
--disk-prompt-cache-bytes SIZE          # default 4 GB
--disk-prompt-cache-min-tokens N        # default 256

All three flags are optional; existing servers keep their existing in-memory-only behavior.

Implementation

  • mlx_lm/disk_prompt_cache.py (new, ~460 LOC). Small library module:

    • DiskPromptCacheIndex - token-keyed disk index per model_id.
    • save_prefix_cache / load_cached_prefix - sidecar JSON + atomic safetensors write.
    • hash_tokens(tokens, model_id) - SHA-256 over model_id || little-endian-packed tokens.
    • 16-char filename + full hex sidecar for collision check.
    • LRU byte-budget eviction (oldest last_used_at first).
  • mlx_lm/models/cache.py. LRUPromptCache.__init__ gains the three keyword-only kwargs above. When disk_cache_dir is set, insert_cache write-throughs to disk and fetch_nearest_cache probes disk on in-memory miss; a disk hit is re-materialized into the in-memory trie so subsequent lookups take the fast path.

  • mlx_lm/server.py. Three new CLI flags pass through to the LRUPromptCache constructor. Defaults preserve existing behavior.

Safety

  • Trimmed (SnapKV-style) and quantized caches are refused at the save path - the on-disk format treats offset == token count and that invariant only holds for full untrimmed prefix caches.
  • Cross-model loads are rejected via a model_id check in the sidecar (refuses to load somebody else's KV state into our model).
  • Disk-write failures are logged and swallowed; inference is never broken by a disk error.
  • Atomic write: tmp file + os.replace into final name; two writers racing on the same key both produce identical content (by hash equality), so last-write-wins is safe.

Backwards compatibility

Default off. When disk_cache_dir is not set (or --disk-prompt-cache-dir is omitted on the server), LRUPromptCache does not touch disk and follows the exact same code paths it did before this PR. All previously-passing tests still pass.

Tests

New TestDiskPromptCache class in tests/test_prompt_cache.py:

  • test_hash_tokens_is_deterministic - same (tokens, model_id) -> same 64-char hex.
  • test_hash_tokens_changes_on_model_or_tokens - changes when model_id, a token, or list length changes.
  • test_disk_index_save_load_roundtrip - DiskPromptCacheIndex.put then .get returns bit-equal KV arrays.
  • test_disk_index_skips_below_min_tokens - .put returns False and no files land on disk.
  • test_disk_index_model_id_isolation - same tokens under a different model_id is a miss.
  • test_lru_prompt_cache_default_is_memory_only - disk_cache_dir=None writes nothing to disk and still hits in-memory.
  • test_lru_prompt_cache_writes_through_to_disk - insert produces .safetensors + .meta.json files.
  • test_lru_prompt_cache_reads_back_on_cold_start - a brand-new LRUPromptCache instance fetches the prior instance's KV state back from disk (this is the headline behavior of the feature).
  • test_lru_prompt_cache_skips_disk_below_min_tokens - prompts shorter than disk_cache_min_tokens don't hit disk.
  • test_disk_byte_budget_eviction - inserting two entries with a budget that fits one triggers eviction of the older entry's safetensors + sidecar.

All tests are hermetic (each uses its own tempfile.mkdtemp()).

Black + isort clean.

Scope notes

  • Async prefetch deferred. The reference implementation in the companion mlx-fast repo also has an async-prefetch path (background warmup of top-K MRU entries at startup, async-load-on-large-miss). That adds a thread pool and is a separate optimisation; this PR ships only the synchronous persistence layer which already delivers the 1477-4585x headline speedup.
  • Chat-message hashing not ported. Keying by chat-message structure (rather than the flat token list) is not part of this PR.

… restarts)

Adds an opt-in disk persistence layer that turns the existing in-memory
LRUPromptCache into a write-through cache. With disk persistence
enabled, prefix KV caches survive process restarts: on a cold start the
longest stored prefix is re-materialized into the in-memory trie,
avoiding a full re-prefill of long system prompts.

Default behavior is unchanged. The feature activates only when
--disk-prompt-cache-dir (server) or disk_cache_dir= (library) is set.

Motivation. mlx_lm.server's LRUPromptCache already does prefix sharing
via a token-id trie, but the trie lives entirely in RAM. A server
restart, an OOM kill, or a deploy means every long system prompt
re-prefills from scratch. For workloads with stable preambles
(RAG, multi-turn chat, agent loops) this is the dominant TTFT cost.

Measured impact, Qwen3.6-35B-A3B-4bit, M4 Max 36GB, prefix-only reuse
across a fresh process start:

  context   cold prefill   warm load   speedup
    4k        2.7 s         1.7 ms     1561x
   32k       18.1 s         5.7 ms     3175x
   96k       63.0 s        16.5 ms     3824x

Reads are bit-exact: the on-disk format is mlx_lm's own
save_prompt_cache / load_prompt_cache (safetensors), so a hit produces
the identical KV state a fresh prefill would have produced.

Implementation. Three production files + one test file:

  * mlx_lm/disk_prompt_cache.py (new). A small library module:
    DiskPromptCacheIndex (token-keyed disk index per model_id),
    save_prefix_cache / load_cached_prefix (sidecar + atomic write),
    hash_tokens, model_id_for. SHA-256 keying, 16-char filename + full
    hex sidecar for collision check, LRU byte-budget eviction.

  * mlx_lm/models/cache.py. LRUPromptCache.__init__ gains three
    keyword-only kwargs (disk_cache_dir, disk_cache_bytes,
    disk_cache_min_tokens). When disk_cache_dir is None (default)
    nothing changes. When set, insert_cache write-throughs and
    fetch_nearest_cache probes disk on miss.

  * mlx_lm/server.py. Three new CLI flags pass through to the
    LRUPromptCache constructor: --disk-prompt-cache-dir,
    --disk-prompt-cache-bytes, --disk-prompt-cache-min-tokens. All
    optional; defaults preserve existing in-memory-only behavior.

  * tests/test_prompt_cache.py. New TestDiskPromptCache class:
    hash_tokens determinism + model/token-id sensitivity;
    DiskPromptCacheIndex save/load round-trip; min-token threshold
    skip; model_id isolation; LRUPromptCache default-off no-op;
    write-through on insert; cold-start re-materialization (fresh
    LRUPromptCache instance reads back KV state written by prior
    instance); LRU byte-budget eviction.

Safety. Trimmed (SnapKV-style) and quantized caches are refused at the
save path - the on-disk format treats offset == token count and that
invariant only holds for full untrimmed prefix caches. Cross-model
loads are rejected via a model_id check in the sidecar (refuses to
load somebody else's KV state into our model). Disk-write failures are
logged and swallowed; inference is never broken by a disk error.

Out of scope. The reference implementation in the companion mlx-fast
repo also has an async-prefetch path (background warmup of top-K MRU
entries at startup, async-load-on-large-miss). That adds a thread
pool and is a separate optimisation; this PR ships only the
synchronous persistence layer which delivers the headline speedup.
Chat-message hashing (keying by message structure rather than the
flat token list) is also out of scope here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@benjamin-levin benjamin-levin force-pushed the persistent-prompt-cache branch from 4a1683a to 29fea32 Compare May 19, 2026 00:10
@benjamin-levin benjamin-levin marked this pull request as ready for review May 19, 2026 18:24
@benjamin-levin benjamin-levin merged commit 6d92893 into main May 19, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant