Add SimpleEngine prefix trie cache#574
Conversation
56935b9 to
c847950
Compare
janhilgard
left a comment
There was a problem hiding this comment.
Reviewed end-to-end against the design in #567 and the existing stream_chat cache machinery.
Approve rationale
- Lookup order is the right one. Exact system-prefix snapshot still wins before any trie lookup, so the fast path (no deepcopy) is preserved for the workloads it was already serving. The trie lookup only runs after the exact-snapshot miss is logged, which is exactly the layering #567 asked for.
- Fail-closed throughout.
_fetch_prefix_trie_cacheand_insert_prefix_trie_cacheboth wrap mlx-lm calls intry / except, increment askipscounter, and return the cold path on failure. The MLLM guard (_ensure_prefix_trie_cachereturnsNonewhen_is_mllm) keeps the feature scoped to the path the PR claims. - Thread-safe.
threading.Lockconsistently wraps lazy init, fetch, insert, and the stats snapshot. Since the generation worker runs on a background thread under_run_blocking_serialized, this matters. - Exact-fit edge case handled. When the trie returns the entire prompt (
len(trie_rest) == 0), the code usescan_trim_prompt_cache+trim_prompt_cache(..., 1)to leave one token for the decoder. Without this, mlx-lm's stream loop would have nothing to consume — easy to miss, good to see covered. - Stats stay separate. Exposing
system_kv_cacheandprefix_trie_cacheas independent blocks (rather than overloading the existing snapshot stats) keeps the two cache strategies observable in isolation, which matters when reasoning about which one is actually carrying a workload. - Default off + opt-in CLI surface —
--prefix-trie-cache,--prefix-trie-cache-size,--prefix-trie-cache-memory-mbplumbed throughcli.py→lifecycle.py→model_registry.py→server.py→SimpleEngine.__init__, consistent with the existing patterns. Nothing should change for current users until they opt in. - CI is green on all matrix entries after the black fix; the earlier lint failure is resolved.
Test coverage — 4 cases hit the main API surface (growing-prefix reuse, exact-snapshot-wins precedence, default-off, LRU entry bound). Solid for the public contract.
Minor follow-up suggestions (none blocking, none worth a re-review round)
- The token-id capture in
_run_with_cacheusestry: int(token); except TypeError: int(token.item()). It works today but bakes in an assumption about what mlx-lm yields. Agetattr(token, "item", lambda: token)()would be a touch more robust if mlx-lm later starts yielding plainints on the cold path. max_bytes = 1 << 63when no memory cap is set reads as a magic sentinel;sys.maxsizewould carry the same intent more explicitly.- The PR honestly notes no measured production speedup yet — agreed it should land first and be measured separately. With default-off semantics the risk profile is fine. Once enabled in anger on a real workload, a follow-up that reports hit rate + tokens_saved against the original Claude Code repro in #567 would close the loop.
- The default
--prefix-trie-cache-size 32could hold a non-trivial amount of state on large-prefix workloads (in the #567 repro, ~40K-token prefixes on Qwen3-Coder-30B); operators may want a one-liner in the help string nudging them to set--prefix-trie-cache-memory-mbinstead of relying on entry count alone on memory-constrained hosts.
None of those gate this PR. The implementation is correct, well-scoped, and ready to land.
|
Heads up before you rebase: this branched from 9c83c84, before #541 replaced the single-slot system-KV snapshot with the multi-slot LRU, and #576 (snapshot helpers) just landed on top of that, so the conflict is semantic, not mechanical. The trie lookup currently hangs off the old single-slot miss branch: # this PR: these attributes no longer exist on main
if system_hash == self._system_kv_hash:
...
else:
cached = self._fetch_prefix_trie_cache(...)On current main the miss check is against the LRU OrderedDict: if self._system_kv_cache.get(system_hash) is None:
cached = self._fetch_prefix_trie_cache(...)and The mlx-lm API usage itself checked out fine: |
Summary
Implements the default-off SimpleEngine prefix-trie cache requested in #567 using mlx-lm's
LRUPromptCache.Scope is intentionally narrow:
SimpleEngine.stream_chat()onlyLRUPromptCache.fetch_nearest_cache()is tried only after an exact snapshot missDetails
New configuration:
--prefix-trie-cache(default off)--prefix-trie-cache-size--prefix-trie-cache-memory-mbThe feature remains fail-closed under the existing SimpleEngine cache guards: stop/logits processors, non-default sampling controls, MTP, SpecPrefill,
max_kv_size, and non-plain KV cache classes continue to use the uncached path.Stats are exposed separately under
prefix_trie_cacheso exact snapshot cache behavior and trie-cache behavior are not conflated.Tests
Passed:
.venv/bin/python -m pytest tests/test_simple_engine.py tests/test_simple_engine_prefix_trie_cache.py tests/test_cli.py tests/test_lifecycle_cli.py tests/test_model_registry.py -q.venv/bin/python -m pytest -q --ignore=tests/test_ssd_cache.py.venv/bin/python -m ruff check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/cli.py vllm_mlx/lifecycle.py vllm_mlx/model_registry.py tests/test_simple_engine_prefix_trie_cache.py tests/test_cli.py tests/test_lifecycle_cli.py tests/test_model_registry.py.venv/bin/python -m compileall -q vllm_mlx tests/test_simple_engine_prefix_trie_cache.pygit diff --check/opt/ai-runtime/bin/lint-upstream-claims --root /Users/David/code/vllm-mlx ...touched files...Full-suite note:
.venv/bin/python -m pytest -qfails only intests/test_ssd_cache.pywith 4 SSD cache failures.tests/test_ssd_cache.pyfailures reproduce on a cleanupstream/mainworktree at9c83c84, so they are not introduced by this branch.Not claimed
This PR does not claim measured production speedup yet. It adds the default-off implementation and regression coverage so performance can be measured separately on real long-prefix workloads.