Add SimpleEngine prefix trie cache by Thump604 · Pull Request #574 · waybarrios/vllm-mlx

Thump604 · 2026-05-24T13:58:02Z

Summary

Implements the default-off SimpleEngine prefix-trie cache requested in #567 using mlx-lm's LRUPromptCache.

Scope is intentionally narrow:

pure-LLM SimpleEngine.stream_chat() only
existing exact system-prefix snapshot path still wins first
LRUPromptCache.fetch_nearest_cache() is tried only after an exact snapshot miss
cache insertion stores the completed prompt cache with the full prompt plus generated tokens, matching mlx-lm server behavior
no MLLM, continuous batching, MTP, SpecPrefill, KVQ, or constrained decoding changes

Details

New configuration:

--prefix-trie-cache (default off)
--prefix-trie-cache-size
--prefix-trie-cache-memory-mb

The feature remains fail-closed under the existing SimpleEngine cache guards: stop/logits processors, non-default sampling controls, MTP, SpecPrefill, max_kv_size, and non-plain KV cache classes continue to use the uncached path.

Stats are exposed separately under prefix_trie_cache so exact snapshot cache behavior and trie-cache behavior are not conflated.

Tests

Passed:

.venv/bin/python -m pytest tests/test_simple_engine.py tests/test_simple_engine_prefix_trie_cache.py tests/test_cli.py tests/test_lifecycle_cli.py tests/test_model_registry.py -q
.venv/bin/python -m pytest -q --ignore=tests/test_ssd_cache.py
.venv/bin/python -m ruff check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/cli.py vllm_mlx/lifecycle.py vllm_mlx/model_registry.py tests/test_simple_engine_prefix_trie_cache.py tests/test_cli.py tests/test_lifecycle_cli.py tests/test_model_registry.py
.venv/bin/python -m compileall -q vllm_mlx tests/test_simple_engine_prefix_trie_cache.py
git diff --check
/opt/ai-runtime/bin/lint-upstream-claims --root /Users/David/code/vllm-mlx ...touched files...

Full-suite note:

.venv/bin/python -m pytest -q fails only in tests/test_ssd_cache.py with 4 SSD cache failures.
The same 4 tests/test_ssd_cache.py failures reproduce on a clean upstream/main worktree at 9c83c84, so they are not introduced by this branch.

Not claimed

This PR does not claim measured production speedup yet. It adds the default-off implementation and regression coverage so performance can be measured separately on real long-prefix workloads.

janhilgard

Reviewed end-to-end against the design in #567 and the existing stream_chat cache machinery.

Approve rationale

Lookup order is the right one. Exact system-prefix snapshot still wins before any trie lookup, so the fast path (no deepcopy) is preserved for the workloads it was already serving. The trie lookup only runs after the exact-snapshot miss is logged, which is exactly the layering #567 asked for.
Fail-closed throughout. _fetch_prefix_trie_cache and _insert_prefix_trie_cache both wrap mlx-lm calls in try / except, increment a skips counter, and return the cold path on failure. The MLLM guard (_ensure_prefix_trie_cache returns None when _is_mllm) keeps the feature scoped to the path the PR claims.
Thread-safe. threading.Lock consistently wraps lazy init, fetch, insert, and the stats snapshot. Since the generation worker runs on a background thread under _run_blocking_serialized, this matters.
Exact-fit edge case handled. When the trie returns the entire prompt (len(trie_rest) == 0), the code uses can_trim_prompt_cache + trim_prompt_cache(..., 1) to leave one token for the decoder. Without this, mlx-lm's stream loop would have nothing to consume — easy to miss, good to see covered.
Stats stay separate. Exposing system_kv_cache and prefix_trie_cache as independent blocks (rather than overloading the existing snapshot stats) keeps the two cache strategies observable in isolation, which matters when reasoning about which one is actually carrying a workload.
Default off + opt-in CLI surface — --prefix-trie-cache, --prefix-trie-cache-size, --prefix-trie-cache-memory-mb plumbed through cli.py → lifecycle.py → model_registry.py → server.py → SimpleEngine.__init__, consistent with the existing patterns. Nothing should change for current users until they opt in.
CI is green on all matrix entries after the black fix; the earlier lint failure is resolved.

Test coverage — 4 cases hit the main API surface (growing-prefix reuse, exact-snapshot-wins precedence, default-off, LRU entry bound). Solid for the public contract.

Minor follow-up suggestions (none blocking, none worth a re-review round)

The token-id capture in _run_with_cache uses try: int(token); except TypeError: int(token.item()). It works today but bakes in an assumption about what mlx-lm yields. A getattr(token, "item", lambda: token)() would be a touch more robust if mlx-lm later starts yielding plain ints on the cold path.
max_bytes = 1 << 63 when no memory cap is set reads as a magic sentinel; sys.maxsize would carry the same intent more explicitly.
The PR honestly notes no measured production speedup yet — agreed it should land first and be measured separately. With default-off semantics the risk profile is fine. Once enabled in anger on a real workload, a follow-up that reports hit rate + tokens_saved against the original Claude Code repro in #567 would close the loop.
The default --prefix-trie-cache-size 32 could hold a non-trivial amount of state on large-prefix workloads (in the #567 repro, ~40K-token prefixes on Qwen3-Coder-30B); operators may want a one-liner in the help string nudging them to set --prefix-trie-cache-memory-mb instead of relying on entry count alone on memory-constrained hosts.

None of those gate this PR. The implementation is correct, well-scoped, and ready to land.

waybarrios · 2026-06-12T01:30:20Z

Heads up before you rebase: this branched from 9c83c84, before #541 replaced the single-slot system-KV snapshot with the multi-slot LRU, and #576 (snapshot helpers) just landed on top of that, so the conflict is semantic, not mechanical. The trie lookup currently hangs off the old single-slot miss branch:

# this PR: these attributes no longer exist on main
if system_hash == self._system_kv_hash:
    ...
else:
    cached = self._fetch_prefix_trie_cache(...)

On current main the miss check is against the LRU OrderedDict:

if self._system_kv_cache.get(system_hash) is None:
    cached = self._fetch_prefix_trie_cache(...)

and stop() should no longer clear _system_kv_snapshot / _system_kv_hash / _system_kv_token_count (also gone). Two more things for the rebase: the trie lookup is nested under has_system, so requests without a system message can never hit it, silently zero benefit for those workloads. And start()'s engine summary line doesn't mention prefix_trie_cache=True (the only signal is a print in cli.py), which runs into the no-silent-feature-flags convention.

The mlx-lm API usage itself checked out fine: fetch_nearest_cache deep-copies so there's no aliasing, ArraysCache models fall through safely, and the LRU memory bounds are wired through. The feature is worth landing once it's rewired against the multi-slot cache.

Thump604 requested a review from janhilgard May 24, 2026 13:58

Thump604 assigned janhilgard May 24, 2026

Thump604 mentioned this pull request May 24, 2026

Feature request: use mlx-lm LRUPromptCache for intra-session prefix reuse alongside the snapshot LRU #567

Open

Add SimpleEngine prefix trie cache

c847950

Thump604 force-pushed the 604/prefix-trie-cache branch from 56935b9 to c847950 Compare May 24, 2026 13:59

janhilgard approved these changes May 24, 2026

View reviewed changes

Thump604 assigned waybarrios May 24, 2026

Thump604 requested a review from waybarrios May 24, 2026 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SimpleEngine prefix trie cache#574

Add SimpleEngine prefix trie cache#574
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/prefix-trie-cache

Thump604 commented May 24, 2026

Uh oh!

janhilgard left a comment

Uh oh!

waybarrios commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Thump604 commented May 24, 2026

Summary

Details

Tests

Not claimed

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

waybarrios commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants