Skip to content

Add SimpleEngine prefix trie cache#574

Open
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/prefix-trie-cache
Open

Add SimpleEngine prefix trie cache#574
Thump604 wants to merge 1 commit into
waybarrios:mainfrom
Thump604:604/prefix-trie-cache

Conversation

@Thump604

Copy link
Copy Markdown
Collaborator

Summary

Implements the default-off SimpleEngine prefix-trie cache requested in #567 using mlx-lm's LRUPromptCache.

Scope is intentionally narrow:

  • pure-LLM SimpleEngine.stream_chat() only
  • existing exact system-prefix snapshot path still wins first
  • LRUPromptCache.fetch_nearest_cache() is tried only after an exact snapshot miss
  • cache insertion stores the completed prompt cache with the full prompt plus generated tokens, matching mlx-lm server behavior
  • no MLLM, continuous batching, MTP, SpecPrefill, KVQ, or constrained decoding changes

Details

New configuration:

  • --prefix-trie-cache (default off)
  • --prefix-trie-cache-size
  • --prefix-trie-cache-memory-mb

The feature remains fail-closed under the existing SimpleEngine cache guards: stop/logits processors, non-default sampling controls, MTP, SpecPrefill, max_kv_size, and non-plain KV cache classes continue to use the uncached path.

Stats are exposed separately under prefix_trie_cache so exact snapshot cache behavior and trie-cache behavior are not conflated.

Tests

Passed:

  • .venv/bin/python -m pytest tests/test_simple_engine.py tests/test_simple_engine_prefix_trie_cache.py tests/test_cli.py tests/test_lifecycle_cli.py tests/test_model_registry.py -q
  • .venv/bin/python -m pytest -q --ignore=tests/test_ssd_cache.py
  • .venv/bin/python -m ruff check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/cli.py vllm_mlx/lifecycle.py vllm_mlx/model_registry.py tests/test_simple_engine_prefix_trie_cache.py tests/test_cli.py tests/test_lifecycle_cli.py tests/test_model_registry.py
  • .venv/bin/python -m compileall -q vllm_mlx tests/test_simple_engine_prefix_trie_cache.py
  • git diff --check
  • /opt/ai-runtime/bin/lint-upstream-claims --root /Users/David/code/vllm-mlx ...touched files...

Full-suite note:

  • .venv/bin/python -m pytest -q fails only in tests/test_ssd_cache.py with 4 SSD cache failures.
  • The same 4 tests/test_ssd_cache.py failures reproduce on a clean upstream/main worktree at 9c83c84, so they are not introduced by this branch.

Not claimed

This PR does not claim measured production speedup yet. It adds the default-off implementation and regression coverage so performance can be measured separately on real long-prefix workloads.

@Thump604 Thump604 force-pushed the 604/prefix-trie-cache branch from 56935b9 to c847950 Compare May 24, 2026 13:59

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed end-to-end against the design in #567 and the existing stream_chat cache machinery.

Approve rationale

  • Lookup order is the right one. Exact system-prefix snapshot still wins before any trie lookup, so the fast path (no deepcopy) is preserved for the workloads it was already serving. The trie lookup only runs after the exact-snapshot miss is logged, which is exactly the layering #567 asked for.
  • Fail-closed throughout. _fetch_prefix_trie_cache and _insert_prefix_trie_cache both wrap mlx-lm calls in try / except, increment a skips counter, and return the cold path on failure. The MLLM guard (_ensure_prefix_trie_cache returns None when _is_mllm) keeps the feature scoped to the path the PR claims.
  • Thread-safe. threading.Lock consistently wraps lazy init, fetch, insert, and the stats snapshot. Since the generation worker runs on a background thread under _run_blocking_serialized, this matters.
  • Exact-fit edge case handled. When the trie returns the entire prompt (len(trie_rest) == 0), the code uses can_trim_prompt_cache + trim_prompt_cache(..., 1) to leave one token for the decoder. Without this, mlx-lm's stream loop would have nothing to consume — easy to miss, good to see covered.
  • Stats stay separate. Exposing system_kv_cache and prefix_trie_cache as independent blocks (rather than overloading the existing snapshot stats) keeps the two cache strategies observable in isolation, which matters when reasoning about which one is actually carrying a workload.
  • Default off + opt-in CLI surface--prefix-trie-cache, --prefix-trie-cache-size, --prefix-trie-cache-memory-mb plumbed through cli.pylifecycle.pymodel_registry.pyserver.pySimpleEngine.__init__, consistent with the existing patterns. Nothing should change for current users until they opt in.
  • CI is green on all matrix entries after the black fix; the earlier lint failure is resolved.

Test coverage — 4 cases hit the main API surface (growing-prefix reuse, exact-snapshot-wins precedence, default-off, LRU entry bound). Solid for the public contract.

Minor follow-up suggestions (none blocking, none worth a re-review round)

  1. The token-id capture in _run_with_cache uses try: int(token); except TypeError: int(token.item()). It works today but bakes in an assumption about what mlx-lm yields. A getattr(token, "item", lambda: token)() would be a touch more robust if mlx-lm later starts yielding plain ints on the cold path.
  2. max_bytes = 1 << 63 when no memory cap is set reads as a magic sentinel; sys.maxsize would carry the same intent more explicitly.
  3. The PR honestly notes no measured production speedup yet — agreed it should land first and be measured separately. With default-off semantics the risk profile is fine. Once enabled in anger on a real workload, a follow-up that reports hit rate + tokens_saved against the original Claude Code repro in #567 would close the loop.
  4. The default --prefix-trie-cache-size 32 could hold a non-trivial amount of state on large-prefix workloads (in the #567 repro, ~40K-token prefixes on Qwen3-Coder-30B); operators may want a one-liner in the help string nudging them to set --prefix-trie-cache-memory-mb instead of relying on entry count alone on memory-constrained hosts.

None of those gate this PR. The implementation is correct, well-scoped, and ready to land.

@waybarrios

Copy link
Copy Markdown
Owner

Heads up before you rebase: this branched from 9c83c84, before #541 replaced the single-slot system-KV snapshot with the multi-slot LRU, and #576 (snapshot helpers) just landed on top of that, so the conflict is semantic, not mechanical. The trie lookup currently hangs off the old single-slot miss branch:

# this PR: these attributes no longer exist on main
if system_hash == self._system_kv_hash:
    ...
else:
    cached = self._fetch_prefix_trie_cache(...)

On current main the miss check is against the LRU OrderedDict:

if self._system_kv_cache.get(system_hash) is None:
    cached = self._fetch_prefix_trie_cache(...)

and stop() should no longer clear _system_kv_snapshot / _system_kv_hash / _system_kv_token_count (also gone). Two more things for the rebase: the trie lookup is nested under has_system, so requests without a system message can never hit it, silently zero benefit for those workloads. And start()'s engine summary line doesn't mention prefix_trie_cache=True (the only signal is a print in cli.py), which runs into the no-silent-feature-flags convention.

The mlx-lm API usage itself checked out fine: fetch_nearest_cache deep-copies so there's no aliasing, ArraysCache models fall through safely, and the LRU memory bounds are wired through. The feature is worth landing once it's rewired against the multi-slot cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants