Skip to content

fix(mllm): wire SSD cold tier onto the MLLM prefix cache#618

Open
CBribiescas wants to merge 3 commits into
waybarrios:mainfrom
CBribiescas:fix/mllm-ssd-wire-cold-tier
Open

fix(mllm): wire SSD cold tier onto the MLLM prefix cache#618
CBribiescas wants to merge 3 commits into
waybarrios:mainfrom
CBribiescas:fix/mllm-ssd-wire-cold-tier

Conversation

@CBribiescas

Copy link
Copy Markdown
Contributor

Summary

--ssd-cache-dir is silently a no-op on the MLLM/VLM path used by Qwen 3.5, Gemma-4 VL, and other vision/hybrid models. The SSD tier was only attached to the standard Scheduler's MemoryAwarePrefixCache (scheduler.py ~1226). MLLMSchedulerConfig had no SSD field, BatchedEngine._start_mllm passed nothing through, and the MLLM generator's MemoryAwarePrefixCache never had .set_ssd_tier() called on it. RAM evictions on MLLM models went straight to /dev/null.

This PR mirrors the standard-path wiring on the MLLM side. Additive, no-op when --ssd-cache-dir is unset.

The change (36 production lines)

vllm_mlx/engine/batched.py (+6)

Read ssd_cache_dir + ssd_cache_max_gb off the SchedulerConfig (the same fields cli.py already populates for the standard path) and forward them into the MLLMSchedulerConfig constructor.

vllm_mlx/mllm_scheduler.py (+30)

Two pieces:

  1. MLLMSchedulerConfig gains ssd_cache_dir: Optional[str] = None and ssd_cache_max_gb: float = 10.0.
  2. After MLLMBatchGenerator is built, lazily construct SSDCacheTier(SSDCacheConfig(...)), call start_writer() + reconcile(), and set_ssd_tier() on the generator's prefix_cache. Identical shape to the standard Scheduler path. One logger.info line on startup makes it visible.

tests/test_mllm_ssd_spill.py (+ test_batched_engine_mllm_config.py)

Model-free wiring tests proving:

  • The batched bridge forwards both ssd_cache_dir and ssd_cache_max_gb into MLLMSchedulerConfig
  • The scheduler's attach logic calls set_ssd_tier() on the prefix cache exactly when the flag is set
  • No-op when the flag is unset (no SSDCacheTier constructed)

Production verification

Tested locally against the three loaded vllm-mlx servers, with --continuous-batching --enable-prefix-cache --ssd-cache-dir <path> --ssd-cache-max-gb 33 --cache-memory-mb 6000 --kv-cache-quantization:

Model Path Pre-fix Post-fix
gemma-4-E4B-it-MLX-4bit MLLM (vision/hybrid) RAM evictions reported in /v1/cache/stats, zero disk entries at --ssd-cache-dir after days of load 6 evictions → 6 spill files / 414 MB on disk within 11 prompts past the 6 GB budget
Qwen3-Coder-30B-A3B-Instruct-MLX-4bit text (control) spilling fine via #563 (4 disk entries) unchanged — text path untouched
gpt-oss-120b-MXFP4-Q8 text (control) spilling fine via #563 (1 disk entry) unchanged — text path untouched

The startup log now reports [mllm] SSD cache tier enabled on MLLM prefix cache: dir=…, max=33.0GB, mirroring the standard path's existing message.

Why ArraysCache support is unchanged

The serializer already handles ArraysCache (ssd_cache.py:544), so hybrid layers (KVCache + ArraysCache mix used by Gemma-4 et al.) spill and promote correctly once the tier is attached. No new serializer work needed.

Test plan

pytest tests/test_mllm_ssd_spill.py tests/test_ssd_cache.py tests/test_batched_engine_mllm_config.py
62 passed

black --check vllm_mlx/mllm_scheduler.py vllm_mlx/engine/batched.py tests/test_mllm_ssd_spill.py
3 files would be left unchanged

Plus the production-server validation above.

Related

Not claimed

  • Does not change MLLM scheduling, request routing, or any vision-encoder path.
  • Does not change behavior when --ssd-cache-dir is unset.

CBribiescas and others added 3 commits June 14, 2026 17:07
--ssd-cache-dir was a silent no-op on the MLLM path used by Qwen3.5 and
other VLM/hybrid models: the SSD tier was only attached to the standard
Scheduler's MemoryAwarePrefixCache (scheduler.py ~1226). MLLMSchedulerConfig
had no ssd field, batched._start_mllm passed none through, and the MLLM
generator's MemoryAwarePrefixCache never got .set_ssd_tier().

Plumbing (additive, no-op when --ssd-cache-dir is unset):
- MLLMSchedulerConfig gains ssd_cache_dir / ssd_cache_max_gb.
- batched._start_mllm reads them off the SchedulerConfig (same fields cli.py
  populates) and forwards them.
- mllm_scheduler builds SSDCacheTier(SSDCacheConfig(...)), start_writer() +
  reconcile(), and calls set_ssd_tier() on the generator's prefix cache,
  mirroring the standard path. One-line startup log makes it visible.

The SSD serializer already supports ArraysCache (ssd_cache.py:544), so
hybrid layers spill/promote correctly once the tier is attached.

Tests: model-free wiring tests for the batched bridge and the scheduler
attach logic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolves CI lint failure on the first run of fix/mllm-ssd-wire-cold-tier
by reformatting tests/test_mllm_ssd_spill.py to match the project's
black config.
Missed the fourth file in the earlier black sweep on 7bbd89a; CI lint
still failed because tests/test_batched_engine_mllm_config.py had
non-black formatting. Single-line whitespace tweak.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant