fix(mllm): wire SSD cold tier onto the MLLM prefix cache by CBribiescas · Pull Request #618 · waybarrios/vllm-mlx

CBribiescas · 2026-06-14T09:09:55Z

Summary

--ssd-cache-dir is silently a no-op on the MLLM/VLM path used by Qwen 3.5, Gemma-4 VL, and other vision/hybrid models. The SSD tier was only attached to the standard Scheduler's MemoryAwarePrefixCache (scheduler.py ~1226). MLLMSchedulerConfig had no SSD field, BatchedEngine._start_mllm passed nothing through, and the MLLM generator's MemoryAwarePrefixCache never had .set_ssd_tier() called on it. RAM evictions on MLLM models went straight to /dev/null.

This PR mirrors the standard-path wiring on the MLLM side. Additive, no-op when --ssd-cache-dir is unset.

The change (36 production lines)

`vllm_mlx/engine/batched.py` (+6)

Read ssd_cache_dir + ssd_cache_max_gb off the SchedulerConfig (the same fields cli.py already populates for the standard path) and forward them into the MLLMSchedulerConfig constructor.

`vllm_mlx/mllm_scheduler.py` (+30)

Two pieces:

MLLMSchedulerConfig gains ssd_cache_dir: Optional[str] = None and ssd_cache_max_gb: float = 10.0.
After MLLMBatchGenerator is built, lazily construct SSDCacheTier(SSDCacheConfig(...)), call start_writer() + reconcile(), and set_ssd_tier() on the generator's prefix_cache. Identical shape to the standard Scheduler path. One logger.info line on startup makes it visible.

`tests/test_mllm_ssd_spill.py` (+ test_batched_engine_mllm_config.py)

Model-free wiring tests proving:

The batched bridge forwards both ssd_cache_dir and ssd_cache_max_gb into MLLMSchedulerConfig
The scheduler's attach logic calls set_ssd_tier() on the prefix cache exactly when the flag is set
No-op when the flag is unset (no SSDCacheTier constructed)

Production verification

Tested locally against the three loaded vllm-mlx servers, with --continuous-batching --enable-prefix-cache --ssd-cache-dir <path> --ssd-cache-max-gb 33 --cache-memory-mb 6000 --kv-cache-quantization:

Model	Path	Pre-fix	Post-fix
`gemma-4-E4B-it-MLX-4bit`	MLLM (vision/hybrid)	RAM evictions reported in `/v1/cache/stats`, zero disk entries at `--ssd-cache-dir` after days of load	6 evictions → 6 spill files / 414 MB on disk within 11 prompts past the 6 GB budget
`Qwen3-Coder-30B-A3B-Instruct-MLX-4bit`	text (control)	spilling fine via #563 (4 disk entries)	unchanged — text path untouched
`gpt-oss-120b-MXFP4-Q8`	text (control)	spilling fine via #563 (1 disk entry)	unchanged — text path untouched

The startup log now reports [mllm] SSD cache tier enabled on MLLM prefix cache: dir=…, max=33.0GB, mirroring the standard path's existing message.

Why ArraysCache support is unchanged

The serializer already handles ArraysCache (ssd_cache.py:544), so hybrid layers (KVCache + ArraysCache mix used by Gemma-4 et al.) spill and promote correctly once the tier is attached. No new serializer work needed.

Test plan

pytest tests/test_mllm_ssd_spill.py tests/test_ssd_cache.py tests/test_batched_engine_mllm_config.py
62 passed

black --check vllm_mlx/mllm_scheduler.py vllm_mlx/engine/batched.py tests/test_mllm_ssd_spill.py
3 files would be left unchanged

Plus the production-server validation above.

PR fix(ssd-cache): snapshot KV on producer thread + handle bf16↔numpy #563 (merged): introduced the producer-thread snapshot pattern this fix relies on.
PR fix(ssd-cache): spill native QuantizedKVCache + handle bfloat16 buffer #605 (merged): added native QuantizedKVCache spill — same underlying tier this PR now exposes to MLLM models.
PR fix(ssd-cache): preserve original bfloat16 dtype across quantized spill #612 (open): closes the bf16 dtype round-trip @waybarrios flagged on fix(ssd-cache): spill native QuantizedKVCache + handle bfloat16 buffer #605.

Not claimed

Does not change MLLM scheduling, request routing, or any vision-encoder path.
Does not change behavior when --ssd-cache-dir is unset.

--ssd-cache-dir was a silent no-op on the MLLM path used by Qwen3.5 and other VLM/hybrid models: the SSD tier was only attached to the standard Scheduler's MemoryAwarePrefixCache (scheduler.py ~1226). MLLMSchedulerConfig had no ssd field, batched._start_mllm passed none through, and the MLLM generator's MemoryAwarePrefixCache never got .set_ssd_tier(). Plumbing (additive, no-op when --ssd-cache-dir is unset): - MLLMSchedulerConfig gains ssd_cache_dir / ssd_cache_max_gb. - batched._start_mllm reads them off the SchedulerConfig (same fields cli.py populates) and forwards them. - mllm_scheduler builds SSDCacheTier(SSDCacheConfig(...)), start_writer() + reconcile(), and calls set_ssd_tier() on the generator's prefix cache, mirroring the standard path. One-line startup log makes it visible. The SSD serializer already supports ArraysCache (ssd_cache.py:544), so hybrid layers spill/promote correctly once the tier is attached. Tests: model-free wiring tests for the batched bridge and the scheduler attach logic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolves CI lint failure on the first run of fix/mllm-ssd-wire-cold-tier by reformatting tests/test_mllm_ssd_spill.py to match the project's black config.

Missed the fourth file in the earlier black sweep on 7bbd89a; CI lint still failed because tests/test_batched_engine_mllm_config.py had non-black formatting. Single-line whitespace tweak.

CBribiescas and others added 3 commits June 14, 2026 17:07

lint: black-format mllm SSD wiring + test file

7bbd89a

Resolves CI lint failure on the first run of fix/mllm-ssd-wire-cold-tier by reformatting tests/test_mllm_ssd_spill.py to match the project's black config.

lint: black-format test_batched_engine_mllm_config.py

5a80859

Missed the fourth file in the earlier black sweep on 7bbd89a; CI lint still failed because tests/test_batched_engine_mllm_config.py had non-black formatting. Single-line whitespace tweak.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mllm): wire SSD cold tier onto the MLLM prefix cache#618

fix(mllm): wire SSD cold tier onto the MLLM prefix cache#618
CBribiescas wants to merge 3 commits into
waybarrios:mainfrom
CBribiescas:fix/mllm-ssd-wire-cold-tier

CBribiescas commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CBribiescas commented Jun 14, 2026

Summary

The change (36 production lines)

vllm_mlx/engine/batched.py (+6)

vllm_mlx/mllm_scheduler.py (+30)

tests/test_mllm_ssd_spill.py (+ test_batched_engine_mllm_config.py)

Production verification

Why ArraysCache support is unchanged

Test plan

Related

Not claimed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`vllm_mlx/engine/batched.py` (+6)

`vllm_mlx/mllm_scheduler.py` (+30)

`tests/test_mllm_ssd_spill.py` (+ test_batched_engine_mllm_config.py)