fix(mllm): wire SSD cold tier onto the MLLM prefix cache#618
Open
CBribiescas wants to merge 3 commits into
Open
fix(mllm): wire SSD cold tier onto the MLLM prefix cache#618CBribiescas wants to merge 3 commits into
CBribiescas wants to merge 3 commits into
Conversation
--ssd-cache-dir was a silent no-op on the MLLM path used by Qwen3.5 and other VLM/hybrid models: the SSD tier was only attached to the standard Scheduler's MemoryAwarePrefixCache (scheduler.py ~1226). MLLMSchedulerConfig had no ssd field, batched._start_mllm passed none through, and the MLLM generator's MemoryAwarePrefixCache never got .set_ssd_tier(). Plumbing (additive, no-op when --ssd-cache-dir is unset): - MLLMSchedulerConfig gains ssd_cache_dir / ssd_cache_max_gb. - batched._start_mllm reads them off the SchedulerConfig (same fields cli.py populates) and forwards them. - mllm_scheduler builds SSDCacheTier(SSDCacheConfig(...)), start_writer() + reconcile(), and calls set_ssd_tier() on the generator's prefix cache, mirroring the standard path. One-line startup log makes it visible. The SSD serializer already supports ArraysCache (ssd_cache.py:544), so hybrid layers spill/promote correctly once the tier is attached. Tests: model-free wiring tests for the batched bridge and the scheduler attach logic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolves CI lint failure on the first run of fix/mllm-ssd-wire-cold-tier by reformatting tests/test_mllm_ssd_spill.py to match the project's black config.
Missed the fourth file in the earlier black sweep on 7bbd89a; CI lint still failed because tests/test_batched_engine_mllm_config.py had non-black formatting. Single-line whitespace tweak.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--ssd-cache-diris silently a no-op on the MLLM/VLM path used by Qwen 3.5, Gemma-4 VL, and other vision/hybrid models. The SSD tier was only attached to the standardScheduler'sMemoryAwarePrefixCache(scheduler.py~1226).MLLMSchedulerConfighad no SSD field,BatchedEngine._start_mllmpassed nothing through, and the MLLM generator'sMemoryAwarePrefixCachenever had.set_ssd_tier()called on it. RAM evictions on MLLM models went straight to/dev/null.This PR mirrors the standard-path wiring on the MLLM side. Additive, no-op when
--ssd-cache-diris unset.The change (36 production lines)
vllm_mlx/engine/batched.py(+6)Read
ssd_cache_dir+ssd_cache_max_gboff theSchedulerConfig(the same fieldscli.pyalready populates for the standard path) and forward them into theMLLMSchedulerConfigconstructor.vllm_mlx/mllm_scheduler.py(+30)Two pieces:
MLLMSchedulerConfiggainsssd_cache_dir: Optional[str] = Noneandssd_cache_max_gb: float = 10.0.MLLMBatchGeneratoris built, lazily constructSSDCacheTier(SSDCacheConfig(...)), callstart_writer()+reconcile(), andset_ssd_tier()on the generator'sprefix_cache. Identical shape to the standardSchedulerpath. Onelogger.infoline on startup makes it visible.tests/test_mllm_ssd_spill.py(+ test_batched_engine_mllm_config.py)Model-free wiring tests proving:
ssd_cache_dirandssd_cache_max_gbintoMLLMSchedulerConfigset_ssd_tier()on the prefix cache exactly when the flag is setSSDCacheTierconstructed)Production verification
Tested locally against the three loaded vllm-mlx servers, with
--continuous-batching --enable-prefix-cache --ssd-cache-dir <path> --ssd-cache-max-gb 33 --cache-memory-mb 6000 --kv-cache-quantization:gemma-4-E4B-it-MLX-4bit/v1/cache/stats, zero disk entries at--ssd-cache-dirafter days of loadQwen3-Coder-30B-A3B-Instruct-MLX-4bitgpt-oss-120b-MXFP4-Q8The startup log now reports
[mllm] SSD cache tier enabled on MLLM prefix cache: dir=…, max=33.0GB, mirroring the standard path's existing message.Why ArraysCache support is unchanged
The serializer already handles
ArraysCache(ssd_cache.py:544), so hybrid layers (KVCache + ArraysCachemix used by Gemma-4 et al.) spill and promote correctly once the tier is attached. No new serializer work needed.Test plan
Plus the production-server validation above.
Related
QuantizedKVCachespill — same underlying tier this PR now exposes to MLLM models.Not claimed
--ssd-cache-diris unset.