Fix batch quantized rotating cache decode by Thump604 · Pull Request #1 · Thump604/mlx-qsdpa

Thump604 · 2026-04-29T14:39:20Z

Summary

Fixes the batch quantized rotating cache path used by MLLM continuous batching when KV quantization and bounded KV caches are enabled together.

Three bugs were exposed by a real Qwen 3.6 VLM media-path smoke under --max-kv-size 65536 --kv-cache-quantization --kv-cache-quantization-bits 8:

QuantizedRotatingSDPACache did not expose merge(), so MLLM per-request cache merge failed immediately.
BatchQuantizedRotatingSDPACache.make_mask() built mask width from logical offset; multimodal position progress can exceed the visible K/V buffer length, causing mask/K shape mismatches.
BatchQuantizedRotatingSDPACache.merge() returns compact visible buffers, but decode writes did not grow those buffers before appending the next token.

Validation

python -m pytest tests -q -> 95 passed
Local runtime media-path smoke after vendoring the wheel: Qwen 3.6 35B VLM returned routing=mllm_batch, kv_quantize=true, non-empty image description.
Local sustained media-path run: 600 MLLM requests, concurrency 3, --disable-prefix-cache --no-memory-aware-cache --kv-cache-quantization --kv-cache-quantization-bits 8, all validated, no server errors, prefix cache stayed at zero entries/zero MB, Metal active returned to 41.0GB with 43.59GB peak.

Notes

This is part of the issue #442 follow-up in vllm-mlx. It does not by itself prove every long-run memory report is closed, but it removes the MLLM media-path crashes found while validating the cache-off control fix.

fix: repair batch quantized rotating cache decode

dc8bc37

This was referenced Apr 29, 2026

vllm-mlx / MLX backend grows wired/Metal memory over time and eventually aborts with kIOGPUCommandBufferCallbackErrorOutOfMemory waybarrios/vllm-mlx#442

Closed

Fix MLLM prefix cache disable wiring waybarrios/vllm-mlx#469

Merged

refactor: split qsdpa cache and benchmark modules

d51c93a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batch quantized rotating cache decode#1

Fix batch quantized rotating cache decode#1
Thump604 wants to merge 2 commits into
mainfrom
fix/rotating-cache-mllm-batch

Thump604 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Apr 29, 2026

Summary

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant