fix(ssd-cache): snapshot KV on producer thread + handle bf16↔numpy#563
Conversation
… numpy Two distinct bugs that together prevented --ssd-cache-dir from being usable on any of the top models (qwen3-coder-30b, gpt-oss-20b, gpt-oss-120b) under --continuous-batching with --enable-prefix-cache. Bug 1: SSD writer thread crashed with std::runtime_error: There is no Stream(gpu, N) in current thread. When a prefix-cache eviction triggered ssd_tier.enqueue_spill(), the mx.array references for layer.keys / layer.values were placed on a queue and picked up by a background writer thread. That writer thread then called np.array(mx_array) inside serialize_layer, which forces an MLX materialization. MLX 0.31+ ties each request's compute to a thread-local Stream(gpu, N) (where N is the request uid); the writer thread has no such stream registered, so the conversion aborts the whole process. Same root cause as issue waybarrios#496 / PR waybarrios#479, but at a new site: the SSD spill writer is its own background daemon thread, not the BatchedEngine MLLM thread waybarrios#479 fixes. Fix: add LayerSerializer.snapshot_layer(layer) -> dict, called on the producer (request handler) thread inside enqueue_spill BEFORE the queue handoff. The snapshot materializes each mx.array to numpy while the request's stream is still valid, and the writer thread now only handles numpy + disk bytes. serialize_layer's signature changes from (layer, idx, path) -> metadata to (snapshot, idx, path) -> metadata so the writer thread can't accidentally touch MLX. Bug 2: np.array(mx.array<bfloat16>) raises RuntimeError "Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1" — numpy has no native bfloat16. qwen3-coder-30b and many other modern MLX models use bf16 KV cache, so this was masked behind Bug 1 on every model that actually evicts. Fix: add ssd_cache._mx_to_numpy_safe(arr) which try/excepts the bf16 case and falls back to arr.astype(mx.float32) + mx.eval + np.array. Original dtype is recorded in the snapshot (and propagated through the manifest) so _reconstruct_ssd_layers in scheduler.py can cast back to bf16 after deserialization. fp32 storage is 2x bf16 on disk; transient 2x RAM during snapshot — acceptable for the spill path. Plus: fix 4 unit tests in test_ssd_cache.py that were failing on 0.4.0rc1 base (Bug 1 was masking these too). The MemoryCacheConfig default min_prefix_tokens=128 rejected the 10-token sequences these tests used. Tests now pass min_prefix_tokens=1 explicitly. Plus: new regression test test_snapshot_runs_on_caller_thread that uses a ThreadBoundArray fake to assert enqueue_spill must materialize layer tensors on the calling thread, not the writer thread. Runtime verified on M5 Pro Max 128GB with Qwen3-Coder-30B (MLX-4bit, bf16 KV): start vllm-mlx with --continuous-batching --enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test --ssd-cache-max-gb 5. Send 9 distinct 200+ token prompts. Before: server aborts on request 4 with Stream(gpu, 3). After: server completes all 9 requests, evictions=8 in /v1/cache/stats, 400 safetensors files ~462MB written to /tmp/test/data, no crash.
Drop the multi-paragraph docstrings on LayerSerializer / snapshot_layer / enqueue_spill / _writer_loop / _write_entry and the long test docstring explaining the threading invariant. The threading model (snapshot on producer thread, persist on writer thread) is stated once in the LayerSerializer class docstring and otherwise lives in the PR description. Kept: brief one-line reasons on _mx_to_numpy_safe (bf16 buffer mismatch), on _mx_dtype_from_name in scheduler.py (why None is acceptable), and on min_prefix_tokens=1 in the 4 fixed unit tests. No behavior change. 106 tests still pass; lint + black clean.
…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).
…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).
|
Polite refresh — this has been quiet since 05-22. The fix follows the same shape as PR #479 / issue #468 (Stream(gpu, N) thread ownership), just at the SSD spill writer site rather than the BatchedEngine MLLM thread. CI is 9/9 green and the fix is pre/post-verified on qwen3-coder-30b and gpt-oss-20b/120b. Happy to address concerns or split the bf16↔numpy fallback off if it helps narrow scope. |
Thump604
left a comment
There was a problem hiding this comment.
Reviewed the SSD spill/threading path and the dtype restore path. This looks correct to me.
What I checked:
SSDCacheTier.enqueue_spill()now snapshots each layer on the caller/request thread before handing work to the writer queue._writer_loop()/_write_entry()now receive only serializer + numpy snapshot data, so the SSD writer thread no longer needs to touch live MLX arrays.KVCacheSerializerandArraysCacheSerializerboth keep the old manifest shape plus original dtype hints when a bf16 upcast was needed.Scheduler._reconstruct_ssd_layers()casts restored arrays back when those dtype hints are present.- The new
test_snapshot_runs_on_caller_threaddirectly guards the producer-thread snapshot invariant.
Local verification on the PR worktree:
python -m pytest tests/test_ssd_cache.py tests/test_memory_cache.py -q
# 106 passed
ruff check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py --select E,F,W --ignore E402,E501,E731,F811,F841
black --check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py
git diff --check upstream/main...HEADAll passed. I did not rerun the large-model manual SSD spill matrix locally; this approval is based on code review, focused tests, CI, and the runtime evidence in the PR body.
|
Friendly merge nudge — approved 05-30, CI still 9/9 green and branch is mergeable. Happy to rebase if anything's drifted. |
waybarrios
left a comment
There was a problem hiding this comment.
Carefully checked - it is well-done
…waybarrios#579/waybarrios#594) Base pointer 015e080 -> caa8838 (also corrects the stale 395b13c note — the tree had been on 015e080 since the previous rebase). New rebase note: waybarrios#540 merged upstream with the env-clobber bug our patch waybarrios#15 fixes (kept ours wholesale; filed the fix upstream), waybarrios#579's simple.py delta restored in a dedicated commit, waybarrios#563 verified compatible with patch waybarrios#16's system_kv_ssd imports and resolves its bf16 upstream-report todo. Future work: waybarrios#233 closed upstream, waybarrios#528 already merged into base. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…waybarrios#579/waybarrios#594) Base pointer 015e080 -> caa8838 (also corrects the stale 395b13c note — the tree had been on 015e080 since the previous rebase). New rebase note: waybarrios#540 merged upstream with the env-clobber bug our patch waybarrios#15 fixes (kept ours wholesale; filed the fix upstream), waybarrios#579's simple.py delta restored in a dedicated commit, waybarrios#563 verified compatible with patch waybarrios#16's system_kv_ssd imports and resolves its bf16 upstream-report todo. Future work: waybarrios#233 closed upstream, waybarrios#528 already merged into base. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Summary
Fixes two distinct bugs that together made
--ssd-cache-dircrash the server on any text-only model under--continuous-batching --enable-prefix-cachewhenever the prefix cache filled and triggered eviction. Affects every top model I could test (qwen3-coder-30b, gpt-oss-20b, gpt-oss-120b).Bug 1 —
std::runtime_error: There is no Stream(gpu, N) in current thread.When an eviction triggered
ssd_tier.enqueue_spill(), the livemx.arrayreferences forlayer.keys / layer.valueswere placed on a queue and consumed by a background writer thread. The writer then callednp.array(mx_array)insideserialize_layer, forcing an MLX materialization. MLX ≥ 0.31.2 ties each request's compute to a thread-localStream(gpu, N)(N == request uid); the writer thread had no such stream registered, so the conversion aborted the entire process.Same architectural class as issue #496 / PR #479, but at a new site — the SSD spill writer is its own background daemon thread, not the BatchedEngine MLLM inference thread #479 fixes.
Fix
LayerSerializer.snapshot_layer(layer) -> dict, called on the producer (request handler) thread insideenqueue_spillbefore the queue handoff. The snapshot materializes eachmx.arrayto numpy while the request's stream is still valid.serialize_layernow takes the snapshot dict instead of the live layer — the writer thread can't accidentally touch MLX because the API doesn't give it any MLX objects.Bug 2 —
RuntimeError: Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1.np.array(mx.array<bfloat16>)raises because numpy has no native bfloat16 and the mlx ≥ 0.31 buffer-protocol export confuses it. qwen3-coder-30b and many modern MLX models use bf16 KV cache, so this was masked behind Bug 1 on every model that actually evicted.Fix
ssd_cache._mx_to_numpy_safe(arr)try/excepts the bf16 case and falls back toarr.astype(mx.float32) + mx.eval + np.array."bfloat16").scheduler._reconstruct_ssd_layerscasts the restored array back to the original dtype viamx.array(numpy_fp32).astype(dt)so the model sees its native KV layout (preserves memory + avoids implicit-cast surprises in attention kernels).Plus: 4 stale unit tests fixed
test_evict_lru_calls_ssd_spill,test_check_ssd_returns_none_on_ram_hit,test_full_round_trip,test_metrics_accuracywere all failing on clean v0.4.0rc1 base. TheMemoryCacheConfig.min_prefix_tokensdefault is 128; the tests used 10-token sequences, so everystore()was silently rejected as "short prefix" and the cache stayed empty. Passmin_prefix_tokens=1explicitly. (Bug 1 was masking these too — even ifstore()worked, the writer would crash before any test could assert onspill_count.)Plus: new regression test
test_snapshot_runs_on_caller_threaduses aThreadBoundArrayfake that raises if__array__()is accessed off the creator thread, provingenqueue_spillmaterializes layers on the calling thread (not the writer). This is the architectural invariant Bug 1 violated.Verification
Host: Apple M5 Pro Max, 128 GB unified memory, macOS Tahoe.
Unit tests
Lint clean:
ruff check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py --select E,F,W --ignore E402,E501,E731,F811,F841passes.black --checkpasses.Runtime verification
Same harness for all three: start vllm-mlx with
--continuous-batching --enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test --ssd-cache-max-gb 5, send 9 distinct ≥200-token prompts to force eviction.Qwen3-Coder-30B-A3B-Instruct-MLX-4bit(bf16 KV)Stream(gpu, 2)after request 6mlx-community/gpt-oss-20b-MXFP4-Q8Stream(gpu, 3)after request 4mlx-community/gpt-oss-120b-MXFP4-Q8Stream(gpu, 3)after request 3Related
RuntimeError: There is no Stream(gpu, 0) in current thread#496 (closed) — the original Stream(gpu, N) thread-local invariant write-up.MLXWorkerThreadfor BatchedEngine MLLM. Orthogonal; this PR doesn't depend on it and doesn't conflict (different thread, different site).Test plan
tests/test_ssd_cache.py+tests/test_memory_cache.pygreen on all matrix Pythons.vllm-mlx serve <bf16-KV model> --continuous-batching --enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test, send N>cache-capacity prompts, confirm/v1/cache/statsshowsevictions > 0, files appear under/tmp/test/data/, no process abort.🤖 Generated with Claude Code