fix(ssd-cache): snapshot KV on producer thread + handle bf16↔numpy by CBribiescas · Pull Request #563 · waybarrios/vllm-mlx

CBribiescas · 2026-05-22T05:38:48Z

Summary

Fixes two distinct bugs that together made --ssd-cache-dir crash the server on any text-only model under --continuous-batching --enable-prefix-cache whenever the prefix cache filled and triggered eviction. Affects every top model I could test (qwen3-coder-30b, gpt-oss-20b, gpt-oss-120b).

Bug 1 — `std::runtime_error: There is no Stream(gpu, N) in current thread.`

When an eviction triggered ssd_tier.enqueue_spill(), the live mx.array references for layer.keys / layer.values were placed on a queue and consumed by a background writer thread. The writer then called np.array(mx_array) inside serialize_layer, forcing an MLX materialization. MLX ≥ 0.31.2 ties each request's compute to a thread-local Stream(gpu, N) (N == request uid); the writer thread had no such stream registered, so the conversion aborted the entire process.

Same architectural class as issue #496 / PR #479, but at a new site — the SSD spill writer is its own background daemon thread, not the BatchedEngine MLLM inference thread #479 fixes.

Fix

Add LayerSerializer.snapshot_layer(layer) -> dict, called on the producer (request handler) thread inside enqueue_spill before the queue handoff. The snapshot materializes each mx.array to numpy while the request's stream is still valid.
serialize_layer now takes the snapshot dict instead of the live layer — the writer thread can't accidentally touch MLX because the API doesn't give it any MLX objects.
Both KV/RotatingKVCache and ArraysCache (Mamba) serializers implement the new method.

Bug 2 — `RuntimeError: Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1.`

np.array(mx.array<bfloat16>) raises because numpy has no native bfloat16 and the mlx ≥ 0.31 buffer-protocol export confuses it. qwen3-coder-30b and many modern MLX models use bf16 KV cache, so this was masked behind Bug 1 on every model that actually evicted.

Fix

New helper ssd_cache._mx_to_numpy_safe(arr) try/excepts the bf16 case and falls back to arr.astype(mx.float32) + mx.eval + np.array.
The original dtype is recorded in the snapshot/manifest as a string (e.g. "bfloat16").
scheduler._reconstruct_ssd_layers casts the restored array back to the original dtype via mx.array(numpy_fp32).astype(dt) so the model sees its native KV layout (preserves memory + avoids implicit-cast surprises in attention kernels).
fp32 on disk is 2× bf16, accepted as a transient trade-off for the spill path.

Plus: 4 stale unit tests fixed

test_evict_lru_calls_ssd_spill, test_check_ssd_returns_none_on_ram_hit, test_full_round_trip, test_metrics_accuracy were all failing on clean v0.4.0rc1 base. The MemoryCacheConfig.min_prefix_tokens default is 128; the tests used 10-token sequences, so every store() was silently rejected as "short prefix" and the cache stayed empty. Pass min_prefix_tokens=1 explicitly. (Bug 1 was masking these too — even if store() worked, the writer would crash before any test could assert on spill_count.)

Plus: new regression test

test_snapshot_runs_on_caller_thread uses a ThreadBoundArray fake that raises if __array__() is accessed off the creator thread, proving enqueue_spill materializes layers on the calling thread (not the writer). This is the architectural invariant Bug 1 violated.

Verification

Host: Apple M5 Pro Max, 128 GB unified memory, macOS Tahoe.

Unit tests

tests/test_ssd_cache.py — 59 passed (was 54 passed + 4 failed; +1 new test)
tests/test_memory_cache.py — 47 passed (non-regression)

Lint clean: ruff check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py --select E,F,W --ignore E402,E501,E731,F811,F841 passes. black --check passes.

Runtime verification

Same harness for all three: start vllm-mlx with --continuous-batching --enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test --ssd-cache-max-gb 5, send 9 distinct ≥200-token prompts to force eviction.

Model	Pre-fix	Post-fix
`Qwen3-Coder-30B-A3B-Instruct-MLX-4bit` (bf16 KV)	crashes `Stream(gpu, 2)` after request 6	✅ 400 spill files / 462 MB, 8 evictions, no crash
`mlx-community/gpt-oss-20b-MXFP4-Q8`	crashes `Stream(gpu, 3)` after request 4	✅ 130 spill files / 115 MB, 5 evictions, 8/8 hits on replay
`mlx-community/gpt-oss-120b-MXFP4-Q8`	crashes `Stream(gpu, 3)` after request 3	✅ 266 spill files / 242 MB, 7 evictions, 8/8 hits on replay

Issue Vision requests fail with RuntimeError: There is no Stream(gpu, 0) in current thread #496 (closed) — the original Stream(gpu, N) thread-local invariant write-up.
Issue Define stronger MLX generation stream ownership invariant #468 (open) — "Define stronger MLX generation stream ownership invariant" — this PR is one site of the broader work.
PR fix: run BatchedEngine MLLM on dedicated MLXWorkerThread to prevent cross-thread stream errors #479 (open) — MLXWorkerThread for BatchedEngine MLLM. Orthogonal; this PR doesn't depend on it and doesn't conflict (different thread, different site).
PR fix: harden SSD cache review follow-ups #481 (merged) — addressed SQLite/locking concerns from issue Track unresolved SSD cache tiering review concerns from PR #309 #480 but did not touch the writer-thread stream crash.

Test plan

CI: tests/test_ssd_cache.py + tests/test_memory_cache.py green on all matrix Pythons.
Manual: vllm-mlx serve <bf16-KV model> --continuous-batching --enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test, send N>cache-capacity prompts, confirm /v1/cache/stats shows evictions > 0, files appear under /tmp/test/data/, no process abort.
Manual: replay the first prompt after eviction, confirm SSD reload path runs without dtype errors.

🤖 Generated with Claude Code

… numpy Two distinct bugs that together prevented --ssd-cache-dir from being usable on any of the top models (qwen3-coder-30b, gpt-oss-20b, gpt-oss-120b) under --continuous-batching with --enable-prefix-cache. Bug 1: SSD writer thread crashed with std::runtime_error: There is no Stream(gpu, N) in current thread. When a prefix-cache eviction triggered ssd_tier.enqueue_spill(), the mx.array references for layer.keys / layer.values were placed on a queue and picked up by a background writer thread. That writer thread then called np.array(mx_array) inside serialize_layer, which forces an MLX materialization. MLX 0.31+ ties each request's compute to a thread-local Stream(gpu, N) (where N is the request uid); the writer thread has no such stream registered, so the conversion aborts the whole process. Same root cause as issue waybarrios#496 / PR waybarrios#479, but at a new site: the SSD spill writer is its own background daemon thread, not the BatchedEngine MLLM thread waybarrios#479 fixes. Fix: add LayerSerializer.snapshot_layer(layer) -> dict, called on the producer (request handler) thread inside enqueue_spill BEFORE the queue handoff. The snapshot materializes each mx.array to numpy while the request's stream is still valid, and the writer thread now only handles numpy + disk bytes. serialize_layer's signature changes from (layer, idx, path) -> metadata to (snapshot, idx, path) -> metadata so the writer thread can't accidentally touch MLX. Bug 2: np.array(mx.array<bfloat16>) raises RuntimeError "Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1" — numpy has no native bfloat16. qwen3-coder-30b and many other modern MLX models use bf16 KV cache, so this was masked behind Bug 1 on every model that actually evicts. Fix: add ssd_cache._mx_to_numpy_safe(arr) which try/excepts the bf16 case and falls back to arr.astype(mx.float32) + mx.eval + np.array. Original dtype is recorded in the snapshot (and propagated through the manifest) so _reconstruct_ssd_layers in scheduler.py can cast back to bf16 after deserialization. fp32 storage is 2x bf16 on disk; transient 2x RAM during snapshot — acceptable for the spill path. Plus: fix 4 unit tests in test_ssd_cache.py that were failing on 0.4.0rc1 base (Bug 1 was masking these too). The MemoryCacheConfig default min_prefix_tokens=128 rejected the 10-token sequences these tests used. Tests now pass min_prefix_tokens=1 explicitly. Plus: new regression test test_snapshot_runs_on_caller_thread that uses a ThreadBoundArray fake to assert enqueue_spill must materialize layer tensors on the calling thread, not the writer thread. Runtime verified on M5 Pro Max 128GB with Qwen3-Coder-30B (MLX-4bit, bf16 KV): start vllm-mlx with --continuous-batching --enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test --ssd-cache-max-gb 5. Send 9 distinct 200+ token prompts. Before: server aborts on request 4 with Stream(gpu, 3). After: server completes all 9 requests, evictions=8 in /v1/cache/stats, 400 safetensors files ~462MB written to /tmp/test/data, no crash.

Drop the multi-paragraph docstrings on LayerSerializer / snapshot_layer / enqueue_spill / _writer_loop / _write_entry and the long test docstring explaining the threading invariant. The threading model (snapshot on producer thread, persist on writer thread) is stated once in the LayerSerializer class docstring and otherwise lives in the PR description. Kept: brief one-line reasons on _mx_to_numpy_safe (bf16 buffer mismatch), on _mx_dtype_from_name in scheduler.py (why None is acceptable), and on min_prefix_tokens=1 in the 4 fixed unit tests. No behavior change. 106 tests still pass; lint + black clean.

…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).

CBribiescas · 2026-05-29T16:16:54Z

Polite refresh — this has been quiet since 05-22. The fix follows the same shape as PR #479 / issue #468 (Stream(gpu, N) thread ownership), just at the SSD spill writer site rather than the BatchedEngine MLLM thread. CI is 9/9 green and the fix is pre/post-verified on qwen3-coder-30b and gpt-oss-20b/120b. Happy to address concerns or split the bf16↔numpy fallback off if it helps narrow scope.

Thump604

Reviewed the SSD spill/threading path and the dtype restore path. This looks correct to me.

What I checked:

SSDCacheTier.enqueue_spill() now snapshots each layer on the caller/request thread before handing work to the writer queue.
_writer_loop() / _write_entry() now receive only serializer + numpy snapshot data, so the SSD writer thread no longer needs to touch live MLX arrays.
KVCacheSerializer and ArraysCacheSerializer both keep the old manifest shape plus original dtype hints when a bf16 upcast was needed.
Scheduler._reconstruct_ssd_layers() casts restored arrays back when those dtype hints are present.
The new test_snapshot_runs_on_caller_thread directly guards the producer-thread snapshot invariant.

Local verification on the PR worktree:

python -m pytest tests/test_ssd_cache.py tests/test_memory_cache.py -q
# 106 passed
ruff check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py --select E,F,W --ignore E402,E501,E731,F811,F841
black --check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py
git diff --check upstream/main...HEAD

All passed. I did not rerun the large-model manual SSD spill matrix locally; this approval is based on code review, focused tests, CI, and the runtime evidence in the PR body.

CBribiescas · 2026-06-05T10:17:42Z

Friendly merge nudge — approved 05-30, CI still 9/9 green and branch is mergeable. Happy to rebase if anything's drifted.

waybarrios

Carefully checked - it is well-done

…waybarrios#579/waybarrios#594) Base pointer 015e080 -> caa8838 (also corrects the stale 395b13c note — the tree had been on 015e080 since the previous rebase). New rebase note: waybarrios#540 merged upstream with the env-clobber bug our patch waybarrios#15 fixes (kept ours wholesale; filed the fix upstream), waybarrios#579's simple.py delta restored in a dedicated commit, waybarrios#563 verified compatible with patch waybarrios#16's system_kv_ssd imports and resolves its bf16 upstream-report todo. Future work: waybarrios#233 closed upstream, waybarrios#528 already merged into base. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

CBribiescas added 2 commits May 22, 2026 11:43

This was referenced May 23, 2026

Bug: gpt-oss chat-template diverges from training format → ~25% CC quality loss vs ollama #568

Open

fix(gpt-oss): route harmony prompts through openai-harmony (refs #568) #581

Open

Thump604 approved these changes May 30, 2026

View reviewed changes

Thump604 requested a review from waybarrios May 30, 2026 21:34

Thump604 assigned waybarrios May 30, 2026

CBribiescas mentioned this pull request Jun 9, 2026

fix(ssd-cache): unblock spill for native QuantizedKVCache + bfloat16 #602

Closed

waybarrios approved these changes Jun 9, 2026

View reviewed changes

waybarrios merged commit 967d4f3 into waybarrios:main Jun 9, 2026
9 checks passed

This was referenced Jun 12, 2026

fix(ssd-cache): preserve original bfloat16 dtype across quantized spill #612

Open

fix(mllm): wire SSD cold tier onto the MLLM prefix cache #618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ssd-cache): snapshot KV on producer thread + handle bf16↔numpy#563

fix(ssd-cache): snapshot KV on producer thread + handle bf16↔numpy#563
waybarrios merged 2 commits into
waybarrios:mainfrom
CBribiescas:fix/ssd-cache-spill-stream-crash

CBribiescas commented May 22, 2026

Uh oh!

CBribiescas commented May 29, 2026

Uh oh!

Thump604 left a comment

Uh oh!

CBribiescas commented Jun 5, 2026

Uh oh!

waybarrios left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CBribiescas commented May 22, 2026

Summary

Bug 1 — std::runtime_error: There is no Stream(gpu, N) in current thread.

Fix

Bug 2 — RuntimeError: Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1.

Fix

Plus: 4 stale unit tests fixed

Plus: new regression test

Verification

Unit tests

Runtime verification

Related

Test plan

Uh oh!

CBribiescas commented May 29, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

CBribiescas commented Jun 5, 2026

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug 1 — `std::runtime_error: There is no Stream(gpu, N) in current thread.`

Bug 2 — `RuntimeError: Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1.`