Skip to content

fix(ssd-cache): snapshot KV on producer thread + handle bf16↔numpy#563

Merged
waybarrios merged 2 commits into
waybarrios:mainfrom
CBribiescas:fix/ssd-cache-spill-stream-crash
Jun 9, 2026
Merged

fix(ssd-cache): snapshot KV on producer thread + handle bf16↔numpy#563
waybarrios merged 2 commits into
waybarrios:mainfrom
CBribiescas:fix/ssd-cache-spill-stream-crash

Conversation

@CBribiescas

Copy link
Copy Markdown
Contributor

Summary

Fixes two distinct bugs that together made --ssd-cache-dir crash the server on any text-only model under --continuous-batching --enable-prefix-cache whenever the prefix cache filled and triggered eviction. Affects every top model I could test (qwen3-coder-30b, gpt-oss-20b, gpt-oss-120b).

Bug 1 — std::runtime_error: There is no Stream(gpu, N) in current thread.

When an eviction triggered ssd_tier.enqueue_spill(), the live mx.array references for layer.keys / layer.values were placed on a queue and consumed by a background writer thread. The writer then called np.array(mx_array) inside serialize_layer, forcing an MLX materialization. MLX ≥ 0.31.2 ties each request's compute to a thread-local Stream(gpu, N) (N == request uid); the writer thread had no such stream registered, so the conversion aborted the entire process.

Same architectural class as issue #496 / PR #479, but at a new site — the SSD spill writer is its own background daemon thread, not the BatchedEngine MLLM inference thread #479 fixes.

Fix

  • Add LayerSerializer.snapshot_layer(layer) -> dict, called on the producer (request handler) thread inside enqueue_spill before the queue handoff. The snapshot materializes each mx.array to numpy while the request's stream is still valid.
  • serialize_layer now takes the snapshot dict instead of the live layer — the writer thread can't accidentally touch MLX because the API doesn't give it any MLX objects.
  • Both KV/RotatingKVCache and ArraysCache (Mamba) serializers implement the new method.

Bug 2 — RuntimeError: Item size 2 for PEP 3118 buffer format string B does not match the dtype B item size 1.

np.array(mx.array<bfloat16>) raises because numpy has no native bfloat16 and the mlx ≥ 0.31 buffer-protocol export confuses it. qwen3-coder-30b and many modern MLX models use bf16 KV cache, so this was masked behind Bug 1 on every model that actually evicted.

Fix

  • New helper ssd_cache._mx_to_numpy_safe(arr) try/excepts the bf16 case and falls back to arr.astype(mx.float32) + mx.eval + np.array.
  • The original dtype is recorded in the snapshot/manifest as a string (e.g. "bfloat16").
  • scheduler._reconstruct_ssd_layers casts the restored array back to the original dtype via mx.array(numpy_fp32).astype(dt) so the model sees its native KV layout (preserves memory + avoids implicit-cast surprises in attention kernels).
  • fp32 on disk is 2× bf16, accepted as a transient trade-off for the spill path.

Plus: 4 stale unit tests fixed

test_evict_lru_calls_ssd_spill, test_check_ssd_returns_none_on_ram_hit, test_full_round_trip, test_metrics_accuracy were all failing on clean v0.4.0rc1 base. The MemoryCacheConfig.min_prefix_tokens default is 128; the tests used 10-token sequences, so every store() was silently rejected as "short prefix" and the cache stayed empty. Pass min_prefix_tokens=1 explicitly. (Bug 1 was masking these too — even if store() worked, the writer would crash before any test could assert on spill_count.)

Plus: new regression test

test_snapshot_runs_on_caller_thread uses a ThreadBoundArray fake that raises if __array__() is accessed off the creator thread, proving enqueue_spill materializes layers on the calling thread (not the writer). This is the architectural invariant Bug 1 violated.

Verification

Host: Apple M5 Pro Max, 128 GB unified memory, macOS Tahoe.

Unit tests

tests/test_ssd_cache.py — 59 passed (was 54 passed + 4 failed; +1 new test)
tests/test_memory_cache.py — 47 passed (non-regression)

Lint clean: ruff check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py --select E,F,W --ignore E402,E501,E731,F811,F841 passes. black --check passes.

Runtime verification

Same harness for all three: start vllm-mlx with --continuous-batching --enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test --ssd-cache-max-gb 5, send 9 distinct ≥200-token prompts to force eviction.

Model Pre-fix Post-fix
Qwen3-Coder-30B-A3B-Instruct-MLX-4bit (bf16 KV) crashes Stream(gpu, 2) after request 6 ✅ 400 spill files / 462 MB, 8 evictions, no crash
mlx-community/gpt-oss-20b-MXFP4-Q8 crashes Stream(gpu, 3) after request 4 ✅ 130 spill files / 115 MB, 5 evictions, 8/8 hits on replay
mlx-community/gpt-oss-120b-MXFP4-Q8 crashes Stream(gpu, 3) after request 3 ✅ 266 spill files / 242 MB, 7 evictions, 8/8 hits on replay

Related

Test plan

  • CI: tests/test_ssd_cache.py + tests/test_memory_cache.py green on all matrix Pythons.
  • Manual: vllm-mlx serve <bf16-KV model> --continuous-batching --enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test, send N>cache-capacity prompts, confirm /v1/cache/stats shows evictions > 0, files appear under /tmp/test/data/, no process abort.
  • Manual: replay the first prompt after eviction, confirm SSD reload path runs without dtype errors.

🤖 Generated with Claude Code

… numpy

Two distinct bugs that together prevented --ssd-cache-dir from being usable
on any of the top models (qwen3-coder-30b, gpt-oss-20b, gpt-oss-120b) under
--continuous-batching with --enable-prefix-cache.

Bug 1: SSD writer thread crashed with std::runtime_error: There is no
Stream(gpu, N) in current thread.

When a prefix-cache eviction triggered ssd_tier.enqueue_spill(), the
mx.array references for layer.keys / layer.values were placed on a
queue and picked up by a background writer thread. That writer thread
then called np.array(mx_array) inside serialize_layer, which forces an
MLX materialization. MLX 0.31+ ties each request's compute to a
thread-local Stream(gpu, N) (where N is the request uid); the writer
thread has no such stream registered, so the conversion aborts the
whole process.

Same root cause as issue waybarrios#496 / PR waybarrios#479, but at a new site: the SSD
spill writer is its own background daemon thread, not the BatchedEngine
MLLM thread waybarrios#479 fixes.

Fix: add LayerSerializer.snapshot_layer(layer) -> dict, called on the
producer (request handler) thread inside enqueue_spill BEFORE the queue
handoff. The snapshot materializes each mx.array to numpy while the
request's stream is still valid, and the writer thread now only handles
numpy + disk bytes. serialize_layer's signature changes from
(layer, idx, path) -> metadata to (snapshot, idx, path) -> metadata so
the writer thread can't accidentally touch MLX.

Bug 2: np.array(mx.array<bfloat16>) raises RuntimeError "Item size 2
for PEP 3118 buffer format string B does not match the dtype B item
size 1" — numpy has no native bfloat16. qwen3-coder-30b and many other
modern MLX models use bf16 KV cache, so this was masked behind Bug 1
on every model that actually evicts.

Fix: add ssd_cache._mx_to_numpy_safe(arr) which try/excepts the bf16
case and falls back to arr.astype(mx.float32) + mx.eval + np.array.
Original dtype is recorded in the snapshot (and propagated through the
manifest) so _reconstruct_ssd_layers in scheduler.py can cast back to
bf16 after deserialization. fp32 storage is 2x bf16 on disk; transient
2x RAM during snapshot — acceptable for the spill path.

Plus: fix 4 unit tests in test_ssd_cache.py that were failing on
0.4.0rc1 base (Bug 1 was masking these too). The MemoryCacheConfig
default min_prefix_tokens=128 rejected the 10-token sequences these
tests used. Tests now pass min_prefix_tokens=1 explicitly.

Plus: new regression test test_snapshot_runs_on_caller_thread that
uses a ThreadBoundArray fake to assert enqueue_spill must materialize
layer tensors on the calling thread, not the writer thread.

Runtime verified on M5 Pro Max 128GB with Qwen3-Coder-30B
(MLX-4bit, bf16 KV): start vllm-mlx with --continuous-batching
--enable-prefix-cache --cache-memory-mb 50 --ssd-cache-dir /tmp/test
--ssd-cache-max-gb 5. Send 9 distinct 200+ token prompts. Before:
server aborts on request 4 with Stream(gpu, 3). After: server completes
all 9 requests, evictions=8 in /v1/cache/stats, 400 safetensors files
~462MB written to /tmp/test/data, no crash.
Drop the multi-paragraph docstrings on LayerSerializer / snapshot_layer /
enqueue_spill / _writer_loop / _write_entry and the long test docstring
explaining the threading invariant. The threading model (snapshot on
producer thread, persist on writer thread) is stated once in the
LayerSerializer class docstring and otherwise lives in the PR description.

Kept: brief one-line reasons on _mx_to_numpy_safe (bf16 buffer mismatch),
on _mx_dtype_from_name in scheduler.py (why None is acceptable), and on
min_prefix_tokens=1 in the 4 fixed unit tests.

No behavior change. 106 tests still pass; lint + black clean.
CBribiescas added a commit to CBribiescas/vllm-mlx that referenced this pull request May 27, 2026
…g/bare-JSON

Production-running state at 2026-05-27. Not a PR target — this branch is
a snapshot for reproducing my local setup; the equivalent changes are in
upstream review as separate PRs:

- llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580.
  Applied here directly so this snapshot doesn't depend on waybarrios#580 landing.

- gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony
  is superseded by PR waybarrios#581 (route through openai-harmony lib); the
  gemma4 flag-flip is what the production launchd start-vllm-gemma
  service depends on for tool-call extraction.

The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 +
waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this
branch as an editable install gives you the same vllm-mlx serving
behavior I'm running for the 3-slot production stack (qwen-coder-30b,
gemma-4-E4B-it, gpt-oss-120b).
CBribiescas added a commit to CBribiescas/vllm-mlx that referenced this pull request May 27, 2026
…g/bare-JSON

Production-running state at 2026-05-27. Not a PR target — this branch is
a snapshot for reproducing my local setup; the equivalent changes are in
upstream review as separate PRs:

- llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580.
  Applied here directly so this snapshot doesn't depend on waybarrios#580 landing.

- gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony
  is superseded by PR waybarrios#581 (route through openai-harmony lib); the
  gemma4 flag-flip is what the production launchd start-vllm-gemma
  service depends on for tool-call extraction.

The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 +
waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this
branch as an editable install gives you the same vllm-mlx serving
behavior I'm running for the 3-slot production stack (qwen-coder-30b,
gemma-4-E4B-it, gpt-oss-120b).
@CBribiescas

Copy link
Copy Markdown
Contributor Author

Polite refresh — this has been quiet since 05-22. The fix follows the same shape as PR #479 / issue #468 (Stream(gpu, N) thread ownership), just at the SSD spill writer site rather than the BatchedEngine MLLM thread. CI is 9/9 green and the fix is pre/post-verified on qwen3-coder-30b and gpt-oss-20b/120b. Happy to address concerns or split the bf16↔numpy fallback off if it helps narrow scope.

@Thump604 Thump604 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the SSD spill/threading path and the dtype restore path. This looks correct to me.

What I checked:

  • SSDCacheTier.enqueue_spill() now snapshots each layer on the caller/request thread before handing work to the writer queue.
  • _writer_loop() / _write_entry() now receive only serializer + numpy snapshot data, so the SSD writer thread no longer needs to touch live MLX arrays.
  • KVCacheSerializer and ArraysCacheSerializer both keep the old manifest shape plus original dtype hints when a bf16 upcast was needed.
  • Scheduler._reconstruct_ssd_layers() casts restored arrays back when those dtype hints are present.
  • The new test_snapshot_runs_on_caller_thread directly guards the producer-thread snapshot invariant.

Local verification on the PR worktree:

python -m pytest tests/test_ssd_cache.py tests/test_memory_cache.py -q
# 106 passed
ruff check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py --select E,F,W --ignore E402,E501,E731,F811,F841
black --check vllm_mlx/ssd_cache.py vllm_mlx/scheduler.py tests/test_ssd_cache.py
git diff --check upstream/main...HEAD

All passed. I did not rerun the large-model manual SSD spill matrix locally; this approval is based on code review, focused tests, CI, and the runtime evidence in the PR body.

@CBribiescas

Copy link
Copy Markdown
Contributor Author

Friendly merge nudge — approved 05-30, CI still 9/9 green and branch is mergeable. Happy to rebase if anything's drifted.

@waybarrios waybarrios left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carefully checked - it is well-done

@waybarrios waybarrios merged commit 967d4f3 into waybarrios:main Jun 9, 2026
9 checks passed
TimotejLabsky added a commit to TimotejLabsky/vllm-mlx that referenced this pull request Jun 9, 2026
…waybarrios#579/waybarrios#594)

Base pointer 015e080 -> caa8838 (also corrects the stale 395b13c note —
the tree had been on 015e080 since the previous rebase). New rebase note:
waybarrios#540 merged upstream with the env-clobber bug our patch waybarrios#15 fixes (kept
ours wholesale; filed the fix upstream), waybarrios#579's simple.py delta restored
in a dedicated commit, waybarrios#563 verified compatible with patch waybarrios#16's
system_kv_ssd imports and resolves its bf16 upstream-report todo.
Future work: waybarrios#233 closed upstream, waybarrios#528 already merged into base.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TimotejLabsky added a commit to TimotejLabsky/vllm-mlx that referenced this pull request Jun 14, 2026
…waybarrios#579/waybarrios#594)

Base pointer 015e080 -> caa8838 (also corrects the stale 395b13c note —
the tree had been on 015e080 since the previous rebase). New rebase note:
waybarrios#540 merged upstream with the env-clobber bug our patch waybarrios#15 fixes (kept
ours wholesale; filed the fix upstream), waybarrios#579's simple.py delta restored
in a dedicated commit, waybarrios#563 verified compatible with patch waybarrios#16's
system_kv_ssd imports and resolves its bf16 upstream-report todo.
Future work: waybarrios#233 closed upstream, waybarrios#528 already merged into base.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants