You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On case-project-v0.31.3.8 (Patch 7 = upstream ml-explore#1034 "graceful Metal OOM"), mlx_lm.serverstill hard-aborts under sustained multi-turn (tool-loop) load instead of degrading gracefully. The crash is an async Metal command-buffer-completion OOM that surfaces as an uncaught C++ exception → libc++abi: terminating → rc=-6. Patch 7's graceful handling does not intercept this path.
Mac Studio M4 Pro, 48 GB. iogpu.wired_limit_mb = 0 (macOS default ≈ 0.70–0.75 × RAM ≈ 33–36 GB Metal-wired ceiling).
Workload: case-project extraction A/B. Single-turn (no tools) pass ran clean for ~2h53m; the server crashed early into the multi-turn tool-loop pass.
Crash signature (server log)
20:19:46 INFO - Prompt Cache: 10 sequences, 15.55 GB
libc++abi: terminating due to uncaught exception of type std::runtime_error:
[METAL] Command buffer execution failed: Insufficient Memory
(00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
mlx_lm.server exited (rc=-6) after 10413.7s of healthy runtime — scheduling restart
Supervisor restarts + rewarms in ~4 s, but the in-flight request is lost → multi-turn workload is corrupted (every heavy doc that accumulates enough KV trips it).
Analysis
Patch 7 gap.Handle Metal OOM gracefully in mlx_lm.server with structured errors ml-explore/mlx-lm#1034 appears to catch synchronous allocation OOM. kIOGPUCommandBufferCallbackErrorOutOfMemory is reported from the GPU command-buffer completion callback — asynchronous, surfacing as an uncaught std::runtime_error that libc++abi turns into terminate(). It cannot be wrapped by a synchronous try/except around allocation, so graceful 503 conversion never happens and the process hard-aborts.
The single-turn path never accumulates the multi-turn active KV, which is why ~2.9 h of single-turn ran clean before the first multi-turn doc crashed.
Ask
Either (whichever is feasible):
Extend the graceful-OOM handling to detect the command-buffer-callback failure (kIOGPUCommandBufferCallbackError*) and convert to HTTP 503 + reset, so a supervised server degrades instead of rc=-6 aborting and losing the request; and/or
Document that command-buffer-callback OOM is uncatchable from Python, and that the supported mitigation for large/multi-turn workloads is raising iogpu.wired_limit_mb (the default ~36 GB ceiling on a 48 GB box is the real limiter, not --prompt-cache-bytes), plus conservative cache sizing.
Operator-side we are raising iogpu.wired_limit_mb and re-deriving the cache budget; filing this so the graceful-OOM expectation set by ml-explore#1034 is either met for this path or explicitly scoped.
Summary
On
case-project-v0.31.3.8(Patch 7 = upstream ml-explore#1034 "graceful Metal OOM"),mlx_lm.serverstill hard-aborts under sustained multi-turn (tool-loop) load instead of degrading gracefully. The crash is an async Metal command-buffer-completion OOM that surfaces as an uncaught C++ exception →libc++abi: terminating→rc=-6. Patch 7's graceful handling does not intercept this path.Environment
GoodOlClint/[email protected](Patch 1 refresh Native MTP speculative decoding (Qwen3.5/3.6 reference implementation) ml-explore/mlx-lm#990 + Patch 7 Handle Metal OOM gracefully in mlx_lm.server with structured errors ml-explore/mlx-lm#1034 + Patch 8 fix: honor --prompt-cache-bytes in sequential serve mode ml-explore/mlx-lm#1118 + Patches 1–6)mlx_lm.server --model Qwen3.5-27B-4bit-mtp --host 0.0.0.0 --port 9100 --trust-remote-code --mtp --prompt-cache-bytes 24Giogpu.wired_limit_mb = 0(macOS default ≈ 0.70–0.75 × RAM ≈ 33–36 GB Metal-wired ceiling).Crash signature (server log)
Supervisor restarts + rewarms in ~4 s, but the in-flight request is lost → multi-turn workload is corrupted (every heavy doc that accumulates enough KV trips it).
Analysis
kIOGPUCommandBufferCallbackErrorOutOfMemoryis reported from the GPU command-buffer completion callback — asynchronous, surfacing as an uncaughtstd::runtime_errorthatlibc++abiturns intoterminate(). It cannot be wrapped by a synchronous try/except around allocation, so graceful 503 conversion never happens and the process hard-aborts.--prompt-cache-bytesbudget (Patch 8 fix: honor --prompt-cache-bytes in sequential serve mode ml-explore/mlx-lm#1118 working as intended) — so this is not unbounded-cache growth (that was Prompt cache grows unbounded across diverse prompts → Metal OOM on Studio under corpus load #6). It is total Metal-wired pressure: model (~16 GB) + cache (15.55 GB) + multi-turn active KV crossing the defaultiogpu.wired_limit_mbceiling (~36 GB) while regular RAM still shows free (operator confirmed asitop never hit 100%).Ask
Either (whichever is feasible):
kIOGPUCommandBufferCallbackError*) and convert to HTTP 503 + reset, so a supervised server degrades instead ofrc=-6aborting and losing the request; and/oriogpu.wired_limit_mb(the default ~36 GB ceiling on a 48 GB box is the real limiter, not--prompt-cache-bytes), plus conservative cache sizing.Operator-side we are raising
iogpu.wired_limit_mband re-deriving the cache budget; filing this so the graceful-OOM expectation set by ml-explore#1034 is either met for this path or explicitly scoped.