Skip to content

Patch 7 (#1034) graceful OOM does not catch kIOGPUCommandBufferCallbackErrorOutOfMemory — server still hard-aborts (rc=-6) under multi-turn load #7

@GoodOlClint

Description

@GoodOlClint

Summary

On case-project-v0.31.3.8 (Patch 7 = upstream ml-explore#1034 "graceful Metal OOM"), mlx_lm.server still hard-aborts under sustained multi-turn (tool-loop) load instead of degrading gracefully. The crash is an async Metal command-buffer-completion OOM that surfaces as an uncaught C++ exception → libc++abi: terminatingrc=-6. Patch 7's graceful handling does not intercept this path.

Environment

Crash signature (server log)

20:19:46  INFO - Prompt Cache: 10 sequences, 15.55 GB
          libc++abi: terminating due to uncaught exception of type std::runtime_error:
          [METAL] Command buffer execution failed: Insufficient Memory
          (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
          mlx_lm.server exited (rc=-6) after 10413.7s of healthy runtime — scheduling restart

Supervisor restarts + rewarms in ~4 s, but the in-flight request is lost → multi-turn workload is corrupted (every heavy doc that accumulates enough KV trips it).

Analysis

  1. Patch 7 gap. Handle Metal OOM gracefully in mlx_lm.server with structured errors ml-explore/mlx-lm#1034 appears to catch synchronous allocation OOM. kIOGPUCommandBufferCallbackErrorOutOfMemory is reported from the GPU command-buffer completion callback — asynchronous, surfacing as an uncaught std::runtime_error that libc++abi turns into terminate(). It cannot be wrapped by a synchronous try/except around allocation, so graceful 503 conversion never happens and the process hard-aborts.
  2. Not the prompt cache. Crash at cache = 15.55 GB, well under the 24 G --prompt-cache-bytes budget (Patch 8 fix: honor --prompt-cache-bytes in sequential serve mode ml-explore/mlx-lm#1118 working as intended) — so this is not unbounded-cache growth (that was Prompt cache grows unbounded across diverse prompts → Metal OOM on Studio under corpus load #6). It is total Metal-wired pressure: model (~16 GB) + cache (15.55 GB) + multi-turn active KV crossing the default iogpu.wired_limit_mb ceiling (~36 GB) while regular RAM still shows free (operator confirmed asitop never hit 100%).
  3. The single-turn path never accumulates the multi-turn active KV, which is why ~2.9 h of single-turn ran clean before the first multi-turn doc crashed.

Ask

Either (whichever is feasible):

  • Extend the graceful-OOM handling to detect the command-buffer-callback failure (kIOGPUCommandBufferCallbackError*) and convert to HTTP 503 + reset, so a supervised server degrades instead of rc=-6 aborting and losing the request; and/or
  • Document that command-buffer-callback OOM is uncatchable from Python, and that the supported mitigation for large/multi-turn workloads is raising iogpu.wired_limit_mb (the default ~36 GB ceiling on a 48 GB box is the real limiter, not --prompt-cache-bytes), plus conservative cache sizing.

Operator-side we are raising iogpu.wired_limit_mb and re-deriving the cache budget; filing this so the graceful-OOM expectation set by ml-explore#1034 is either met for this path or explicitly scoped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions