Patch 7 (#1034) graceful OOM does not catch kIOGPUCommandBufferCallbackErrorOutOfMemory — server still hard-aborts (rc=-6) under multi-turn load

## Summary
On `case-project-v0.31.3.8` (Patch 7 = upstream #1034 "graceful Metal OOM"), `mlx_lm.server` **still hard-aborts** under sustained multi-turn (tool-loop) load instead of degrading gracefully. The crash is an async Metal command-buffer-completion OOM that surfaces as an uncaught C++ exception → `libc++abi: terminating` → `rc=-6`. Patch 7's graceful handling does not intercept this path.

## Environment
- Fork: `GoodOlClint/mlx-lm@case-project-v0.31.3.8` (Patch 1 refresh #990 + Patch 7 #1034 + Patch 8 #1118 + Patches 1–6)
- `mlx_lm.server --model Qwen3.5-27B-4bit-mtp --host 0.0.0.0 --port 9100 --trust-remote-code --mtp --prompt-cache-bytes 24G`
- Mac Studio M4 Pro, 48 GB. **`iogpu.wired_limit_mb = 0`** (macOS default ≈ 0.70–0.75 × RAM ≈ 33–36 GB Metal-wired ceiling).
- Workload: case-project extraction A/B. Single-turn (no tools) pass ran **clean for ~2h53m**; the server crashed early into the **multi-turn tool-loop** pass.

## Crash signature (server log)
```
20:19:46  INFO - Prompt Cache: 10 sequences, 15.55 GB
          libc++abi: terminating due to uncaught exception of type std::runtime_error:
          [METAL] Command buffer execution failed: Insufficient Memory
          (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
          mlx_lm.server exited (rc=-6) after 10413.7s of healthy runtime — scheduling restart
```
Supervisor restarts + rewarms in ~4 s, but the in-flight request is lost → multi-turn workload is corrupted (every heavy doc that accumulates enough KV trips it).

## Analysis
1. **Patch 7 gap.** #1034 appears to catch *synchronous allocation* OOM. `kIOGPUCommandBufferCallbackErrorOutOfMemory` is reported from the **GPU command-buffer completion callback** — asynchronous, surfacing as an uncaught `std::runtime_error` that `libc++abi` turns into `terminate()`. It cannot be wrapped by a synchronous try/except around allocation, so graceful 503 conversion never happens and the process hard-aborts.
2. **Not the prompt cache.** Crash at cache = 15.55 GB, *well under* the 24 G `--prompt-cache-bytes` budget (Patch 8 #1118 working as intended) — so this is not unbounded-cache growth (that was #6). It is total **Metal-wired** pressure: model (~16 GB) + cache (15.55 GB) + multi-turn active KV crossing the default `iogpu.wired_limit_mb` ceiling (~36 GB) while regular RAM still shows free (operator confirmed asitop never hit 100%).
3. The single-turn path never accumulates the multi-turn active KV, which is why ~2.9 h of single-turn ran clean before the first multi-turn doc crashed.

## Ask
Either (whichever is feasible):
- Extend the graceful-OOM handling to detect the command-buffer-callback failure (`kIOGPUCommandBufferCallbackError*`) and convert to HTTP 503 + reset, so a supervised server degrades instead of `rc=-6` aborting and losing the request; **and/or**
- Document that command-buffer-callback OOM is uncatchable from Python, and that the supported mitigation for large/multi-turn workloads is raising `iogpu.wired_limit_mb` (the default ~36 GB ceiling on a 48 GB box is the real limiter, not `--prompt-cache-bytes`), plus conservative cache sizing.

Operator-side we are raising `iogpu.wired_limit_mb` and re-deriving the cache budget; filing this so the graceful-OOM expectation set by #1034 is either met for this path or explicitly scoped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch 7 (#1034) graceful OOM does not catch kIOGPUCommandBufferCallbackErrorOutOfMemory — server still hard-aborts (rc=-6) under multi-turn load #7

Summary

Environment

Crash signature (server log)

Analysis

Ask

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Patch 7 (#1034) graceful OOM does not catch kIOGPUCommandBufferCallbackErrorOutOfMemory — server still hard-aborts (rc=-6) under multi-turn load #7

Description

Summary

Environment

Crash signature (server log)

Analysis

Ask

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions