Skip to content

fix(engine): keep per-engine MLX worker thread alive to fix DeepSeek V4 unload SIGSEGV (#1304 regression)#1542

Closed
SheeJiaWei wants to merge 2 commits into
jundot:mainfrom
SheeJiaWei:debug/deepseek-fail-unload-v0-3-12
Closed

fix(engine): keep per-engine MLX worker thread alive to fix DeepSeek V4 unload SIGSEGV (#1304 regression)#1542
SheeJiaWei wants to merge 2 commits into
jundot:mainfrom
SheeJiaWei:debug/deepseek-fail-unload-v0-3-12

Conversation

@SheeJiaWei

Copy link
Copy Markdown
Contributor

Unloading DeepSeek V4 crashed omlx serve with a native SIGSEGV.
Root cause: MLX's @mx.compile cache (CompilerCache) is a C++ thread_local; the per-engine executor introduced in #1304 runs V4's module-scope @mx.compile graphs, then EngineCore.close() called executor.shutdown(wait=True), exiting the worker thread → ~CompilerCache() freed those graphs' Python objects from a thread-exit handler (no GIL) → use-after-free. V4-only (only model with module-scope compiled graphs) and sync-immune.

Fix: keep per-engine MLX worker threads alive for the process lifetime (matching the pre-#1304 global-thread behavior), so the destructor never runs mid-process.

Summary
Unloading DeepSeek-V4-Flash crashed the whole omlx serve process with a native SIGSEGV. The per-engine executor thread added in #1304 runs V4's @mx.compile graphs, and exiting that thread at unload triggers a use-after-free in MLX's thread_local compile-cache destructor. Fix: don't tear down the per-engine MLX worker thread at unload (matching the pre-#1304 single-global-thread behavior).

Root cause
MLX's @mx.compile cache (CompilerCache) is a C++ thread_local holding compiled graphs that reference Python objects. DeepSeek V4 is the only model with module-scope @mx.compile graphs, so they populate the per-engine thread's cache. EngineCore.close() called executor.shutdown(wait=True) → the worker thread exits → dyld runs ~CompilerCache() → it frees those Python objects from a thread-exit handler (no GIL, after gc already freed them) → use-after-free. V4-only and sync-immune (a thread-exit destructor, not GPU work). Pre-#1304, the shared global MLX thread never exited mid-process, so this never surfaced.
.ips crash frame (EXC_BAD_ACCESS at 0x10):
_pthread_exit → dyld::ThreadLocalVariables::finalizeList
→ mlx::core::detail::CompilerCache::~CompilerCache()
→ __deallocate_node(…CompilerCache::CacheEntry…) → tupledealloc

Fix
EngineCore.close() no longer shuts down the per-engine executor; the executor (and its stream) is held in a process-lifetime registry so the thread — and its thread_local CompilerCache — is never destructed mid-process:

close(): was self._mlx_executor.shutdown(wait=True)

self._mlx_executor = None # thread kept alive via module-global registry
No synchronization (sync-immune); MLX exposes no compile-cache-clear API. Cost: one idle thread per model load, reclaimed at exit.

Test plan

  • Unit: updated tests/test_per_engine_threads.py; 280 tests pass (engine_core/engine_pool/per_engine_threads/stream_usage/keepalive/scheduler/batched_engine).
  • Manual (512 GB Mac Studio): repeated load→serve→unload of DeepSeek-V4-Flash-mxfp8 — was SIGSEGV every time, now reaches Unloaded model … (settled); MiniMax/VLM unaffected.

@beamivalice

Copy link
Copy Markdown
Contributor

I think this very same thing happens with the just released Step 3.7 Flash too - always python crashed on exit. I tried this but had to fix the unload so I didn't bother sending pr to support Step 3.7. Could be 2 in 1 solves.

@jundot

jundot commented May 30, 2026

Copy link
Copy Markdown
Owner

Thanks for the really thorough writeup, the crash frame and your root-cause read were spot on and made this easy to chase down. I reproduced it on a 512 GB machine with DeepSeek-V4-Flash-8bit exactly as you described: POST .../unload goes through EngineCore.close() to _mlx_executor.shutdown(wait=True), the worker thread exits, and ~CompilerCache then frees the compiled graphs' Python objects from the thread-exit handler. faulthandler caught it as Fatal Python error: PyThreadState_Get: ... the GIL is released, with the same _pthread_exit -> finalizeList -> CompilerCache::~CompilerCache -> tupledealloc frame you posted.

Tracing it the rest of the way, the trigger is upstream MLX #3280, which moved the compile cache to a thread_local CompilerCache. MLX meant to guard this with an atexit handler that clears the cache, but the registration is dead code (the lambda in transforms.cpp is defined and never actually called), and it would run on the main thread anyway so it can't reach a worker thread's thread_local cache. So any model with module-scope @mx.compile graphs trips it whenever the thread that ran those graphs is torn down. V4 does, and Step 3.x does too (I checked the model sources), which is why @beamivalice is seeing the same thing.

Keeping the worker thread alive the way this PR does is a clean fix for the unload crash and I nearly went with it. Two things nudged me elsewhere: it pins one idle thread plus stream per model load for the process lifetime, and it only relocates the crash to process exit, where even an immortal thread gets torn down and runs the same destructor (that lines up with the "crash on exit" @beamivalice mentioned).

The fix I landed clears the cache instead of dodging it. compile_clear_cache() is exported from libmlx.dylib, so I resolve it through ctypes (with PyDLL, so the GIL stays held during the call) and run it ON the worker thread right before close() shuts it down. The cache ends up empty, ~CompilerCache becomes a no-op, the thread shuts down normally, so there is no leak and the exit path is covered too. If that symbol ever stops resolving, it falls back to exactly your keep-the-thread-alive approach, so your idea is still in the code as the safety net. After the change, repeated load/unload/reload and shutting down with the model still loaded are all clean, no Fatal Python error and memory settles back.

So I'm going to close this in favor of that fix, but it landed straight off the back of your debugging, thanks a lot for it. I'll also flag the dead atexit lambda upstream so the real fix can eventually live in MLX.

@jundot jundot closed this May 30, 2026
@SheeJiaWei

Copy link
Copy Markdown
Contributor Author

Thanks, that compile_clear_cache approach is much cleaner — glad the writeup helped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants