Skip to content

fix(engine): per-engine threads to eliminate cross-engine stream contamination#1304

Open
ivaniguarans wants to merge 1 commit into
jundot:mainfrom
ivaniguarans:feat/per-engine-threads
Open

fix(engine): per-engine threads to eliminate cross-engine stream contamination#1304
ivaniguarans wants to merge 1 commit into
jundot:mainfrom
ivaniguarans:feat/per-engine-threads

Conversation

@ivaniguarans
Copy link
Copy Markdown
Contributor

Summary

When multiple LM engines run concurrently, they share a single _global_mlx_executor with max_workers=1, serializing all scheduler.step() calls through one thread. More critically, the MTP patch reads the module-level generation_stream via sys.modules for its forward passes, bypassing whatever stream the BatchGenerator was instantiated with. If two MTP-capable engines run simultaneously, their MTP forwards land on the same module-level stream regardless of which engine dispatched them — a stream-ordering violation that upstream's BatchGenerator(stream=...) parameter (mlx-lm 0.31.3) was designed to prevent.

This PR gives each EngineCore its own ThreadPoolExecutor and mx.Stream, passes the stream through Scheduler into BatchGenerator, and removes the _get_generation_stream() indirection from the MTP patch so MTP operations inherit the correct per-engine stream from the enclosing BatchGenerator context — the same pattern upstream's GenerationBatch._step() already uses.

The global executor is retained for non-LM engines (TTS, STT, embedding, reranker) that still rely on get_mlx_executor() and _init_mlx_thread.

Changes

  • engine_core.py: EngineCore.__init__ creates a per-engine ThreadPoolExecutor + mx.new_thread_local_stream() and passes the stream to Scheduler. close() shuts down the per-engine executor after scheduler cleanup. Added _ensure_wired_limit() so the process-global mx.set_wired_limit() runs once rather than racing across concurrent BatchGenerator inits.
  • scheduler.py: Scheduler.__init__ accepts an optional stream parameter (falls back to the module-level generation_stream when not provided). All 37 internal references to generation_stream — sync barriers, cache clears, mx.stream() context managers, BatchGenerator creation — now use self._stream.
  • batch_generator.py (MTP patch): Removed _get_generation_stream() and the 4 explicit with mx.stream(...) wrappers that pushed the module-level stream. MTP forwards now inherit the per-engine stream from the enclosing BatchGenerator context, matching GenerationBatch._step()'s existing pattern.

Concurrent throughput

Two models generating simultaneously vs sequentially:

Model pair Before (shared executor) After (per-engine) Speedup
Qwen3-0.6B + Qwen3-Coder-Next-6bit 1.00x 1.14x 0.6B TTFT: 2089 ms → 701 ms
Qwen3.6-35B-A3B-oQ4-mtp + Qwen3.6-27B-oQ6-mtp 0.93x 1.12x wall: 5408 ms → 4722 ms

Sub-2x is expected — Metal command buffers still serialize on one GPU. The win is CPU-side overlap (prefill + decode can be submitted concurrently) and eliminating head-of-line blocking where one engine's long prefill stalls another's token emission.

Test plan

  • New tests/test_per_engine_threads.py (10 tests): verifies Scheduler stores and uses explicit streams, regex-scans the Scheduler class body for bare generation_stream references, confirms each EngineCore gets a distinct executor/stream, validates executor shutdown on close(), and asserts the MTP patch no longer contains _get_generation_stream or any generation_stream reference.
  • Updated tests/test_engine_core.py: existing executor tests now assert is not (distinct executors) and concurrent execution (both executors active simultaneously).
  • Full suite passes: 4493 passed, 19 skipped.
  • Live-tested with concurrent Qwen3.6-35B-A3B-oQ4-mtp + Qwen3.6-27B-oQ6-mtp (both MTP-enabled) serving requests simultaneously.

Related to #1248

…amination

Replace the shared _global_mlx_executor with per-EngineCore
ThreadPoolExecutor + mx.Stream, and fix the MTP patch reading the
module-level generation_stream instead of the per-engine stream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant