fix(engine): per-engine threads to eliminate cross-engine stream contamination by ivaniguarans · Pull Request #1304 · jundot/omlx

ivaniguarans · 2026-05-19T13:25:59Z

Summary

When multiple LM engines run concurrently, they share a single _global_mlx_executor with max_workers=1, serializing all scheduler.step() calls through one thread. More critically, the MTP patch reads the module-level generation_stream via sys.modules for its forward passes, bypassing whatever stream the BatchGenerator was instantiated with. If two MTP-capable engines run simultaneously, their MTP forwards land on the same module-level stream regardless of which engine dispatched them — a stream-ordering violation that upstream's BatchGenerator(stream=...) parameter (mlx-lm 0.31.3) was designed to prevent.

This PR gives each EngineCore its own ThreadPoolExecutor and mx.Stream, passes the stream through Scheduler into BatchGenerator, and removes the _get_generation_stream() indirection from the MTP patch so MTP operations inherit the correct per-engine stream from the enclosing BatchGenerator context — the same pattern upstream's GenerationBatch._step() already uses.

The global executor is retained for non-LM engines (TTS, STT, embedding, reranker) that still rely on get_mlx_executor() and _init_mlx_thread.

Changes

engine_core.py: EngineCore.__init__ creates a per-engine ThreadPoolExecutor + mx.new_thread_local_stream() and passes the stream to Scheduler. close() shuts down the per-engine executor after scheduler cleanup. Added _ensure_wired_limit() so the process-global mx.set_wired_limit() runs once rather than racing across concurrent BatchGenerator inits.
scheduler.py: Scheduler.__init__ accepts an optional stream parameter (falls back to the module-level generation_stream when not provided). All 37 internal references to generation_stream — sync barriers, cache clears, mx.stream() context managers, BatchGenerator creation — now use self._stream.
batch_generator.py (MTP patch): Removed _get_generation_stream() and the 4 explicit with mx.stream(...) wrappers that pushed the module-level stream. MTP forwards now inherit the per-engine stream from the enclosing BatchGenerator context, matching GenerationBatch._step()'s existing pattern.

Concurrent throughput

Two models generating simultaneously vs sequentially:

Model pair	Before (shared executor)	After (per-engine)	Speedup
Qwen3-0.6B + Qwen3-Coder-Next-6bit	1.00x	1.14x	0.6B TTFT: 2089 ms → 701 ms
Qwen3.6-35B-A3B-oQ4-mtp + Qwen3.6-27B-oQ6-mtp	0.93x	1.12x	wall: 5408 ms → 4722 ms

Sub-2x is expected — Metal command buffers still serialize on one GPU. The win is CPU-side overlap (prefill + decode can be submitted concurrently) and eliminating head-of-line blocking where one engine's long prefill stalls another's token emission.

Test plan

New tests/test_per_engine_threads.py (10 tests): verifies Scheduler stores and uses explicit streams, regex-scans the Scheduler class body for bare generation_stream references, confirms each EngineCore gets a distinct executor/stream, validates executor shutdown on close(), and asserts the MTP patch no longer contains _get_generation_stream or any generation_stream reference.
Updated tests/test_engine_core.py: existing executor tests now assert is not (distinct executors) and concurrent execution (both executors active simultaneously).
Full suite passes: 4493 passed, 19 skipped.
Live-tested with concurrent Qwen3.6-35B-A3B-oQ4-mtp + Qwen3.6-27B-oQ6-mtp (both MTP-enabled) serving requests simultaneously.

Related to #1248

…amination Replace the shared _global_mlx_executor with per-EngineCore ThreadPoolExecutor + mx.Stream, and fix the MTP patch reading the module-level generation_stream instead of the per-engine stream.

fix(engine): per-engine threads to eliminate cross-engine stream cont…

0760768

…amination Replace the shared _global_mlx_executor with per-EngineCore ThreadPoolExecutor + mx.Stream, and fix the MTP patch reading the module-level generation_stream instead of the per-engine stream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): per-engine threads to eliminate cross-engine stream contamination#1304

fix(engine): per-engine threads to eliminate cross-engine stream contamination#1304
ivaniguarans wants to merge 1 commit into
jundot:mainfrom
ivaniguarans:feat/per-engine-threads

ivaniguarans commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ivaniguarans commented May 19, 2026

Summary

Changes

Concurrent throughput

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant