BatchGenerator: opt-in prefer_prefill_when_pending scheduler#1
Merged
Conversation
This was referenced May 18, 2026
aa5008b to
29dc1b2
Compare
Adds an opt-in `prefer_prefill_when_pending` kwarg to `BatchGenerator`
that pauses decode steps while any prefill work is queued or in flight.
Default is `False`, so existing behavior is unchanged.
Motivation
----------
On Apple M-series unified-memory GPUs, prefill and decode share a single
Metal command engine. With multiple concurrent requests at long context,
the existing scheduler interleaves one decode step (~10 ms) with one
prefill chunk (~1 s+ at 32k context) per cycle. Each in-flight decoding
request is therefore stalled almost the entire time another request is
prefilling, dropping per-victim decode to ~0.7-1 tok/s and yielding ~0x
aggregate scaling vs. a single request.
Pausing decode until prefill has drained lets the batch decode together
at native batched speed once prefill catches up.
Measured impact (Qwen3.6-35B-A3B-4bit, M4 Max 36GB)
---------------------------------------------------
Three concurrent requests at 32k context each:
mode aggregate per-request scaling
----------------------------------------------- ------------ -------------- -------
prefer_prefill_when_pending=False (default) 86 tok/s ~0.8 victim 0x
prefer_prefill_when_pending=True 163 tok/s symmetric 1.89x
Trade-off
---------
With the flag enabled, late-arriving requests' TTFT is unchanged but
already-decoding requests "freeze" briefly while new prefill runs. For
background-agent and chatbot-batch workloads where aggregate throughput
dominates, this is the right trade. For single-stream low-latency it is
not. Hence: opt-in, default off.
Backwards compatibility
-----------------------
Default value is `False`, so existing callers see no behavior change.
All other `BatchGenerator` semantics, attributes, and method signatures
are untouched.
Tests
-----
Three tests in `tests/test_generate.py`:
- test_prefer_prefill_when_pending_default_false: locks in default
behavior.
- test_prefer_prefill_when_pending_accepted_and_stored: asserts the
kwarg is accepted and stored on the instance.
- test_prefer_prefill_pauses_decode_when_prefill_pending: constructs
the exact scheduler state the flag targets (one in-flight decode +
one queued prefill, below saturation) and asserts the new branch
fires: decode advances by zero tokens that cycle when the flag is
on, and advances normally when the flag is off.
4df8c04 to
ad775f0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
prefer_prefill_when_pendingkwarg toBatchGeneratorthat pauses decode steps while any prefill work is queued or in flight. Default isFalse, so existing behavior is unchanged.Motivation
On Apple M-series unified-memory GPUs, prefill and decode share a single Metal command engine. With multiple concurrent requests at long context, the existing scheduler interleaves one decode step (~10 ms) with one prefill chunk (~1 s+ at 32k context) per cycle. Each in-flight decoding request is therefore stalled almost the entire time another request is prefilling, dropping per-victim decode to ~0.7-1 tok/s and yielding ~0x aggregate scaling vs. a single request.
Pausing decode until prefill has drained lets the batch decode together at native batched speed once prefill catches up.
Measured impact (headline)
Qwen3.6-35B-A3B-4bit, M4 Max 36 GB, three concurrent requests at 32k context each (canonical isolated bench from
mlx_fast/serve_integration.py, which monkey-patches the same scheduler logic this PR upstreams):prefer_prefill_when_pending=False(default)prefer_prefill_when_pending=TrueFull matrix
Streaming chat-completion driven by
mlx_lm.server(worktree build with this PR), wrappingBatchGeneratorwith a thin shim that togglesprefer_prefill_when_pending. Same M4 Max 36 GB,--decode-concurrency 8 --prompt-concurrency 4 --prefill-step-size 2048, Qwen3.6-35B-A3B-4bit, deterministic argmax sampling, nomlx_fastpatches.per_decodeis per-request tok/s once each request's first token has arrived (so it isolates decode throughput from prefill time). All input contexts below are actual tokens, not approximations.PRIMARY: N concurrent requests x context tokens:
The OFF column's per_decode arrays show the bug directly: at N>=3, two of three (or three of four) requests get only ~0.8-3.9 tok/s while a single privileged request decodes near its solo speed. With the flag on, every request gets a symmetric, near-batched per-decode rate. N=2 isn't affected because the default scheduler's interleave-pattern still allows both to make progress; the bug fires whenever a third request prefills while two are already decoding.
OFF-TARGET (N=1; flag must be a no-op for single-stream):
STRESS / NOTES:
kIOGPUCommandBufferCallbackErrorOutOfMemory) for both modes - 4 KV caches of 32k for a 35B-A3B model exceed the 31 GB wired limit. Not a regression vs the existing scheduler.API
New kwarg on
BatchGenerator.__init__:When
True, a step that has any queued/in-flight prefill work skips the decode this cycle unless the generation batch is already saturated (len(generation_batch) >= completion_batch_size). WhenFalse, scheduling is identical to today's behavior, byte-for-byte.Trade-off
With the flag enabled, late-arriving requests' TTFT is unchanged but already-decoding requests "freeze" briefly while new prefill runs. For background-agent and chatbot-batch workloads where aggregate throughput dominates, this is the right trade. For single-stream low-latency it is not. Hence: opt-in, default off.
Backwards compatibility
Default value is
False, so:BatchGeneratorsemantics, attributes, and method signatures are untouched.Tests
Added three tests in
tests/test_generate.py:test_prefer_prefill_when_pending_default_false- asserts the flag defaults toFalse, locking in unchanged-default behavior.test_prefer_prefill_when_pending_accepted_and_stored- asserts the kwarg is accepted and stored on the instance.test_prefer_prefill_pauses_decode_when_prefill_pending- constructs the exact scheduler state the flag targets (one in-flight decode below saturation + a queued second prompt) and asserts the new branch fires: decode advances by zero tokens that cycle when the flag is on, and advances normally when the flag is off.