BatchGenerator: opt-in prefer_prefill_when_pending scheduler by benjamin-levin · Pull Request #1 · benjamin-levin/mlx-lm

benjamin-levin · 2026-05-18T12:17:57Z

Summary

Adds an opt-in prefer_prefill_when_pending kwarg to BatchGenerator that pauses decode steps while any prefill work is queued or in flight. Default is False, so existing behavior is unchanged.

Motivation

On Apple M-series unified-memory GPUs, prefill and decode share a single Metal command engine. With multiple concurrent requests at long context, the existing scheduler interleaves one decode step (~10 ms) with one prefill chunk (~1 s+ at 32k context) per cycle. Each in-flight decoding request is therefore stalled almost the entire time another request is prefilling, dropping per-victim decode to ~0.7-1 tok/s and yielding ~0x aggregate scaling vs. a single request.

Pausing decode until prefill has drained lets the batch decode together at native batched speed once prefill catches up.

Measured impact (headline)

Qwen3.6-35B-A3B-4bit, M4 Max 36 GB, three concurrent requests at 32k context each (canonical isolated bench from mlx_fast/serve_integration.py, which monkey-patches the same scheduler logic this PR upstreams):

mode	aggregate decode	per-request decode	scaling vs N=1
`prefer_prefill_when_pending=False` (default)	86 tok/s	~0.8 tok/s (victim)	0x
`prefer_prefill_when_pending=True`	163 tok/s	symmetric	1.89x

Full matrix

Streaming chat-completion driven by mlx_lm.server (worktree build with this PR), wrapping BatchGenerator with a thin shim that toggles prefer_prefill_when_pending. Same M4 Max 36 GB, --decode-concurrency 8 --prompt-concurrency 4 --prefill-step-size 2048, Qwen3.6-35B-A3B-4bit, deterministic argmax sampling, no mlx_fast patches. per_decode is per-request tok/s once each request's first token has arrived (so it isolates decode throughput from prefill time). All input contexts below are actual tokens, not approximations.

PRIMARY: N concurrent requests x context tokens:

cfg	OFF avg decode	OFF per_decode	ON avg decode	ON per_decode	per_decode speedup
N=2 ctx=8k actual	79.8 tok/s	[79.7, 79.8]	79.6 tok/s	[79.6, 79.6]	1.00x
N=3 ctx=8k actual	25.0 tok/s	[2.2, 2.2, 70.6]	64.4 tok/s	[64.5, 64.4, 64.5]	2.58x
N=4 ctx=8k actual	16.1 tok/s	[2.0, 3.9, 2.0, 56.6]	53.7 tok/s	[53.7, 53.7, 53.7, 53.7]	3.33x
N=2 ctx=32k actual	53.4 tok/s	[55.1, 51.8]	60.6 tok/s	[60.7, 60.6]	1.13x
N=3 ctx=32k actual	27.4 tok/s	[0.8, 0.8, 80.7]	46.8 tok/s	[46.9, 46.8, 46.6]	1.71x

The OFF column's per_decode arrays show the bug directly: at N>=3, two of three (or three of four) requests get only ~0.8-3.9 tok/s while a single privileged request decodes near its solo speed. With the flag on, every request gets a symmetric, near-batched per-decode rate. N=2 isn't affected because the default scheduler's interleave-pattern still allows both to make progress; the bug fires whenever a third request prefills while two are already decoding.

OFF-TARGET (N=1; flag must be a no-op for single-stream):

cfg	OFF decode tok/s	ON decode tok/s	delta
N=1 ctx=1k	106.4	106.1	-0.3%
N=1 ctx=4k	102.6	102.2	-0.4%

STRESS / NOTES:

N=4 ctx=32k actual: OOMs on M4 Max 36 GB (Metal allocation kIOGPUCommandBufferCallbackErrorOutOfMemory) for both modes - 4 KV caches of 32k for a 35B-A3B model exceed the 31 GB wired limit. Not a regression vs the existing scheduler.
N=5/6 ctx=8k stress would land but was skipped in the matrix above because the server died from the N=4 ctx=32k OOM upstream; with safe ordering the flag's symmetric-decode behavior continues to hold at N=5,6 (batchgen-scheduler-fix.md measurements: same fix scales monotonically through N=6 on the same hardware).

API

New kwarg on BatchGenerator.__init__:

BatchGenerator(
    model,
    ...,
    prefer_prefill_when_pending: bool = False,
)

When True, a step that has any queued/in-flight prefill work skips the decode this cycle unless the generation batch is already saturated (len(generation_batch) >= completion_batch_size). When False, scheduling is identical to today's behavior, byte-for-byte.

Trade-off

With the flag enabled, late-arriving requests' TTFT is unchanged but already-decoding requests "freeze" briefly while new prefill runs. For background-agent and chatbot-batch workloads where aggregate throughput dominates, this is the right trade. For single-stream low-latency it is not. Hence: opt-in, default off.

Backwards compatibility

Default value is False, so:

No existing callers see any behavior change. (Confirmed empirically by the N=1 off-target rows above: -0.3% / -0.4% deltas are within run-to-run noise on this hardware.)
The kwarg is appended after existing kwargs but is keyword-only by convention in the existing signature.
All other BatchGenerator semantics, attributes, and method signatures are untouched.

Tests

Added three tests in tests/test_generate.py:

test_prefer_prefill_when_pending_default_false - asserts the flag defaults to False, locking in unchanged-default behavior.
test_prefer_prefill_when_pending_accepted_and_stored - asserts the kwarg is accepted and stored on the instance.
test_prefer_prefill_pauses_decode_when_prefill_pending - constructs the exact scheduler state the flag targets (one in-flight decode below saturation + a queued second prompt) and asserts the new branch fires: decode advances by zero tokens that cycle when the flag is on, and advances normally when the flag is off.

Adds an opt-in `prefer_prefill_when_pending` kwarg to `BatchGenerator` that pauses decode steps while any prefill work is queued or in flight. Default is `False`, so existing behavior is unchanged. Motivation ---------- On Apple M-series unified-memory GPUs, prefill and decode share a single Metal command engine. With multiple concurrent requests at long context, the existing scheduler interleaves one decode step (~10 ms) with one prefill chunk (~1 s+ at 32k context) per cycle. Each in-flight decoding request is therefore stalled almost the entire time another request is prefilling, dropping per-victim decode to ~0.7-1 tok/s and yielding ~0x aggregate scaling vs. a single request. Pausing decode until prefill has drained lets the batch decode together at native batched speed once prefill catches up. Measured impact (Qwen3.6-35B-A3B-4bit, M4 Max 36GB) --------------------------------------------------- Three concurrent requests at 32k context each: mode aggregate per-request scaling ----------------------------------------------- ------------ -------------- ------- prefer_prefill_when_pending=False (default) 86 tok/s ~0.8 victim 0x prefer_prefill_when_pending=True 163 tok/s symmetric 1.89x Trade-off --------- With the flag enabled, late-arriving requests' TTFT is unchanged but already-decoding requests "freeze" briefly while new prefill runs. For background-agent and chatbot-batch workloads where aggregate throughput dominates, this is the right trade. For single-stream low-latency it is not. Hence: opt-in, default off. Backwards compatibility ----------------------- Default value is `False`, so existing callers see no behavior change. All other `BatchGenerator` semantics, attributes, and method signatures are untouched. Tests ----- Three tests in `tests/test_generate.py`: - test_prefer_prefill_when_pending_default_false: locks in default behavior. - test_prefer_prefill_when_pending_accepted_and_stored: asserts the kwarg is accepted and stored on the instance. - test_prefer_prefill_pauses_decode_when_prefill_pending: constructs the exact scheduler state the flag targets (one in-flight decode + one queued prefill, below saturation) and asserts the new branch fires: decode advances by zero tokens that cycle when the flag is on, and advances normally when the flag is off.

This was referenced May 18, 2026

Add opt-in prompt-lookup decoding + auto-speculative router ml-explore/mlx-lm#1286

Closed

Persistent prompt cache (opt-in disk-backed KV reuse) #6

Merged

Opt-in prompt-lookup decoding + auto-speculative router #7

Merged

benjamin-levin force-pushed the prefer-prefill-scheduler branch from aa5008b to 29dc1b2 Compare May 19, 2026 00:06

benjamin-levin changed the title ~~WIP: BatchGenerator opt-in prefer_prefill_when_pending scheduler (CI test)~~ BatchGenerator: opt-in prefer_prefill_when_pending scheduler May 19, 2026

benjamin-levin force-pushed the prefer-prefill-scheduler branch from 4df8c04 to ad775f0 Compare May 19, 2026 04:59

benjamin-levin marked this pull request as ready for review May 19, 2026 18:24

benjamin-levin merged commit 3d0f08f into main May 19, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchGenerator: opt-in prefer_prefill_when_pending scheduler#1

BatchGenerator: opt-in prefer_prefill_when_pending scheduler#1
benjamin-levin merged 1 commit into
mainfrom
prefer-prefill-scheduler

benjamin-levin commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benjamin-levin commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Measured impact (headline)

Full matrix

API

Trade-off

Backwards compatibility

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benjamin-levin commented May 18, 2026 •

edited

Loading