Skip to content

BatchGenerator: opt-in prefer_prefill_when_pending scheduler#1

Merged
benjamin-levin merged 1 commit into
mainfrom
prefer-prefill-scheduler
May 19, 2026
Merged

BatchGenerator: opt-in prefer_prefill_when_pending scheduler#1
benjamin-levin merged 1 commit into
mainfrom
prefer-prefill-scheduler

Conversation

@benjamin-levin
Copy link
Copy Markdown
Owner

@benjamin-levin benjamin-levin commented May 18, 2026

Summary

Adds an opt-in prefer_prefill_when_pending kwarg to BatchGenerator that pauses decode steps while any prefill work is queued or in flight. Default is False, so existing behavior is unchanged.

Motivation

On Apple M-series unified-memory GPUs, prefill and decode share a single Metal command engine. With multiple concurrent requests at long context, the existing scheduler interleaves one decode step (~10 ms) with one prefill chunk (~1 s+ at 32k context) per cycle. Each in-flight decoding request is therefore stalled almost the entire time another request is prefilling, dropping per-victim decode to ~0.7-1 tok/s and yielding ~0x aggregate scaling vs. a single request.

Pausing decode until prefill has drained lets the batch decode together at native batched speed once prefill catches up.

Measured impact (headline)

Qwen3.6-35B-A3B-4bit, M4 Max 36 GB, three concurrent requests at 32k context each (canonical isolated bench from mlx_fast/serve_integration.py, which monkey-patches the same scheduler logic this PR upstreams):

mode aggregate decode per-request decode scaling vs N=1
prefer_prefill_when_pending=False (default) 86 tok/s ~0.8 tok/s (victim) 0x
prefer_prefill_when_pending=True 163 tok/s symmetric 1.89x

Full matrix

Streaming chat-completion driven by mlx_lm.server (worktree build with this PR), wrapping BatchGenerator with a thin shim that toggles prefer_prefill_when_pending. Same M4 Max 36 GB, --decode-concurrency 8 --prompt-concurrency 4 --prefill-step-size 2048, Qwen3.6-35B-A3B-4bit, deterministic argmax sampling, no mlx_fast patches. per_decode is per-request tok/s once each request's first token has arrived (so it isolates decode throughput from prefill time). All input contexts below are actual tokens, not approximations.

PRIMARY: N concurrent requests x context tokens:

cfg OFF avg decode OFF per_decode ON avg decode ON per_decode per_decode speedup
N=2 ctx=8k actual 79.8 tok/s [79.7, 79.8] 79.6 tok/s [79.6, 79.6] 1.00x
N=3 ctx=8k actual 25.0 tok/s [2.2, 2.2, 70.6] 64.4 tok/s [64.5, 64.4, 64.5] 2.58x
N=4 ctx=8k actual 16.1 tok/s [2.0, 3.9, 2.0, 56.6] 53.7 tok/s [53.7, 53.7, 53.7, 53.7] 3.33x
N=2 ctx=32k actual 53.4 tok/s [55.1, 51.8] 60.6 tok/s [60.7, 60.6] 1.13x
N=3 ctx=32k actual 27.4 tok/s [0.8, 0.8, 80.7] 46.8 tok/s [46.9, 46.8, 46.6] 1.71x

The OFF column's per_decode arrays show the bug directly: at N>=3, two of three (or three of four) requests get only ~0.8-3.9 tok/s while a single privileged request decodes near its solo speed. With the flag on, every request gets a symmetric, near-batched per-decode rate. N=2 isn't affected because the default scheduler's interleave-pattern still allows both to make progress; the bug fires whenever a third request prefills while two are already decoding.

OFF-TARGET (N=1; flag must be a no-op for single-stream):

cfg OFF decode tok/s ON decode tok/s delta
N=1 ctx=1k 106.4 106.1 -0.3%
N=1 ctx=4k 102.6 102.2 -0.4%

STRESS / NOTES:

  • N=4 ctx=32k actual: OOMs on M4 Max 36 GB (Metal allocation kIOGPUCommandBufferCallbackErrorOutOfMemory) for both modes - 4 KV caches of 32k for a 35B-A3B model exceed the 31 GB wired limit. Not a regression vs the existing scheduler.
  • N=5/6 ctx=8k stress would land but was skipped in the matrix above because the server died from the N=4 ctx=32k OOM upstream; with safe ordering the flag's symmetric-decode behavior continues to hold at N=5,6 (batchgen-scheduler-fix.md measurements: same fix scales monotonically through N=6 on the same hardware).

API

New kwarg on BatchGenerator.__init__:

BatchGenerator(
    model,
    ...,
    prefer_prefill_when_pending: bool = False,
)

When True, a step that has any queued/in-flight prefill work skips the decode this cycle unless the generation batch is already saturated (len(generation_batch) >= completion_batch_size). When False, scheduling is identical to today's behavior, byte-for-byte.

Trade-off

With the flag enabled, late-arriving requests' TTFT is unchanged but already-decoding requests "freeze" briefly while new prefill runs. For background-agent and chatbot-batch workloads where aggregate throughput dominates, this is the right trade. For single-stream low-latency it is not. Hence: opt-in, default off.

Backwards compatibility

Default value is False, so:

  • No existing callers see any behavior change. (Confirmed empirically by the N=1 off-target rows above: -0.3% / -0.4% deltas are within run-to-run noise on this hardware.)
  • The kwarg is appended after existing kwargs but is keyword-only by convention in the existing signature.
  • All other BatchGenerator semantics, attributes, and method signatures are untouched.

Tests

Added three tests in tests/test_generate.py:

  • test_prefer_prefill_when_pending_default_false - asserts the flag defaults to False, locking in unchanged-default behavior.
  • test_prefer_prefill_when_pending_accepted_and_stored - asserts the kwarg is accepted and stored on the instance.
  • test_prefer_prefill_pauses_decode_when_prefill_pending - constructs the exact scheduler state the flag targets (one in-flight decode below saturation + a queued second prompt) and asserts the new branch fires: decode advances by zero tokens that cycle when the flag is on, and advances normally when the flag is off.

@benjamin-levin benjamin-levin force-pushed the prefer-prefill-scheduler branch from aa5008b to 29dc1b2 Compare May 19, 2026 00:06
@benjamin-levin benjamin-levin changed the title WIP: BatchGenerator opt-in prefer_prefill_when_pending scheduler (CI test) BatchGenerator: opt-in prefer_prefill_when_pending scheduler May 19, 2026
Adds an opt-in `prefer_prefill_when_pending` kwarg to `BatchGenerator`
that pauses decode steps while any prefill work is queued or in flight.
Default is `False`, so existing behavior is unchanged.

Motivation
----------

On Apple M-series unified-memory GPUs, prefill and decode share a single
Metal command engine. With multiple concurrent requests at long context,
the existing scheduler interleaves one decode step (~10 ms) with one
prefill chunk (~1 s+ at 32k context) per cycle. Each in-flight decoding
request is therefore stalled almost the entire time another request is
prefilling, dropping per-victim decode to ~0.7-1 tok/s and yielding ~0x
aggregate scaling vs. a single request.

Pausing decode until prefill has drained lets the batch decode together
at native batched speed once prefill catches up.

Measured impact (Qwen3.6-35B-A3B-4bit, M4 Max 36GB)
---------------------------------------------------

Three concurrent requests at 32k context each:

  mode                                            aggregate    per-request    scaling
  ----------------------------------------------- ------------ -------------- -------
  prefer_prefill_when_pending=False (default)     86 tok/s     ~0.8 victim    0x
  prefer_prefill_when_pending=True                163 tok/s    symmetric      1.89x

Trade-off
---------

With the flag enabled, late-arriving requests' TTFT is unchanged but
already-decoding requests "freeze" briefly while new prefill runs. For
background-agent and chatbot-batch workloads where aggregate throughput
dominates, this is the right trade. For single-stream low-latency it is
not. Hence: opt-in, default off.

Backwards compatibility
-----------------------

Default value is `False`, so existing callers see no behavior change.
All other `BatchGenerator` semantics, attributes, and method signatures
are untouched.

Tests
-----

Three tests in `tests/test_generate.py`:

  - test_prefer_prefill_when_pending_default_false: locks in default
    behavior.
  - test_prefer_prefill_when_pending_accepted_and_stored: asserts the
    kwarg is accepted and stored on the instance.
  - test_prefer_prefill_pauses_decode_when_prefill_pending: constructs
    the exact scheduler state the flag targets (one in-flight decode +
    one queued prefill, below saturation) and asserts the new branch
    fires: decode advances by zero tokens that cycle when the flag is
    on, and advances normally when the flag is off.
@benjamin-levin benjamin-levin force-pushed the prefer-prefill-scheduler branch from 4df8c04 to ad775f0 Compare May 19, 2026 04:59
@benjamin-levin benjamin-levin marked this pull request as ready for review May 19, 2026 18:24
@benjamin-levin benjamin-levin merged commit 3d0f08f into main May 19, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant