Add opt-in prompt-lookup decoding + auto-speculative router by benjamin-levin · Pull Request #1286 · ml-explore/mlx-lm

benjamin-levin · 2026-05-18T23:32:25Z

WIP / fork validation PR. Pre-existing mac_build_and_test failures (transformers 5.x tokenizer compatibility) are unrelated to this change.

Summary

Adds two new generators to mlx_lm.generate (both opt-in, default off):

prompt_lookup_generate_step — PLD speculative decoding: drafts the next k tokens by n-gram lookup against the prompt, verifies in one main-model forward, accepts the greedy prefix, trims the cache on partial accept. No draft model required.
auto_speculative_generate_step — routes between PLD and plain AR based on prompt length, n-gram density, and a 16-token PLD probe. Probe-failure path falls back to AR on the warm cache so the prefill + probe cost is paid once.

Motivation

On an Apple-silicon companion fork (mlx_fast.auto_speculative) the same router measured +17-25% across echo-heavy / code-edit / open-gen / qa-short workloads vs plain AR, with bit-exact greedy output (96/96 prompts).

The router pattern is the right default for "I don't know what the prompt looks like" workloads (e.g. an OpenAI-compatible server) because:

short prompts and prompts with no echo structure skip PLD entirely (zero overhead vs AR),
long structured prompts (RAG, code edit, translation, summarization) get the full PLD win,
the 16-token probe means the worst case is "AR + probe" rather than "AR + PLD overhead on every cycle."

Measured impact (Qwen3.6-35B-A3B-4bit on M4 Max 36GB, N=1)

From companion-repo benchmarks (mlx_fast/auto_speculative.py):

workload	speedup vs AR	bit-exact
echo	1.17×	yes
code-edit	1.08×	yes
open-gen	1.25×	yes
qa-short	1.21×	yes
router never loses to AR across the 96-prompt suite

Implementation

Files changed: mlx_lm/generate.py (+559 / -7).

New functions (all module-public):

prompt_lookup_generate_step(prompt, model, prompt_ids, *, prompt_lookup_num_tokens=8, prompt_lookup_max_matches=2, ...) — mirrors speculative_generate_step's shape but the draft comes from _pld_find_draft (longest-suffix n-gram match against the prompt) instead of a draft model.
auto_speculative_generate_step(prompt, model, *, ...) — the router. Cheap length+n-gram pre-filter, then probes with PLD, gates on acceptance, continues with PLD or falls back to AR using the warm cache.
_pld_find_draft, _auto_spec_score — internal helpers.

Opt-in flags (default off, matches the prefer-prefill-scheduler opt-in pattern):

stream_generate / generate accept:

auto_speculative=True — route via the auto router.
prompt_lookup_num_tokens=N — use PLD directly without the router.

Both are mutually exclusive with draft_model=. CLI exposes the matching --auto-speculative and --prompt-lookup-num-tokens flags.

Cache handling: PLD requires a trimmable prompt cache; the router falls back to AR (with the same warm cache) when the cache type isn't trimmable.

Trade-off

PLD has a per-cycle setup cost (the verify forward sees 1 + k_lookahead tokens instead of 1). Net wins require either successful drafts or amortized prefill via the warm-cache fallback. The router's length pre-filter + early-bail probe keeps the worst case at "AR plus one probe" rather than "AR with PLD overhead on every cycle."

Defaults are conservative:

_AUTO_SPEC_SHORT_LEN=256 — below this, skip PLD entirely.
_AUTO_SPEC_PROBE_TOKENS=16 — probe budget.
_AUTO_SPEC_PROBE_THRESHOLD=0.30 — acceptance rate needed to commit to PLD.
_AUTO_SPEC_PROBE_EARLY_BAIL=4 — consecutive misses → abort probe early.

Test plan

_pld_find_draft returns longest-suffix match, returns [] when no match or no continuation, handles empty inputs.
_auto_spec_score returns 0 for prompts below 256 tokens, ramps to 1 for long+repetitive prompts.
setup_arg_parser parses --auto-speculative and --prompt-lookup-num-tokens; defaults are False and None.
mlx_lm.generate module imports cleanly; all four new functions resolve.
stream_generate rejects auto_speculative=True together with draft_model=.
CI: mac_build_and_test (will fail on pre-existing transformers tokenizer tests, unrelated).
CI: check_lint.

Companion repo

The reference implementation that motivated this PR — including 3-way AR/MTP/PLD routing, exact GDN rollback on partial accept, and the +17-25% measurements — lives in mlx_fast/auto_speculative.py. mlx-lm doesn't ship an MTP draft-head primitive, so the companion's 3-way router collapses here into the 2-way AR/PLD router; the router shape is preserved so an MTP arm can drop in later.

🤖 Generated with Claude Code

Empty commit to trigger pull_request.yml workflow registration on the fork. No source changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the fork to the if-gate for both jobs in pull_request.yml. Lets PR CI on the fork run against the self-hosted M4 Max runner registered on this fork. DROP THIS COMMIT BEFORE UPSTREAM SUBMISSION. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Public forks suppress push/pull_request workflow events by default; adding workflow_dispatch lets us manually trigger CI for the fork. DROP BEFORE UPSTREAM SUBMISSION. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two new generators to mlx_lm.generate (both opt-in, default off): * `prompt_lookup_generate_step` — PLD speculative decoding (drafts via n-gram lookup against the prompt; no draft model required). Verifies in one main-model forward, accepts the greedy prefix, trims the cache on partial accept. * `auto_speculative_generate_step` — routes between PLD and plain AR based on prompt length, n-gram density, and a 16-token PLD probe. Probe-failure path falls back to AR on the warm cache so the prefill + probe cost is paid once. Wiring. `stream_generate` accepts two new kwargs: * `auto_speculative=True` (default False) — route via the auto router. * `prompt_lookup_num_tokens=N` — use PLD directly without the router. Both are mutually exclusive with `draft_model=`. CLI exposes the matching `--auto-speculative` and `--prompt-lookup-num-tokens` flags. Motivation. On an Apple-silicon companion fork (mlx_fast) the same router measured +17-25% across echo-heavy / code-edit / open-gen / qa-short workloads vs plain AR, with bit-exact greedy output: Qwen3.6-35B-A3B-4bit on M4 Max 36GB (N=1): echo 1.17x code-edit 1.08x open-gen 1.25x qa-short 1.21x Cross-route correctness: 96/96 prompts bit-exact vs AR. Trade-off. PLD has a per-cycle setup cost (the verify forward sees 1 + k_lookahead tokens instead of 1) — net wins require either a successful draft or amortized prefill via the warm-cache fallback. The router's length pre-filter + early-bail probe keeps the worst case at "AR plus one probe" rather than "AR with PLD overhead on every cycle." Note. mlx-lm doesn't ship an MTP draft-head primitive, so the companion fork's 3-way AR/MTP/PLD router collapses here into the 2-way AR/PLD router. The router shape stays identical so an MTP arm can drop in later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benjamin-levin · 2026-05-18T23:32:53Z

Wrong target — recreating against benjamin-levin/mlx-lm for fork CI validation.

benjamin-levin and others added 5 commits May 18, 2026 08:23

ci: enable GitHub Actions on fork

15aec37

Empty commit to trigger pull_request.yml workflow registration on the fork. No source changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ci: test workflow trigger

6fa7377

benjamin-levin closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in prompt-lookup decoding + auto-speculative router#1286

Add opt-in prompt-lookup decoding + auto-speculative router#1286
benjamin-levin wants to merge 5 commits into
ml-explore:mainfrom
benjamin-levin:auto-speculative-router

benjamin-levin commented May 18, 2026

Uh oh!

benjamin-levin commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benjamin-levin commented May 18, 2026

Summary

Motivation

Measured impact (Qwen3.6-35B-A3B-4bit on M4 Max 36GB, N=1)

Implementation

Trade-off

Test plan

Companion repo

Uh oh!

benjamin-levin commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant