feat(dflash): Path A double-engine + Gemma 4 backend — generalized for Qwen + Gemma 4 by panwudi · Pull Request #1185 · jundot/omlx

panwudi · 2026-05-11T18:21:38Z

Summary

This PR brings Gemma 4 (26B-A4B MoE / dense 31B) up on DFlashEngine via bstnxbt HEAD, and lands the internal weight-sharing follow-up that was originally deferred — see "Path A" below. Five threads of work:

Gemma 4 DFlash backend — bstnxbt HEAD upgrade, event-API adaptation, SummaryEvent summary-access fix. Allows gemma4 model_type through the DFlash loader. Admin benchmark batch path now runs against DFlashEngine instead of silently skipping.
admin/benchmark metric symmetry — Continuous Batching speedup ratios were comparing batch.tg_tps (wall-aggregate) against baseline.gen_tps (gen-only), producing fake sub-1x ratios when prefill was non-negligible. Both paths now share a wall_tg_tps metric. The omlx.ai community-board submission keeps gen_tps (peak decode rate) — that's the standard community metric and is annotated in code.
dflash_max_concurrent admission cap — a soft cap on simultaneously in-flight DFlash requests (default 4, Optional[int]), enforced via asyncio.Semaphore. Excess requests block at the gate until a slot opens.
Path A — DFlashEngine double-engine refactor (NEW) — what the original PR description said was "deferred to a follow-up PR" is now actually in this PR. DFlashEngine becomes a "smart container" that eagerly embeds a long-lived VLMBatchedEngine; the DFlash drafter attaches to the same target weights via a non-destructive wrapper (DFlashVLMTargetWrapper). Per-request routing on concurrency / KV pressure / max_ctx decides which path serves the request. The _in_fallback_mode one-way gate and _evict_dflash_and_start_fallback 71-line eviction helper are gone; both paths coexist for the engine's lifetime. Adds dflash_lazy_drafter opt-in for high-concurrency deployments that benefit from "framework present but drafter not loaded" (zero Metal contention).
Upstream merge (56 commits) + post-merge fixes — the branch was rebased onto jundot/main (paroquant integration, audio routes, etc.). One latent bug surfaced post-merge: _load_drafter_bundle was reading self._draft_quant_bits, an attribute the merge had replaced with upstream's 4-field quant config. Caught + fixed in d80127b — _load_drafter_bundle now uses _build_quant_spec(weight, activation, group_size) from the 4 dataclass fields; pytest 42/42 pass.

Path A — what it solves and how

Problem. Pre-Path-A, dflash_max_ctx triggers eager eviction of the dflash bundle, then lazy-loads VLMBatchedEngine/BatchedEngine from scratch. The target weights are re-loaded in memory — Gemma 4 26B-A4B Q4 is ~15 GB, so context fallback doubles the resident footprint to ~30 GB. The fallback flag (_in_fallback_mode) is also one-way: once flipped, subsequent short-context requests stay on the fallback engine forever.

Approach. Refactor DFlashEngine so it owns both decode paths simultaneously:

async def start(self):
    self._embedded_vlm = VLMBatchedEngine(...)   # canonical target owner
    await self._embedded_vlm.start()
    wrapped = DFlashVLMTargetWrapper(self._embedded_vlm._vlm_model)  # non-destructive proxy
    self._dflash_bundle = await attach_dflash_to_loaded_target(
        target_model=wrapped,                   # SAME _vlm_model object
        draft_path=...,
        ...,
    )

Weight sharing verified at Python id level in spike: engine._target_model._vlm is engine._embedded_vlm._vlm_model == True. Metal active memory peaks at ~15.6 GB (vs ~30 GB if we had two copies).

Routing. Per-request decision based on:

def _route(self, prompt_tokens):
    if self._active_count >= self._max_dflash_concurrent: return "bg"
    if self._kv_pressure() > self._kv_pressure_threshold:  return "bg"
    if self._max_dflash_ctx and len(prompt_tokens) >= self._max_dflash_ctx: return "bg"
    return "dflash"

No hysteresis — when concurrency drops, the next request goes back to dflash immediately. Verified by burst-then-single test (c=16 cold burst then single request at t+0s: 153 tok/s, in the dflash band; the prior same-prompt test showed t+30s drop to 119 tok/s, but with unique-prompt prompts the curve flattens, confirming the apparent "decay" was prefix cache aging, not routing state).

Lazy drafter (dflash_lazy_drafter: bool = False). Defers the drafter+wrapper+factory call until the first dflash-routed request. For workloads that almost never trigger dflash (high concurrency + bg-only), this avoids the Metal contention from the drafter being co-resident even when idle. Trade-off: first dflash request pays ~3s cold-start. Concrete impact on c=4 with mc=0: 129 tok/s (eager, drafter loaded but unused) vs 165 tok/s (lazy, drafter not loaded) — ~28% throughput recovery.

Empirical motivation

VLM MTP on Gemma 4 26B-A4B Q4 (initial Gemma 4 bringup measurements; /v1/chat/completions, m5max):

prompt	bare tg	VLM MTP tg	speedup
PROSE short story	121 tok/s	125 tok/s	1.04x
CODE Fibonacci	123 tok/s	159 tok/s	1.29x
MATH arithmetic step-by-step	121 tok/s	174 tok/s	1.41x
JSON 3 capitals	123 tok/s	158 tok/s	1.29x
CCM 6K-token quote-parse (real production prompt)	107 tok/s	144 tok/s	1.34x (mean of 2 runs)

DFlash on Path A (gemma4-moe-26b-a4b Q4, m5max, structured JSON output prompt, max_tokens=300, temp=0.0, 4 threads × 3 rounds for concurrent rows):

concurrency	DFlash OFF	DFlash mc=4 (Path A)	speedup
c=1 (single)	98 tok/s	168 tok/s	+72%
c=2	130	190	+46%
c=4	195	195	parity
c=8	236	192	-19%
c=12	269	202	-25%
c=16	259	204	-21%

c=12 is roughly where BatchedEngine saturates on this hardware (Apple M5 Max, 100 GB Metal budget). Below saturation, dflash wins; above, BatchedEngine's continuous batching wins. The routing math (concurrency + KV pressure) keeps each request on the right path.

Caveats — when DFlash is NOT a win

DFlash (and VLM MTP) is prompt-distribution-sensitive. The drafter's accept rate collapses outside its training distribution; once accept_rate × tokens-per-round ≤ drafter forward cost, you go negative. On Gemma 4 26B-A4B Q4 with bstnxbt's drafter checkpoint:

prompt type	DFlash speedup
Structured outputs (math, code, JSON, tool-calling)	1.7 – 2.4x
Real production prompts (CCM quote-parse, ~6 K tokens)	2.2x
General prose / explanation, EN or ZH	0.58 – 0.78x — slower than bare BatchedEngine

Operators routing mixed workloads (e.g. an agent that alternates between tool-calling steps and natural-language summarization) should benchmark on their own prompt distribution. dflash_max_concurrent addresses memory + tail-latency under bursts; it does NOT paper over distribution sensitivity. A future PR could add an accept-rate-monitored auto-disable (drop to bare BatchedEngine if the moving-window accept rate falls below threshold), but that is out of scope.

Files & scope

file	role
`omlx/engine/dflash.py`	refactored: embedded VLM in `start()`, `_load_drafter_bundle` extracted, `_ensure_drafter_loaded` lazy path, new `_route` + `_kv_pressure` (corrected formula — `mgr.usage` upstream computes against the bounded free-block queue, which gives misleading 0.997 on empty caches; we use `allocated_blocks/max_blocks` instead), `_record_route` jsonl metric
`omlx/speculative/dflash_vlm_target_wrap.py`	NEW. mlx_vlm → mlx_lm Gemma 4 surface adapter (non-destructive proxy). Currently Gemma 4 only
`omlx/speculative/dflash_factory.py`	NEW. `attach_dflash_to_loaded_target` — partial bundle factory; loads only the drafter and binds, does NOT call `load_runtime_bundle` (which would re-load target)
`omlx/speculative/__init__.py`	NEW. Idempotent monkey-patch for `dflash_mlx.runtime.get_stop_token_ids` (HF GemmaTokenizer's `eos_token_ids` returns `int`, upstream `list(int)` throws TypeError); `detect_fallback_engine_type` helper
`omlx/metrics/dflash_routing.py`	NEW. jsonl routing decision writer with size guard + `DFLASH_METRIC_DISABLE` env
`omlx/engine_pool.py`	passes new ctor kwargs from `model_settings`; duck-type sniff (`hasattr(engine, "_dflash_bundle")`) replaces `type().__name__` string check
`omlx/model_settings.py`	`dflash_lazy_drafter` + `dflash_kv_pressure_threshold` fields (in addition to upstream's quant fields)
`tests/test_dflash_engine.py`	route tests rewritten to call `_route()` returning `"dflash"`/`"bg"`; 8 additional Gemma 4 compatibility tests from upstream merge auto-merged

Scheduler intentionally untouched. omlx/scheduler.py has zero diff in this PR. Per-request routing inside DFlashEngine was preferred over _route_to_dflash in the scheduler because dflash_mlx.SpeculativeSession.open() does not currently accept a pre-prefilled cache — implementing the latter cleanly requires upstream dflash_mlx cooperation. Path A reaches "high concurrency degrades to BatchedEngine throughput" within the engine layer without touching scheduler dispatch.

Open questions / discussion welcome

DFlashVLMTargetWrapper was originally Gemma-4-only; now auto-routed (see force-push 18b4df6): _load_drafter_bundle probes dflash_mlx.engine.target_ops.resolve_target_ops on the embedded model first and applies the wrapper ONLY when upstream rejects. Qwen 3.x via mlx_vlm is accepted by QwenGdnTargetOps directly (no wrapper). Gemma 4 via mlx_vlm still needs the wrapper because Gemma4TargetOps reads mlx_lm-specific attribute names. When bstnxbt/dflash-mlx upstream generalizes Gemma4TargetOps to match QwenGdnTargetOps's VLM-aware pattern, the wrapper module can be deleted. Happy to send that 1-file PR to bstnxbt separately.
The monkey-patch on dflash_mlx.runtime.get_stop_token_ids is a temporary fix. Happy to send a separate 1-line PR to bstnxbt/dflash-mlx upstream so omlx can drop the patch.
PagedCacheManager.usage semantic: we found that the formula 1.0 - free_block_queue.num_free_blocks / (max_blocks - 1) returns near-1.0 even on near-empty caches because num_free_blocks is the bounded free queue size (~256 cap), not the true unallocated block count. Our Path A _kv_pressure() computes allocated_blocks / max_blocks instead. Should usage itself be fixed upstream?
Path A vs scheduler-level routing (à la _route_to_vlm_mtp): we kept these as separate strategies — would the project rather see them unified?

Caveats on this PR's scope

DFlashVLMTargetWrapper actively covers Gemma 4 only (the wrapper is auto-bypassed for other families whose upstream ops are already VLM-aware, e.g. Qwen). Adding more families that ALSO need the mlx_vlm→mlx_lm bridge is mechanical (the 6 attribute-rename drift points are documented in the wrapper's module docstring); auto-probe will route them through automatically without changes here.
Tests grew the unit-test surface (route, build_quant_spec, gemma4 compatibility variants) but do not cover the new factory body — its full execution requires a real drafter checkpoint. Validated end-to-end on m5max via spike6 against gemma4-moe-26b-a4b + z-lab/gemma-4-26B-A4B-it-DFlash.
Burst-load soak (100 concurrent against mc=4) deferred.
Fork can stay as-is if the architectural direction here doesn't match the project's preferred shape — happy to maintain in panwudi/omlx long-term.

Test plan

Closes / Fixes

Closes add Gemma4 DFlash #1102 (add Gemma 4 DFlash)
Fixes DFlash for Gemma 4 31B fails with rms_norm(): incompatible function arguments #1084 (DFlash Gemma 4 31B rms_norm() incompatible function arguments — the DFlashVLMTargetWrapper introduced here bridges the mlx_vlm→mlx_lm shape mismatch that triggers the tuple-into-rms_norm crash)

… Qwen & Gemma 4) This PR consolidates four logically-related changes against the DFlashEngine code path, all needed to make DFlash work end-to-end with Gemma 4 on mlx_vlm and to address the "internal weight-sharing follow-up" originally listed as deferred in the PR description. 1. Gemma 4 DFlash backend bringup - bstnxbt dflash-mlx HEAD upgrade; runtime package restructure adaptation (runtime.context, runtime.loading, cache.manager.*) - LoadedTargetBundle dataclass dispatch - TokenEvent / SummaryEvent isinstance dispatch (replaces dict.get) - is_dflash_compatible broadened to top-level model_type in {"gemma4", "gemma4_text"}; -assistant variants stay rejected - admin/benchmark continuous-batching metric symmetry fix (was comparing wall_tg_tps to gen_tps; now both wall_tg_tps, 1x baseline reads honest) - dflash_max_concurrent admission cap (default 4, semaphore-bound) 2. Path A — DFlashEngine double-engine refactor - DFlashEngine eagerly stands up an embedded VLMBatchedEngine / BatchedEngine in start(); the DFlash drafter then attaches to the SAME loaded target weights (verified via Python identity check: engine._target_model._vlm is engine._embedded_vlm._vlm_model). - Removes the one-way _in_fallback_mode gate and the 71-line _evict_dflash_and_start_fallback method. Both decode paths coexist for the engine's lifetime. - Per-request routing in _route(): concurrency cap → BG, KV pressure (with corrected formula — see below) → BG, max_ctx hard limit → BG, else DFlash. No hysteresis: when concurrency drops the next request returns to DFlash immediately (verified via burst-then- single test). - dflash_lazy_drafter (opt-in, default False): defers drafter + wrapper + factory until first DFlash-routed request. For high-concurrency deployments where the drafter would otherwise sit in Metal memory idle (~28% throughput hit on co-resident). 3. Auto-routing wrapper (no family hardcoding) - _load_drafter_bundle probes dflash_mlx.engine.target_ops.resolve_target_ops on the embedded model first. Apply DFlashVLMTargetWrapper only when upstream ops reject. This makes Path A apply uniformly to whatever families dflash_mlx supports today: * Qwen 3.x via mlx_vlm: QwenGdnTargetOps walks language_model + uses structural hasattr checks, so the embedded model is accepted natively. No wrapper applied. * Gemma 4 via mlx_vlm: Gemma4TargetOps reads text_wrapper.args.layer_types and inner._get_per_layer_inputs (mlx_lm-only naming) so upstream rejects → wrapper applied. - When bstnxbt/dflash-mlx upstream eventually generalizes Gemma4TargetOps to match QwenGdnTargetOps's pattern, the wrapper path goes idle and the wrapper module can be deleted. 4. Post-merge follow-ups (single, atomic fix here) - Merged jundot/main (56 commits — paroquant integration, audio routes, version bump, scheduler chunked-KV Llama-4 fix, etc.). - _load_drafter_bundle was reading self._draft_quant_bits, an attribute the merge removed (replaced with the 4-field upstream quant config: enabled / weight_bits / activation_bits / group_size). _load_drafter_bundle now uses _build_quant_spec. - omlx PagedCacheManager.usage property: kept the upstream as-is but our _kv_pressure routing signal switched to allocated_blocks / max_blocks. The upstream formula uses free_block_queue.num_free_blocks which is a bounded queue size (~256 cap), not the true unallocated block count — gives a misleading ~0.997 on near-empty caches. The routing signal would have been useless without this correction. - Upstream dflash_mlx.runtime.get_stop_token_ids monkey-patched at import time: HF GemmaTokenizer's eos_token_ids returns int, the upstream list(int) raises TypeError. Idempotent, applied once in omlx/speculative/__init__.py. Happy to send a 1-line PR to bstnxbt/dflash-mlx separately so omlx can drop the monkey-patch. Empirical motivation (gemma4-moe-26b-a4b on m5max M5 Max 128GB, real production-style structured JSON prompt, max_tokens=300, temp=0.0): c=1 single: OFF 98 / Path A mc=4 168 tok/s (+72%) c=2: OFF 130 / Path A mc=4 190 (+46%) c=4: OFF 195 / Path A mc=4 195 (parity) c=8: OFF 236 / Path A mc=4 192 (-19%) c=12: OFF 269 / Path A mc=4 202 (-25%) c=16: OFF 259 / Path A mc=4 204 (-21%) c=12 is roughly where BatchedEngine saturates this hardware. Below saturation, DFlash wins; above, BatchedEngine's continuous batching wins. Routing (concurrency + KV pressure) keeps each request on the right path. Caveats — when DFlash is NOT a win (gemma4-moe-26b-a4b Q4): prompt type | DFlash speedup Structured outputs (math/code/JSON/tools) | 1.7 – 2.4x Real CCM logistics quote-parse (6 K tok) | 2.2x General prose / explanation (EN or ZH) | 0.58 – 0.78x (slower) DFlash is prompt-distribution-sensitive: the drafter's accept rate collapses outside the families it was trained on. Operators routing mixed workloads should benchmark on their own prompt distribution. Test plan - [x] Server starts cleanly on gemma4-moe-26b-a4b with dflash_enabled - [x] ModelSettings.dflash_max_concurrent defaults to 4; round-trips - [x] Admin GET /api/models exposes dflash_max_concurrent; PUT /api/models/{id}/settings accepts + persists - [x] DFlash engine load log prints max_concurrent=4 - [x] Continuous-batching metric symmetry post-fix - [x] Path A weight sharing verified at Python id level (engine._target_model._vlm is engine._embedded_vlm._vlm_model) - [x] Path A end-to-end gemma4-moe-26b-a4b + real DFlash drafter on m5max — factory loaded, Gemma4TargetOps applied via wrapper, real /v1/chat/completions 200 OK - [x] Path A auto-routing: Qwen via mlx_vlm passes through QwenGdnTargetOps directly (no wrapper), Gemma 4 via mlx_vlm falls through to wrapper - [x] Path A routing recovery from c=16 burst → single returns to DFlash band at t+0 (no hysteresis) - [x] pytest tests/test_dflash_engine.py 42/42 pass - [x] i18n parity (en / zh / zh-TW) for dflash_max_concurrent keys - [ ] Burst-load soak (100 concurrent vs mc=4) — out of scope for this PR - [ ] More VLM families beyond Gemma 4 + Qwen — deferred; the auto-probe mechanism in _load_drafter_bundle keeps the door open without code changes when upstream dflash_mlx adds new families. Files (modified vs upstream/main) omlx/engine/dflash.py refactored omlx/engine_pool.py pass new kwargs omlx/model_settings.py new fields omlx/admin/i18n/{en,zh,zh-TW}.json new i18n keys omlx/admin/templates/dashboard/ _modal_model_settings.html new admin UI omlx/admin/benchmark.py metric symmetry pyproject.toml dflash-mlx pin update tests/test_dflash_engine.py route tests omlx/speculative/__init__.py NEW omlx/speculative/dflash_vlm_target_wrap.py NEW omlx/speculative/dflash_factory.py NEW omlx/metrics/__init__.py NEW omlx/metrics/dflash_routing.py NEW Scheduler intentionally untouched. Per-request routing inside DFlashEngine, not at scheduler dispatch. dflash_mlx SpeculativeSession.open() does not currently accept an externally prefilled cache, so scheduler-level routing à la _route_to_vlm_mtp isn't a viable path until upstream cooperates. 由飞驼助手生成

Nine engine-engineering docs lived in the separate dev-llm research repo (dflash Path A spec, Gemma 4 spec-dec design, fork GUI bundle hack, the PR jundot#1185 body draft, the upstream issue draft, m5max oMLX/ mlx setup notes, the STT diarize client API, the STT roadmap). They belong with the engine — consolidating them here so they version with the code, as the dev-llm repo is wound down. --- 九个引擎工程文档原先放在独立的 dev-llm 研究仓里(dflash Path A spec、Gemma 4 spec-dec 设计、fork GUI bundle hack、PR jundot#1185 body 草稿、上游 issue 草稿、m5max oMLX/mlx setup 笔记、STT diarize client API、STT roadmap)。它们本就属于引擎,归并到这里跟代码同版本管理;dev-llm 仓正在收尾。

panwudi force-pushed the feat/gemma4-dflash branch from f9f3678 to 7d037e9 Compare May 11, 2026 18:42

panwudi changed the title ~~feat(dflash): Gemma 4 backend + benchmark metric fix + admission cap~~ feat(dflash): Gemma 4 backend + Path A double-engine + concurrency-aware routing May 12, 2026

panwudi force-pushed the feat/gemma4-dflash branch from d80127b to 18b4df6 Compare May 12, 2026 20:27

panwudi changed the title ~~feat(dflash): Gemma 4 backend + Path A double-engine + concurrency-aware routing~~ feat(dflash): Gemma 4 backend + Path A double-engine — generalized for Qwen & Gemma 4 May 12, 2026

panwudi changed the title ~~feat(dflash): Gemma 4 backend + Path A double-engine — generalized for Qwen & Gemma 4~~ feat(dflash): Path A double-engine + Gemma 4 backend — generalized for Qwen + Gemma 4 May 12, 2026

This was referenced May 12, 2026

add Gemma4 DFlash #1102

Open

DFlash for Gemma 4 31B fails with rms_norm(): incompatible function arguments #1084

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): Path A double-engine + Gemma 4 backend — generalized for Qwen + Gemma 4#1185

feat(dflash): Path A double-engine + Gemma 4 backend — generalized for Qwen + Gemma 4#1185
panwudi wants to merge 1 commit into
jundot:mainfrom
panwudi:feat/gemma4-dflash

panwudi commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

panwudi commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Path A — what it solves and how

Empirical motivation

Caveats — when DFlash is NOT a win

Files & scope

Open questions / discussion welcome

Caveats on this PR's scope

Test plan

Closes / Fixes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

panwudi commented May 11, 2026 •

edited

Loading