feat(dflash): Path A double-engine + Gemma 4 backend — generalized for Qwen + Gemma 4#1185
Open
panwudi wants to merge 1 commit into
Open
feat(dflash): Path A double-engine + Gemma 4 backend — generalized for Qwen + Gemma 4#1185panwudi wants to merge 1 commit into
panwudi wants to merge 1 commit into
Conversation
f9f3678 to
7d037e9
Compare
… Qwen & Gemma 4)
This PR consolidates four logically-related changes against the
DFlashEngine code path, all needed to make DFlash work end-to-end with
Gemma 4 on mlx_vlm and to address the "internal weight-sharing
follow-up" originally listed as deferred in the PR description.
1. Gemma 4 DFlash backend bringup
- bstnxbt dflash-mlx HEAD upgrade; runtime package restructure
adaptation (runtime.context, runtime.loading, cache.manager.*)
- LoadedTargetBundle dataclass dispatch
- TokenEvent / SummaryEvent isinstance dispatch (replaces dict.get)
- is_dflash_compatible broadened to top-level model_type in
{"gemma4", "gemma4_text"}; -assistant variants stay rejected
- admin/benchmark continuous-batching metric symmetry fix
(was comparing wall_tg_tps to gen_tps; now both wall_tg_tps,
1x baseline reads honest)
- dflash_max_concurrent admission cap (default 4, semaphore-bound)
2. Path A — DFlashEngine double-engine refactor
- DFlashEngine eagerly stands up an embedded VLMBatchedEngine /
BatchedEngine in start(); the DFlash drafter then attaches to the
SAME loaded target weights (verified via Python identity check:
engine._target_model._vlm is engine._embedded_vlm._vlm_model).
- Removes the one-way _in_fallback_mode gate and the 71-line
_evict_dflash_and_start_fallback method. Both decode paths
coexist for the engine's lifetime.
- Per-request routing in _route(): concurrency cap → BG, KV pressure
(with corrected formula — see below) → BG, max_ctx hard limit →
BG, else DFlash. No hysteresis: when concurrency drops the next
request returns to DFlash immediately (verified via burst-then-
single test).
- dflash_lazy_drafter (opt-in, default False): defers drafter +
wrapper + factory until first DFlash-routed request. For
high-concurrency deployments where the drafter would otherwise
sit in Metal memory idle (~28% throughput hit on co-resident).
3. Auto-routing wrapper (no family hardcoding)
- _load_drafter_bundle probes
dflash_mlx.engine.target_ops.resolve_target_ops on the embedded
model first. Apply DFlashVLMTargetWrapper only when upstream ops
reject. This makes Path A apply uniformly to whatever families
dflash_mlx supports today:
* Qwen 3.x via mlx_vlm: QwenGdnTargetOps walks language_model
+ uses structural hasattr checks, so the embedded model is
accepted natively. No wrapper applied.
* Gemma 4 via mlx_vlm: Gemma4TargetOps reads
text_wrapper.args.layer_types and inner._get_per_layer_inputs
(mlx_lm-only naming) so upstream rejects → wrapper applied.
- When bstnxbt/dflash-mlx upstream eventually generalizes
Gemma4TargetOps to match QwenGdnTargetOps's pattern, the wrapper
path goes idle and the wrapper module can be deleted.
4. Post-merge follow-ups (single, atomic fix here)
- Merged jundot/main (56 commits — paroquant integration, audio
routes, version bump, scheduler chunked-KV Llama-4 fix, etc.).
- _load_drafter_bundle was reading self._draft_quant_bits, an
attribute the merge removed (replaced with the 4-field upstream
quant config: enabled / weight_bits / activation_bits /
group_size). _load_drafter_bundle now uses _build_quant_spec.
- omlx PagedCacheManager.usage property: kept the upstream as-is
but our _kv_pressure routing signal switched to
allocated_blocks / max_blocks. The upstream formula uses
free_block_queue.num_free_blocks which is a bounded queue size
(~256 cap), not the true unallocated block count — gives a
misleading ~0.997 on near-empty caches. The routing signal would
have been useless without this correction.
- Upstream dflash_mlx.runtime.get_stop_token_ids monkey-patched at
import time: HF GemmaTokenizer's eos_token_ids returns int, the
upstream list(int) raises TypeError. Idempotent, applied once in
omlx/speculative/__init__.py. Happy to send a 1-line PR to
bstnxbt/dflash-mlx separately so omlx can drop the monkey-patch.
Empirical motivation (gemma4-moe-26b-a4b on m5max M5 Max 128GB, real
production-style structured JSON prompt, max_tokens=300, temp=0.0):
c=1 single: OFF 98 / Path A mc=4 168 tok/s (+72%)
c=2: OFF 130 / Path A mc=4 190 (+46%)
c=4: OFF 195 / Path A mc=4 195 (parity)
c=8: OFF 236 / Path A mc=4 192 (-19%)
c=12: OFF 269 / Path A mc=4 202 (-25%)
c=16: OFF 259 / Path A mc=4 204 (-21%)
c=12 is roughly where BatchedEngine saturates this hardware. Below
saturation, DFlash wins; above, BatchedEngine's continuous batching
wins. Routing (concurrency + KV pressure) keeps each request on the
right path.
Caveats — when DFlash is NOT a win (gemma4-moe-26b-a4b Q4):
prompt type | DFlash speedup
Structured outputs (math/code/JSON/tools) | 1.7 – 2.4x
Real CCM logistics quote-parse (6 K tok) | 2.2x
General prose / explanation (EN or ZH) | 0.58 – 0.78x (slower)
DFlash is prompt-distribution-sensitive: the drafter's accept rate
collapses outside the families it was trained on. Operators routing
mixed workloads should benchmark on their own prompt distribution.
Test plan
- [x] Server starts cleanly on gemma4-moe-26b-a4b with dflash_enabled
- [x] ModelSettings.dflash_max_concurrent defaults to 4; round-trips
- [x] Admin GET /api/models exposes dflash_max_concurrent; PUT
/api/models/{id}/settings accepts + persists
- [x] DFlash engine load log prints max_concurrent=4
- [x] Continuous-batching metric symmetry post-fix
- [x] Path A weight sharing verified at Python id level
(engine._target_model._vlm is engine._embedded_vlm._vlm_model)
- [x] Path A end-to-end gemma4-moe-26b-a4b + real DFlash drafter on
m5max — factory loaded, Gemma4TargetOps applied via wrapper,
real /v1/chat/completions 200 OK
- [x] Path A auto-routing: Qwen via mlx_vlm passes through
QwenGdnTargetOps directly (no wrapper), Gemma 4 via mlx_vlm
falls through to wrapper
- [x] Path A routing recovery from c=16 burst → single returns to
DFlash band at t+0 (no hysteresis)
- [x] pytest tests/test_dflash_engine.py 42/42 pass
- [x] i18n parity (en / zh / zh-TW) for dflash_max_concurrent keys
- [ ] Burst-load soak (100 concurrent vs mc=4) — out of scope for
this PR
- [ ] More VLM families beyond Gemma 4 + Qwen — deferred; the
auto-probe mechanism in _load_drafter_bundle keeps the door
open without code changes when upstream dflash_mlx adds new
families.
Files (modified vs upstream/main)
omlx/engine/dflash.py refactored
omlx/engine_pool.py pass new kwargs
omlx/model_settings.py new fields
omlx/admin/i18n/{en,zh,zh-TW}.json new i18n keys
omlx/admin/templates/dashboard/
_modal_model_settings.html new admin UI
omlx/admin/benchmark.py metric symmetry
pyproject.toml dflash-mlx pin update
tests/test_dflash_engine.py route tests
omlx/speculative/__init__.py NEW
omlx/speculative/dflash_vlm_target_wrap.py NEW
omlx/speculative/dflash_factory.py NEW
omlx/metrics/__init__.py NEW
omlx/metrics/dflash_routing.py NEW
Scheduler intentionally untouched. Per-request routing inside
DFlashEngine, not at scheduler dispatch. dflash_mlx
SpeculativeSession.open() does not currently accept an externally
prefilled cache, so scheduler-level routing à la _route_to_vlm_mtp
isn't a viable path until upstream cooperates.
由飞驼助手生成
d80127b to
18b4df6
Compare
This was referenced May 12, 2026
panwudi
pushed a commit
to panwudi/flyto-mlx
that referenced
this pull request
May 19, 2026
Nine engine-engineering docs lived in the separate dev-llm research repo (dflash Path A spec, Gemma 4 spec-dec design, fork GUI bundle hack, the PR jundot#1185 body draft, the upstream issue draft, m5max oMLX/ mlx setup notes, the STT diarize client API, the STT roadmap). They belong with the engine — consolidating them here so they version with the code, as the dev-llm repo is wound down. --- 九个引擎工程文档原先放在独立的 dev-llm 研究仓里(dflash Path A spec、Gemma 4 spec-dec 设计、fork GUI bundle hack、PR jundot#1185 body 草稿、上游 issue 草稿、m5max oMLX/mlx setup 笔记、STT diarize client API、STT roadmap)。它们本就属于引擎,归并到这里跟代码同 版本管理;dev-llm 仓正在收尾。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR brings Gemma 4 (26B-A4B MoE / dense 31B) up on
DFlashEngineviabstnxbtHEAD, and lands the internal weight-sharing follow-up that was originally deferred — see "Path A" below. Five threads of work:Gemma 4 DFlash backend — bstnxbt HEAD upgrade, event-API adaptation,
SummaryEventsummary-access fix. Allowsgemma4model_type through the DFlash loader. Admin benchmark batch path now runs againstDFlashEngineinstead of silently skipping.admin/benchmarkmetric symmetry — Continuous Batching speedup ratios were comparingbatch.tg_tps(wall-aggregate) againstbaseline.gen_tps(gen-only), producing fake sub-1x ratios when prefill was non-negligible. Both paths now share awall_tg_tpsmetric. The omlx.ai community-board submission keepsgen_tps(peak decode rate) — that's the standard community metric and is annotated in code.dflash_max_concurrentadmission cap — a soft cap on simultaneously in-flight DFlash requests (default 4,Optional[int]), enforced viaasyncio.Semaphore. Excess requests block at the gate until a slot opens.Path A — DFlashEngine double-engine refactor (NEW) — what the original PR description said was "deferred to a follow-up PR" is now actually in this PR. DFlashEngine becomes a "smart container" that eagerly embeds a long-lived
VLMBatchedEngine; the DFlash drafter attaches to the same target weights via a non-destructive wrapper (DFlashVLMTargetWrapper). Per-request routing on concurrency / KV pressure / max_ctx decides which path serves the request. The_in_fallback_modeone-way gate and_evict_dflash_and_start_fallback71-line eviction helper are gone; both paths coexist for the engine's lifetime. Addsdflash_lazy_drafteropt-in for high-concurrency deployments that benefit from "framework present but drafter not loaded" (zero Metal contention).Upstream merge (56 commits) + post-merge fixes — the branch was rebased onto
jundot/main(paroquant integration, audio routes, etc.). One latent bug surfaced post-merge:_load_drafter_bundlewas readingself._draft_quant_bits, an attribute the merge had replaced with upstream's 4-field quant config. Caught + fixed ind80127b—_load_drafter_bundlenow uses_build_quant_spec(weight, activation, group_size)from the 4 dataclass fields; pytest 42/42 pass.Path A — what it solves and how
Problem. Pre-Path-A,
dflash_max_ctxtriggers eager eviction of the dflash bundle, then lazy-loadsVLMBatchedEngine/BatchedEnginefrom scratch. The target weights are re-loaded in memory — Gemma 4 26B-A4B Q4 is ~15 GB, so context fallback doubles the resident footprint to ~30 GB. The fallback flag (_in_fallback_mode) is also one-way: once flipped, subsequent short-context requests stay on the fallback engine forever.Approach. Refactor
DFlashEngineso it owns both decode paths simultaneously:Weight sharing verified at Python id level in spike:
engine._target_model._vlm is engine._embedded_vlm._vlm_model == True. Metal active memory peaks at ~15.6 GB (vs ~30 GB if we had two copies).Routing. Per-request decision based on:
No hysteresis — when concurrency drops, the next request goes back to dflash immediately. Verified by burst-then-single test (
c=16cold burst then single request att+0s: 153 tok/s, in the dflash band; the prior same-prompt test showed t+30s drop to 119 tok/s, but with unique-prompt prompts the curve flattens, confirming the apparent "decay" was prefix cache aging, not routing state).Lazy drafter (
dflash_lazy_drafter: bool = False). Defers the drafter+wrapper+factory call until the first dflash-routed request. For workloads that almost never trigger dflash (high concurrency + bg-only), this avoids the Metal contention from the drafter being co-resident even when idle. Trade-off: first dflash request pays ~3s cold-start. Concrete impact onc=4withmc=0: 129 tok/s (eager, drafter loaded but unused) vs 165 tok/s (lazy, drafter not loaded) — ~28% throughput recovery.Empirical motivation
VLM MTP on Gemma 4 26B-A4B Q4 (initial Gemma 4 bringup measurements;
/v1/chat/completions, m5max):DFlash on Path A (gemma4-moe-26b-a4b Q4, m5max, structured JSON output prompt,
max_tokens=300,temp=0.0, 4 threads × 3 rounds for concurrent rows):c=12is roughly where BatchedEngine saturates on this hardware (Apple M5 Max, 100 GB Metal budget). Below saturation, dflash wins; above, BatchedEngine's continuous batching wins. The routing math (concurrency + KV pressure) keeps each request on the right path.Caveats — when DFlash is NOT a win
DFlash (and VLM MTP) is prompt-distribution-sensitive. The drafter's accept rate collapses outside its training distribution; once
accept_rate × tokens-per-round ≤ drafter forward cost, you go negative. On Gemma 4 26B-A4B Q4 withbstnxbt's drafter checkpoint:Operators routing mixed workloads (e.g. an agent that alternates between tool-calling steps and natural-language summarization) should benchmark on their own prompt distribution.
dflash_max_concurrentaddresses memory + tail-latency under bursts; it does NOT paper over distribution sensitivity. A future PR could add an accept-rate-monitored auto-disable (drop to bare BatchedEngine if the moving-window accept rate falls below threshold), but that is out of scope.Files & scope
omlx/engine/dflash.pystart(),_load_drafter_bundleextracted,_ensure_drafter_loadedlazy path, new_route+_kv_pressure(corrected formula —mgr.usageupstream computes against the bounded free-block queue, which gives misleading 0.997 on empty caches; we useallocated_blocks/max_blocksinstead),_record_routejsonl metricomlx/speculative/dflash_vlm_target_wrap.pyomlx/speculative/dflash_factory.pyattach_dflash_to_loaded_target— partial bundle factory; loads only the drafter and binds, does NOT callload_runtime_bundle(which would re-load target)omlx/speculative/__init__.pydflash_mlx.runtime.get_stop_token_ids(HF GemmaTokenizer'seos_token_idsreturnsint, upstreamlist(int)throws TypeError);detect_fallback_engine_typehelperomlx/metrics/dflash_routing.pyDFLASH_METRIC_DISABLEenvomlx/engine_pool.pymodel_settings; duck-type sniff (hasattr(engine, "_dflash_bundle")) replacestype().__name__string checkomlx/model_settings.pydflash_lazy_drafter+dflash_kv_pressure_thresholdfields (in addition to upstream's quant fields)tests/test_dflash_engine.py_route()returning"dflash"/"bg"; 8 additional Gemma 4 compatibility tests from upstream merge auto-mergedScheduler intentionally untouched.
omlx/scheduler.pyhas zero diff in this PR. Per-request routing insideDFlashEnginewas preferred over_route_to_dflashin the scheduler becausedflash_mlx.SpeculativeSession.open()does not currently accept a pre-prefilled cache — implementing the latter cleanly requires upstreamdflash_mlxcooperation. Path A reaches "high concurrency degrades to BatchedEngine throughput" within the engine layer without touching scheduler dispatch.Open questions / discussion welcome
DFlashVLMTargetWrapperwas originally Gemma-4-only; now auto-routed (see force-push 18b4df6):_load_drafter_bundleprobesdflash_mlx.engine.target_ops.resolve_target_opson the embedded model first and applies the wrapper ONLY when upstream rejects. Qwen 3.x via mlx_vlm is accepted byQwenGdnTargetOpsdirectly (no wrapper). Gemma 4 via mlx_vlm still needs the wrapper becauseGemma4TargetOpsreads mlx_lm-specific attribute names. Whenbstnxbt/dflash-mlxupstream generalizesGemma4TargetOpsto matchQwenGdnTargetOps's VLM-aware pattern, the wrapper module can be deleted. Happy to send that 1-file PR tobstnxbtseparately.dflash_mlx.runtime.get_stop_token_idsis a temporary fix. Happy to send a separate 1-line PR tobstnxbt/dflash-mlxupstream so omlx can drop the patch.PagedCacheManager.usagesemantic: we found that the formula1.0 - free_block_queue.num_free_blocks / (max_blocks - 1)returns near-1.0 even on near-empty caches becausenum_free_blocksis the bounded free queue size (~256 cap), not the true unallocated block count. Our Path A_kv_pressure()computesallocated_blocks / max_blocksinstead. Shouldusageitself be fixed upstream?_route_to_vlm_mtp): we kept these as separate strategies — would the project rather see them unified?Caveats on this PR's scope
DFlashVLMTargetWrapperactively covers Gemma 4 only (the wrapper is auto-bypassed for other families whose upstream ops are already VLM-aware, e.g. Qwen). Adding more families that ALSO need the mlx_vlm→mlx_lm bridge is mechanical (the 6 attribute-rename drift points are documented in the wrapper's module docstring); auto-probe will route them through automatically without changes here.spike6againstgemma4-moe-26b-a4b+z-lab/gemma-4-26B-A4B-it-DFlash.mc=4) deferred.panwudi/omlxlong-term.Test plan
gemma4-moe-26b-a4bwithdflash_enabled=True, dflash_max_concurrent=4ModelSettings.dflash_max_concurrentdefaults to 4; round-trips throughfrom_dict/to_dict/api/modelsexposesdflash_max_concurrent; PUT/api/models/{id}/settingsaccepts and persistsmax_concurrent=4idlevel (_target_model._vlm is _embedded_vlm._vlm_model)gemma4-moe-26b-a4b+ real DFlash drafter on m5max — factory loaded,Gemma4TargetOpsresolved, real chat completion 200 OKtests/test_dflash_engine.py42/42 pass post-merge + post-quant-fixdflash_max_concurrent,dflash_max_concurrent_placeholder,dflash_max_concurrent_helpmc=4) — out of scope, would benefit from a dedicated stress-test harnessCloses / Fixes
rms_norm()incompatible function arguments — theDFlashVLMTargetWrapperintroduced here bridges the mlx_vlm→mlx_lm shape mismatch that triggers the tuple-into-rms_normcrash)