Releases: Hmbown/ZMLX
Releases · Hmbown/ZMLX
v0.10.0: Qwen3.5-35B-A3B support
What's New
Qwen3.5-35B-A3B MoE decode support — patch(model) now auto-detects Qwen3.5's hybrid DeltaNet+Attention architecture (256 experts, K=8) and applies fused MoE decode kernels.
Measured results (M4 Max 36GB, stock MLX)
| Model | Decode | Prefill | Fidelity |
|---|---|---|---|
| Qwen3.5-35B-A3B-4bit | ~+2% | ~+4% | token-identical |
| LFM2-8B-A1B-4bit | +12.8% | neutral | token-identical |
| LFM2-24B-A2B-4bit | +6.0% | neutral | token-identical |
Results will vary depending on hardware, thermal state, and prompt length.
Also in this release
- DeltaNet pattern: fuses
conv1d + silufor Qwen3.5 GatedDeltaNet decode layers (30 of 40 layers) - Expert index sorting (opt-in):
ZMLX_MOE_SORT_EXPERTS=1for DRAM locality experiments - README cleanup: consolidated benchmark sections into a single results table
- 150+ benchmark repro capsules from GLM/Qwen3 isolation sweeps and consistency checks
- New tests: 23 DeltaNet kernel tests + 8 patch tests + fusion/integration tests
Install / upgrade
pip install --upgrade zmlxFull changelog
See CHANGELOG.md
zmlx 0.9.2
zmlx 0.9.2
- Auto-enable promoted Qwen3.5/Qwen3-Next moe_mlp defaults (fused SwiGLU + router argpartition logits + topk)
- Update README with front-and-center Qwen3.5 automatic configuration
- Add reproducibility capsule for auto-defaults vs explicit env parity
v0.9.0 — LFM2-24B +7% decode, foundry & fusion
Highlights
- LFM2-24B-A2B-MLX-4bit: +6–7% decode speedup on stock MLX (M4 Max), 500/500 token-identical fidelity
- D-SIMD gate kernel: fuses softmax+bias+topK into 1 Metal dispatch for D=64 expert gating (2 SIMD groups, 64 threads)
- Smart K-based defaults: K≤2 (LFM2-8B) → fused SwiGLU +12%; K≥3 (LFM2-24B) → D-SIMD gate +7%. No env vars needed.
- Foundry module: kernel template evaluation and SFT dataset generation (16 ops, 9 kernel classes)
- Fusion module: JIT graph tracing and Metal codegen for fused op sequences
Quick Start
pip install -U zmlxfrom mlx_lm import load, generate
from zmlx.patch import patch
model, tokenizer = load("LiquidAI/LFM2-24B-A2B-MLX-4bit")
patch(model) # auto-detects architecture, applies safe fusions
print(generate(model, tokenizer, prompt="Hello!", max_tokens=200))Stock MLX Benchmarks (M4 Max, 36GB)
| Model | Baseline | Patched | Speedup | Fidelity |
|---|---|---|---|---|
| LFM2-8B-A1B-4bit | 200.1 tok/s | 223.3 tok/s | +11.6% | 500/500 |
| LFM2-24B-A2B-4bit | 152.0 tok/s | 161.1 tok/s | +6.0% | 500/500 |
Full changelog: https://github.com/Hmbown/ZMLX/blob/main/CHANGELOG.md#090---2026-02-24
v0.8.4
Highlights
- MoE (Qwen, experimental): new env-gated routing kernel path via
router_argpartition_logits_topk().- Enable with:
ZMLX_QWEN_ROUTER_ARGPARTITION_LOGITS=1ZMLX_QWEN_ROUTER_ARGPARTITION_LOGITS_TOPK=1(requires the above)
- Off by default; intended for controlled benchmarks.
- Enable with:
- Bench evidence: v8 promoted stack and v9 reproduction/routertopk suites are committed under
benchmarks/repro_capsules/with matchingbenchmarks/matrix.jsonlrows. - Custom MLX context (optional): build
mx.gather_qmm_swiglulocally viaintegrations/mlx_local_integration/(no changes to upstream MLX required until you opt in). - exo usage: run exo with runtime patching via
zmlx-exo(seedocs/EXO.md).
Install / Use
- Install ZMLX:
pip install zmlx - Run with exo (in the same env where exo is installed):
zmlx-exo - Optional custom MLX primitive (for GLM/Qwen3 gains):
bash integrations/mlx_local_integration/setup_mlx_local.sh(seedocs/EXPERIMENTAL_MLX.md)
Repro Capsules (Source of Truth)
Promoted v8 (summary capsules):
benchmarks/repro_capsules/qwen3_a3b_combo_v8_fp32nofmaonly_t200_r2_summary.jsonbenchmarks/repro_capsules/qwen3_a3b_combo_v8_fp32nofmaonly_t1024_r2_summary.jsonbenchmarks/repro_capsules/glm47_combo_v8_fp32nofmaonly_t200_r2_summary.jsonbenchmarks/repro_capsules/glm47_combo_v8_fp32nofmaonly_t1024_r2_summary.json
v9 reproduction + routertopk (summary capsules):
benchmarks/repro_capsules/qwen3_combo_v9_repro_t200_r3_summary.jsonbenchmarks/repro_capsules/qwen3_combo_v9_repro_t1024_r2_summary.jsonbenchmarks/repro_capsules/glm47_combo_v9_repro_t200_r3_summary.jsonbenchmarks/repro_capsules/glm47_combo_v9_repro_t1024_r2_summary.jsonbenchmarks/repro_capsules/qwen3_combo_v9_routertopk_t200_r3_summary.jsonbenchmarks/repro_capsules/qwen3_combo_v9_routertopk_t1024_r2_summary.jsonbenchmarks/repro_capsules/qwen3_combo_v9_routertopk_t1024_r3_confirm_summary.json
Notes
- The router top-k kernel path is experimental and not promoted as a default.
v0.8.3
Fixed
- Qwen3
moe_mlpnow disables fusedgather_qmm_swigluby default unless explicitly enabled withZMLX_QWEN_FUSED_SWIGLU=1. - Added regression tests for Qwen fused-SwiGLU env gating.
Changed
integrations/mlx_local_integration/setup_mlx_local.shnow defaults to MLXv0.30.6(185b06d9...) for customgather_qmm_swiglubring-up.- Updated lab notebook with reproducible bring-up notes and sequential GLM/Qwen validation results on custom MLX 0.30.6.
Validation
ruff check .pytest -q(852 passed, 74 skipped, 3 xfailed)
v0.8.2
Full Changelog: v0.8.1...v0.8.2
v0.8.1
ZMLX 0.8.1
Patch release focused on MLX benchmark reruns and docs refresh.
Highlights
- Re-ran GLM-4.7-Flash stress benchmark on MLX 0.30.4.dev20260204+2f324cc and updated README references/capsules.
- Re-ran LFM2-8B-A1B stock-MLX benchmarks and updated README tables.
- Re-ran Qwen3-30B-A3B experiments and highlighted the best verified rerun in README (96.5 -> 104.3 tok/s, +8.1%, token-identical).
- Added new repro capsules under benchmarks/repro_capsules/ and synced benchmarks/matrix.jsonl.
Version
- pyproject.toml and src/zmlx/init.py bumped to 0.8.1.
v0.8.0 — GLM/Qwen3 MoE decode + exo integration
What's new
- GLM-4.7-Flash decode +8% — 46 MoE layers fused via
gather_qmm_swiglucustom primitive - Qwen3-30B-A3B decode +6% — same fused MoE path
- exo integration — one-command setup (
bash setup_zmlx.sh) for running ZMLX-accelerated models in exo clusters. See docs/EXO.md. - mlx-lm compatibility layer — handles API differences across mlx-lm versions
- Auto-skip safety — models auto-skip on stock MLX when custom primitive is unavailable (0% change, no regressions)
Requirements
- Stock MLX: LFM2 gains work out of the box
- GLM/Qwen3 gains: requires building the custom
gather_qmm_swigluprimitive (see docs/EXPERIMENTAL_MLX.md)
Benchmarks
| Model | Hardware | Change |
|---|---|---|
| LFM2-8B-A1B-4bit | M4 Max 36 GB | +11.6% |
| GLM-4.7-Flash-4bit | M4 Max 36 GB | +8.1% |
| GLM-4.7-Flash-4bit | M4 Mac Studio | +8% |
| Qwen3-30B-A3B-4bit | M4 Max 36 GB | +5.5% |
All results token-identical under greedy decoding.
v0.7.13
What's New
Added
- GLM model detection in
moe_mlppattern, with opt-in fused SwiGLU viaZMLX_GLM_FUSED_SWIGLUenv var - Experimental MoE stream pool:
ZMLX_MOE_STREAMS=Nfor multi-stream expert dispatch - Kernel correctness tests covering bits, quant, image, indexing, fused_moe, and optimizers
Fixed
- Numerically stable sigmoid:
kk_sigmoiduses abs+branch to avoid overflow on large negative inputs - SwiGLU native dtype: forward/backward kernels use native dtype with
kk_sigmoidhelper - mypy stream type errors in moe_mlp.py (DeviceType → Device)
Changed
- GLM auto-excluded:
moe_mlpandswiglu_mlpexcluded for GLM models (token fidelity failure) - README polish, gitignore cleanup
Full Changelog: v0.7.12...v0.7.13
v0.7.12
[0.7.12] - 2026-02-01
Added
- ReLU2 kernels:
relu2andrelu2_gradwith catalog tests. - Docs:
docs/BENCHMARKS.mdmethodology + repro capsules,docs/EXPERIMENTAL_MLX.mdfor optional MLX fork work. - Stable MoE coverage: Qwen3-30B-A3B and GPT-OSS-20B listed as token-identical on stock MLX.
Fixed
- MoE gating detection: cache GPT-OSS/Qwen3 model detection before class replacement so
_gatingselects the correct path. moe_combine_exactbf16 rounding: explicit rounding after multiply/add to match MLX bf16 accumulation semantics.
Changed
- GPT-OSS combine routing: float32 gating weights now use
moe_combine_fp32to preserve MLX promotion behavior. - Auto-excludes: Qwen3 excludes only
swiglu_mlp+residual_norm; GPT-OSS excludes onlyresidual_norm. - README: simplified install (
zmlx[train]), removed custom MLX details from main docs, refreshed stable model table.