Skip to content

Releases: Hmbown/ZMLX

v0.10.0: Qwen3.5-35B-A3B support

03 Mar 21:20

Choose a tag to compare

What's New

Qwen3.5-35B-A3B MoE decode supportpatch(model) now auto-detects Qwen3.5's hybrid DeltaNet+Attention architecture (256 experts, K=8) and applies fused MoE decode kernels.

Measured results (M4 Max 36GB, stock MLX)

Model Decode Prefill Fidelity
Qwen3.5-35B-A3B-4bit ~+2% ~+4% token-identical
LFM2-8B-A1B-4bit +12.8% neutral token-identical
LFM2-24B-A2B-4bit +6.0% neutral token-identical

Results will vary depending on hardware, thermal state, and prompt length.

Also in this release

  • DeltaNet pattern: fuses conv1d + silu for Qwen3.5 GatedDeltaNet decode layers (30 of 40 layers)
  • Expert index sorting (opt-in): ZMLX_MOE_SORT_EXPERTS=1 for DRAM locality experiments
  • README cleanup: consolidated benchmark sections into a single results table
  • 150+ benchmark repro capsules from GLM/Qwen3 isolation sweeps and consistency checks
  • New tests: 23 DeltaNet kernel tests + 8 patch tests + fusion/integration tests

Install / upgrade

pip install --upgrade zmlx

Full changelog

See CHANGELOG.md

zmlx 0.9.2

25 Feb 17:48

Choose a tag to compare

zmlx 0.9.2

  • Auto-enable promoted Qwen3.5/Qwen3-Next moe_mlp defaults (fused SwiGLU + router argpartition logits + topk)
  • Update README with front-and-center Qwen3.5 automatic configuration
  • Add reproducibility capsule for auto-defaults vs explicit env parity

v0.9.0 — LFM2-24B +7% decode, foundry & fusion

24 Feb 16:48

Choose a tag to compare

Highlights

  • LFM2-24B-A2B-MLX-4bit: +6–7% decode speedup on stock MLX (M4 Max), 500/500 token-identical fidelity
  • D-SIMD gate kernel: fuses softmax+bias+topK into 1 Metal dispatch for D=64 expert gating (2 SIMD groups, 64 threads)
  • Smart K-based defaults: K≤2 (LFM2-8B) → fused SwiGLU +12%; K≥3 (LFM2-24B) → D-SIMD gate +7%. No env vars needed.
  • Foundry module: kernel template evaluation and SFT dataset generation (16 ops, 9 kernel classes)
  • Fusion module: JIT graph tracing and Metal codegen for fused op sequences

Quick Start

pip install -U zmlx
from mlx_lm import load, generate
from zmlx.patch import patch

model, tokenizer = load("LiquidAI/LFM2-24B-A2B-MLX-4bit")
patch(model)  # auto-detects architecture, applies safe fusions
print(generate(model, tokenizer, prompt="Hello!", max_tokens=200))

Stock MLX Benchmarks (M4 Max, 36GB)

Model Baseline Patched Speedup Fidelity
LFM2-8B-A1B-4bit 200.1 tok/s 223.3 tok/s +11.6% 500/500
LFM2-24B-A2B-4bit 152.0 tok/s 161.1 tok/s +6.0% 500/500

Full changelog: https://github.com/Hmbown/ZMLX/blob/main/CHANGELOG.md#090---2026-02-24

v0.8.4

10 Feb 19:44

Choose a tag to compare

Highlights

  • MoE (Qwen, experimental): new env-gated routing kernel path via router_argpartition_logits_topk().
    • Enable with:
      • ZMLX_QWEN_ROUTER_ARGPARTITION_LOGITS=1
      • ZMLX_QWEN_ROUTER_ARGPARTITION_LOGITS_TOPK=1 (requires the above)
    • Off by default; intended for controlled benchmarks.
  • Bench evidence: v8 promoted stack and v9 reproduction/routertopk suites are committed under benchmarks/repro_capsules/ with matching benchmarks/matrix.jsonl rows.
  • Custom MLX context (optional): build mx.gather_qmm_swiglu locally via integrations/mlx_local_integration/ (no changes to upstream MLX required until you opt in).
  • exo usage: run exo with runtime patching via zmlx-exo (see docs/EXO.md).

Install / Use

  • Install ZMLX: pip install zmlx
  • Run with exo (in the same env where exo is installed): zmlx-exo
  • Optional custom MLX primitive (for GLM/Qwen3 gains): bash integrations/mlx_local_integration/setup_mlx_local.sh (see docs/EXPERIMENTAL_MLX.md)

Repro Capsules (Source of Truth)

Promoted v8 (summary capsules):

  • benchmarks/repro_capsules/qwen3_a3b_combo_v8_fp32nofmaonly_t200_r2_summary.json
  • benchmarks/repro_capsules/qwen3_a3b_combo_v8_fp32nofmaonly_t1024_r2_summary.json
  • benchmarks/repro_capsules/glm47_combo_v8_fp32nofmaonly_t200_r2_summary.json
  • benchmarks/repro_capsules/glm47_combo_v8_fp32nofmaonly_t1024_r2_summary.json

v9 reproduction + routertopk (summary capsules):

  • benchmarks/repro_capsules/qwen3_combo_v9_repro_t200_r3_summary.json
  • benchmarks/repro_capsules/qwen3_combo_v9_repro_t1024_r2_summary.json
  • benchmarks/repro_capsules/glm47_combo_v9_repro_t200_r3_summary.json
  • benchmarks/repro_capsules/glm47_combo_v9_repro_t1024_r2_summary.json
  • benchmarks/repro_capsules/qwen3_combo_v9_routertopk_t200_r3_summary.json
  • benchmarks/repro_capsules/qwen3_combo_v9_routertopk_t1024_r2_summary.json
  • benchmarks/repro_capsules/qwen3_combo_v9_routertopk_t1024_r3_confirm_summary.json

Notes

  • The router top-k kernel path is experimental and not promoted as a default.

v0.8.3

08 Feb 23:21

Choose a tag to compare

Fixed

  • Qwen3 moe_mlp now disables fused gather_qmm_swiglu by default unless explicitly enabled with ZMLX_QWEN_FUSED_SWIGLU=1.
  • Added regression tests for Qwen fused-SwiGLU env gating.

Changed

  • integrations/mlx_local_integration/setup_mlx_local.sh now defaults to MLX v0.30.6 (185b06d9...) for custom gather_qmm_swiglu bring-up.
  • Updated lab notebook with reproducible bring-up notes and sequential GLM/Qwen validation results on custom MLX 0.30.6.

Validation

  • ruff check .
  • pytest -q (852 passed, 74 skipped, 3 xfailed)

v0.8.2

07 Feb 20:00

Choose a tag to compare

Full Changelog: v0.8.1...v0.8.2

v0.8.1

05 Feb 21:30

Choose a tag to compare

ZMLX 0.8.1

Patch release focused on MLX benchmark reruns and docs refresh.

Highlights

  • Re-ran GLM-4.7-Flash stress benchmark on MLX 0.30.4.dev20260204+2f324cc and updated README references/capsules.
  • Re-ran LFM2-8B-A1B stock-MLX benchmarks and updated README tables.
  • Re-ran Qwen3-30B-A3B experiments and highlighted the best verified rerun in README (96.5 -> 104.3 tok/s, +8.1%, token-identical).
  • Added new repro capsules under benchmarks/repro_capsules/ and synced benchmarks/matrix.jsonl.

Version

  • pyproject.toml and src/zmlx/init.py bumped to 0.8.1.

v0.8.0 — GLM/Qwen3 MoE decode + exo integration

04 Feb 16:44

Choose a tag to compare

What's new

  • GLM-4.7-Flash decode +8% — 46 MoE layers fused via gather_qmm_swiglu custom primitive
  • Qwen3-30B-A3B decode +6% — same fused MoE path
  • exo integration — one-command setup (bash setup_zmlx.sh) for running ZMLX-accelerated models in exo clusters. See docs/EXO.md.
  • mlx-lm compatibility layer — handles API differences across mlx-lm versions
  • Auto-skip safety — models auto-skip on stock MLX when custom primitive is unavailable (0% change, no regressions)

Requirements

  • Stock MLX: LFM2 gains work out of the box
  • GLM/Qwen3 gains: requires building the custom gather_qmm_swiglu primitive (see docs/EXPERIMENTAL_MLX.md)

Benchmarks

Model Hardware Change
LFM2-8B-A1B-4bit M4 Max 36 GB +11.6%
GLM-4.7-Flash-4bit M4 Max 36 GB +8.1%
GLM-4.7-Flash-4bit M4 Mac Studio +8%
Qwen3-30B-A3B-4bit M4 Max 36 GB +5.5%

All results token-identical under greedy decoding.

v0.7.13

02 Feb 22:38

Choose a tag to compare

What's New

Added

  • GLM model detection in moe_mlp pattern, with opt-in fused SwiGLU via ZMLX_GLM_FUSED_SWIGLU env var
  • Experimental MoE stream pool: ZMLX_MOE_STREAMS=N for multi-stream expert dispatch
  • Kernel correctness tests covering bits, quant, image, indexing, fused_moe, and optimizers

Fixed

  • Numerically stable sigmoid: kk_sigmoid uses abs+branch to avoid overflow on large negative inputs
  • SwiGLU native dtype: forward/backward kernels use native dtype with kk_sigmoid helper
  • mypy stream type errors in moe_mlp.py (DeviceType → Device)

Changed

  • GLM auto-excluded: moe_mlp and swiglu_mlp excluded for GLM models (token fidelity failure)
  • README polish, gitignore cleanup

Full Changelog: v0.7.12...v0.7.13

v0.7.12

01 Feb 22:20

Choose a tag to compare

[0.7.12] - 2026-02-01

Added

  • ReLU2 kernels: relu2 and relu2_grad with catalog tests.
  • Docs: docs/BENCHMARKS.md methodology + repro capsules, docs/EXPERIMENTAL_MLX.md for optional MLX fork work.
  • Stable MoE coverage: Qwen3-30B-A3B and GPT-OSS-20B listed as token-identical on stock MLX.

Fixed

  • MoE gating detection: cache GPT-OSS/Qwen3 model detection before class replacement so _gating selects the correct path.
  • moe_combine_exact bf16 rounding: explicit rounding after multiply/add to match MLX bf16 accumulation semantics.

Changed

  • GPT-OSS combine routing: float32 gating weights now use moe_combine_fp32 to preserve MLX promotion behavior.
  • Auto-excludes: Qwen3 excludes only swiglu_mlp + residual_norm; GPT-OSS excludes only residual_norm.
  • README: simplified install (zmlx[train]), removed custom MLX details from main docs, refreshed stable model table.