Releases · Hmbown/ZMLX

03 Mar 21:20

Hmbown

v0.10.0

c124599

Latest

What's New

Qwen3.5-35B-A3B MoE decode support — patch(model) now auto-detects Qwen3.5's hybrid DeltaNet+Attention architecture (256 experts, K=8) and applies fused MoE decode kernels.

Measured results (M4 Max 36GB, stock MLX)

Model	Decode	Prefill	Fidelity
Qwen3.5-35B-A3B-4bit	~+2%	~+4%	token-identical
LFM2-8B-A1B-4bit	+12.8%	neutral	token-identical
LFM2-24B-A2B-4bit	+6.0%	neutral	token-identical

Results will vary depending on hardware, thermal state, and prompt length.

Also in this release

DeltaNet pattern: fuses conv1d + silu for Qwen3.5 GatedDeltaNet decode layers (30 of 40 layers)
Expert index sorting (opt-in): ZMLX_MOE_SORT_EXPERTS=1 for DRAM locality experiments
README cleanup: consolidated benchmark sections into a single results table
150+ benchmark repro capsules from GLM/Qwen3 isolation sweeps and consistency checks
New tests: 23 DeltaNet kernel tests + 8 patch tests + fusion/integration tests

Install / upgrade

pip install --upgrade zmlx

Full changelog

See CHANGELOG.md

Assets 2

25 Feb 17:48

Hmbown

v0.9.2

39f4b61

zmlx 0.9.2

Auto-enable promoted Qwen3.5/Qwen3-Next moe_mlp defaults (fused SwiGLU + router argpartition logits + topk)
Update README with front-and-center Qwen3.5 automatic configuration
Add reproducibility capsule for auto-defaults vs explicit env parity

Assets 2

24 Feb 16:48

Hmbown

v0.9.0

7bc6ac2

v0.9.0 — LFM2-24B +7% decode, foundry & fusion

Highlights

LFM2-24B-A2B-MLX-4bit: +6–7% decode speedup on stock MLX (M4 Max), 500/500 token-identical fidelity
D-SIMD gate kernel: fuses softmax+bias+topK into 1 Metal dispatch for D=64 expert gating (2 SIMD groups, 64 threads)
Smart K-based defaults: K≤2 (LFM2-8B) → fused SwiGLU +12%; K≥3 (LFM2-24B) → D-SIMD gate +7%. No env vars needed.
Foundry module: kernel template evaluation and SFT dataset generation (16 ops, 9 kernel classes)
Fusion module: JIT graph tracing and Metal codegen for fused op sequences

Quick Start

pip install -U zmlx

from mlx_lm import load, generate
from zmlx.patch import patch

model, tokenizer = load("LiquidAI/LFM2-24B-A2B-MLX-4bit")
patch(model)  # auto-detects architecture, applies safe fusions
print(generate(model, tokenizer, prompt="Hello!", max_tokens=200))

Stock MLX Benchmarks (M4 Max, 36GB)

Model	Baseline	Patched	Speedup	Fidelity
LFM2-8B-A1B-4bit	200.1 tok/s	223.3 tok/s	+11.6%	500/500
LFM2-24B-A2B-4bit	152.0 tok/s	161.1 tok/s	+6.0%	500/500

Full changelog: https://github.com/Hmbown/ZMLX/blob/main/CHANGELOG.md#090---2026-02-24

Assets 2

10 Feb 19:44

Hmbown

v0.8.4

8aa0247

v0.8.4

Highlights

MoE (Qwen, experimental): new env-gated routing kernel path via router_argpartition_logits_topk().
- Enable with:
  - ZMLX_QWEN_ROUTER_ARGPARTITION_LOGITS=1
  - ZMLX_QWEN_ROUTER_ARGPARTITION_LOGITS_TOPK=1 (requires the above)
- Off by default; intended for controlled benchmarks.
Bench evidence: v8 promoted stack and v9 reproduction/routertopk suites are committed under benchmarks/repro_capsules/ with matching benchmarks/matrix.jsonl rows.
Custom MLX context (optional): build mx.gather_qmm_swiglu locally via integrations/mlx_local_integration/ (no changes to upstream MLX required until you opt in).
exo usage: run exo with runtime patching via zmlx-exo (see docs/EXO.md).

Install / Use

Install ZMLX: pip install zmlx
Run with exo (in the same env where exo is installed): zmlx-exo
Optional custom MLX primitive (for GLM/Qwen3 gains): bash integrations/mlx_local_integration/setup_mlx_local.sh (see docs/EXPERIMENTAL_MLX.md)

Repro Capsules (Source of Truth)

Promoted v8 (summary capsules):

benchmarks/repro_capsules/qwen3_a3b_combo_v8_fp32nofmaonly_t200_r2_summary.json
benchmarks/repro_capsules/qwen3_a3b_combo_v8_fp32nofmaonly_t1024_r2_summary.json
benchmarks/repro_capsules/glm47_combo_v8_fp32nofmaonly_t200_r2_summary.json
benchmarks/repro_capsules/glm47_combo_v8_fp32nofmaonly_t1024_r2_summary.json

v9 reproduction + routertopk (summary capsules):

benchmarks/repro_capsules/qwen3_combo_v9_repro_t200_r3_summary.json
benchmarks/repro_capsules/qwen3_combo_v9_repro_t1024_r2_summary.json
benchmarks/repro_capsules/glm47_combo_v9_repro_t200_r3_summary.json
benchmarks/repro_capsules/glm47_combo_v9_repro_t1024_r2_summary.json
benchmarks/repro_capsules/qwen3_combo_v9_routertopk_t200_r3_summary.json
benchmarks/repro_capsules/qwen3_combo_v9_routertopk_t1024_r2_summary.json
benchmarks/repro_capsules/qwen3_combo_v9_routertopk_t1024_r3_confirm_summary.json

Notes

The router top-k kernel path is experimental and not promoted as a default.

Assets 2

08 Feb 23:21

Hmbown

v0.8.3

c2e0109

v0.8.3

Fixed

Qwen3 moe_mlp now disables fused gather_qmm_swiglu by default unless explicitly enabled with ZMLX_QWEN_FUSED_SWIGLU=1.
Added regression tests for Qwen fused-SwiGLU env gating.

Changed

integrations/mlx_local_integration/setup_mlx_local.sh now defaults to MLX v0.30.6 (185b06d9...) for custom gather_qmm_swiglu bring-up.
Updated lab notebook with reproducible bring-up notes and sequential GLM/Qwen validation results on custom MLX 0.30.6.

Validation

ruff check .
pytest -q (852 passed, 74 skipped, 3 xfailed)

Assets 2

07 Feb 20:00

Hmbown

v0.8.2

49721a5

v0.8.2

Full Changelog: v0.8.1...v0.8.2

Assets 2

05 Feb 21:30

Hmbown

v0.8.1

e0aa069

v0.8.1

ZMLX 0.8.1

Patch release focused on MLX benchmark reruns and docs refresh.

Highlights

Re-ran GLM-4.7-Flash stress benchmark on MLX 0.30.4.dev20260204+2f324cc and updated README references/capsules.
Re-ran LFM2-8B-A1B stock-MLX benchmarks and updated README tables.
Re-ran Qwen3-30B-A3B experiments and highlighted the best verified rerun in README (96.5 -> 104.3 tok/s, +8.1%, token-identical).
Added new repro capsules under benchmarks/repro_capsules/ and synced benchmarks/matrix.jsonl.

Version

pyproject.toml and src/zmlx/init.py bumped to 0.8.1.

Assets 2

04 Feb 16:44

Hmbown

v0.8.0

88c1057

v0.8.0 — GLM/Qwen3 MoE decode + exo integration

What's new

GLM-4.7-Flash decode +8% — 46 MoE layers fused via gather_qmm_swiglu custom primitive
Qwen3-30B-A3B decode +6% — same fused MoE path
exo integration — one-command setup (bash setup_zmlx.sh) for running ZMLX-accelerated models in exo clusters. See docs/EXO.md.
mlx-lm compatibility layer — handles API differences across mlx-lm versions
Auto-skip safety — models auto-skip on stock MLX when custom primitive is unavailable (0% change, no regressions)

Requirements

Stock MLX: LFM2 gains work out of the box
GLM/Qwen3 gains: requires building the custom gather_qmm_swiglu primitive (see docs/EXPERIMENTAL_MLX.md)

Benchmarks

Model	Hardware	Change
LFM2-8B-A1B-4bit	M4 Max 36 GB	+11.6%
GLM-4.7-Flash-4bit	M4 Max 36 GB	+8.1%
GLM-4.7-Flash-4bit	M4 Mac Studio	+8%
Qwen3-30B-A3B-4bit	M4 Max 36 GB	+5.5%

All results token-identical under greedy decoding.

Assets 2

02 Feb 22:38

Hmbown

v0.7.13

2a48363

v0.7.13

What's New

Added

GLM model detection in moe_mlp pattern, with opt-in fused SwiGLU via ZMLX_GLM_FUSED_SWIGLU env var
Experimental MoE stream pool: ZMLX_MOE_STREAMS=N for multi-stream expert dispatch
Kernel correctness tests covering bits, quant, image, indexing, fused_moe, and optimizers

Fixed

Numerically stable sigmoid: kk_sigmoid uses abs+branch to avoid overflow on large negative inputs
SwiGLU native dtype: forward/backward kernels use native dtype with kk_sigmoid helper
mypy stream type errors in moe_mlp.py (DeviceType → Device)

Changed

GLM auto-excluded: moe_mlp and swiglu_mlp excluded for GLM models (token fidelity failure)
README polish, gitignore cleanup

Full Changelog: v0.7.12...v0.7.13

Assets 2

01 Feb 22:20

Hmbown

v0.7.12

c9858f6

v0.7.12

[0.7.12] - 2026-02-01

Added

ReLU2 kernels: relu2 and relu2_grad with catalog tests.
Docs: docs/BENCHMARKS.md methodology + repro capsules, docs/EXPERIMENTAL_MLX.md for optional MLX fork work.
Stable MoE coverage: Qwen3-30B-A3B and GPT-OSS-20B listed as token-identical on stock MLX.

Fixed

MoE gating detection: cache GPT-OSS/Qwen3 model detection before class replacement so _gating selects the correct path.
moe_combine_exact bf16 rounding: explicit rounding after multiply/add to match MLX bf16 accumulation semantics.

Changed

GPT-OSS combine routing: float32 gating weights now use moe_combine_fp32 to preserve MLX promotion behavior.
Auto-excludes: Qwen3 excludes only swiglu_mlp + residual_norm; GPT-OSS excludes only residual_norm.
README: simplified install (zmlx[train]), removed custom MLX details from main docs, refreshed stable model table.

Assets 2

Releases: Hmbown/ZMLX

v0.10.0: Qwen3.5-35B-A3B support

What's New

Measured results (M4 Max 36GB, stock MLX)

Also in this release

Install / upgrade

Full changelog

Uh oh!

zmlx 0.9.2

zmlx 0.9.2

Uh oh!

v0.9.0 — LFM2-24B +7% decode, foundry & fusion

Highlights

Quick Start

Stock MLX Benchmarks (M4 Max, 36GB)

Uh oh!

v0.8.4

Highlights

Install / Use

Repro Capsules (Source of Truth)

Notes

Uh oh!

v0.8.3

Fixed

Changed

Validation

Uh oh!

v0.8.2

Uh oh!

v0.8.1

ZMLX 0.8.1

Highlights

Version

Uh oh!

v0.8.0 — GLM/Qwen3 MoE decode + exo integration

What's new

Requirements

Benchmarks

Uh oh!

v0.7.13

What's New

Added

Fixed

Changed

Uh oh!

v0.7.12

[0.7.12] - 2026-02-01

Added

Fixed

Changed

Uh oh!