Releases: ml-explore/mlx
Releases Β· ml-explore/mlx
v0.14.1
v0.14.0
Highlights
- Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
mx.gather_qmm
quantized equivalent formx.gather_mm
which speeds up MoE inference by ~2x- Grouped 2D convolutions
Core
mx.conjugate
mx.conv3d
andnn.Conv3d
- List based indexing
- Started
mx.distributed
which uses MPI (if installed) for communication across machinesmx.distributed.init
mx.distributed.all_gather
mx.distributed.all_reduce_sum
- Support conversion to and from dlpack
mx.linalg.cholesky
on CPUmx.quantized_matmul
sped up for vector-matrix productsmx.trace
mx.block_masked_mm
now supports floating point masks!
Fixes
- Error messaging in eval
- Add some missing docs
- Scatter index bug
- The extensions example now compiles and runs
- CPU copy bug with many dimensions
v0.13.1
v0.13.0
Highlights
- Block sparse matrix multiply speeds up MoEs by >2x
- Improved quantization algorithm should work well for all networks
- Improved gpu command submission speeds up training and inference
Core
- Bitwise ops added:
mx.bitwise_[or|and|xor]
,mx.[left|right]_shift
, operator overloads
- Groups added to Conv1d
- Added
mx.metal.device_info
to get better informed memory limits - Added resettable memory stats
mlx.optimizers.clip_grad_norm
andmlx.utils.tree_reduce
added- Add
mx.arctan2
- Unary ops now accept array-like inputs ie one can do
mx.sqrt(2)
Bugfixes
- Fixed shape for slice update
- Bugfix in quantize that used slightly wrong scales/biases
- Fixed memory leak for multi-output primitives encountered with gradient checkpointing
- Fixed conversion from other frameworks for all datatypes
- Fixed index overflow for matmul with large batch size
- Fixed initialization ordering that occasionally caused segfaults
v0.12.2
v0.12.0
Highlights
- Faster quantized matmul
- Up to 40% faster QLoRA or prompt processing, some numbers
Core
mx.synchronize
to wait for computation dispatched withmx.async_eval
mx.radians
andmx.degrees
mx.metal.clear_cache
to return to the OS the memory held by MLX as a cache for future allocations- Change quantization to always represent 0 exactly (relevant issue)
Bugfixes
- Fixed quantization of a block with all 0s that produced NaNs
- Fixed the
len
field in the buffer protocol implementation
v0.11.0
v0.10.0
Highlights
- Improvements for LLM generation
- Reshapeless quant matmul/matvec
mx.async_eval
- Async command encoding
Core
- Slightly faster reshapeless quantized gemms
- Option for precise softmax
mx.metal.start_capture
andmx.metal.stop_capture
for GPU debug/profilemx.expm1
mx.std
mx.meshgrid
- CPU only
mx.random.multivariate_normal
mx.cumsum
(and other scans) forbfloat
- Async command encoder with explicit barriers / dependency management
NN
nn.upsample
support bicubic interpolation
Misc
- Updated MLX Extension to work with nanobind
Bugfixes
- Fix buffer donation in softmax and fast ops
- Bug in layer norm vjp
- Bug initializing from lists with scalar
- Bug in indexing
- CPU compilation bug
- Multi-output compilation bug
- Fix stack overflow issues in eval and array destruction
v0.9.0
Highlights:
- Fast partial RoPE (used by Phi-2)
- Fast gradients for RoPE, RMSNorm, and LayerNorm
- Up to 7x faster, benchmarks
Core
- More overhead reductions
- Partial fast RoPE (fast Phi-2)
- Better buffer donation for copy
- Type hierarchy and issubdtype
- Fast VJPs for RoPE, RMSNorm, and LayerNorm
NN
Module.set_dtype
- Chaining in
nn.Module
(model.freeze().update(β¦)
)
Bugfixes
- Fix set item bugs
- Fix scatter vjp
- Check shape integer overlow on array construction
- Fix bug with module attributes
- Fix two bugs for odd shaped QMV
- Fix GPU sort for large sizes
- Fix bug in negative padding for convolutions
- Fix bug in multi-stream race condition for graph evaluation
- Fix random normal generation for half precision
v0.8.0
Highlights
- More perf!
mx.fast.rms_norm
andmx.fast.layer_norm
- Switch to Nanobind substantially reduces overhead
- Up to 4x faster
__setitem__
(e.g.a[...] = b
)
Core
mx.inverse
, CPU only- vmap over
mx.matmul
andmx.addmm
- Switch to nanobind from pybind11
- Faster setitem indexing
mx.fast.rms_norm
, token generation benchmarkmx.fast.layer_norm
, token generation benchmark- vmap for inverse and svd
- Faster non-overlapping pooling
Optimizers
- Set minimum value in cosine decay scheduler
Bugfixes
- Fix bug in multi-dimensional reduction