Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
cd92f6f
Add TurboQuant KV cache compression (3-bit, 4.6x)
arozanov Mar 28, 2026
530e6a5
Add architecture compatibility check for TurboQuant
arozanov Mar 28, 2026
de54031
Rework TurboQuant: to_turbo_quantized(), make_prompt_cache routing, C…
arozanov Mar 28, 2026
9315fbc
Add TurboQuant tests and fix save/load support
arozanov Mar 29, 2026
fceb638
Add public dequantize() and copy() methods to TurboQuantKVCache
arozanov Apr 1, 2026
0efd502
Mixed-precision quantized KV cache: K@8-bit + V@4-bit
arozanov Apr 17, 2026
71e3996
Optimize MixedQuantKVCache hot path
arozanov Apr 17, 2026
448063d
Add KV cache quantization and disk persistence to mlx_lm.server
arozanov Apr 18, 2026
4e5bad1
Fix disk cache serialization for all cache types
arozanov Apr 19, 2026
ccecaba
Guard checkpoint_callback against uninitialized cache layers
arozanov Apr 19, 2026
f90d364
Auto-clean old cache format + guard empty cache on fetch
arozanov Apr 20, 2026
52d419b
Fix CacheList support for MoE models (GLM-5.1, DeepSeek V3)
arozanov Apr 20, 2026
86a7e22
Fix checkpoint_callback consuming generator twice
arozanov Apr 22, 2026
ec14344
Add --no-batch flag to force single-serve mode
arozanov Apr 23, 2026
c411d20
Fix cli_args -> self.cli_args in no_batch check
arozanov Apr 23, 2026
e010ee2
Fix Stream(gpu,2) threading crash in server mode
arozanov Apr 25, 2026
d1ce32b
Fix stale stream reference + add set_default_stream
arozanov Apr 25, 2026
572e694
Security hardening: allowlist cache classes, validate inputs, fix stream
arozanov Apr 25, 2026
ea6111c
Refactor trim_to: delegate to parent instead of duplicating logic
arozanov Apr 25, 2026
dfd9f50
Add missing cache classes to allowlist
arozanov Apr 25, 2026
51c2714
Add sub-cache allowlist in CacheList.from_state, reject invalid state
arozanov Apr 25, 2026
037d405
Restore mx.set_default_stream -- it IS thread-local
arozanov Apr 26, 2026
56a5480
Fix async_eval outside stream block in generate_step
arozanov Apr 26, 2026
24f8a95
Revert "Fix async_eval outside stream block in generate_step"
arozanov Apr 26, 2026
6849929
Run generation on main thread, HTTP server in background
arozanov Apr 26, 2026
dccaac6
Add graceful shutdown on KeyboardInterrupt
arozanov Apr 26, 2026
4014b77
Add prefill checkpoint saving in single-serve mode
arozanov Apr 27, 2026
6d4fc42
Save prefill checkpoint for multi-turn cache reuse
arozanov Apr 29, 2026
136d9c6
Handle exact cache hit: trim last token to avoid empty prompt crash
arozanov Apr 29, 2026
fda593e
Separate disk cache size from RAM cache size
arozanov Apr 29, 2026
23ed009
Add DeepSeek V4 model support
arozanov May 4, 2026
d32dd4c
Fix tokenizer loading for unrecognized model types
arozanov May 4, 2026
530dbd4
Fix chunked prefill crash for compressed layers
arozanov May 5, 2026
449059f
Skip disk cache save when system memory is low
arozanov May 5, 2026
c672a36
Fix continuation prefill compressor crash
arozanov May 5, 2026
39925f7
Fix continuation prefill crashes and optimize Sinkhorn
arozanov May 5, 2026
c0df3d4
DeepSeek V4 performance optimizations and bugfixes
arozanov May 7, 2026
15d9828
Add DeepSeek V4 section to README
arozanov May 7, 2026
a6c3ae6
Add tests, server TurboQuant args, and bugfixes
arozanov May 8, 2026
bb75fd6
Merge upstream/main and resolve conflicts
arozanov May 8, 2026
6b95c0a
Add FP8 weight loading, value compression, and batch mode
arozanov May 8, 2026
cd38d1a
Fallback for older MLX without device_info()
arozanov May 8, 2026
1e6dad5
Support Thump604 weight naming in sanitize()
arozanov May 8, 2026
d4af330
Fix switch_mlp remap doubling ffn prefix
arozanov May 8, 2026
110f2c7
Support single wo_a linear for Thump604 quantized weights
arozanov May 8, 2026
d00a7f6
Fix wo_a for Thump604: single linear with per-group row slicing
arozanov May 8, 2026
700d2c7
Remap quantization config keys for Thump604 mixed-bit models
arozanov May 8, 2026
9c24eed
Remap quantization config keys for renamed weight paths
arozanov May 8, 2026
d9839d1
Add mimo_v2 -> mimo_v2_flash model type remapping
arozanov May 8, 2026
ee35b61
Add MoE expert offloading with LRU cache (--max-resident-experts)
arozanov May 8, 2026
79cc0b7
Fix prefill offloading: ensure_resident per token, not entire prompt
arozanov May 10, 2026
860a4af
Extend fused Metal kernels to 8-bit quantized weights
arozanov May 10, 2026
83c9738
Add tests for V4, TurboQuant, offloading, 8-bit kernels, sanitize
arozanov May 13, 2026
f001580
Support reversed HC naming (attn_hc/ffn_hc) in sanitize
arozanov May 13, 2026
9061d31
Restore KV cache quantization and disk cache CLI flags
arozanov May 14, 2026
67db9af
Fix Metal resource leak in batched_sparse_decode
arozanov May 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,40 @@
## DeepSeek V4 Support

First MLX implementation of [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4) (284B MoE). This fork adds full inference support including:

- **Architecture**: CSA/HCA sparse attention, Hyper-Connections, Lightning Indexer, 256 MoE experts
- **Custom fused Metal kernels** for decode acceleration
- **Disk-backed KV cache** for large context lengths
- **Multi-turn cache reuse**, stream threading fix, tokenizer fallback for unknown model types

### Performance (Mac Studio M3 Ultra 512GB)

| Quantization | Throughput | RAM Usage |
|---|---|---|
| 4-bit (MLX) | 21 tok/s | 161 GB |
| 8-bit (MLX) | 8.5 tok/s | 303 GB |

### Quick Start

```sh
pip install git+https://github.com/arozanov/mlx-lm.git@feature/turboquant-kv-cache
huggingface-cli download mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit --local-dir models/DeepSeek-V4-Flash-4bit
mlx_lm.server --model models/DeepSeek-V4-Flash-4bit --host 127.0.0.1 --port 8080 --prompt-cache-size 5 --no-batch
```

For 8-bit:

```sh
huggingface-cli download mlx-community/deepseek-ai-DeepSeek-V4-Flash-8bit --local-dir models/DeepSeek-V4-Flash-8bit
mlx_lm.server --model models/DeepSeek-V4-Flash-8bit --host 127.0.0.1 --port 8080 --prompt-cache-size 3 --no-batch
```

**Requirements**: Apple Silicon Mac, 192GB+ unified RAM (4-bit) or 384GB+ (8-bit).

**Branch**: `feature/turboquant-kv-cache`

---

## MLX LM

MLX LM is a Python package for generating text and fine-tuning large language
Expand Down
Loading