Multi-round inference leads to insufficient memory crash

Comparing [Qwen3.6 35B 4bit mlx](https://lmstudio.ai/unn/qwen3.6-35b-a3b-4bit) to [Qwen3.6 35B q4_k_m gguf](https://lmstudio.ai/models/qwen/qwen3.6-35b-a3b), after running in Claude Code for a while, the MLX version consumes significantly more memory than the GGUF version, and the MLX version crashed due to insufficient memory.

Run reproduction [long_prompt_demo.py](https://github.com/Edsuns/mlx-engine/blob/demo/memory-bloat/long_prompt_demo.py) on Mac mini M4 32GB:
```shell
lms get mlx-community/Qwen3.6-35B-A3B-4bit
python3 long_prompt_demo.py --model mlx-community/Qwen3.6-35B-A3B-4bit --max-kv-size 65536 --rounds 5 --prompt-length 100000
```

I have tested models that exhibit this problem:
- [Qwen3.6 35B 4bit mlx](https://lmstudio.ai/unn/qwen3.6-35b-a3b-4bit) 
- [Qwen3.5 35B mxfp4 MLX](https://huggingface.co/TheCluster/Qwen3.5-35B-A3B-Heretic-MLX-mxfp4)

Error log using lms:
```
2026-04-18 12:08:57  [INFO]
 [unn/qwen3.6-35b-a3b-4bit] Running Anthropic messages API on conversation with 19 messages.
2026-04-18 12:08:57  [INFO]
 [unn/qwen3.6-35b-a3b-4bit] Streaming Anthropic response...
2026-04-18 12:08:58 [DEBUG]
 [cache_wrapper][INFO]: Prompt cache: using 36384/36871 tokens from cache
2026-04-18 12:08:58  [INFO]
 [unn/qwen3.6-35b-a3b-4bit] Prompt processing progress: 0.0%
2026-04-18 12:09:04  [INFO]
 [unn/qwen3.6-35b-a3b-4bit] Prompt processing progress: 97.7%
2026-04-18 12:09:04  [INFO]
 [unn/qwen3.6-35b-a3b-4bit] Prompt processing progress: 99.8%
2026-04-18 12:09:04  [INFO]
 [unn/qwen3.6-35b-a3b-4bit] Prompt processing progress: 99.8%
2026-04-18 12:09:05 [DEBUG]
 libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
2026-04-18 12:09:05 [DEBUG]
 Fatal Python error: Aborted

Thread 0x
2026-04-18 12:09:05 [DEBUG]
 0000000de2c57000 (most recent call first):
  File "/Users/xxx/.lmstudio/extensions/backends/vendor/_amphibian/app-mlx-generate-mac14-arm64@22/lib/python3.11/site-packages/mlx_lm/generate.py", line 455 in generate_step
  File "/Users/xxx/.lmstudio/extensions/backends/vendor/_amphibian/app-mlx-generate-mac14-arm64@22/lib/python3.11/site-packages/mlx_lm/generate.py", line 705 in <genexpr>
  File "/Users/xxx/.lmstudio/extensions/backends/vendor/_amphibian/app-mlx-generate-mac14-arm64@22/lib/python3.11/site-packages/mlx_lm/generate.py", line 716 in stream_generate
2026-04-18 12:09:05 [DEBUG]
 File "/Users/xxx/.lmstudio/extensions/backends/vendor/_amphibian/app-mlx-generate-mac14-arm64@22/lib/python3.11/site-packages/mlx_engine/generate.py", line 543 in _sequential_generation

Thread 0x00000001f7d6d8c0 (most recent call first):
  <no Python frame>
2026-04-18 12:09:05 [DEBUG]
 
Extension modules: yaml._yaml
2026-04-18 12:09:05 [DEBUG]
 , regex._regex, numpy._core._multiarray_umath
2026-04-18 12:09:05 [DEBUG]
 , numpy.linalg._umath_linalg, markupsafe._speedups
2026-04-18 12:09:05 [DEBUG]
 , PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special
2026-04-18 12:09:05 [DEBUG]
 , numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand
2026-04-18 12:09:05 [DEBUG]
 , numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator
2026-04-18 12:09:05 [DEBUG]
 , sentencepiece._sentencepiece
2026-04-18 12:09:05 [DEBUG]
 , PIL._imagingft
2026-04-18 12:09:05 [DEBUG]
 , charset_normalizer.md, charset_normalizer.cd, requests.packages.charset_normalizer.md, requests.packages.chardet.md, requests.packages.charset_normalizer.cd, requests.packages.chardet.cd
2026-04-18 12:09:05 [DEBUG]
 , xxhash._xxhash
2026-04-18 12:09:06 [DEBUG]
  (total: 35)
2026-04-18 12:09:06 [ERROR]
 [unn/qwen3.6-35b-a3b-4bit] Anthropic streaming error: The model has crashed without additional information. (Exit code: null)
2026-04-18 12:09:06  [INFO]
 [unn/qwen3.6-35b-a3b-4bit] Finished streaming Anthropic response
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-round inference leads to insufficient memory crash #314

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multi-round inference leads to insufficient memory crash #314

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions