I've got a fresh install of LM Studio with the latest versions of the various harnesses and just about any time the tool calls are done and it comes back to me for the next prompt, even if I already have the next prompt in the queue, it starts back over at 0/n tokens from cache 0% 10% ...
Harnesses I've tried:
Models I've tried (all MLX):
- Qwen 3.5 9b
- Qwen 3 coder 30b
- Qwen 3.6 27b
- Qwen 3.6 35b a3b
- Gemma 4
- many others
I've been using the Anthropic compatibility API.
I've tried on an MacBook Pro M1 Max 32gb and an Mac Studio M2 Max 64gb.
I don't have any special templates - just the defaults.
I have context set to the maximum for all models by default and the context limit set to stop.
Is this a bug? Normal behavior? Am I holding it wrong?
The slowdown is so harsh that the system just isn't usable for coding agents. If I could figure this out, local Ai might be feasible on the 64gb Studio.
I've got a fresh install of LM Studio with the latest versions of the various harnesses and just about any time the tool calls are done and it comes back to me for the next prompt, even if I already have the next prompt in the queue, it starts back over at 0/n tokens from cache 0% 10% ...
Harnesses I've tried:
Models I've tried (all MLX):
I've been using the Anthropic compatibility API.
I've tried on an MacBook Pro M1 Max 32gb and an Mac Studio M2 Max 64gb.
I don't have any special templates - just the defaults.
I have context set to the maximum for all models by default and the context limit set to stop.
Is this a bug? Normal behavior? Am I holding it wrong?
The slowdown is so harsh that the system just isn't usable for coding agents. If I could figure this out, local Ai might be feasible on the 64gb Studio.