Skip to content

kv cache starts back from 0 every other turn or so in all models and in all harnesses #327

@coolaj86

Description

@coolaj86

I've got a fresh install of LM Studio with the latest versions of the various harnesses and just about any time the tool calls are done and it comes back to me for the next prompt, even if I already have the next prompt in the queue, it starts back over at 0/n tokens from cache 0% 10% ...

Harnesses I've tried:

  • Claude CLI
  • OpenCode
  • Pi

Models I've tried (all MLX):

  • Qwen 3.5 9b
  • Qwen 3 coder 30b
  • Qwen 3.6 27b
  • Qwen 3.6 35b a3b
  • Gemma 4
  • many others

I've been using the Anthropic compatibility API.

I've tried on an MacBook Pro M1 Max 32gb and an Mac Studio M2 Max 64gb.

I don't have any special templates - just the defaults.

I have context set to the maximum for all models by default and the context limit set to stop.

Is this a bug? Normal behavior? Am I holding it wrong?

The slowdown is so harsh that the system just isn't usable for coding agents. If I could figure this out, local Ai might be feasible on the 64gb Studio.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions