fix(remote): stop sending max_tokens/num_predict to remote servers#377
Conversation
The app's maxTokens setting (default 1024) was flowing into every remote chat request as max_tokens (LM Studio / OpenAI-compatible) or num_predict (Ollama). Reasoning models like Qwen3 and DeepSeek-R1 routinely spend 2k-5k tokens inside a <think> block before producing visible output, so the 1024 cap forced the stream to stop mid-think. The closing </think> never arrived, ThinkTagParser routed every token to onReasoning, and the chat bubble stayed empty - the user saw a loading spinner forever. Output limits belong on the server, where the user already configured the model. LM Studio and Ollama both have sensible defaults (full remaining context and -1 respectively) that don't truncate reasoning. Local generation is unchanged - the maxTokens slider still bounds the on-device inference loop. Co-Authored-By: Dishit hanmadishit74@gmail.com
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
There was a problem hiding this comment.
Code Review
This pull request removes the max_tokens (and Ollama's num_predict) parameter from being sent to OpenAI-compatible and Ollama providers to prevent client-side truncation of reasoning models (like DeepSeek-R1 or Qwen3) during their <think> phase. However, completely omitting these parameters at the provider level prevents callers from enforcing explicit token limits when desired (e.g., for cost control or utility tasks). It is recommended to handle this at the caller level or only omit the parameter when it matches a default/unset state.
| // max_tokens intentionally omitted — the remote server owns output limits. | ||
| // A client-side cap (default 1024) silently truncates reasoning models that | ||
| // need a larger budget for <think> blocks (Qwen3, DeepSeek-R1, etc). |
There was a problem hiding this comment.
Completely omitting max_tokens at the provider level prevents any caller from limiting the generation length, even when a limit is explicitly desired (e.g., for utility tasks like summarization, title generation, or cost control on paid endpoints).
Instead of ignoring options.maxTokens entirely in the provider, the caller (e.g., the chat store or the service orchestrating the generation) should avoid passing the default local maxTokens setting to remote providers, or the provider should only omit it if it matches a default/unset state. This preserves the provider's ability to respect explicit token limits when requested.
There was a problem hiding this comment.
Thanks for the review — we considered the suggestion but went with provider-level omission deliberately.
The "explicit cap" use case is already covered by server-side config. LM Studio exposes a per-model default-max-output-tokens setting; Ollama supports num_predict in the Modelfile and server config. Users who want a cap set it once on the server they own. A parallel client-side knob just duplicates that and, as this PR shows, causes silent truncation when the local default disagrees with the server's intended budget.
For length-constrained outputs (titles, summaries), prompt design is the right lever — "give a 3-word title" produces 3 words because of the prompt, not the cap. max_tokens is a guillotine, not a length control; it cuts mid-token if the model overshoots.
Off Grid currently has no remote caller that benefits from a per-call override — context compaction summarization runs against the local model, not the remote provider. Cost control isn't a concern either; Off Grid targets LM Studio and Ollama, both free local servers. There's no paid-by-token API in the codebase today.
If a future provider does need a cap (e.g. an Anthropic/OpenAI billed-by-token provider), that provider can decide to honor options.maxTokens at its own level. Keeping the local-only maxTokens setting out of the OpenAI-compatible and Ollama remote paths is the bug fix we want.
| // num_predict intentionally omitted — Ollama defaults to -1 (until natural stop). | ||
| // A client-side cap truncates reasoning models mid-<think>. |
There was a problem hiding this comment.
Completely omitting num_predict at the provider level prevents any caller from limiting the generation length, even when a limit is explicitly desired (e.g., for utility tasks like summarization, title generation, or cost control on paid endpoints).
Instead of ignoring options.maxTokens entirely in the provider, the caller (e.g., the chat store or the service orchestrating the generation) should avoid passing the default local maxTokens setting to remote providers, or the provider should only omit it if it matches a default/unset state. This preserves the provider's ability to respect explicit token limits when requested.
There was a problem hiding this comment.
Same reasoning as on the OpenAI-compatible side — going with provider-level omission deliberately.
For Ollama specifically, the cap belongs in the Modelfile (PARAMETER num_predict <n>) or server config. Ollama's own default is -1 (unlimited until natural stop), which is the right default for chat. The bug was that our client's local-only default of 1024 was overriding that, truncating reasoning models mid-<think>.
For length-constrained outputs (titles, summaries), prompt design is the right lever, not a hard token cap. Off Grid has no current remote caller that wants a per-call cap — context compaction summarization runs against the local model. If a future provider needs one, it can opt in at its own level.



Summary
The app was sending
max_tokens: 1024(LM Studio / OpenAI-compatible) andnum_predict: 1024(Ollama) on every remote request, sourced from the localmaxTokenssetting. Reasoning models like Qwen3 and DeepSeek-R1 routinely spend 2k-5k tokens inside a<think>block before producing visible output, so the 1024 cap stopped the stream mid-think. The closing</think>never arrived,ThinkTagParserrouted every token toonReasoning, and the chat bubble stayed empty - users saw a loading spinner forever.This PR drops the client-side cap on both remote code paths. Output limits belong on the server, where the user already configured the model. LM Studio defaults to the model's full remaining context window; Ollama defaults to
-1(until natural stop). Local generation is unchanged - themaxTokensslider still bounds the on-device llama.cpp inference loop.Changes
src/services/providers/openAICompatibleProvider.ts- dropmax_tokensfrom the/v1/chat/completionsbodysrc/services/providers/openAICompatibleStream.ts- dropnum_predictfrom the Ollama/api/chatbodyWhy the client cap was wrong
maxTokenssetting backs both. Local needs the cap (KV cache / battery); remote does not.Reported case
User on Play Store review and follow-up email: Qwen3.6-35B-A3B on LM Studio produced only a loading animation. Server log showed a clean 1024-token generation with
finish_reason: length, but no content reached the UI because the entire stream was inside an unclosed<think>block.Test plan
npx tsc --noEmitcleannpx eslintclean on changed filesnpm test -- openAICompatibleProvider remoteProviderRouting- 61/61 passmaxTokensslider still bounds on-device outputFollow-ups (not in this PR)
maxTokensandcontextLengthsliders are still editable when a remote model is active. The settings modal shows a banner that says they don't apply, but the sliders aren't disabled. Worth hiding both for remote sessions in a separate change.documentService.tstruncates attachment / pasted-text content usingsettings.contextLengthregardless of provider type - low slider + remote model silently cuts attached docs. Separate bug.