Skip to content

fix(remote): stop sending max_tokens/num_predict to remote servers#377

Merged
dishit-wednesday merged 1 commit into
mainfrom
fix/remote-drop-max-tokens-cap
Jun 3, 2026
Merged

fix(remote): stop sending max_tokens/num_predict to remote servers#377
dishit-wednesday merged 1 commit into
mainfrom
fix/remote-drop-max-tokens-cap

Conversation

@dishit-wednesday

@dishit-wednesday dishit-wednesday commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

The app was sending max_tokens: 1024 (LM Studio / OpenAI-compatible) and num_predict: 1024 (Ollama) on every remote request, sourced from the local maxTokens setting. Reasoning models like Qwen3 and DeepSeek-R1 routinely spend 2k-5k tokens inside a <think> block before producing visible output, so the 1024 cap stopped the stream mid-think. The closing </think> never arrived, ThinkTagParser routed every token to onReasoning, and the chat bubble stayed empty - users saw a loading spinner forever.

This PR drops the client-side cap on both remote code paths. Output limits belong on the server, where the user already configured the model. LM Studio defaults to the model's full remaining context window; Ollama defaults to -1 (until natural stop). Local generation is unchanged - the maxTokens slider still bounds the on-device llama.cpp inference loop.

Changes

  • src/services/providers/openAICompatibleProvider.ts - drop max_tokens from the /v1/chat/completions body
  • src/services/providers/openAICompatibleStream.ts - drop num_predict from the Ollama /api/chat body

Why the client cap was wrong

  • The user picks the model and configures its context on the server; the client second-guessing that just produces silent truncation.
  • Other LM Studio / Ollama clients (Open WebUI, Continue.dev) don't send a cap by default for the same reason.
  • Reasoning-token budget is unpredictable per-prompt - any fixed client default will be wrong for some class of model.
  • The local default of 1024 leaked into the remote path because one maxTokens setting backs both. Local needs the cap (KV cache / battery); remote does not.

Reported case

User on Play Store review and follow-up email: Qwen3.6-35B-A3B on LM Studio produced only a loading animation. Server log showed a clean 1024-token generation with finish_reason: length, but no content reached the UI because the entire stream was inside an unclosed <think> block.

Test plan

  • npx tsc --noEmit clean
  • npx eslint clean on changed files
  • npm test -- openAICompatibleProvider remoteProviderRouting - 61/61 pass
  • Manual: point app at LM Studio with Qwen3 model, send a prompt, confirm response renders
  • Manual: point app at Ollama with a non-reasoning model (e.g. llama3.1), confirm output completes naturally instead of being cut at 1024 tokens
  • Manual: load a local model, confirm maxTokens slider still bounds on-device output

Follow-ups (not in this PR)

  • The maxTokens and contextLength sliders are still editable when a remote model is active. The settings modal shows a banner that says they don't apply, but the sliders aren't disabled. Worth hiding both for remote sessions in a separate change.
  • documentService.ts truncates attachment / pasted-text content using settings.contextLength regardless of provider type - low slider + remote model silently cuts attached docs. Separate bug.

The app's maxTokens setting (default 1024) was flowing into every remote
chat request as max_tokens (LM Studio / OpenAI-compatible) or num_predict
(Ollama). Reasoning models like Qwen3 and DeepSeek-R1 routinely spend
2k-5k tokens inside a <think> block before producing visible output, so
the 1024 cap forced the stream to stop mid-think. The closing </think>
never arrived, ThinkTagParser routed every token to onReasoning, and the
chat bubble stayed empty - the user saw a loading spinner forever.

Output limits belong on the server, where the user already configured
the model. LM Studio and Ollama both have sensible defaults (full
remaining context and -1 respectively) that don't truncate reasoning.

Local generation is unchanged - the maxTokens slider still bounds the
on-device inference loop.

Co-Authored-By: Dishit hanmadishit74@gmail.com

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@sonarqubecloud

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the max_tokens (and Ollama's num_predict) parameter from being sent to OpenAI-compatible and Ollama providers to prevent client-side truncation of reasoning models (like DeepSeek-R1 or Qwen3) during their <think> phase. However, completely omitting these parameters at the provider level prevents callers from enforcing explicit token limits when desired (e.g., for cost control or utility tasks). It is recommended to handle this at the caller level or only omit the parameter when it matches a default/unset state.

Comment on lines +102 to +104
// max_tokens intentionally omitted — the remote server owns output limits.
// A client-side cap (default 1024) silently truncates reasoning models that
// need a larger budget for <think> blocks (Qwen3, DeepSeek-R1, etc).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Completely omitting max_tokens at the provider level prevents any caller from limiting the generation length, even when a limit is explicitly desired (e.g., for utility tasks like summarization, title generation, or cost control on paid endpoints).

Instead of ignoring options.maxTokens entirely in the provider, the caller (e.g., the chat store or the service orchestrating the generation) should avoid passing the default local maxTokens setting to remote providers, or the provider should only omit it if it matches a default/unset state. This preserves the provider's ability to respect explicit token limits when requested.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review — we considered the suggestion but went with provider-level omission deliberately.

The "explicit cap" use case is already covered by server-side config. LM Studio exposes a per-model default-max-output-tokens setting; Ollama supports num_predict in the Modelfile and server config. Users who want a cap set it once on the server they own. A parallel client-side knob just duplicates that and, as this PR shows, causes silent truncation when the local default disagrees with the server's intended budget.

For length-constrained outputs (titles, summaries), prompt design is the right lever — "give a 3-word title" produces 3 words because of the prompt, not the cap. max_tokens is a guillotine, not a length control; it cuts mid-token if the model overshoots.

Off Grid currently has no remote caller that benefits from a per-call override — context compaction summarization runs against the local model, not the remote provider. Cost control isn't a concern either; Off Grid targets LM Studio and Ollama, both free local servers. There's no paid-by-token API in the codebase today.

If a future provider does need a cap (e.g. an Anthropic/OpenAI billed-by-token provider), that provider can decide to honor options.maxTokens at its own level. Keeping the local-only maxTokens setting out of the OpenAI-compatible and Ollama remote paths is the bug fix we want.

Comment on lines +292 to +293
// num_predict intentionally omitted — Ollama defaults to -1 (until natural stop).
// A client-side cap truncates reasoning models mid-<think>.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Completely omitting num_predict at the provider level prevents any caller from limiting the generation length, even when a limit is explicitly desired (e.g., for utility tasks like summarization, title generation, or cost control on paid endpoints).

Instead of ignoring options.maxTokens entirely in the provider, the caller (e.g., the chat store or the service orchestrating the generation) should avoid passing the default local maxTokens setting to remote providers, or the provider should only omit it if it matches a default/unset state. This preserves the provider's ability to respect explicit token limits when requested.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as on the OpenAI-compatible side — going with provider-level omission deliberately.

For Ollama specifically, the cap belongs in the Modelfile (PARAMETER num_predict <n>) or server config. Ollama's own default is -1 (unlimited until natural stop), which is the right default for chat. The bug was that our client's local-only default of 1024 was overriding that, truncating reasoning models mid-<think>.

For length-constrained outputs (titles, summaries), prompt design is the right lever, not a hard token cap. Off Grid has no current remote caller that wants a per-call cap — context compaction summarization runs against the local model. If a future provider needs one, it can opt in at its own level.

@dishit-wednesday dishit-wednesday changed the base branch from main to litertsupport May 27, 2026 08:08
@dishit-wednesday dishit-wednesday changed the base branch from litertsupport to main May 27, 2026 12:29
@dishit-wednesday dishit-wednesday merged commit c112688 into main Jun 3, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant