fix(remote): stop sending max_tokens/num_predict to remote servers by dishit-wednesday · Pull Request #377 · off-grid-ai/mobile

dishit-wednesday · 2026-05-27T06:44:44Z

Summary

The app was sending max_tokens: 1024 (LM Studio / OpenAI-compatible) and num_predict: 1024 (Ollama) on every remote request, sourced from the local maxTokens setting. Reasoning models like Qwen3 and DeepSeek-R1 routinely spend 2k-5k tokens inside a <think> block before producing visible output, so the 1024 cap stopped the stream mid-think. The closing </think> never arrived, ThinkTagParser routed every token to onReasoning, and the chat bubble stayed empty - users saw a loading spinner forever.

This PR drops the client-side cap on both remote code paths. Output limits belong on the server, where the user already configured the model. LM Studio defaults to the model's full remaining context window; Ollama defaults to -1 (until natural stop). Local generation is unchanged - the maxTokens slider still bounds the on-device llama.cpp inference loop.

Changes

src/services/providers/openAICompatibleProvider.ts - drop max_tokens from the /v1/chat/completions body
src/services/providers/openAICompatibleStream.ts - drop num_predict from the Ollama /api/chat body

Why the client cap was wrong

The user picks the model and configures its context on the server; the client second-guessing that just produces silent truncation.
Other LM Studio / Ollama clients (Open WebUI, Continue.dev) don't send a cap by default for the same reason.
Reasoning-token budget is unpredictable per-prompt - any fixed client default will be wrong for some class of model.
The local default of 1024 leaked into the remote path because one maxTokens setting backs both. Local needs the cap (KV cache / battery); remote does not.

Reported case

User on Play Store review and follow-up email: Qwen3.6-35B-A3B on LM Studio produced only a loading animation. Server log showed a clean 1024-token generation with finish_reason: length, but no content reached the UI because the entire stream was inside an unclosed <think> block.

Test plan

npx tsc --noEmit clean
npx eslint clean on changed files
npm test -- openAICompatibleProvider remoteProviderRouting - 61/61 pass
Manual: point app at LM Studio with Qwen3 model, send a prompt, confirm response renders
Manual: point app at Ollama with a non-reasoning model (e.g. llama3.1), confirm output completes naturally instead of being cut at 1024 tokens
Manual: load a local model, confirm maxTokens slider still bounds on-device output

Follow-ups (not in this PR)

The maxTokens and contextLength sliders are still editable when a remote model is active. The settings modal shows a banner that says they don't apply, but the sliders aren't disabled. Worth hiding both for remote sessions in a separate change.
documentService.ts truncates attachment / pasted-text content using settings.contextLength regardless of provider type - low slider + remote model silently cuts attached docs. Separate bug.

The app's maxTokens setting (default 1024) was flowing into every remote chat request as max_tokens (LM Studio / OpenAI-compatible) or num_predict (Ollama). Reasoning models like Qwen3 and DeepSeek-R1 routinely spend 2k-5k tokens inside a <think> block before producing visible output, so the 1024 cap forced the stream to stop mid-think. The closing </think> never arrived, ThinkTagParser routed every token to onReasoning, and the chat bubble stayed empty - the user saw a loading spinner forever. Output limits belong on the server, where the user already configured the model. LM Studio and Ollama both have sensible defaults (full remaining context and -1 respectively) that don't truncate reasoning. Local generation is unchanged - the maxTokens slider still bounds the on-device inference loop. Co-Authored-By: Dishit hanmadishit74@gmail.com

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

sonarqubecloud · 2026-05-27T06:45:35Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist

Code Review

This pull request removes the max_tokens (and Ollama's num_predict) parameter from being sent to OpenAI-compatible and Ollama providers to prevent client-side truncation of reasoning models (like DeepSeek-R1 or Qwen3) during their <think> phase. However, completely omitting these parameters at the provider level prevents callers from enforcing explicit token limits when desired (e.g., for cost control or utility tasks). It is recommended to handle this at the caller level or only omit the parameter when it matches a default/unset state.

gemini-code-assist · 2026-05-27T06:45:40Z

+      // max_tokens intentionally omitted — the remote server owns output limits.
+      // A client-side cap (default 1024) silently truncates reasoning models that
+      // need a larger budget for <think> blocks (Qwen3, DeepSeek-R1, etc).


Completely omitting max_tokens at the provider level prevents any caller from limiting the generation length, even when a limit is explicitly desired (e.g., for utility tasks like summarization, title generation, or cost control on paid endpoints).

Instead of ignoring options.maxTokens entirely in the provider, the caller (e.g., the chat store or the service orchestrating the generation) should avoid passing the default local maxTokens setting to remote providers, or the provider should only omit it if it matches a default/unset state. This preserves the provider's ability to respect explicit token limits when requested.

Thanks for the review — we considered the suggestion but went with provider-level omission deliberately.

The "explicit cap" use case is already covered by server-side config. LM Studio exposes a per-model default-max-output-tokens setting; Ollama supports num_predict in the Modelfile and server config. Users who want a cap set it once on the server they own. A parallel client-side knob just duplicates that and, as this PR shows, causes silent truncation when the local default disagrees with the server's intended budget.

For length-constrained outputs (titles, summaries), prompt design is the right lever — "give a 3-word title" produces 3 words because of the prompt, not the cap. max_tokens is a guillotine, not a length control; it cuts mid-token if the model overshoots.

Off Grid currently has no remote caller that benefits from a per-call override — context compaction summarization runs against the local model, not the remote provider. Cost control isn't a concern either; Off Grid targets LM Studio and Ollama, both free local servers. There's no paid-by-token API in the codebase today.

If a future provider does need a cap (e.g. an Anthropic/OpenAI billed-by-token provider), that provider can decide to honor options.maxTokens at its own level. Keeping the local-only maxTokens setting out of the OpenAI-compatible and Ollama remote paths is the bug fix we want.

gemini-code-assist · 2026-05-27T06:45:40Z

+      // num_predict intentionally omitted — Ollama defaults to -1 (until natural stop).
+      // A client-side cap truncates reasoning models mid-<think>.


Completely omitting num_predict at the provider level prevents any caller from limiting the generation length, even when a limit is explicitly desired (e.g., for utility tasks like summarization, title generation, or cost control on paid endpoints).

Instead of ignoring options.maxTokens entirely in the provider, the caller (e.g., the chat store or the service orchestrating the generation) should avoid passing the default local maxTokens setting to remote providers, or the provider should only omit it if it matches a default/unset state. This preserves the provider's ability to respect explicit token limits when requested.

Same reasoning as on the OpenAI-compatible side — going with provider-level omission deliberately.

For Ollama specifically, the cap belongs in the Modelfile (PARAMETER num_predict <n>) or server config. Ollama's own default is -1 (unlimited until natural stop), which is the right default for chat. The bug was that our client's local-only default of 1024 was overriding that, truncating reasoning models mid-<think>.

For length-constrained outputs (titles, summaries), prompt design is the right lever, not a hard token cap. Off Grid has no current remote caller that wants a per-call cap — context compaction summarization runs against the local model. If a future provider needs one, it can opt in at its own level.

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

dishit-wednesday changed the base branch from main to litertsupport May 27, 2026 08:08

dishit-wednesday changed the base branch from litertsupport to main May 27, 2026 12:29

dishit-wednesday merged commit c112688 into main Jun 3, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(remote): stop sending max_tokens/num_predict to remote servers#377

fix(remote): stop sending max_tokens/num_predict to remote servers#377
dishit-wednesday merged 1 commit into
mainfrom
fix/remote-drop-max-tokens-cap

dishit-wednesday commented May 27, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 27, 2026

Uh oh!

dishit-wednesday May 27, 2026

Uh oh!

gemini-code-assist Bot May 27, 2026

Uh oh!

dishit-wednesday May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		// num_predict intentionally omitted — Ollama defaults to -1 (until natural stop).
		// A client-side cap truncates reasoning models mid-<think>.

Uh oh!

Conversation

dishit-wednesday commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why the client cap was wrong

Reported case

Test plan

Follow-ups (not in this PR)

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Quality Gate passed

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

dishit-wednesday May 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

dishit-wednesday May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dishit-wednesday commented May 27, 2026 •

edited

Loading