Skip to content

feat(litert): add LiteRT-LM as second on-device inference engine#360

Merged
dishit-wednesday merged 95 commits into
mainfrom
litertsupport
Jun 3, 2026
Merged

feat(litert): add LiteRT-LM as second on-device inference engine#360
dishit-wednesday merged 95 commits into
mainfrom
litertsupport

Conversation

@dishit-wednesday

@dishit-wednesday dishit-wednesday commented May 16, 2026

Copy link
Copy Markdown
Collaborator

Add LiteRT as a second on-device inference engine

Adds LiteRT (Google's on-device inference runtime) as a peer to the existing llama.cpp engine. Android-only at ship. The JS layer is built platform-agnostic so the iOS Swift LiteRT SDK can drop in when Google releases it with no code changes on this side.

Engine is decided by file extension — .gguf runs on llama.cpp as before, .litertlm runs on LiteRT. No engine toggle in the UI; it follows the model.


Why a second engine

LiteRT runs Gemma 4, Gemma 3n, and other LiteRT-LM packaged models with GPU acceleration paths that llama.cpp does not have on Android. CPU remains as the universal fallback.

Both engines stay in the app. Users keep their existing .gguf models and existing chats. Adding a .litertlm model is purely additive.


Architecture

Follows the same shape as the existing llama.cpp / ONNX split — a peer service, not a layer on top of llmService.

src/services/
  llm.ts            ──▶  llama.cpp via llama.rn (existing)
  litert.ts         ──▶  LiteRT via LiteRTModule.kt (new)
  engines.ts        ──▶  getActiveEngineService() router
  liteRTCompaction  ──▶  context summarization for LiteRT

activeModelService/loaders.ts
doLoadTextModel ──┬─▶ doLoadLiteRTModel (model.engine === 'litert')
└─▶ llama path (model.engine === 'llama')

generationServiceHelpers.ts
generateResponseImpl
├─▶ runLiteRTResponseImpl
└─▶ llama path

android/.../litert/LiteRTModule.kt
ReactContextBaseJavaModule wrapping
com.google.ai.edge.litertlm:litertlm-android:0.11.0

Type model

DownloadedModel is a discriminated union over engine:

type DownloadedModel = LlamaDownloadedModel | LiteRTDownloadedModel;

interface LlamaDownloadedModel { engine: 'llama'; mmProjPath?: string; ... }
interface LiteRTDownloadedModel { engine: 'litert'; liteRTVision: boolean; ... }

Engine-specific fields live on the engine-specific type. Consumers narrow with model.engine === 'llama' and let the compiler enforce correctness. isLlamaModel / isLiteRTModel type guards are exported for tests and selectors.

One downloadedModels collection, one activeModelId. The image-model pattern of separate collections is not used here because only one text model is loaded at a time and most call sites are engine-agnostic.

Settings model

LiteRT-specific settings are flat fields prefixed with liteRT:

Setting Type Default
liteRTBackend 'gpu' | 'cpu' 'gpu'
liteRTTemperature number 0.7
liteRTTopP number 0.9
liteRTMaxTokens number 4096

Test plan

  • npm run lint && npx tsc --noEmit && npm test clean
  • Android cold start — llama model loads and generates
  • Android cold start — LiteRT model loads and generates
  • Switch from llama to LiteRT mid-session, switch back
  • LiteRT GPU path: load Gemma 4 E2B, verify ~38 chars/sec decode
  • LiteRT CPU fallback when GPU init refused
  • Stop generation mid-stream, send another message — no FAILED_PRECONDITION
  • Long conversation — watch auto-compact trigger near 65%
  • Vision: Gemma 3n with image attached, model responds about the image
  • Vision blocked: text-only model with image attached, clean error
  • Tool calling: web_search invocation completes and returns to model
  • Settings: change LiteRT temperature, verify next response reflects it
  • Settings: change LiteRT max tokens, reload banner appears, reload works
  • Memory: load LiteRT + image gen, watch memory check
  • CI build succeeds from a fresh checkout (gesture-handler patch applies)
  • iOS build succeeds, no LiteRT UI visible on iOS device
# Add LiteRT as a second on-device inference engine

Adds [LiteRT](https://ai.google.dev/edge/litert) (Google's on-device inference runtime) as a peer to the existing llama.cpp engine. Android-only at ship. The JS layer is built platform-agnostic so the iOS Swift LiteRT SDK can drop in when Google releases it with no code changes on this side.

Engine is decided by file extension — .gguf runs on llama.cpp as before, .litertlm runs on LiteRT. No engine toggle in the UI; it follows the model.


Why a second engine

LiteRT runs Gemma 4, Gemma 3n, and other LiteRT-LM packaged models with GPU acceleration paths that llama.cpp does not have on Android. CPU remains as the universal fallback.

Both engines stay in the app. Users keep their existing .gguf models and existing chats. Adding a .litertlm model is purely additive.


Architecture

Follows the same shape as the existing llama.cpp / ONNX split — a peer service, not a layer on top of llmService.

src/services/
  llm.ts            ──▶  llama.cpp via llama.rn (existing)
  litert.ts         ──▶  LiteRT via LiteRTModule.kt (new)
  engines.ts        ──▶  getActiveEngineService() router
  liteRTCompaction  ──▶  context summarization for LiteRT

activeModelService/loaders.ts
  doLoadTextModel  ──┬─▶ doLoadLiteRTModel  (model.engine === 'litert')
                     └─▶ llama path         (model.engine === 'llama')

generationServiceHelpers.ts
  generateResponseImpl
     ├─▶ runLiteRTResponseImpl
     └─▶ llama path

android/.../litert/LiteRTModule.kt
  ReactContextBaseJavaModule wrapping
  com.google.ai.edge.litertlm:litertlm-android:0.11.0

Type model

DownloadedModel is a discriminated union over engine:

type DownloadedModel = LlamaDownloadedModel | LiteRTDownloadedModel;

interface LlamaDownloadedModel  { engine: 'llama';  mmProjPath?: string; ... }
interface LiteRTDownloadedModel { engine: 'litert'; liteRTVision: boolean; ... }

Engine-specific fields live on the engine-specific type. Consumers narrow with model.engine === 'llama' and let the compiler enforce correctness. isLlamaModel / isLiteRTModel type guards are exported for tests and selectors.

One downloadedModels collection, one activeModelId. The image-model pattern of separate collections is not used here because only one text model is loaded at a time and most call sites are engine-agnostic.

Settings model

LiteRT-specific settings are flat fields prefixed with liteRT:

Setting Type Default
liteRTBackend 'gpu' | 'cpu' 'gpu'
liteRTTemperature number 0.7
liteRTTopP number 0.9
liteRTMaxTokens number 4096

liteRTMaxTokens maps to LiteRT's EngineConfig.maxNumTokens — one budget covering system prompt, history, input, and output combined. Existing llama settings are untouched. Switching engines does not cross-contaminate preferences.


Core features

Text generation

liteRTService.sendMessage() streams tokens through four event types from the native module:

Native event JS callback
litert_token onToken — response stream
litert_thinking onReasoning — extended-reasoning stream
litert_complete onComplete with BenchmarkInfo stats
litert_error onError
litert_tool_call tool invocation request

The native Conversation object holds turn history internally. JS only sends the current user turn after the conversation is set up, avoiding re-prefilling the entire chat on every message.

prepareConversation only triggers a native resetConversation when activeConversationId, activeSystemPrompt, or activeToolsJson changes. Otherwise it reuses the live native session.

After load on GPU, the engine runs a one-token warmup against an empty prompt to prime the shader/kernel cache and avoid first-compile latency on the first real prompt.

Thinking

Gemma 4 models need a <|think|> token prepended to the system prompt to activate extended reasoning. applyGemma4ThinkToken handles this for both engines:

applyGemma4ThinkToken(prompt, isRemote, { isLiteRT, thinkingEnabled })

Thinking tokens stream into a separate channel (litert_thinking native / onReasoning JS). The chat UI renders them in a collapsible "Thought process" section above the response. Disabling the thinking toggle removes the prefix and the model produces direct answers only.

Context compaction

LiteRT models load with a fixed maxNumTokens budget. The service tracks cumulative tokens and auto-compacts at 65% of the budget:

Situation Strategy
Active session loaded Ask the model to summarize itself in 3-5 sentences, then reset with [summary, recent turns]
First load, no session Slice the oldest history (cannot summarize what is not loaded)

Summarization runs inside the existing KV cache while ~35% headroom remains, then a single reset rebuilds the conversation with the compacted history. Recent-turn selection keeps the last 40% of context by char estimate, with a minimum of two turns. A 20-second timeout falls back to slice-only if summarization stalls.

Implementation: src/services/liteRTCompaction.ts — pure function, takes history + maxTokens + cumulativeTokens, returns the reset call to make.

Tools

Tool calling is handled natively by the LiteRT SDK. Tools are passed in as JSON at conversation reset time. When the model emits a tool call, the SDK fires litert_tool_call to JS with the tool name and arguments. JS executes the tool and calls respondToToolCall(id, result) back to the native side. The SDK feeds the result into the conversation and continues generation internally — JS never sees the second prefill.

Three text-based parsing fallbacks exist for models that do not use the SDK's tool format (Gemma 4 emits a non-standard <|tool_call> syntax). These parse the response text after generation and execute tools through the same path.

Vision

LiteRT supports multimodal models (Gemma 3n E2B / E4B variants) with a GPU-accelerated vision encoder. The liteRTVision boolean on the model record gates this:

  • At import time, a dialog asks "Text Only / Vision" for any .litertlm file
  • If liteRTVision === true, the engine loads with visionBackend = Backend.GPU()
  • The image-attach button is shown only when the active model supports vision
  • Attaching an image to a non-vision LiteRT model is blocked with a clear error rather than silently dropping the image

The dialog is needed because LiteRT does not expose introspection for vision support on a loaded model.

Backend selection and fallback

Backend Notes
GPU (OpenCL via Adreno) Works on most modern Android devices
CPU Universal fallback

The native module does a two-tier fallback at load time with per-tier timeouts (GPU 20s, CPU 15s). The actually-loaded backend is reported back via getActiveBackend() so the UI can show "Requested GPU, running on CPU" if fallback occurs.


Model management

Import flow

Models screen → Import → pick a .litertlm file
             ↓
Vision support dialog (only for .litertlm)
             ↓
File copy to app documents
             ↓
DownloadedModel record created with engine: 'litert'

No LiteRT model catalog or HuggingFace download flow in this PR — users sideload .litertlm files. Curated downloads will come in a follow-up.

Recommended models for sideload:

  • google/gemma-3n-E2B-it-litert-lm — vision-capable, smaller
  • litert-community/gemma-4-E2B-it-litert-lm — text-only, larger

Settings UI

ModelSettingsScreen and GenerationSettingsModal render one of two sibling components based on the active model:

isLiteRT ? <LiteRTTextSettings /> : <LlamaTextSettings />

LiteRTTextSettings shows: Temperature, Max Tokens, Top P (advanced), Acceleration (GPU/CPU), Show Generation Details, Thinking toggle.

LlamaTextSettings shows: Temperature, Max Tokens, Context Length, Top P, Repeat Penalty, CPU Threads, Batch Size, Backend, Flash Attention, KV Cache Type, Model Loading Strategy.

LiteRT hides controls that do not map to its SDK (no flash attention, no GPU layer count, no manual thread count).

Settings that bake into the engine at load time (backend, maxTokens) trigger a "settings changed, reload required" banner. Settings the engine reads per-generation (temperature, top-P) take effect on the next reset.

Memory budget

The memory check at model load now sums RAM across both engines plus any loaded image model. Loading a LiteRT model while llama is resident unloads llama first, and vice versa. Loading a LiteRT model alongside an image model checks both budgets to prevent OOM.


Build infrastructure

Kotlin 2.2.0

The LiteRT SDK requires Kotlin 2.x. kotlinVersion is bumped to 2.2.0 in android/build.gradle along with the Kotlin Gradle plugin classpath. This unlocks the K2 compiler — most modules compile faster and produce slightly smaller bytecode.

react-native-gesture-handler patch

Kotlin 2.2's K2 compiler tightened smart-cast inference inside lambdas when the receiver is a property. react-native-gesture-handler@2.30.0 relies on pre-K2 behavior in findRootHelperForViewAncestor:

// Before (fails K2 — smart cast doesn't carry across && for property receivers)
it.rootView is ReactRootView && it.rootView.rootViewTag == rootViewTag

// After (patches/react-native-gesture-handler+2.30.0.patch)
it.rootView is ReactRootView && (it.rootView as ReactRootView).getRootViewTag() == rootViewTag

Same JVM bytecode, same runtime behavior. Applied via patch-package in postinstall. Will self-remove when gesture-handler ships a K2-compatible version upstream.

Native module

android/app/src/main/java/ai/offgridmobile/litert/LiteRTModule.kt registers as a standard ReactContextBaseJavaModule:

Method What it does
loadModel Two-tier backend fallback init
resetConversation Close current, create new with system + history + tools
sendMessage Send current turn, stream tokens via events
respondToToolCall Feed tool result back to the SDK
stopGeneration Cancel current generation, null active session
unloadModel Close Conversation, then Engine, in that order
getMemoryInfo Native RAM stats for DeviceStatsChip

OpenCL opted in via AndroidManifest:

<uses-native-library android:name="libOpenCL.so" android:required="false" />

Platform readiness

liteRTService detects availability by native module presence (!!NativeModules.LiteRTModule), not Platform.OS === 'android'. When the iOS Swift LiteRT SDK is released and registered under the same JS name, the JS service starts working on iOS with no code changes here.


Behavior summary

Action Result
Tap a .gguf model Loads on llama.cpp (existing)
Tap a .litertlm model Loads on LiteRT, settings UI swaps
Change LiteRT backend while loaded Reload banner appears
Change llama backend while LiteRT loaded No banner (independent settings)
Stop mid-generation Cancels, next turn safely restarts
Send an image to a non-vision LiteRT model Clean error, no silent drop
Conversation context hits 65% of budget Auto-summarizes and resets
App backgrounded with LiteRT loaded, then resumed State syncs correctly
iOS user opens the app Llama path unchanged, no LiteRT UI

Test plan

  • npm run lint && npx tsc --noEmit && npm test clean
  • Android cold start — llama model loads and generates
  • Android cold start — LiteRT model loads and generates
  • Switch from llama to LiteRT mid-session, switch back
  • LiteRT GPU path: load Gemma 4 E2B, verify ~38 chars/sec decode
  • LiteRT CPU fallback when GPU init refused
  • Stop generation mid-stream, send another message — no FAILED_PRECONDITION
  • Long conversation — watch auto-compact trigger near 65%
  • Vision: Gemma 3n with image attached, model responds about the image
  • Vision blocked: text-only model with image attached, clean error
  • Tool calling: web_search invocation completes and returns to model
  • Settings: change LiteRT temperature, verify next response reflects it
  • Settings: change LiteRT max tokens, reload banner appears, reload works
  • Memory: load LiteRT + image gen, watch memory check
  • CI build succeeds from a fresh checkout (gesture-handler patch applies)
  • iOS build succeeds, no LiteRT UI visible on iOS device

- Add LiteRTModule.kt — native Android module managing Engine/Conversation
  lifecycle with NPU→GPU→CPU fallback chain and image decode pipeline
- Add LiteRTPackage.kt and register in MainApplication
- Add LiteRTService.ts — JS bridge with streaming token events
- Wire generation routing in generationServiceHelpers (litert vs llama.cpp)
- Add doLoadLiteRTModel in activeModelService loaders
- Add .litertlm import support with per-model vision toggle dialog
- Add liteRTVision and engine fields to DownloadedModel type
- Add persistent debug logs store (AsyncStorage-backed, survives crashes)
- Add DebugLogsScreen modal accessible from ChatHeader terminal icon
- Upgrade litertlm-android 0.10.0→0.11.0, Kotlin 2.1.20→2.2.0, kapt→ksp
- Fix SIGSEGV: gate visionBackend=GPU behind supportsVision flag
- Fix double load: check liteRTService.isModelLoaded() before triggering load
- Fix reload loop: skip hasPendingSettings and handleReloadTextModel for litert
- Add LITERT_TODO.md with full production readiness backlog
- Fix lint errors and update modelManager tests for .litertlm support

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces LiteRT-LM on-device inference support for Android, featuring a new native module, JS bridge, and debug screen. It also migrates the build system to KSP and updates Kotlin and Gradle versions. Review feedback points out a hardcoded local Java path in gradle.properties that breaks CI, a resource leak in the native engine initialization fallback logic, and several instances where vision support is incorrectly hardcoded to true for all LiteRT models instead of respecting the specific model configuration.

Comment thread android/gradle.properties Outdated
# The setting is particularly useful for tweaking memory settings.
# Default value: -Xmx512m -XX:MaxMetaspaceSize=256m
org.gradle.jvmargs=-Xmx2048m -XX:MaxMetaspaceSize=512m
org.gradle.java.home=/Library/Java/JavaVirtualMachines/temurin-21.jdk/Contents/Home

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The org.gradle.java.home property is hardcoded to a local path on your machine. This will break the build for other developers and on CI environments. As noted in docs/LITERT_TODO.md, this should be removed before merging.

Comment on lines +93 to +123
for (backend in chain) {
val name = backendName(backend)
Log.i(TAG, "initializeWithFallback — trying $name vision=$visionEnabled")
try {
val cfg = EngineConfig(
modelPath = modelPath,
backend = backend,
cacheDir = null,
visionBackend = if (visionEnabled) Backend.GPU() else null,
)
val eng = Engine(cfg)
val timeoutMs = when (backend) {
is Backend.NPU -> NPU_TIMEOUT_MS
is Backend.GPU -> GPU_TIMEOUT_MS
else -> CPU_TIMEOUT_MS
}
withTimeout(timeoutMs) {
eng.initialize()
}
engine = eng
Log.i(TAG, "initializeWithFallback — $name succeeded")
return backend
} catch (e: Exception) {
Log.w(TAG, "initializeWithFallback — $name failed: ${e.message}")
engine?.close()
engine = null
lastError = e
if (backend == chain.last()) break
Log.i(TAG, "initializeWithFallback — falling back to next tier")
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a resource leak in the fallback chain. If eng.initialize() fails or times out, the local eng instance is never closed. The catch block calls engine?.close(), but engine (the class property) is still null at that point because the assignment at line 112 is never reached. You should declare eng outside the try block and ensure it is closed on failure.

        for (backend in chain) {
            val name = backendName(backend)
            Log.i(TAG, "initializeWithFallback — trying $name vision=$visionEnabled")
            var eng: Engine? = null
            try {
                val cfg = EngineConfig(
                    modelPath = modelPath,
                    backend = backend,
                    cacheDir = null,
                    visionBackend = if (visionEnabled) Backend.GPU() else null,
                )
                eng = Engine(cfg)
                val timeoutMs = when (backend) {
                    is Backend.NPU -> NPU_TIMEOUT_MS
                    is Backend.GPU -> GPU_TIMEOUT_MS
                    else           -> CPU_TIMEOUT_MS
                }
                withTimeout(timeoutMs) {
                    eng.initialize()
                }
                engine = eng
                Log.i(TAG, "initializeWithFallback — $name succeeded")
                return backend
            } catch (e: Exception) {
                Log.w(TAG, "initializeWithFallback — $name failed: ${e.message}")
                eng?.close()
                lastError = e
                if (backend == chain.last()) break
                Log.i(TAG, "initializeWithFallback — falling back to next tier")
            }
        }
References
  1. When a process with an iteration limit exceeds that limit, treat it as a failure and invoke the same fallback logic used for other exceptions.
  2. Avoid code duplication by refactoring common operations into a single call, especially when they appear in multiple exit paths of a function.

await activeModelService.loadTextModel(activeModelId);
const multimodalSupport = llmService.getMultimodalSupport();
deps.setSupportsVision(multimodalSupport?.vision || false);
deps.setSupportsVision(activeModel.engine === 'litert' ? true : (multimodalSupport?.vision || false));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For LiteRT models, supportsVision is hardcoded to true. This will enable vision UI elements (like the image attachment button) even for text-only LiteRT models. It should instead respect the liteRTVision flag stored in the model record.

Suggested change
deps.setSupportsVision(activeModel.engine === 'litert' ? true : (multimodalSupport?.vision || false));
deps.setSupportsVision(activeModel.engine === 'litert' ? !!activeModel.liteRTVision : (multimodalSupport?.vision || false));
References
  1. Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.

await activeModelService.loadTextModel(activeModelId);
const multimodalSupport = llmService.getMultimodalSupport();
deps.setSupportsVision(multimodalSupport?.vision || false);
deps.setSupportsVision(activeModel.engine === 'litert' ? true : (multimodalSupport?.vision || false));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous comment, supportsVision should respect the liteRTVision flag from the model record for LiteRT models.

Suggested change
deps.setSupportsVision(activeModel.engine === 'litert' ? true : (multimodalSupport?.vision || false));
deps.setSupportsVision(activeModel.engine === 'litert' ? !!activeModel.liteRTVision : (multimodalSupport?.vision || false));
References
  1. Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.

Comment on lines +150 to +154
deps.setSupportsVision(true);
return;
}
dbg('log', `[LiteRT] ensureModelLoaded — model=${activeModel.name}, triggering load`);
deps.setSupportsVision(true);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding supportsVision to true here will cause the UI to incorrectly show vision capabilities for text-only LiteRT models. Please use the liteRTVision property from the activeModel.

Suggested change
deps.setSupportsVision(true);
return;
}
dbg('log', `[LiteRT] ensureModelLoaded — model=${activeModel.name}, triggering load`);
deps.setSupportsVision(true);
deps.setSupportsVision(!!activeModel.liteRTVision);
return;
}
dbg('log', `[LiteRT] ensureModelLoaded — model=${activeModel.name}, triggering load`);
deps.setSupportsVision(!!activeModel.liteRTVision);
References
  1. Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.

await activeModelService.loadTextModel(model.id);
const multimodalSupport = llmService.getMultimodalSupport();
deps.setSupportsVision(multimodalSupport?.vision || false);
deps.setSupportsVision(model.engine === 'litert' ? true : (multimodalSupport?.vision || false));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Ensure supportsVision is set based on the model's actual capabilities rather than hardcoding it to true for all LiteRT models.

Suggested change
deps.setSupportsVision(model.engine === 'litert' ? true : (multimodalSupport?.vision || false));
deps.setSupportsVision(model.engine === 'litert' ? !!model.liteRTVision : (multimodalSupport?.vision || false));
References
  1. Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.

if (activeModelInfo.isRemote) {
setSupportsVision(activeRemoteModel?.capabilities?.supportsVision ?? false);
} else if (activeModel?.engine === 'litert') {
setSupportsVision(true);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The supportsVision state should be derived from the liteRTVision flag in the model record to avoid misleading the user with vision UI on text-only models.

Suggested change
setSupportsVision(true);
setSupportsVision(!!activeModel.liteRTVision);
References
  1. Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.

dishit-wednesday and others added 27 commits May 19, 2026 15:08
…ams to LiteRTModule

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…memory budget, BenchmarkInfo wiring

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…eload trigger, iOS guard

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
… tps, init time in generation details

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…/NPU warmup

getBenchmarkInfo() requires internal BenchmarkParams not exposed in the public
API. Track TTFT, decode tok/s, and token count via wall-clock timers in JS
instead. Add model warmup after GPU/NPU load to prime shader caches.

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
- Fix regeneration for LiteRT: use ensureModelReady instead of bare
  llmService.isModelLoaded() check which always returns false for LiteRT
- Invalidate native conversation before regenerate/edit so native history
  is correctly rewound to match the JS message array
- Fix context loss after stopGeneration: remove activeConversationId=null
  which was wiping native turn history on every stop
- Add invalidateConversation() to LiteRTService for explicit resets
- Extend tool call parser to handle: no-args calls, Gemma function-call
  style args NAME({"k":"v"}), and </tool_call> closing tag variant
- Fix Gemma native parser regex to accept both <tool_call|> and </tool_call>
  as closing tags
- GPU retry logic in LiteRTModule: retry non-CPU backends up to 3 times
  with 600ms backoff before falling back, handles transient VRAM pressure
  after model switches
- Capture benchmark stats from generateRaw path for generation meta display
- Raise debug log capacity from 200 to 2000 entries

Co-Authored-By: Dishit Karia <hamadishit74@gmail.com>
…ext size, and wire tool call event bridge

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…ore selector

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…extend reload detection to context length

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Tapping the input shrank the FlatList viewport without repositioning
the scroll, leaving the last AI message hidden behind the keyboard.
Track height changes via onLayout and scroll to end when the viewport
shrinks. Add a keyboardWillShow/keyboardDidShow listener as a secondary
trigger for iOS.

Co-Authored-By: Dishit Karia hanmadishit74@gmail.com
…imeout

- Fix Gemma tool call parsing to handle the "tool_name{json}" body
  pattern alongside the existing key:value format; add key validation
  so non-word strings are not treated as argument keys
- Pass temperature/topK/topP through prepareConversation in the tool
  loop so generation settings are respected during tool-call turns
- Unify model init timeout to 90s across all backends (was 45/20/15s)
  to prevent premature timeout failures on slower devices
- Add debugLog helper in LiteRTModule that emits litert_debug_log
  events to the in-app debug screen alongside logcat

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
…ter from history

Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
… settings UI

Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Split DownloadedModel into LlamaDownloadedModel | LiteRTDownloadedModel
with engine as required discriminant. Legacy records without an engine
field are backfilled to 'llama' on load from AsyncStorage.

All call sites that touch llama-only fields (mmProjPath, mmProjFileSize,
isVisionModel) now narrow via engine === 'llama' guards before access,
removing the previous implicit assumption that every model had those fields.

Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Add liteRTTemperature, liteRTTopP, liteRTContextLength, and
liteRTMaxOutputTokens to AppSettings so LiteRT and llama no longer
share a single contextLength/temperature/topP field with different
semantics per engine.

LiteRT generation paths now read liteRT* fields. Llama paths are
unchanged. Migration seeds the new fields from existing shared values
on first upgrade so user preferences carry over. The pending-reload
banner check for LiteRT now watches liteRTContextLength instead of
the shared contextLength.

Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
dishit-wednesday and others added 28 commits May 27, 2026 12:45
Show amber dot on chat settings gear icon and amber tool icon/badge in the quick settings popover when more than 3 tools are active. ToolPickerSheet shows a one-time dismissable banner explaining the latency impact. Dismissed state is persisted — never shown again once acknowledged.

Also sets 3-tool default for new users and adds hint copy to the bottom of the tool picker.

Co-Authored-By: Dishit Karia hanmadishit74@gmail.com
…d LiteRT

- Revert applicationId, versionCode, versionName to match main
- Revert app_name to "Off Grid"
- Revert JVM heap args to match main
- Migrate kapt → KSP for Room compiler (required for Kotlin 2.2.0)
- Bump Kotlin 2.1.20 → 2.2.0, pin AGP to 8.8.2
- Add KSP plugin at 2.2.0-2.0.2
- Add LiteRT dependency: litertlm-android:0.11.0
Extract color literals to constants, move inline styles to stylesheets,
remove unused AVAILABLE_TOOLS imports from Popovers and ChatInput.
… dependency in ChatInput

ChatInput now receives showSettingsDot as a prop from ChatMessageArea,
keeping the store read out of ChatInput and fixing test failures caused
by unmocked useAppStore in existing ChatInput tests.
…ile overhead

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
and remove excessive comments
Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Wrap the settings icon in a relative-positioned View so the dot
is anchored to the icon bounds (18px) not the button bounds (32px).
Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
  Co-Authored-By: Dishit Karia hanmadishit74@gmail.com
LiteRT runs the tool loop natively via automaticToolCalling, so the JS
MAX_TOTAL_TOOL_CALLS cap never applied to it — a single message could
trigger unbounded tool calls and overflow the ~4096-token KV cache
mid-turn, producing degenerate output or crashing.

Add a per-turn counter in buildLiteRTToolCallHandler: calls 1-3 run
normally; the 4th+ skips execution and returns a 'stop, answer now'
nudge to the model. Counter resets each turn (closure rebuilt per
generation). Loop stays native.

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…ume retry

When a download fails on the Models screen, the card now renders the
error message, a red partial-progress bar, and Retry / Remove buttons
directly inside the card boundary — matching the Download Manager UI.

Tapping Retry calls backgroundDownloadService.retryDownload with the
existing download ID so the native WorkManager resumes from the partial
file via HTTP Range instead of starting a fresh download from 0.

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Remove the two logs that fire in tight loops:
- llm.ts: reasoning_content chunk received (fired on every thinking
  token — O(N²) string work serializing accumulated text each call)
- useDownloads.ts: mmproj progress and missed-entry debug logs (fired
  every 1.5s during download and on every progress event miss)

All other diagnostic logs (model load, download lifecycle, tool calls)
are untouched — they fire once per user action and are useful for
diagnosing real issues.

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
ChatsListScreen was subscribing to the entire chatStore and appStore
with no selectors, causing it to re-render on every streaming token
while mounted in the tab navigator. Actions moved to getState() and
data fields use targeted selectors.

Adds an informational banner above the chat input when a llama model
is loaded with OpenCL selected as the inference backend, nudging users
to switch to CPU in Settings. Does not show for LiteRT models or
remote models.

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
… StyleSheet

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…stener split

useDownloads.ts: remove useDownloadListeners() call — now fully independent.
App.tsx: mount useDownloadListeners() directly at root so listener registration
is not lost after the split.

TextModelsTab handleRetryDownload:
- Android-only guard; iOS falls back to proceedDownload (fresh download)
- mmproj sidecar retry: set pending before retry, only call
  resetMmProjForRetry if native retry succeeded, set failed on error.
  Matches retryAndroidDownload in useDownloadManager exactly — prevents
  silent vision loss on retry from the Models screen.
- onRetry branches on Platform.OS
- Use storeDownloads selector instead of getState() snapshot for storeEntry

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Co-Authored-By: Dishit Karia hanmadishit74@gmail.com
…callback

Replace captured store.downloadIdIndex snapshot with a live getState()
call inside the async callback, matching the pattern in
reattachRetriedTextDownload in useDownloadManager.ts.

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
- ModelCard: 8 new tests covering failedState / FailedSection (new
  inline retry UI from fix/69d17d28)
- generationToolLoop: 4 new tests for LiteRT native tool-call cap
  introduced in fix/73f85ff8 — verifies cap at 3, Aborted fast-path,
  and per-generation counter reset
- activeModelService loaders: fix stale-path test (add isVisionModel:true),
  add guard tests for text-only model and mmProjFileName repair sentinel
- scan.test.ts (new): unit tests for extractBaseName and
  findMatchingMmProj, plus curatedLiteRTRegistry entry lookup
- visionRepair: 3 additional branch tests (name-lookup false path,
  catalog-no-mmproj path, fileName vl-detection path)

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…e ID

- Create shared test utilities: mocks.ts with AsyncStorage, logger, whisper service, and HTTP client factories
- Add store-specific reset helpers (resetDownloadStore, resetRemoteServerStore, resetWhisperStore, etc)
- Add act() wrapper utilities (actStoreUpdate, actAsyncStoreUpdate) to reduce boilerplate
- Refactor remoteServerStore.test.ts to use shared actStoreUpdate() instead of 50+ act() calls
- Refactor whisperStore.test.ts to use resetWhisperStore() from shared utilities
- Change litert bundle ID from ai.offgridmobile to ai.offgridmobile.litert (allows side-by-side install with Play Store version)

Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 3, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

@dishit-wednesday dishit-wednesday merged commit a7bc414 into main Jun 3, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant