feat(litert): add LiteRT-LM as second on-device inference engine#360
Conversation
- Add LiteRTModule.kt — native Android module managing Engine/Conversation lifecycle with NPU→GPU→CPU fallback chain and image decode pipeline - Add LiteRTPackage.kt and register in MainApplication - Add LiteRTService.ts — JS bridge with streaming token events - Wire generation routing in generationServiceHelpers (litert vs llama.cpp) - Add doLoadLiteRTModel in activeModelService loaders - Add .litertlm import support with per-model vision toggle dialog - Add liteRTVision and engine fields to DownloadedModel type - Add persistent debug logs store (AsyncStorage-backed, survives crashes) - Add DebugLogsScreen modal accessible from ChatHeader terminal icon - Upgrade litertlm-android 0.10.0→0.11.0, Kotlin 2.1.20→2.2.0, kapt→ksp - Fix SIGSEGV: gate visionBackend=GPU behind supportsVision flag - Fix double load: check liteRTService.isModelLoaded() before triggering load - Fix reload loop: skip hasPendingSettings and handleReloadTextModel for litert - Add LITERT_TODO.md with full production readiness backlog - Fix lint errors and update modelManager tests for .litertlm support Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Code Review
This pull request introduces LiteRT-LM on-device inference support for Android, featuring a new native module, JS bridge, and debug screen. It also migrates the build system to KSP and updates Kotlin and Gradle versions. Review feedback points out a hardcoded local Java path in gradle.properties that breaks CI, a resource leak in the native engine initialization fallback logic, and several instances where vision support is incorrectly hardcoded to true for all LiteRT models instead of respecting the specific model configuration.
| # The setting is particularly useful for tweaking memory settings. | ||
| # Default value: -Xmx512m -XX:MaxMetaspaceSize=256m | ||
| org.gradle.jvmargs=-Xmx2048m -XX:MaxMetaspaceSize=512m | ||
| org.gradle.java.home=/Library/Java/JavaVirtualMachines/temurin-21.jdk/Contents/Home |
| for (backend in chain) { | ||
| val name = backendName(backend) | ||
| Log.i(TAG, "initializeWithFallback — trying $name vision=$visionEnabled") | ||
| try { | ||
| val cfg = EngineConfig( | ||
| modelPath = modelPath, | ||
| backend = backend, | ||
| cacheDir = null, | ||
| visionBackend = if (visionEnabled) Backend.GPU() else null, | ||
| ) | ||
| val eng = Engine(cfg) | ||
| val timeoutMs = when (backend) { | ||
| is Backend.NPU -> NPU_TIMEOUT_MS | ||
| is Backend.GPU -> GPU_TIMEOUT_MS | ||
| else -> CPU_TIMEOUT_MS | ||
| } | ||
| withTimeout(timeoutMs) { | ||
| eng.initialize() | ||
| } | ||
| engine = eng | ||
| Log.i(TAG, "initializeWithFallback — $name succeeded") | ||
| return backend | ||
| } catch (e: Exception) { | ||
| Log.w(TAG, "initializeWithFallback — $name failed: ${e.message}") | ||
| engine?.close() | ||
| engine = null | ||
| lastError = e | ||
| if (backend == chain.last()) break | ||
| Log.i(TAG, "initializeWithFallback — falling back to next tier") | ||
| } | ||
| } |
There was a problem hiding this comment.
There is a resource leak in the fallback chain. If eng.initialize() fails or times out, the local eng instance is never closed. The catch block calls engine?.close(), but engine (the class property) is still null at that point because the assignment at line 112 is never reached. You should declare eng outside the try block and ensure it is closed on failure.
for (backend in chain) {
val name = backendName(backend)
Log.i(TAG, "initializeWithFallback — trying $name vision=$visionEnabled")
var eng: Engine? = null
try {
val cfg = EngineConfig(
modelPath = modelPath,
backend = backend,
cacheDir = null,
visionBackend = if (visionEnabled) Backend.GPU() else null,
)
eng = Engine(cfg)
val timeoutMs = when (backend) {
is Backend.NPU -> NPU_TIMEOUT_MS
is Backend.GPU -> GPU_TIMEOUT_MS
else -> CPU_TIMEOUT_MS
}
withTimeout(timeoutMs) {
eng.initialize()
}
engine = eng
Log.i(TAG, "initializeWithFallback — $name succeeded")
return backend
} catch (e: Exception) {
Log.w(TAG, "initializeWithFallback — $name failed: ${e.message}")
eng?.close()
lastError = e
if (backend == chain.last()) break
Log.i(TAG, "initializeWithFallback — falling back to next tier")
}
}References
- When a process with an iteration limit exceeds that limit, treat it as a failure and invoke the same fallback logic used for other exceptions.
- Avoid code duplication by refactoring common operations into a single call, especially when they appear in multiple exit paths of a function.
| await activeModelService.loadTextModel(activeModelId); | ||
| const multimodalSupport = llmService.getMultimodalSupport(); | ||
| deps.setSupportsVision(multimodalSupport?.vision || false); | ||
| deps.setSupportsVision(activeModel.engine === 'litert' ? true : (multimodalSupport?.vision || false)); |
There was a problem hiding this comment.
For LiteRT models, supportsVision is hardcoded to true. This will enable vision UI elements (like the image attachment button) even for text-only LiteRT models. It should instead respect the liteRTVision flag stored in the model record.
| deps.setSupportsVision(activeModel.engine === 'litert' ? true : (multimodalSupport?.vision || false)); | |
| deps.setSupportsVision(activeModel.engine === 'litert' ? !!activeModel.liteRTVision : (multimodalSupport?.vision || false)); |
References
- Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.
| await activeModelService.loadTextModel(activeModelId); | ||
| const multimodalSupport = llmService.getMultimodalSupport(); | ||
| deps.setSupportsVision(multimodalSupport?.vision || false); | ||
| deps.setSupportsVision(activeModel.engine === 'litert' ? true : (multimodalSupport?.vision || false)); |
There was a problem hiding this comment.
Similar to the previous comment, supportsVision should respect the liteRTVision flag from the model record for LiteRT models.
| deps.setSupportsVision(activeModel.engine === 'litert' ? true : (multimodalSupport?.vision || false)); | |
| deps.setSupportsVision(activeModel.engine === 'litert' ? !!activeModel.liteRTVision : (multimodalSupport?.vision || false)); |
References
- Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.
| deps.setSupportsVision(true); | ||
| return; | ||
| } | ||
| dbg('log', `[LiteRT] ensureModelLoaded — model=${activeModel.name}, triggering load`); | ||
| deps.setSupportsVision(true); |
There was a problem hiding this comment.
Hardcoding supportsVision to true here will cause the UI to incorrectly show vision capabilities for text-only LiteRT models. Please use the liteRTVision property from the activeModel.
| deps.setSupportsVision(true); | |
| return; | |
| } | |
| dbg('log', `[LiteRT] ensureModelLoaded — model=${activeModel.name}, triggering load`); | |
| deps.setSupportsVision(true); | |
| deps.setSupportsVision(!!activeModel.liteRTVision); | |
| return; | |
| } | |
| dbg('log', `[LiteRT] ensureModelLoaded — model=${activeModel.name}, triggering load`); | |
| deps.setSupportsVision(!!activeModel.liteRTVision); |
References
- Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.
| await activeModelService.loadTextModel(model.id); | ||
| const multimodalSupport = llmService.getMultimodalSupport(); | ||
| deps.setSupportsVision(multimodalSupport?.vision || false); | ||
| deps.setSupportsVision(model.engine === 'litert' ? true : (multimodalSupport?.vision || false)); |
There was a problem hiding this comment.
Ensure supportsVision is set based on the model's actual capabilities rather than hardcoding it to true for all LiteRT models.
| deps.setSupportsVision(model.engine === 'litert' ? true : (multimodalSupport?.vision || false)); | |
| deps.setSupportsVision(model.engine === 'litert' ? !!model.liteRTVision : (multimodalSupport?.vision || false)); |
References
- Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.
| if (activeModelInfo.isRemote) { | ||
| setSupportsVision(activeRemoteModel?.capabilities?.supportsVision ?? false); | ||
| } else if (activeModel?.engine === 'litert') { | ||
| setSupportsVision(true); |
There was a problem hiding this comment.
The supportsVision state should be derived from the liteRTVision flag in the model record to avoid misleading the user with vision UI on text-only models.
| setSupportsVision(true); | |
| setSupportsVision(!!activeModel.liteRTVision); |
References
- Vision-language models should be specifically categorized with type 'vision' to ensure correct UI behavior.
…ams to LiteRTModule Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…memory budget, BenchmarkInfo wiring Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…eload trigger, iOS guard Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
… tps, init time in generation details Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…/NPU warmup getBenchmarkInfo() requires internal BenchmarkParams not exposed in the public API. Track TTFT, decode tok/s, and token count via wall-clock timers in JS instead. Add model warmup after GPU/NPU load to prime shader caches. Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
- Fix regeneration for LiteRT: use ensureModelReady instead of bare
llmService.isModelLoaded() check which always returns false for LiteRT
- Invalidate native conversation before regenerate/edit so native history
is correctly rewound to match the JS message array
- Fix context loss after stopGeneration: remove activeConversationId=null
which was wiping native turn history on every stop
- Add invalidateConversation() to LiteRTService for explicit resets
- Extend tool call parser to handle: no-args calls, Gemma function-call
style args NAME({"k":"v"}), and </tool_call> closing tag variant
- Fix Gemma native parser regex to accept both <tool_call|> and </tool_call>
as closing tags
- GPU retry logic in LiteRTModule: retry non-CPU backends up to 3 times
with 600ms backoff before falling back, handles transient VRAM pressure
after model switches
- Capture benchmark stats from generateRaw path for generation meta display
- Raise debug log capacity from 200 to 2000 entries
Co-Authored-By: Dishit Karia <hamadishit74@gmail.com>
…ext size, and wire tool call event bridge Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…ore selector Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…extend reload detection to context length Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Tapping the input shrank the FlatList viewport without repositioning the scroll, leaving the last AI message hidden behind the keyboard. Track height changes via onLayout and scroll to end when the viewport shrinks. Add a keyboardWillShow/keyboardDidShow listener as a secondary trigger for iOS. Co-Authored-By: Dishit Karia hanmadishit74@gmail.com
…imeout
- Fix Gemma tool call parsing to handle the "tool_name{json}" body
pattern alongside the existing key:value format; add key validation
so non-word strings are not treated as argument keys
- Pass temperature/topK/topP through prepareConversation in the tool
loop so generation settings are respected during tool-call turns
- Unify model init timeout to 90s across all backends (was 45/20/15s)
to prevent premature timeout failures on slower devices
- Add debugLog helper in LiteRTModule that emits litert_debug_log
events to the in-app debug screen alongside logcat
Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
…ter from history Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
… settings UI Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Split DownloadedModel into LlamaDownloadedModel | LiteRTDownloadedModel with engine as required discriminant. Legacy records without an engine field are backfilled to 'llama' on load from AsyncStorage. All call sites that touch llama-only fields (mmProjPath, mmProjFileSize, isVisionModel) now narrow via engine === 'llama' guards before access, removing the previous implicit assumption that every model had those fields. Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Add liteRTTemperature, liteRTTopP, liteRTContextLength, and liteRTMaxOutputTokens to AppSettings so LiteRT and llama no longer share a single contextLength/temperature/topP field with different semantics per engine. LiteRT generation paths now read liteRT* fields. Llama paths are unchanged. Migration seeds the new fields from existing shared values on first upgrade so user preferences carry over. The pending-reload banner check for LiteRT now watches liteRTContextLength instead of the shared contextLength. Co-authored-by: Dishit Karia <hanmadishit74@gmail.com>
Show amber dot on chat settings gear icon and amber tool icon/badge in the quick settings popover when more than 3 tools are active. ToolPickerSheet shows a one-time dismissable banner explaining the latency impact. Dismissed state is persisted — never shown again once acknowledged. Also sets 3-tool default for new users and adds hint copy to the bottom of the tool picker. Co-Authored-By: Dishit Karia hanmadishit74@gmail.com
…d LiteRT - Revert applicationId, versionCode, versionName to match main - Revert app_name to "Off Grid" - Revert JVM heap args to match main - Migrate kapt → KSP for Room compiler (required for Kotlin 2.2.0) - Bump Kotlin 2.1.20 → 2.2.0, pin AGP to 8.8.2 - Add KSP plugin at 2.2.0-2.0.2 - Add LiteRT dependency: litertlm-android:0.11.0
Extract color literals to constants, move inline styles to stylesheets, remove unused AVAILABLE_TOOLS imports from Popovers and ChatInput.
… dependency in ChatInput ChatInput now receives showSettingsDot as a prop from ChatMessageArea, keeping the store read out of ChatInput and fixing test failures caused by unmocked useAppStore in existing ChatInput tests.
…ile overhead Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
and remove excessive comments Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Wrap the settings icon in a relative-positioned View so the dot is anchored to the icon bounds (18px) not the button bounds (32px).
Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Co-Authored-By: Dishit Karia hanmadishit74@gmail.com
LiteRT runs the tool loop natively via automaticToolCalling, so the JS MAX_TOTAL_TOOL_CALLS cap never applied to it — a single message could trigger unbounded tool calls and overflow the ~4096-token KV cache mid-turn, producing degenerate output or crashing. Add a per-turn counter in buildLiteRTToolCallHandler: calls 1-3 run normally; the 4th+ skips execution and returns a 'stop, answer now' nudge to the model. Counter resets each turn (closure rebuilt per generation). Loop stays native. Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…ume retry When a download fails on the Models screen, the card now renders the error message, a red partial-progress bar, and Retry / Remove buttons directly inside the card boundary — matching the Download Manager UI. Tapping Retry calls backgroundDownloadService.retryDownload with the existing download ID so the native WorkManager resumes from the partial file via HTTP Range instead of starting a fresh download from 0. Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Remove the two logs that fire in tight loops: - llm.ts: reasoning_content chunk received (fired on every thinking token — O(N²) string work serializing accumulated text each call) - useDownloads.ts: mmproj progress and missed-entry debug logs (fired every 1.5s during download and on every progress event miss) All other diagnostic logs (model load, download lifecycle, tool calls) are untouched — they fire once per user action and are useful for diagnosing real issues. Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
ChatsListScreen was subscribing to the entire chatStore and appStore with no selectors, causing it to re-render on every streaming token while mounted in the tab navigator. Actions moved to getState() and data fields use targeted selectors. Adds an informational banner above the chat input when a llama model is loaded with OpenCL selected as the inference backend, nudging users to switch to CPU in Settings. Does not show for LiteRT models or remote models. Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
… StyleSheet Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…stener split useDownloads.ts: remove useDownloadListeners() call — now fully independent. App.tsx: mount useDownloadListeners() directly at root so listener registration is not lost after the split. TextModelsTab handleRetryDownload: - Android-only guard; iOS falls back to proceedDownload (fresh download) - mmproj sidecar retry: set pending before retry, only call resetMmProjForRetry if native retry succeeded, set failed on error. Matches retryAndroidDownload in useDownloadManager exactly — prevents silent vision loss on retry from the Models screen. - onRetry branches on Platform.OS - Use storeDownloads selector instead of getState() snapshot for storeEntry Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…callback Replace captured store.downloadIdIndex snapshot with a live getState() call inside the async callback, matching the pattern in reattachRetriedTextDownload in useDownloadManager.ts. Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
- ModelCard: 8 new tests covering failedState / FailedSection (new inline retry UI from fix/69d17d28) - generationToolLoop: 4 new tests for LiteRT native tool-call cap introduced in fix/73f85ff8 — verifies cap at 3, Aborted fast-path, and per-generation counter reset - activeModelService loaders: fix stale-path test (add isVisionModel:true), add guard tests for text-only model and mmProjFileName repair sentinel - scan.test.ts (new): unit tests for extractBaseName and findMatchingMmProj, plus curatedLiteRTRegistry entry lookup - visionRepair: 3 additional branch tests (name-lookup false path, catalog-no-mmproj path, fileName vl-detection path) Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
…e ID - Create shared test utilities: mocks.ts with AsyncStorage, logger, whisper service, and HTTP client factories - Add store-specific reset helpers (resetDownloadStore, resetRemoteServerStore, resetWhisperStore, etc) - Add act() wrapper utilities (actStoreUpdate, actAsyncStoreUpdate) to reduce boilerplate - Refactor remoteServerStore.test.ts to use shared actStoreUpdate() instead of 50+ act() calls - Refactor whisperStore.test.ts to use resetWhisperStore() from shared utilities - Change litert bundle ID from ai.offgridmobile to ai.offgridmobile.litert (allows side-by-side install with Play Store version) Co-Authored-By: Dishit Karia <hanmadishit74@gmail.com>
Improve perf
|


Add LiteRT as a second on-device inference engine
Adds LiteRT (Google's on-device inference runtime) as a peer to the existing llama.cpp engine. Android-only at ship. The JS layer is built platform-agnostic so the iOS Swift LiteRT SDK can drop in when Google releases it with no code changes on this side.
Engine is decided by file extension —
.ggufruns on llama.cpp as before,.litertlmruns on LiteRT. No engine toggle in the UI; it follows the model.Why a second engine
LiteRT runs Gemma 4, Gemma 3n, and other LiteRT-LM packaged models with GPU acceleration paths that llama.cpp does not have on Android. CPU remains as the universal fallback.
Both engines stay in the app. Users keep their existing
.ggufmodels and existing chats. Adding a.litertlmmodel is purely additive.Architecture
Follows the same shape as the existing llama.cpp / ONNX split — a peer service, not a layer on top of
llmService.Type model
DownloadedModelis a discriminated union overengine:Engine-specific fields live on the engine-specific type. Consumers narrow with
model.engine === 'llama'and let the compiler enforce correctness.isLlamaModel/isLiteRTModeltype guards are exported for tests and selectors.One
downloadedModelscollection, oneactiveModelId. The image-model pattern of separate collections is not used here because only one text model is loaded at a time and most call sites are engine-agnostic.Settings model
LiteRT-specific settings are flat fields prefixed with
liteRT:Test plan
-
- Android cold start — llama model loads and generates
- Android cold start — LiteRT model loads and generates
- Switch from llama to LiteRT mid-session, switch back
- LiteRT GPU path: load Gemma 4 E2B, verify ~38 chars/sec decode
- LiteRT CPU fallback when GPU init refused
- Stop generation mid-stream, send another message — no
- Long conversation — watch auto-compact trigger near 65%
- Vision: Gemma 3n with image attached, model responds about the image
- Vision blocked: text-only model with image attached, clean error
- Tool calling:
- Settings: change LiteRT temperature, verify next response reflects it
- Settings: change LiteRT max tokens, reload banner appears, reload works
- Memory: load LiteRT + image gen, watch memory check
- CI build succeeds from a fresh checkout (gesture-handler patch applies)
- iOS build succeeds, no LiteRT UI visible on iOS device
# Add LiteRT as a second on-device inference enginenpm run lint && npx tsc --noEmit && npm testcleanFAILED_PRECONDITIONweb_searchinvocation completes and returns to modelAdds [LiteRT](https://ai.google.dev/edge/litert) (Google's on-device inference runtime) as a peer to the existing llama.cpp engine. Android-only at ship. The JS layer is built platform-agnostic so the iOS Swift LiteRT SDK can drop in when Google releases it with no code changes on this side.
Engine is decided by file extension —
.ggufruns on llama.cpp as before,.litertlmruns on LiteRT. No engine toggle in the UI; it follows the model.Why a second engine
LiteRT runs Gemma 4, Gemma 3n, and other LiteRT-LM packaged models with GPU acceleration paths that llama.cpp does not have on Android. CPU remains as the universal fallback.
Both engines stay in the app. Users keep their existing
.ggufmodels and existing chats. Adding a.litertlmmodel is purely additive.Architecture
Follows the same shape as the existing llama.cpp / ONNX split — a peer service, not a layer on top of
llmService.Type model
DownloadedModelis a discriminated union overengine:Engine-specific fields live on the engine-specific type. Consumers narrow with
model.engine === 'llama'and let the compiler enforce correctness.isLlamaModel/isLiteRTModeltype guards are exported for tests and selectors.One
downloadedModelscollection, oneactiveModelId. The image-model pattern of separate collections is not used here because only one text model is loaded at a time and most call sites are engine-agnostic.Settings model
LiteRT-specific settings are flat fields prefixed with
liteRT:liteRTBackend'gpu' | 'cpu''gpu'liteRTTemperaturenumber0.7liteRTTopPnumber0.9liteRTMaxTokensnumber4096liteRTMaxTokensmaps to LiteRT'sEngineConfig.maxNumTokens— one budget covering system prompt, history, input, and output combined. Existing llama settings are untouched. Switching engines does not cross-contaminate preferences.Core features
Text generation
liteRTService.sendMessage()streams tokens through four event types from the native module:litert_tokenonToken— response streamlitert_thinkingonReasoning— extended-reasoning streamlitert_completeonCompletewithBenchmarkInfostatslitert_erroronErrorlitert_tool_callThe native
Conversationobject holds turn history internally. JS only sends the current user turn after the conversation is set up, avoiding re-prefilling the entire chat on every message.prepareConversationonly triggers a nativeresetConversationwhenactiveConversationId,activeSystemPrompt, oractiveToolsJsonchanges. Otherwise it reuses the live native session.After load on GPU, the engine runs a one-token warmup against an empty prompt to prime the shader/kernel cache and avoid first-compile latency on the first real prompt.
Thinking
Gemma 4 models need a
<|think|>token prepended to the system prompt to activate extended reasoning.applyGemma4ThinkTokenhandles this for both engines:Thinking tokens stream into a separate channel (
litert_thinkingnative /onReasoningJS). The chat UI renders them in a collapsible "Thought process" section above the response. Disabling the thinking toggle removes the prefix and the model produces direct answers only.Context compaction
LiteRT models load with a fixed
maxNumTokensbudget. The service tracks cumulative tokens and auto-compacts at 65% of the budget:[summary, recent turns]Summarization runs inside the existing KV cache while ~35% headroom remains, then a single reset rebuilds the conversation with the compacted history. Recent-turn selection keeps the last 40% of context by char estimate, with a minimum of two turns. A 20-second timeout falls back to slice-only if summarization stalls.
Implementation:
src/services/liteRTCompaction.ts— pure function, takeshistory + maxTokens + cumulativeTokens, returns the reset call to make.Tools
Tool calling is handled natively by the LiteRT SDK. Tools are passed in as JSON at conversation reset time. When the model emits a tool call, the SDK fires
litert_tool_callto JS with the tool name and arguments. JS executes the tool and callsrespondToToolCall(id, result)back to the native side. The SDK feeds the result into the conversation and continues generation internally — JS never sees the second prefill.Three text-based parsing fallbacks exist for models that do not use the SDK's tool format (Gemma 4 emits a non-standard
<|tool_call>syntax). These parse the response text after generation and execute tools through the same path.Vision
LiteRT supports multimodal models (Gemma 3n E2B / E4B variants) with a GPU-accelerated vision encoder. The
liteRTVisionboolean on the model record gates this:.litertlmfileliteRTVision === true, the engine loads withvisionBackend = Backend.GPU()The dialog is needed because LiteRT does not expose introspection for vision support on a loaded model.
Backend selection and fallback
The native module does a two-tier fallback at load time with per-tier timeouts (GPU 20s, CPU 15s). The actually-loaded backend is reported back via
getActiveBackend()so the UI can show "Requested GPU, running on CPU" if fallback occurs.Model management
Import flow
No LiteRT model catalog or HuggingFace download flow in this PR — users sideload
.litertlmfiles. Curated downloads will come in a follow-up.Recommended models for sideload:
google/gemma-3n-E2B-it-litert-lm— vision-capable, smallerlitert-community/gemma-4-E2B-it-litert-lm— text-only, largerSettings UI
ModelSettingsScreenandGenerationSettingsModalrender one of two sibling components based on the active model:LiteRTTextSettings shows: Temperature, Max Tokens, Top P (advanced), Acceleration (GPU/CPU), Show Generation Details, Thinking toggle.
LlamaTextSettings shows: Temperature, Max Tokens, Context Length, Top P, Repeat Penalty, CPU Threads, Batch Size, Backend, Flash Attention, KV Cache Type, Model Loading Strategy.
LiteRT hides controls that do not map to its SDK (no flash attention, no GPU layer count, no manual thread count).
Settings that bake into the engine at load time (
backend,maxTokens) trigger a "settings changed, reload required" banner. Settings the engine reads per-generation (temperature,top-P) take effect on the next reset.Memory budget
The memory check at model load now sums RAM across both engines plus any loaded image model. Loading a LiteRT model while llama is resident unloads llama first, and vice versa. Loading a LiteRT model alongside an image model checks both budgets to prevent OOM.
Build infrastructure
Kotlin 2.2.0
The LiteRT SDK requires Kotlin 2.x.
kotlinVersionis bumped to2.2.0inandroid/build.gradlealong with the Kotlin Gradle plugin classpath. This unlocks the K2 compiler — most modules compile faster and produce slightly smaller bytecode.react-native-gesture-handlerpatchKotlin 2.2's K2 compiler tightened smart-cast inference inside lambdas when the receiver is a property.
react-native-gesture-handler@2.30.0relies on pre-K2 behavior infindRootHelperForViewAncestor:Same JVM bytecode, same runtime behavior. Applied via
patch-packageinpostinstall. Will self-remove when gesture-handler ships a K2-compatible version upstream.Native module
android/app/src/main/java/ai/offgridmobile/litert/LiteRTModule.ktregisters as a standardReactContextBaseJavaModule:loadModelresetConversationsendMessagerespondToToolCallstopGenerationunloadModelConversation, thenEngine, in that ordergetMemoryInfoDeviceStatsChipOpenCL opted in via
AndroidManifest:Platform readiness
liteRTServicedetects availability by native module presence (!!NativeModules.LiteRTModule), notPlatform.OS === 'android'. When the iOS Swift LiteRT SDK is released and registered under the same JS name, the JS service starts working on iOS with no code changes here.Behavior summary
.ggufmodel.litertlmmodelTest plan
npm run lint && npx tsc --noEmit && npm testcleanFAILED_PRECONDITIONweb_searchinvocation completes and returns to model