A fully on-device RAG assistant for Android. Drop in PDFs, .txt, .md, or .docx files, ask questions, get answers — without anything leaving your phone.
No cloud calls. No accounts. No telemetry. The model runs locally on CPU via MediaPipe + LiteRT, embeddings are computed with a local TextEmbedder, and chunks live in an ObjectBox vector DB on disk.
- Chat with your documents. Upload one file or a whole pile of them; the app chunks, embeds, and indexes them. Ask questions and you get answers with citations to the source page.
- Workspaces. Keep work, study, and personal libraries separate — each workspace has its own documents, conversations, and system prompt.
- Multiple conversations per workspace. Gemini-style left drawer with every past chat, auto-titled from your first question.
- Pick your model. Built-in catalog of small LLMs (Gemma 3 1B, Gemma 2 2B,
Phi-4 Mini, Qwen 2.5, Gemma 3n) — tap to download, tap to switch. You can
also paste any MediaPipe
.taskURL. - Prompt presets. Balanced, Strict RAG, Tutor, Summarizer, Code Reviewer — or write your own per-workspace prompt.
Grab the latest APK from Releases and sideload it on an Android 10+ phone with about 6 GB of RAM (the bigger models want more). Open the app, hit Settings → Recommended for your device, and download a model. The Gemma 3 1B (int4) is the fastest place to start.
git clone https://github.com/SharadhNaidu/PocketRag.git
cd PocketRag
# Point local.properties at your Android SDK
echo "sdk.dir=/path/to/Android/Sdk" > local.properties
./gradlew :app:assembleDebug
adb install -r app/build/outputs/apk/debug/app-debug.apkModels distributed via Hugging Face at
SharadhNaiduTrains/PocketRag.
The mirror script (scripts/upload_models.py) is what populates that repo
from upstream LiteRT sources.
┌─────────────────────┐ chunks + vectors ┌──────────────────┐
│ DocumentProcessor │ ─────────────────────▶ │ ObjectBox HNSW │
│ (PDF/txt/md/docx) │ └──────────────────┘
└─────────────────────┘ │
▼
┌──────────────────┐
user query ──── TextEmbedder ──▶ │ RetrievalCore │
│ (semantic + MMR)│
└──────────────────┘
│ top-K
▼
┌──────────────────┐
│ RagManager │
│ (prompt + history)│
└──────────────────┘
│
▼
┌──────────────────┐
│ MediaPipe LLM │
│ (CPU, .task) │
└──────────────────┘
Chunking is paragraph-aware with overlap. Retrieval blends semantic top-K with keyword fallback and MMR-style diversity, then merges adjacent chunks so the model gets contiguous context. Prompt strategy is intent-aware — summary, reasoning, and extraction queries get different scaffolds.
The default LLM backend is CPU. GPU was tried on Realme/Oplus devices and got reliably killed by Athena's GPU-memory watchdog at the ~3.5 GB cap, so it's pinned off until that's worked around.
MIT.
@SharadhNaidu — sole maintainer. PRs welcome once the project stabilizes.