Private AI that runs on your machine. No cloud. No account. No compromise.
Studiomc is a desktop AI assistant that runs large language models entirely on your hardware. It auto-detects your system, recommends the best model, and gives you a ChatGPT-quality experience — fully offline, fully private.
| Problem | Studiomc |
|---|---|
| Local AI tools feel like dev tools | Polished UI — install, click, chat |
| Users pick the wrong model and get frustrated | Hardware scan + automatic model recommendation |
| Document Q&A is slow and uncited | CLaRa compression-native retrieval with citations |
| No way to know if a model will run well | Speed ratings and performance predictions in plain English |
| Multiple backends, multiple interfaces | One unified interface across Ollama, LM Studio, and frontier APIs |
- One-click install — Working chat in under 2 minutes
- Autopilot model selection — Scans your hardware, picks the best model automatically
- Multi-backend — Auto-detects Ollama and LM Studio, connects frontier APIs (OpenAI, Anthropic)
- Chat — Streaming responses, conversation history, branching, memory
- Local OpenAI-compatible API — Integrate with any tool that speaks OpenAI
- Privacy-first — Everything runs locally. No telemetry. No accounts. No cloud unless you explicitly opt in.
- Docs mode — Upload PDF/TXT/MD, ask questions, get cited answers grounded in your documents
- CLaRa — Compression-native retrieval: semantic embeddings (sentence-transformers or TF-IDF fallback), per-collection indexes, top-k retrieval with citations (p95 ≤150 ms)
- RAG pipeline — Extract → chunk (500–1000 tokens, overlap) → index → retrieve → generate with source citations
- Recursive reasoning loop — Plan → tool → observe → answer; supports cited (CLaRa), fast, and investigate modes
- LRE (Local Reasoning Environment) — Safe tool layer for the loop: search, grep, open, summarize, table_extract, cite; sandboxed with call budgets
- Investigate mode — Full reasoning trace visibility: tool calls, retrieved chunks, and final answer in one view
- SpliceLLM — Our built-in out-of-core engine: run models of any size by streaming layers from disk one at a time. Model splitter turns HuggingFace checkpoints into per-layer safetensors; only one layer in memory at a time. Prefetch overlaps I/O with compute. Enables large models on limited VRAM/RAM (e.g. 70B on 4GB with clear “slow mode” expectations)
- Multi-backend inference — Bundled engine, Ollama, LM Studio, or frontier APIs; router picks the right backend per model
- Performance dashboard — Speed rating (Fast/OK/Slow), throughput, system metrics
No prerequisites. No Ollama. No Python. Just install and chat.
- Download the latest
.dmgfrom Releases - Open the DMG and drag Studiomc to Applications
- Launch Studiomc — it scans your hardware and recommends a model
- The model downloads from HuggingFace automatically
- You're chatting in under 2 minutes
If you already have Ollama or LM Studio installed, Studiomc auto-detects them and adds their models to your model list. No configuration needed.
# Clone
git clone https://github.com/mchawda/studiomc.git
cd studiomc
# Build macOS app + DMG
./scripts/build-macos.sh
# Output: dist/Studiomc-<version>-macos.dmgRequires: Flutter SDK (stable channel)
Flutter App ── HTTP/WS ──▶ Local Supervisor
├── Inference Router
│ ├── Ollama (auto-detected)
│ ├── LM Studio (auto-detected)
│ ├── SpliceLLM (built-in, out-of-core)
│ └── Frontier APIs (optional)
├── Model Manager
├── Document Service (extract, chunk, store)
├── CLaRa (compression-native retrieval + cited answer)
├── Orchestrator (recursive reasoning loop: plan → tool → answer)
└── LRE (Local Reasoning Environment — tools for the loop)
- Frontend: Flutter (Dart) — macOS, Windows, iOS, Android
- Inference: SpliceLLM (layer streaming from disk), optional Ollama / LM Studio / frontier API
- Models: HuggingFace GGUF or safetensors; splitter produces per-layer files for out-of-core
- Backend: Python FastAPI — documents, CLaRa RAG, orchestrator, LRE
- Storage: SQLite + filesystem
studiomc/
studiomc_app/ # Flutter app
lib/
screens/ # Chat, Models, Documents, Settings, Training, etc.
widgets/ # Reusable UI components
services/ # API clients, inference, orchestrator, settings
models/ # Data models
services/ # Python backend
supervisor/ # Process manager
inference/ # Out-of-core engine, splitter, router (Ollama/LM Studio/frontier)
model_manager/ # Model downloads, registry, autopilot
documents/ # Document extraction, chunking, storage
clara/ # CLaRa — compression-native retrieval + cited answer
lre/ # LRE — tools for orchestrator (search, grep, summarize, etc.)
orchestrator/ # Recursive reasoning loop (plan → tool → observe → answer)
training/ # Private training / personalization
scripts/ # Build & release tooling
product/ # Product specs & design docs
The project follows four release phases. Phase 1 (core chat, models, docs, CLaRa, SpliceLLM) is mostly complete. Phases 2–4 (orchestrator/LRE, Personalize wizard, investigate mode) are in progress. See product/product-roadmap.md for details.
| Metric | Target |
|---|---|
| Install to first chat | ≤ 2 min |
| First token latency | ≤ 2.5s (recommended models) |
| Throughput | ≥ 10 tok/s GPU, ≥ 4 tok/s CPU |
| Document retrieval | p95 ≤ 150ms |
Contributions welcome. Please open an issue first to discuss what you'd like to change.
Built on the shoulders of:
- AirLLM — Inspired SpliceLLM’s out-of-core (layer-streaming) approach
- Ollama — Local model runtime
- Flutter — Cross-platform UI
Source-available — free to use, not open-source. See LICENSE.md for full terms and THIRD_PARTY_NOTICES.md for open-source attribution. Copyright 2024-2026 NIA Pte Ltd.
