Self-hosted AI agent infrastructure with hybrid memory search, LLM routing, voice pipeline, and autonomous operations — all on local GPU inference.
Rasputin Stack is the infrastructure behind a self-hosted AI agent system running on bare-metal GPUs. It combines dense vector search, sparse keyword retrieval, a knowledge graph, and a cross-encoder reranker into a hybrid search pipeline — backed by autonomous cron jobs for memory maintenance.
Design goals:
- Hybrid search — vector + BM25 + graph + reranker fusion
- $0/query — all inference runs on local hardware
- Autonomous maintenance — cron-driven enrichment, dedup, health monitoring
- Multi-interface — Telegram, Discord, voice (WebRTC), web dashboard
┌──────────────────────────────────────────────────────────────────┐
│ INTERFACES │
│ Telegram · Discord · Voice (WebRTC) · Browser · Web Dashboard │
└──────────────────────┬───────────────────────────────────────────┘
│
┌────────────▼────────────┐
│ Agent Gateway │
│ Sessions · Sub-Agents │
│ Crons · Tools · Safety │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ LLM Routing Proxy │
│ Session Affinity │
│ Quality Gate · Failover │
│ Cost Logging │
└──┬─────┬─────┬─────┬───┘
│ │ │ │
┌───▼─┐ ┌▼───┐ ┌▼──┐ ┌▼────┐
│Local│ │Free│ │API│ │Cloud│
│ GPU │ │Tier│ │Key│ │ API │
│122B │ │ │ │ │ │ │
└─────┘ └────┘ └───┘ └─────┘
┌──────────────────────────────────────────────────────────────────┐
│ MEMORY LAYER │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ │
│ │ Qdrant │ │ FalkorDB │ │ BM25 │ │ Reranker │ │
│ │ Dense │ │ Graph │ │Sparse│ │ BGE v2 │ │
│ └──────────┘ └──────────┘ └──────┘ └──────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌─────────────────────┐ │
│ │ BrainBox │ │ STORM │ │ Multi-Tenant │ │
│ │Procedural│ │ Wiki Gen │ │ Agent Isolation │ │
│ └──────────┘ └──────────┘ └─────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ VOICE: Whisper STT ──► LLM ──► Qwen3 TTS ──► Audio │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ AUTONOMOUS: Cron jobs — fact extraction, enrichment, research, │
│ anomaly detection, episode detection, dedup, health monitoring │
└──────────────────────────────────────────────────────────────────┘
Every query passes through all four stages:
Query
│
├─► 1. DENSE VECTORS (Qdrant, nomic-embed-text)
│ Semantic similarity search
│
├─► 2. BM25 SPARSE (keyword precision)
│ Exact term matching for names, dates, codes
│
├─► 3. KNOWLEDGE GRAPH (FalkorDB)
│ Entity relationships and traversal
│
└─► 4. CROSS-ENCODER RERANKER (bge-reranker-v2-m3)
Final relevance scoring and fusion
| Directory | What's Inside |
|---|---|
memory/ |
Hybrid memory engine — Qdrant vectors, BM25, FalkorDB graph, reranker |
tools/ |
Agent tools — browser automation, RAG, memory ops, AI council |
tools/storm-wiki/ |
STORM wiki generator from memory |
tools/brainbox/ |
BrainBox procedural memory (Hebbian learning) |
proxy/ |
LLM routing proxy — multi-provider failover, quality gate |
dashboard/ |
Web dashboard — sessions, playground, cost tracking |
ui/ |
React 19 + Next.js frontend |
backend/ |
Express API — JWT RBAC, PostgreSQL, WebSocket streaming |
voice/ |
Voice pipeline — Qwen3 TTS server, Whisper STT, WebRTC |
agents/ |
Multi-tenant agent workspaces |
crons/ |
Autonomous scheduled jobs |
cli/ |
CLI interface — chat, search, sessions |
browser/ |
Chrome extension (Manifest V3) |
council/ |
Multi-model debate / consensus engine |
selfplay/ |
Self-play pipeline — task generation and evaluation |
research/ |
Research — context compaction study, prompt compilation |
monitoring/ |
Anomaly detection, health checks, forecasting |
doctor/ |
System diagnostics and alerting |
desktop/ |
Electron desktop wrapper |
docs/ |
Architecture, deployment, and API documentation |
The system is designed for bare-metal GPU inference:
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 48 GB | 192+ GB across multiple GPUs |
| CPU | 16 cores | 32+ cores |
| RAM | 64 GB | 256 GB |
| OS | Linux | Arch Linux / Ubuntu |
All inference runs locally — no API costs for standard operations.
This is a production system, not a turnkey install. To adapt it:
- Qdrant —
docker run -p 6333:6333 qdrant/qdrant - FalkorDB —
docker run -p 6379:6379 falkordb/falkordb - Ollama — Install and pull
nomic-embed-textfor embeddings - Memory engine —
cd memory && python memory_engine.py - Proxy —
cd proxy && python proxy.py - Dashboard —
cd dashboard && npm install && node server.js
See docs/DEPLOYMENT_GUIDE.md for full setup instructions.
| Layer | Technology |
|---|---|
| Search | Qdrant, FalkorDB, BM25, bge-reranker-v2-m3 |
| Inference | Ollama, llama.cpp — Qwen 3.5 122B MoE |
| Frontend | Next.js 14, React 19, TypeScript, shadcn/ui |
| Backend | Node.js, Express, PostgreSQL, WebSocket |
| Voice | Qwen3-TTS, faster-whisper, WebRTC |
| Proxy | Python / aiohttp, SSE streaming |
| Infra | PM2, Docker, systemd |
MIT — See LICENSE