Skip to content

galalqassas/VocalMind

Repository files navigation

VocalMind

VocalMind is a modular AI ecosystem integrating speech processing (ASR, Diarization, Synthesis) with retrieval-augmented generation (RAG) to create context-aware conversational agents, designed for call center and telecom use cases.


Architecture

Component Tech Stack Description
Backend FastAPI, SQLModel, asyncpg Central API gateway with auth (JWT/Google OAuth), Supabase integration, and dispute handling.
Frontend React 18, Vite, Tailwind v4, MUI, Radix UI Manager and agent dashboards with session analysis. Tested with Cypress E2E and Vitest.
VAD Silero VAD, FastAPI Voice Activity Detection microservice.
WhisperX WhisperX, pyannote, FastAPI Automatic Speech Recognition and Diarization microservice.
Emotion Transformers, FastAPI Speech emotion recognition microservice.
RAG LlamaIndex, Qdrant, Groq, Ollama Retrieval-Augmented Generation for knowledge queries.
Ingestion LlamaIndex Automated pipeline for RAG document ingestion.
Explainability FastAPI, React, LLM/NLI attribution Evidence-anchored layer that links triggers and compliance verdicts to transcript spans and retrieved policy/SOP evidence.
Research Jupyter Reference experiments for speech pipelines and voice generation.

Quick Start

Prerequisites

  • Docker Desktop (includes Docker Compose v2+)
  • Python 3.12+ (via uv) — only needed for local development
  • Node.js 20+ (via pnpm v10+) — only needed for local frontend development
  • Git LFS — if your fork includes large test fixtures
  • Hugging Face token (HF_TOKEN) — required for WhisperX diarization (pyannote)
  • Groq API key (GROQ_API_KEY) — required for LLM chains and trigger evaluation

1. Configure Environment Variables

cp .env.example .env
cp backend/.env.example backend/.env

Edit backend/.env and fill in the required secrets:

Variable Required Notes
GROQ_API_KEY Yes Get from https://console.groq.com — LLM chains and trigger evaluation
HF_TOKEN Yes Get from https://huggingface.co/settings/tokens — pyannote diarization
SECRET_KEY Yes Generate with openssl rand -hex 32
IS_LOCAL No true = Docker containers (default in docker-compose.yml); false = Kaggle remote

For Option A (full Docker), the root .env provides GROQ_API_KEY and HF_TOKEN which docker-compose.yml passes through. For Option B (local dev), set these in backend/.env and change the service URLs to localhost.

2. Prepare Speaker-Role Model (WhisperX)

WhisperX and the backend both mount the DistilBERT speaker-role classifier. Place the export archive at the repo root and run:

make prepare-speaker-model

This extracts services/whisperx/models/speaker_role/distilbert/ from speaker_classifier_export.zip (which is gitignored). Without this step, WhisperX will fail at startup. Use --delete-zip to remove the archive after extraction:

python infra/scripts/prepare_speaker_role_model.py --delete-zip

3. Build Docker Images

make build

This builds all custom images: backend, frontend, ingestion (RAG), VAD, emotion, and WhisperX.

Windows tip: If you encounter transient Docker daemon errors (EOF, 500 Internal Server Error, rpc Unavailable), use the built-in retry script instead:

make build-retry

This retries up to 4 times with 12-second delays for known transient Docker Desktop issues on Windows.

First build: The emotion and WhisperX images use CUDA base images and download several GB of PyTorch/NVIDIA libraries. Expect 20–40 minutes on a fast connection depending on hardware.

Option A. Run the Full Stack in Docker

Start all services (Database, Backend, Frontend, Ollama, Qdrant, Ingestion, VAD, Emotion, WhisperX):

make up

First-time startup also requires pulling the Ollama embedding model:

docker exec vocalmind-ollama ollama pull snowflake-arctic-embed2

Then seed the database with demo data:

make seed

The app is now available at:

  • Frontend: http://localhost:3000
  • Backend API: http://localhost:8000
  • API docs: http://localhost:8000/docs

Audio files placed in storage/audio/nexalink/ are auto-ingested on startup (see Audio Auto-Ingest below).

To stop:

make down

Option B. Run Backend and Frontend Locally

Start only the supporting infrastructure (Database, Ollama, Qdrant, VAD, Emotion, WhisperX):

make support-up
make prepare-speaker-model   # if not done already

Set IS_LOCAL=true and point service URLs at localhost in backend/.env:

IS_LOCAL=true
EMOTION_API_URL=http://localhost:8001
VAD_API_URL=http://localhost:8002
WHISPERX_API_URL=http://localhost:8003

Backend:

make be-install
make be-dev          # -> http://localhost:8000

Frontend:

make fe-install
make fe-dev          # -> http://localhost:3000

Pull the Ollama embedding model if using RAG:

docker exec vocalmind-ollama ollama pull snowflake-arctic-embed2

Seed demo data:

make seed

Stop supporting containers:

make support-down

GPU Acceleration (Optional)

For inference workloads that benefit from an NVIDIA GPU, use the GPU-enabled compose overlay:

make up-gpu            # full stack with GPU
make support-up-gpu    # supporting services only with GPU

This requires the NVIDIA Container Toolkit and a compatible GPU driver.


Project Structure

VocalMind/
├── backend/          # FastAPI API gateway
├── frontend/         # React dashboard (Manager & Agent routes)
├── services/         # Microservices (VAD, WhisperX, Emotion, RAG)
├── infra/            # DB init, seed/eval scripts, quality benchmarks, test fixtures
│   ├── db/           # PostgreSQL schema & seed SQL
│   ├── benchmarks/   # Quality benchmark data (expected, fixtures, schema)
│   ├── scripts/      # Operational scripts (seed/, eval/, e2e, migrate)
│   └── fixtures/     # Test audio files & external API fixtures
├── storage/          # Unified local storage (docs, audio, uploads)
│   ├── docs/         #   Organization documents (policy, SOP, KB)
│   ├── audio/        #   Per-org audio drop folders (e.g. nexalink/), auto-ingested
│   │                 #   on backend startup. Audio files (.wav/.mp3) are gitignored —
│   │                 #   filename pattern CALL_<NN>_<agent>_<scenario>.<ext> assigns
│   │                 #   the call to the correct seeded agent.
│   └── uploads/      #   Runtime upload buffer (gitignored)
├── research/         # Jupyter notebooks & prototype scripts
├── docs/             # Documentation (explainability, LLM trigger, RAG, design, frontend)
├── tools/            # Local CLI tools (Supabase CLI)
├── .github/          # CI workflows (ci.yml, backend.yml, frontend.yml, rag_ci.yml)
├── docker-compose.yml# Multi-container service definitions
├── Makefile          # Unified development commands
└── README.md

Useful Commands

Backend

make be-install       # Install dependencies
make be-dev           # Run api gateway
make be-test          # Run pytest suite
make be-lint          # Run Ruff linter

Frontend

make fe-install       # Install dependencies (pnpm)
make fe-test          # Run Cypress E2E tests
make fe-e2e-cov       # Run E2E tests with Istanbul code coverage
make fe-lint          # Run ESLint/Type-check validation
make fe-build         # Build production bundle

Docker

make build            # Build all Docker images
make build-retry      # Build + start with retry for transient Docker daemon errors (Windows)
make up               # Start all services
make up-gpu           # Start all services with NVIDIA GPU acceleration
make support-up       # Start supporting services only (DB, Ollama, Qdrant, inference)
make support-up-gpu   # Same, with GPU acceleration
make down             # Stop all services
make support-down     # Stop supporting services only
make logs             # Tail container logs
make seed             # Seed database with demo data

General

make clean            # Remove caches and build artifacts
make prepare-speaker-model  # Extract speaker-role classifier for WhisperX

Utility Scripts

python infra/scripts/measure_dashboard_baseline.py --api-base http://localhost:8000/api/v1
python infra/fixtures/kaggle/scripts/kaggle_api_smoke_test.py --audio-file storage/audio/nexalink/CALL_01_priya_refund_outage.wav

Speaker Classifier Artifact

Place speaker_classifier_export.zip at the repo root and run:

make prepare-speaker-model

This extracts the DistilBERT model into services/whisperx/models/speaker_role/distilbert/. The zip is gitignored — it must be provided separately. WhisperX will fail to start without these model files.

Audio Auto-Ingest

The backend ships an audio folder watcher (backend/app/core/audio_folder_watcher.py) that runs on startup and every 15 seconds while the server is up. It scans storage/audio/<org_slug>/ for any .wav or .mp3 file that is not yet recorded as an interaction, and:

  1. Reads the agent token from the filename (CALL_<NN>_<agent>_<scenario>.<ext>).
  2. Creates an Interaction row owned by that agent with status pending.
  3. Seeds the processing_jobs records for the full pipeline.
  4. Enqueues the interaction onto the in-memory worker queue.

Drop a properly-named audio file into the folder — the manager dashboard will pick it up without any manual upload step. If the filename does not match the pattern, the file is still ingested but assigned to a deterministic fallback agent and a warning is logged. Set AUDIO_FOLDER_WATCHER_ENABLED=false in .env to disable.

The seeded NexaLink organization comes with one manager ([email protected]) and exactly five agents — Priya, Daniel, Marcus, Aisha, Hannah — one per scripted call in storage/audio/nexalink/.

Key Docs

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors