VocalMind

VocalMind is a modular AI ecosystem integrating speech processing (ASR, Diarization, Synthesis) with retrieval-augmented generation (RAG) to create context-aware conversational agents, designed for call center and telecom use cases.

Architecture

Component	Tech Stack	Description
Backend	FastAPI, SQLModel, asyncpg	Central API gateway with auth (JWT/Google OAuth), Supabase integration, and dispute handling.
Frontend	React 18, Vite, Tailwind v4, MUI, Radix UI	Manager and agent dashboards with session analysis. Tested with Cypress E2E and Vitest.
VAD	Silero VAD, FastAPI	Voice Activity Detection microservice.
WhisperX	WhisperX, pyannote, FastAPI	Automatic Speech Recognition and Diarization microservice.
Emotion	Transformers, FastAPI	Speech emotion recognition microservice.
RAG	LlamaIndex, Qdrant, Groq, Ollama	Retrieval-Augmented Generation for knowledge queries.
Ingestion	LlamaIndex	Automated pipeline for RAG document ingestion.
Explainability	FastAPI, React, LLM/NLI attribution	Evidence-anchored layer that links triggers and compliance verdicts to transcript spans and retrieved policy/SOP evidence.
Research	Jupyter	Reference experiments for speech pipelines and voice generation.

Quick Start

Prerequisites

Docker Desktop (includes Docker Compose v2+)
Python 3.12+ (via uv) — only needed for local development
Node.js 20+ (via pnpm v10+) — only needed for local frontend development
Git LFS — if your fork includes large test fixtures
Hugging Face token (HF_TOKEN) — required for WhisperX diarization (pyannote)
Groq API key (GROQ_API_KEY) — required for LLM chains and trigger evaluation

1. Configure Environment Variables

cp .env.example .env
cp backend/.env.example backend/.env

Edit backend/.env and fill in the required secrets:

Variable	Required	Notes
`GROQ_API_KEY`	Yes	Get from https://console.groq.com — LLM chains and trigger evaluation
`HF_TOKEN`	Yes	Get from https://huggingface.co/settings/tokens — pyannote diarization
`SECRET_KEY`	Yes	Generate with `openssl rand -hex 32`
`IS_LOCAL`	No	`true` = Docker containers (default in docker-compose.yml); `false` = Kaggle remote

For Option A (full Docker), the root .env provides GROQ_API_KEY and HF_TOKEN which docker-compose.yml passes through. For Option B (local dev), set these in backend/.env and change the service URLs to localhost.

2. Prepare Speaker-Role Model (WhisperX)

WhisperX and the backend both mount the DistilBERT speaker-role classifier. Place the export archive at the repo root and run:

make prepare-speaker-model

This extracts services/whisperx/models/speaker_role/distilbert/ from speaker_classifier_export.zip (which is gitignored). Without this step, WhisperX will fail at startup. Use --delete-zip to remove the archive after extraction:

python infra/scripts/prepare_speaker_role_model.py --delete-zip

3. Build Docker Images

make build

This builds all custom images: backend, frontend, ingestion (RAG), VAD, emotion, and WhisperX.

Windows tip: If you encounter transient Docker daemon errors (EOF, 500 Internal Server Error, rpc Unavailable), use the built-in retry script instead:
make build-retry
This retries up to 4 times with 12-second delays for known transient Docker Desktop issues on Windows.

First build: The emotion and WhisperX images use CUDA base images and download several GB of PyTorch/NVIDIA libraries. Expect 20–40 minutes on a fast connection depending on hardware.

Option A. Run the Full Stack in Docker

Start all services (Database, Backend, Frontend, Ollama, Qdrant, Ingestion, VAD, Emotion, WhisperX):

make up

First-time startup also requires pulling the Ollama embedding model:

docker exec vocalmind-ollama ollama pull snowflake-arctic-embed2

Then seed the database with demo data:

make seed

The app is now available at:

Frontend: http://localhost:3000
Backend API: http://localhost:8000
API docs: http://localhost:8000/docs

Audio files placed in storage/audio/nexalink/ are auto-ingested on startup (see Audio Auto-Ingest below).

To stop:

make down

Option B. Run Backend and Frontend Locally

Start only the supporting infrastructure (Database, Ollama, Qdrant, VAD, Emotion, WhisperX):

make support-up
make prepare-speaker-model   # if not done already

Set IS_LOCAL=true and point service URLs at localhost in backend/.env:

IS_LOCAL=true
EMOTION_API_URL=http://localhost:8001
VAD_API_URL=http://localhost:8002
WHISPERX_API_URL=http://localhost:8003

Backend:

make be-install
make be-dev          # -> http://localhost:8000

Frontend:

make fe-install
make fe-dev          # -> http://localhost:3000

Pull the Ollama embedding model if using RAG:

docker exec vocalmind-ollama ollama pull snowflake-arctic-embed2

Seed demo data:

make seed

Stop supporting containers:

make support-down

GPU Acceleration (Optional)

For inference workloads that benefit from an NVIDIA GPU, use the GPU-enabled compose overlay:

make up-gpu            # full stack with GPU
make support-up-gpu    # supporting services only with GPU

This requires the NVIDIA Container Toolkit and a compatible GPU driver.

Project Structure

VocalMind/
├── backend/          # FastAPI API gateway
├── frontend/         # React dashboard (Manager & Agent routes)
├── services/         # Microservices (VAD, WhisperX, Emotion, RAG)
├── infra/            # DB init, seed/eval scripts, quality benchmarks, test fixtures
│   ├── db/           # PostgreSQL schema & seed SQL
│   ├── benchmarks/   # Quality benchmark data (expected, fixtures, schema)
│   ├── scripts/      # Operational scripts (seed/, eval/, e2e, migrate)
│   └── fixtures/     # Test audio files & external API fixtures
├── storage/          # Unified local storage (docs, audio, uploads)
│   ├── docs/         #   Organization documents (policy, SOP, KB)
│   ├── audio/        #   Per-org audio drop folders (e.g. nexalink/), auto-ingested
│   │                 #   on backend startup. Audio files (.wav/.mp3) are gitignored —
│   │                 #   filename pattern CALL_<NN>_<agent>_<scenario>.<ext> assigns
│   │                 #   the call to the correct seeded agent.
│   └── uploads/      #   Runtime upload buffer (gitignored)
├── research/         # Jupyter notebooks & prototype scripts
├── docs/             # Documentation (explainability, LLM trigger, RAG, design, frontend)
├── tools/            # Local CLI tools (Supabase CLI)
├── .github/          # CI workflows (ci.yml, backend.yml, frontend.yml, rag_ci.yml)
├── docker-compose.yml# Multi-container service definitions
├── Makefile          # Unified development commands
└── README.md

Useful Commands

Backend

make be-install       # Install dependencies
make be-dev           # Run api gateway
make be-test          # Run pytest suite
make be-lint          # Run Ruff linter

Frontend

make fe-install       # Install dependencies (pnpm)
make fe-test          # Run Cypress E2E tests
make fe-e2e-cov       # Run E2E tests with Istanbul code coverage
make fe-lint          # Run ESLint/Type-check validation
make fe-build         # Build production bundle

Docker

make build            # Build all Docker images
make build-retry      # Build + start with retry for transient Docker daemon errors (Windows)
make up               # Start all services
make up-gpu           # Start all services with NVIDIA GPU acceleration
make support-up       # Start supporting services only (DB, Ollama, Qdrant, inference)
make support-up-gpu   # Same, with GPU acceleration
make down             # Stop all services
make support-down     # Stop supporting services only
make logs             # Tail container logs
make seed             # Seed database with demo data

General

make clean            # Remove caches and build artifacts
make prepare-speaker-model  # Extract speaker-role classifier for WhisperX

Utility Scripts

python infra/scripts/measure_dashboard_baseline.py --api-base http://localhost:8000/api/v1
python infra/fixtures/kaggle/scripts/kaggle_api_smoke_test.py --audio-file storage/audio/nexalink/CALL_01_priya_refund_outage.wav

Speaker Classifier Artifact

Place speaker_classifier_export.zip at the repo root and run:

make prepare-speaker-model

This extracts the DistilBERT model into services/whisperx/models/speaker_role/distilbert/. The zip is gitignored — it must be provided separately. WhisperX will fail to start without these model files.

Audio Auto-Ingest

The backend ships an audio folder watcher (backend/app/core/audio_folder_watcher.py) that runs on startup and every 15 seconds while the server is up. It scans storage/audio/<org_slug>/ for any .wav or .mp3 file that is not yet recorded as an interaction, and:

Reads the agent token from the filename (CALL_<NN>_<agent>_<scenario>.<ext>).
Creates an Interaction row owned by that agent with status pending.
Seeds the processing_jobs records for the full pipeline.
Enqueues the interaction onto the in-memory worker queue.

Drop a properly-named audio file into the folder — the manager dashboard will pick it up without any manual upload step. If the filename does not match the pattern, the file is still ingested but assigned to a deterministic fallback agent and a warning is logged. Set AUDIO_FOLDER_WATCHER_ENABLED=false in .env to disable.

The seeded NexaLink organization comes with one manager ([email protected]) and exactly five agents — Priya, Daniel, Marcus, Aisha, Hannah — one per scripted call in storage/audio/nexalink/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VocalMind

Architecture

Quick Start

Prerequisites

1. Configure Environment Variables

2. Prepare Speaker-Role Model (WhisperX)

3. Build Docker Images

Option A. Run the Full Stack in Docker

Option B. Run Backend and Frontend Locally

GPU Acceleration (Optional)

Project Structure

Useful Commands

Backend

Frontend

Docker

General

Utility Scripts

Speaker Classifier Artifact

Audio Auto-Ingest

Key Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.github		.github
backend		backend
docs		docs
frontend		frontend
infra		infra
research		research
services		services
storage		storage
tools/supabase-cli		tools/supabase-cli
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Makefile		Makefile
README.md		README.md
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

VocalMind

Architecture

Quick Start

Prerequisites

1. Configure Environment Variables

2. Prepare Speaker-Role Model (WhisperX)

3. Build Docker Images

Option A. Run the Full Stack in Docker

Option B. Run Backend and Frontend Locally

GPU Acceleration (Optional)

Project Structure

Useful Commands

Backend

Frontend

Docker

General

Utility Scripts

Speaker Classifier Artifact

Audio Auto-Ingest

Key Docs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages