OVA

A fully-local AI voice assistant with real-time streaming TTS, voice cloning, and multimodal (image + text) support. Built with a FastAPI backend and modern web frontend. All models (ASR / LLM / TTS) run locally with open weights - no data is sent to the Internet.

Features

Real-time PCM streaming TTS - Low-latency audio with Web Audio API AudioWorklet
Streaming ASR - Real-time transcription via WebSocket as you speak
Voice cloning - Clone any voice from a 5-15 second audio sample
Two-phase streaming - Aggressive first chunk for lower TTFB, then stable quality
Early TTS decode - Interleaved LLM→TTS for faster time-to-first-byte (~200-300ms reduction)
Hot-reload - Switch voice/language without restart
Multi-language support - 10 languages: zh, en, ja, ko, de, fr, ru, pt, es, it
Multimodal input - Attach images to chat queries for vision-language responses
Conversation mode - Switch between push-to-talk (half-duplex) and always-listening (full-duplex) with server-side VAD and turn-taking
Prosody control - [pause:X] tags for deliberate silences in speech
Tool calling - LLM can invoke real Python functions (timers, datetime, web search) with real-time push notifications via SSE
MCP client - Connect to external MCP servers for additional tool capabilities (filesystem, databases, APIs) without writing code
Themes - Dark, Light, Her (Samantha), and HAL-9000
torch.compile optimizations - Up to 1.7x speedup after JIT warmup

Quick Start

Pre-requisites

Linux (x86_64)
Python >= 3.13
uv installed and available in PATH
LLM provider (one of):
- Ollama installed and running (default)
- Any OpenAI-compatible API — local or remote (OVA_LLM_PROVIDER=openai)
NVIDIA GPU (Ampere or newer) with CUDA 12 or 13. Default is CUDA 13 — run ./ova.sh configure-cuda 12 before install if on CUDA 12

Install & Run

# If your CUDA version is not 13 (the default), configure first:
./ova.sh configure-cuda    # auto-detect, or: ./ova.sh configure-cuda 12

./ova.sh install

See QUICKSTART.md for voice profile setup and .env configuration.

./ova.sh start

This starts two services:

Backend (FastAPI): http://localhost:5173 — ASR + LLM + TTS pipeline
Frontend (static): http://localhost:8080 — open this in your browser

Logs: tail -f .ova/backend.log (add OVA_DEBUG=true to .env for verbose output).

Minimal Configuration

The defaults .env. examples will work for most setups. The common things to change in .env:

Variable	Default	What to change
`OVA_LANGUAGE`	`es`	Your language (`en`, `de`, `fr`, `ja`, etc.)
`OVA_QWEN3_VOICE`	`myvoice`	Voice profile (see generate_voice_prompts script )
`OVA_LLM_PROVIDER`	`ollama`	Set to `openai` for OpenAI-compatible APIs

See VARIABLES.md for the full configuration reference.

Models

Component	Model
ASR	Qwen3-ASR-0.6B (embedded subprocess)
LLM	Mistral ministral-3 3b 4-bit (Ollama or any OpenAI-compatible server)
TTS	Qwen3-TTS 1.7B
TTS (Alternative)	Qwen3-TTS 0.6B
TTS (Alternative)	Hexgrad Kokoro 82M
VAD	Silero VAD v6 (ONNX, client-side)

Qwen3-TTS Streaming Fork

This project uses a custom fork of Qwen3-TTS with streaming optimizations:

rekuenkdr/Qwen3-TTS-streaming

Key improvements over upstream:

Two-phase streaming - Aggressive first chunk settings for lower TTFB, then stable settings for quality
torch.compile optimizations - Up to 1.7x speedup after JIT warmup

Architecture

┌─────────────────┐                   ┌─────────────────┐
│   Frontend      │ ───────────────▶  │   Backend       │
│   (index.html)  │   HTTP/WebSocket  │   (FastAPI)     │
│   Port 8080     │ ◀───────────────  │   Port 5173     │
│                 │ ◀──── /v1/duplex ──▶  (full-duplex) │
└─────────────────┘                   └────────┬────────┘
                                               │
                    ┌──────────────────────────┴──────────────────────────┐
                    │                   OVAPipeline                        │
                    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │
                    │  │ ASR Process │  │    LLM      │  │ TTS Engine  │  │
                    │  │ (Unix sock) │  │ (Ollama/OAI)│  │ (Qwen3/Kok) │  │
                    │  └─────────────┘  └─────────────┘  └─────────────┘  │
                    └─────────────────────────────────────────────────────┘

The ASR subprocess uses Unix socket IPC (not multiprocessing.Pipe) to avoid vLLM's stdout conflicts and ensure clean CUDA context isolation. ASR runs in a separate subprocess spawned before torch imports so vLLM gets a clean CUDA context and uses fork (5-10x faster startup). The IPC protocol uses pickle serialization over Unix sockets for efficient numpy array transfer.

How It Works

Frontend captures audio/text (optionally with an image) and sends to the backend
Backend processes the request:
- Transcribes audio using embedded ASR subprocess (or streaming via WebSocket)
- Sends text + optional image to the LLM
- Early TTS decode: Starts TTS before full LLM response (reduces TTFB by ~200-300ms)
- Two-phase streaming: Aggressive first chunk settings, then stable streaming
Frontend plays audio as it arrives using an AudioWorklet processor

On an RTX 5060Ti (16GB VRAM), with Qwen3-TTS streaming optimizations enabled, TTFB is ~620ms after warmup with the 1.7B Model.

Streaming TTS

OVA supports two streaming formats: PCM (lower latency, WAV header + raw chunks with AudioWorklet) and WAV (each chunk is a complete WAV file). PCM mode uses reduce-overhead torch.compile for ~1.5-1.7x speedup. The frontend pre-buffers ~0.4 seconds before playback to ensure smooth audio.

Two-phase streaming uses aggressive settings for the first chunk (lower TTFB) then switches to stable settings for quality. The pipeline also includes runtime audio quality assertions (overlap detection, RMS continuity) that log warnings for debugging.

See VARIABLES.md for all streaming tuning parameters (OVA_PCM_*, OVA_FIRST_CHUNK_*).

Qwen3 Voice Profiles

Profiles are organized by language under profiles/<language>/<voice>/:

profiles/
├── zh/, en/, ja/, ko/, de/, fr/, ru/, pt/, es/, it/

Language directories are provided for all qwen3 supported languages. Create voice profiles inside them.

Creating a New Voice Profile

Create a directory: profiles/<language>/<voice_name>/
Add a ref_audio.wav — 5-15 second clear voice sample (24kHz recommended, MP3/MP4 also accepted)
Generate voice clone prompts using one of two methods:
- Option A: Run generate_voice_prompts.py — auto-transcribes the audio and generates .pt files
- Option B: Add a ref_text.txt with the exact transcription — .pt files are auto-generated on first start
Optionally add a prompt.txt to customize the personality (falls back to prompts/<lang>/default.txt)

Voice clone prompts are model-specific:

voice_clone_prompt_0.6B.pt - For 0.6B TTS model (1024-dim embeddings)
voice_clone_prompt_1.7B.pt - For 1.7B TTS model (2048-dim embeddings)

Enhancing Audio Quality (Optional)

For better voice cloning results, you can enhance reference audio using Resemble Enhance.

First, install the optional dependencies (--no-deps on resemble-enhance to avoid torch/numpy version conflicts):

uv pip install --no-deps resemble-enhance && uv pip install deepspeed

Then run the enhancement script:

# Denoise only (recommended)
python scripts/enhance_profile_audio.py martina

# Denoise + enhance (upscale quality)
python scripts/enhance_profile_audio.py martina --enhance

# English profile
python scripts/enhance_profile_audio.py cassidy --language en

This removes background noise, optionally upscales audio quality, and resamples to 24kHz. The original audio is backed up as ref_audio_original.*.

Alternatively, you can use NVIDIA RE-USE for speech enhancement (denoising + bandwidth extension):

# Denoise + BWE to 24kHz (default)
python scripts/enhance_profile_audio_RE-USE.py myvoice

# English profile
python scripts/enhance_profile_audio_RE-USE.py myvoice --language en

Prompting

Each voice profile includes a prompt.txt that controls the assistant's personality and output style. Getting the prompt right is important because the LLM output is spoken aloud by TTS — formatting that looks fine in text can produce audible artifacts in speech.

Prompt Structure

A prompt has two parts:

Personality lines — who the assistant is and what language to use
Instructions block — behavioral rules that keep output TTS-friendly

Here's the English reference prompt (prompts/en/default.txt):

You are a friendly and approachable voice assistant.
You speak with a natural, casual, and relaxed tone, like a friend chatting on the phone.
Always respond in English.

Instructions:
- Be concise and direct - answer the first sentence clearly before continuing.
- Prioritize clarity over response length.
- Use a casual and friendly tone.
- NEVER respond with lists - use complete sentences.
- NEVER include any Markdown formatting, asterisks, underscores, or other formatting.
- Do NOT include emojis.
- Use punctuation to control speech rhythm: commas for brief pauses.
- You may use [pause:X] to insert a deliberate pause of X seconds (e.g., [pause:0.5]). Use sparingly for dramatic effect.

Why the First Sentence Matters for TTFB

With OVA_EARLY_TTS_DECODE=true (the default), the backend starts TTS before the full LLM response is ready. It watches the token stream and triggers audio synthesis on the first sentence boundary (. ? !), buffer reaching ~40 characters, or 12 tokens — whichever comes first.

The instruction "answer the first sentence clearly before continuing" causes the LLM to produce a short opening sentence with punctuation early, triggering TTS sooner. For example, "Sure thing!" triggers immediately, while a long rambling sentence waits for fallback thresholds.

Key Prompt Guidelines

Rule	Why
No lists	TTS reads bullet characters (`-`, `*`, `1.`) literally as speech
No Markdown	Asterisks and underscores become audible artifacts
No emojis	TTS either skips them or mispronounces them
Punctuation for rhythm	Commas, ellipsis, and dashes directly control TTS prosody and pacing
Complete sentences	Produces natural speech flow instead of fragmented phrases

If a profile directory doesn't contain a prompt.txt, the system falls back to prompts/<language>/default.txt. You can reload the active prompt at runtime via the settings panel without restarting.

Prosody (Silence Tags)

OVA supports [pause:X] tags that let the LLM insert deliberate silences into speech. This is purely prompt-controlled — add the instruction line to a profile's prompt.txt to enable it:

- You may use [pause:X] to insert a deliberate pause of X seconds (e.g., [pause:0.5]). Use sparingly for dramatic effect.

Syntax: [pause:X] or [p:X] where X is seconds (e.g., [pause:0.5], [p:1.5]).

When the LLM outputs text with embedded tags, the pipeline splits it into text and silence segments. Tags are stripped from conversation history after processing. Pauses are clamped to OVA_MAX_PAUSE_DURATION (default 3.0 seconds). Requires Qwen3 TTS with PCM streaming (the default).

Multimodal (Image + Text)

The chat interface supports attaching images:

Click the image icon to add an image.
Type your question about the image.
The vision-language model will analyze the image and respond

The /v1/chat endpoint handles text + optional image queries directly.

Conversation Mode

OVA supports two conversation modes, selectable from the settings panel or via OVA_ENABLE_DUPLEX in .env. Switching modes requires a restart and page reload.

Half-duplex (default)

Push-to-talk: the user holds a button (or uses barge-in / wake word) to record, the backend processes the request, and streams TTS back. Audio flows one direction at a time. VAD runs client-side in the browser.

Full-duplex

Always-listening: the microphone stays open and both user audio and bot audio flow simultaneously over a single persistent WebSocket (/v1/duplex). No button presses needed. Enable with OVA_ENABLE_DUPLEX=true.

Under the hood:

Server-side VAD — Silero VAD (ONNX, 16kHz) runs on the backend, detecting speech onset and offset in the incoming audio stream.
Turn-taking state machine — Tracks conversation state through IDLE → USER_SPEAKING → BOT_THINKING → BOT_SPEAKING transitions. Each state determines what happens with incoming audio and outgoing TTS.
Interruption handling — If the user speaks while the bot is talking, TTS playback stops immediately and a new turn begins. A dynamic bot-stop delay accounts for client playback buffer latency so interrupts feel responsive.
Backchannel filtering — Short filler words ("yeah", "ok", "mhm") are detected and discarded so they don't accidentally interrupt the bot mid-sentence.

Comparison

	Half-duplex (default)	Full-duplex
Activation	Push-to-talk (+ optional barge-in / wake word)	Always listening
VAD	Client-side (browser)	Server-side (backend)
Connection	HTTP POST + separate WebSocket for ASR	Single bidirectional WebSocket
Interruption	Barge-in stops playback, user re-records	Automatic — new turn starts instantly

See VARIABLES.md for all duplex tuning parameters (OVA_ENABLE_DUPLEX, OVA_DUPLEX_*, OVA_SERVER_VAD_*, OVA_TURN_*).

Tools / Function Calling

The LLM can invoke real Python functions during a conversation and incorporate the results into its spoken response. Tools are plugin-style — drop a .py file in ova/tools/ and the registry auto-discovers it. No manual registration needed.

Built-in Tools

Tool	Description	Default
`get_time`	Returns current time in configured timezone	Enabled
`get_date`	Resolves dates via natural language ("yesterday", "next Friday")	Enabled
`set_timer`	Sets an in-memory countdown timer (1–3600s)	Enabled
`check_timers`	Reports status of all active timers	Enabled
`web_search`	Example Tavily Web search provided (requires `OVA_SEARCH_API_KEY`)	Disabled

Timer expirations push real-time notifications to the browser via Server-Sent Events (GET /v1/events), with OS-level notifications when the tab is hidden.

Enable with OVA_ENABLE_TOOLS=true in .env. Enabling tools increases TTFA due to the LLM tool iterations.

See TOOLS.md for the full guide — creating custom tools, event publishing, frontend handler registration, security considerations, and the complete API reference.

MCP (External Tool Servers)

OVA can connect to external MCP (Model Context Protocol) servers as a client, expanding its tool capabilities without writing Python code. MCP tools appear alongside native tools — the LLM sees a single unified tool list.

Enable with OVA_ENABLE_MCP=true and OVA_ENABLE_TOOLS=true in .env. Configure servers in mcp_servers.json (same format as Claude Desktop / VS Code):

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user/docs"]
    }
  }
}

Supports stdio (local subprocess), SSE, and Streamable HTTP transports. Servers connect in parallel at startup, with automatic reconnect on failure.

See MCP.md for the full guide — architecture, transport types, examples, error handling, and security considerations.

API Endpoints

All endpoints are versioned under /v1/. Voice is an optional path parameter, language is a query parameter.

Endpoint	Method	Description
`/v1/chat/audio`	POST	Voice input - receives WAV audio, returns streaming TTS response
`/v1/chat/{voice_id}/audio`	POST	Same, with explicit voice
`/v1/chat`	POST	Text + optional image input, returns streaming TTS response
`/v1/chat/{voice_id}`	POST	Same, with explicit voice
`/v1/text-to-speech`	POST	Pure TTS - synthesizes text exactly as given
`/v1/text-to-speech/{voice_id}`	POST	Same, with explicit voice
`/v1/text-to-speech/batch`	POST	Batch TTS - synthesizes multiple texts, returns NDJSON
`/v1/speech-to-text`	POST	One-shot speech-to-text
`/v1/interrupt`	POST	Stop current TTS playback (used by barge-in)
`/v1/speech-to-text/stream`	WebSocket	Streaming ASR - send audio chunks, receive partial transcripts
`/v1/duplex`	WebSocket	Full-duplex voice — bidirectional audio + JSON on a single connection
`/v1/events`	GET	SSE stream for real-time push notifications (tools → browser)
`/v1/info`	GET	Pipeline configuration info
`/v1/health`	GET	Server readiness check
`/v1/settings`	GET/POST	Runtime settings management (hot-reload capable)
`/v1/settings/prompt`	POST	Update system prompt (session-only, no restart)
`/v1/restart`	POST	Trigger server restart

Hot-Reload Settings

The following can be changed at runtime without restart via POST /v1/settings:

Setting	Hot-Reload	Notes
Voice profile	Yes	If preloaded at startup
Language	Yes	Loads new voice prompts automatically
System prompt	Yes	Via `/v1/settings/prompt`
TTS engine	No	Requires restart
Streaming format (pcm/wav)	No	Requires restart
Conversation mode	No	Requires restart (and page reload)

SDK

The OVA SDK is a standalone Python package for programmatic access to the OVA server from any machine. It is installed automatically during ./ova.sh install.

For standalone or remote use, install directly from git:

pip install git+https://github.com/rekuenkdr/ova-python-sdk.git

from ova_sdk import OVA

client = OVA()  # connects to localhost:5173 by default
client.wait_until_ready()

audio = client.chat.send_text("Tell me a joke")
audio.play()

For remote or authenticated servers, set OVA_BASE_URL and OVA_API_KEY environment variables (or pass as constructor arguments).

Full documentation and examples: ova-python-sdk

Security

OVA is designed for localhost use only. The default configuration is safe for local use — it binds to localhost, requires no authentication, and disables tools and MCP by default.

For network/internet exposure, you need API authentication (OVA_API_KEY), HTTPS via a reverse proxy, and feature flag hardening. OVA is single-tenant by design — separate users need separate instances.

See SECURITY.md for the full security overview — threat model, tool/MCP safety, API-only mode, hardening checklist, and all security-related configuration.

Disclaimer

This project is a proof-of-concept demonstration and is provided as is without any warranties or guarantees. It is intended for educational and experimental purposes only.

The voice cloning capability is purely for educational purposes - for real-life or commercial use, always seek relevant permissions. This demo highlights ethical and security considerations: the ease with which one can clone a voice using only a 3-5 second audio clip is both impressive and potentially dangerous in the wrong hands.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
models		models
ova		ova
patches		patches
profiles		profiles
prompts		prompts
scripts		scripts
static		static
.env.example		.env.example
.env.kokoro.example		.env.kokoro.example
.env.qwen3.example		.env.qwen3.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MCP.md		MCP.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SECURITY.md		SECURITY.md
TOOLS.md		TOOLS.md
VARIABLES.md		VARIABLES.md
index.html		index.html
mcp_servers.json		mcp_servers.json
ova.sh		ova.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

OVA

Features

Quick Start

Pre-requisites

Install & Run

Minimal Configuration

Models

Qwen3-TTS Streaming Fork

Architecture

How It Works

Streaming TTS

Qwen3 Voice Profiles

Creating a New Voice Profile

Enhancing Audio Quality (Optional)

Prompting

Prompt Structure

Why the First Sentence Matters for TTFB

Key Prompt Guidelines

Prosody (Silence Tags)

Multimodal (Image + Text)

Conversation Mode

Half-duplex (default)

Full-duplex

Comparison

Tools / Function Calling

Built-in Tools

MCP (External Tool Servers)

API Endpoints

Hot-Reload Settings

SDK

Security

Disclaimer

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages