Lightweight, blazing-fast TTS microservice powered by edge-tts.
Zero GPU required. Docker-ready. Vietnamese-first with 400+ voices.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β xTTS β Text-to-Speech Microservice β
β β
β β¦ Direct edge-tts library calls (no subprocess) β
β β¦ Chunk-based processing for unlimited text length β
β β¦ Word-level captions merged into natural phrases β
β β¦ In-memory LRU cache for instant repeated requests β
β β¦ Streaming endpoint for browser <audio> playback β
β β¦ CORS-ready, Docker-ready, production-ready β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββ
β Client / Browser β
ββββββββββ¬ββββββββ¬βββββββββ
β β
POST /ttsβ βPOST /tts/stream
βΌ βΌ
βββββββββββββββββββββββββββ
β FastAPI (async) β
β βββββββββββββββββ β
β β CORS / Error β β
β β Middleware β β
β βββββββββ¬ββββββββ β
β βΌ β
β βββββββββββββββββ β
β β Input Valid. β β
β β voice + rate β β
β βββββββββ¬ββββββββ β
β βΌ β
β βββββββββββββββββ β
β β LRU Cache βββββββ hit βββΆ Return cached
β β (100 entries) β β
β βββββββββ¬ββββββββ β
β miss β
β βΌ β
β βββββββββββββββββ β
β β Text Chunker β β
β β β€500 chars β β
β βββββββββ¬ββββββββ β
β βΌ β
β βββββββββββββββββββββββ β
β β edge-tts Workers β β
β β (2 concurrent) β β
β β β β
β β chunkβ βββΆ MP3 + WB β β
β β chunkβ βββΆ MP3 + WB β β
β β chunkβ βββΆ MP3 + WB β β
β βββββββββββ¬ββββββββββββ β
β βΌ β
β βββββββββββββββββ β
β β Caption Merger β β
β β words β phrasesβ β
β βββββββββ¬ββββββββ β
β βΌ β
β βββββββββββββββββ β
β β MP3 Concat + β β
β β Cache Store β β
β βββββββββ¬ββββββββ β
β βΌ β
β βββββββββββββββββ β
β β Response β β
β β base64 / MP3 β β
β βββββββββββββββββ β
βββββββββββββββββββββββββββ
sequenceDiagram
participant C as Client
participant A as FastAPI
participant V as Voice Validator
participant $ as LRU Cache
participant T as Text Chunker
participant E as edge-tts
C->>A: POST /tts {text, voice, rate}
A->>V: Validate voice exists
V-->>A: β OK
A->>$: Lookup cache key
alt Cache Hit
$-->>A: Cached audio + captions
A-->>C: 200 JSON (cached: true)
else Cache Miss
A->>T: Split text β chunks
loop Each chunk (2 concurrent)
A->>E: Communicate.stream()
E-->>A: MP3 bytes + WordBoundary events
end
A->>A: Merge captions, concat MP3
A->>$: Store in cache
A-->>C: 200 JSON {audio, captions, ...}
end
graph TD
SERVER[server.py<br/>Entry point] --> APP[app/__init__.py<br/>create_app factory]
APP --> CORS[CORS Middleware]
APP --> ERR[Error Handler]
APP --> R_TTS[routes/tts.py<br/>POST /tts, /tts/stream]
APP --> R_SYS[routes/system.py<br/>GET /health, /stats, /voices]
R_TTS --> ENGINE[tts_engine.py<br/>Core TTS logic]
R_SYS --> ENGINE
ENGINE --> CACHE[cache.py<br/>LRU Cache]
ENGINE --> TEXT[text_utils.py<br/>Chunking & captions]
ENGINE --> EDGE[edge-tts<br/>Microsoft TTS]
APP --> CONFIG[config.py<br/>pydantic-settings]
ENGINE --> CONFIG
R_TTS --> MODELS[models.py<br/>Request/Response]
R_SYS --> MODELS
style SERVER fill:#1a1a2e,stroke:#e94560,color:#fff
style APP fill:#16213e,stroke:#0f3460,color:#fff
style ENGINE fill:#0f3460,stroke:#533483,color:#fff
style CACHE fill:#533483,stroke:#e94560,color:#fff
style TEXT fill:#533483,stroke:#e94560,color:#fff
style EDGE fill:#e94560,stroke:#fff,color:#fff
style CONFIG fill:#16213e,stroke:#0f3460,color:#fff
style MODELS fill:#16213e,stroke:#0f3460,color:#fff
Git Push βββΆ Lint (ruff) βββΆ Test (pytest) βββΆ Docker Build βββΆ Deploy
| Stage | Tool | Command |
|---|---|---|
| Lint | ruff | make lint |
| Test | pytest | make test |
| Build | Docker | docker compose build |
| Deploy | Docker Compose | docker compose up -d |
xTTS/
βββ app/
β βββ __init__.py # App factory (create_app)
β βββ config.py # Settings (pydantic-settings, .env)
β βββ models.py # Request/Response Pydantic models
β βββ cache.py # In-memory LRU cache
β βββ tts_engine.py # Core TTS processing & voice validation
β βββ text_utils.py # Text chunking & caption merging
β βββ routes/
β βββ __init__.py # Router registry
β βββ tts.py # POST /tts, POST /tts/stream
β βββ system.py # GET /health, /stats, /voices
βββ tests/
β βββ test_text_utils.py # Text chunking & caption tests
β βββ test_api.py # Cache, model validation tests
βββ .env.example # Environment variable template
βββ .dockerignore
βββ .gitignore
βββ docker-compose.yml
βββ Dockerfile
βββ Makefile # dev, test, lint, docker shortcuts
βββ mcp_server.py # MCP config management server (:3100)
βββ pyproject.toml # Python project config
βββ README.md
βββ requirements.txt
βββ server.py # Thin entry point
docker compose up -d
# Check it's running
curl http://localhost:3099/healthcp .env.example .env # edit as needed
pip install -r requirements.txt
python server.pypip install -e ".[dev]"
make dev
# β http://localhost:3099/docsReturns base64-encoded MP3 audio with frame-level captions.
Request:
curl -X POST http://localhost:3099/tts \
-H "Content-Type: application/json" \
-d '{
"text": "Xin chΓ o anh em, ΔΓ’y lΓ X Dev. HΓ΄m nay mΓ¬nh sαΊ½ nΓ³i vα» Kubernetes.",
"voice": "vi-VN-HoaiMyNeural",
"rate": "+0%"
}'Response:
{
"audio": "//uQxAAAAAANIAAAAAExBTU...",
"audioFormat": "mp3",
"audioSize": 19728,
"captions": [
{ "startFrame": 3, "endFrame": 48, "text": "Xin chΓ o anh em," },
{ "startFrame": 48, "endFrame": 95, "text": "ΔΓ’y lΓ X Dev." },
{ "startFrame": 98, "endFrame": 180, "text": "Hôm nay mình sẽ nói" },
{ "startFrame": 180, "endFrame": 260, "text": "vα» Kubernetes." }
],
"durationSeconds": 4.8,
"chunks": 1,
"elapsed": 1.923,
"cached": false
}Fields:
| Field | Type | Description |
|---|---|---|
audio |
string | Base64-encoded MP3 data |
audioFormat |
string | Always "mp3" |
audioSize |
int | Raw audio size in bytes |
captions |
array | Phrase-level captions with frame timing |
captions[].startFrame |
int | Start frame at 30fps |
captions[].endFrame |
int | End frame at 30fps |
captions[].text |
string | Caption text (3β6 words) |
durationSeconds |
float | Total audio duration |
chunks |
int | Number of text chunks processed |
elapsed |
float | Server processing time (seconds) |
cached |
bool | true if served from cache |
Same request body. Returns raw audio/mpeg β ideal for browser <audio> or direct download.
# Save to file
curl -X POST http://localhost:3099/tts/stream \
-H "Content-Type: application/json" \
-d '{"text": "Xin chΓ o anh em"}' \
-o output.mp3
# Play directly (macOS)
curl -X POST http://localhost:3099/tts/stream \
-H "Content-Type: application/json" \
-d '{"text": "Xin chΓ o anh em"}' | afplay -Response Headers:
| Header | Description |
|---|---|
Content-Type |
audio/mpeg |
Content-Length |
Audio size in bytes |
X-Duration-Seconds |
Total duration |
X-Captions |
Base64-encoded JSON captions array |
X-Chunks |
Number of chunks processed |
X-Cached |
true / false |
curl http://localhost:3099/health{
"ok": true,
"version": "1.1.0",
"uptime": 3600,
"edge_tts": "7.0.2",
"cache_size": 12,
"voices_loaded": 441
}curl http://localhost:3099/stats{
"requests_total": 150,
"requests_ok": 148,
"requests_error": 2,
"chars_processed": 52340,
"audio_bytes_generated": 3145728,
"cache_hits": 45,
"cache_misses": 105,
"uptime": 7200,
"cache_entries": 42,
"cache_max": 100
}curl "http://localhost:3099/voices?lang=vi"
curl "http://localhost:3099/voices?lang=en"
curl "http://localhost:3099/voices?lang=ja"{
"voices": [
{
"Name": "Microsoft Server Speech Text to Speech Voice (vi-VN, HoaiMyNeural)",
"ShortName": "vi-VN-HoaiMyNeural",
"Gender": "Female",
"Locale": "vi-VN"
},
{
"Name": "Microsoft Server Speech Text to Speech Voice (vi-VN, NamMinhNeural)",
"ShortName": "vi-VN-NamMinhNeural",
"Gender": "Male",
"Locale": "vi-VN"
}
],
"total": 2
}Open http://localhost:3099/docs in browser.
import requests, base64
# JSON response (base64 audio)
resp = requests.post("http://localhost:3099/tts", json={
"text": "Xin chΓ o, ΔΓ’y lΓ X Dev",
"voice": "vi-VN-HoaiMyNeural",
"rate": "+0%",
})
data = resp.json()
with open("output.mp3", "wb") as f:
f.write(base64.b64decode(data["audio"]))
print(f"Duration: {data['durationSeconds']}s, Captions: {len(data['captions'])}")
# Stream response (raw MP3)
resp = requests.post("http://localhost:3099/tts/stream", json={
"text": "Xin chΓ o, ΔΓ’y lΓ X Dev",
})
with open("output.mp3", "wb") as f:
f.write(resp.content)// Play audio directly in browser
async function speak(text) {
const resp = await fetch("http://localhost:3099/tts/stream", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, voice: "vi-VN-HoaiMyNeural" }),
});
const blob = await resp.blob();
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
audio.play();
// Parse captions from header
const captionsB64 = resp.headers.get("X-Captions");
if (captionsB64) {
const captions = JSON.parse(atob(captionsB64));
console.log("Captions:", captions);
}
}
speak("Xin chΓ o anh em, ΔΓ’y lΓ X Dev");const resp = await fetch("http://localhost:3099/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: "Hello from Node.js",
voice: "en-US-GuyNeural",
rate: "+10%",
}),
});
const { audio, captions, durationSeconds } = await resp.json();
const buffer = Buffer.from(audio, "base64");
require("fs").writeFileSync("output.mp3", buffer);
console.log(`Saved ${durationSeconds}s audio with ${captions.length} captions`);// Use captions for subtitle rendering in Remotion
const { captions } = ttsResponse;
<Sequence from={caption.startFrame} durationInFrames={caption.endFrame - caption.startFrame}>
<Subtitle text={caption.text} />
</Sequence>tests/
βββ test_text_utils.py # 10 tests β text processing logic
β βββ TestSplitText # 5 tests β chunk splitting
β βββ TestMergeWordCaptions # 4 tests β caption merging
β βββ TestEstimateMp3Dur. # 1 test β duration estimation
β
βββ test_api.py # 12 tests β cache & validation
βββ TestLRUCache # 5 tests β cache eviction & LRU
βββ TestCacheKey # 2 tests β key determinism
βββ TestTTSRequestValid. # 5 tests β model validation
graph LR
subgraph "test_text_utils.py"
T1[Split: short text<br/>β single chunk]
T2[Split: paragraphs<br/>β multi chunks]
T3[Split: long paragraph<br/>β by sentence]
T4[Split: empty text<br/>β fallback]
T5[Split: respects<br/>max_chars limit]
T6[Merge: empty<br/>β empty]
T7[Merge: single word<br/>β passthrough]
T8[Merge: max_words<br/>β splits phrases]
T9[Merge: gap detection<br/>β splits on pause]
T10[Duration: known size<br/>β 1.0 sec]
end
subgraph "test_api.py"
A1[Cache: put/get]
A2[Cache: LRU eviction]
A3[Cache: access refresh]
A4[Cache: len]
A5[Cache: clear]
A6[Key: deterministic]
A7[Key: different inputs]
A8[Model: valid defaults]
A9[Model: rate formats]
A10[Model: invalid rate β]
A11[Model: invalid voice β]
A12[Model: valid voice β]
end
T1 & T2 & T3 & T4 & T5 --> TU[text_utils.py]
T6 & T7 & T8 & T9 --> TU
T10 --> TU
A1 & A2 & A3 & A4 & A5 --> CA[cache.py]
A6 & A7 --> CA
A8 & A9 & A10 & A11 & A12 --> MO[models.py]
style TU fill:#0f3460,stroke:#533483,color:#fff
style CA fill:#533483,stroke:#e94560,color:#fff
style MO fill:#16213e,stroke:#0f3460,color:#fff
# All tests
make test
# or
python -m pytest tests/ -v
# With coverage
make test-cov
# Specific test file
python -m pytest tests/test_text_utils.py -v
# Specific test class
python -m pytest tests/test_api.py::TestLRUCache -vtests/test_api.py::TestLRUCache::test_put_and_get PASSED [ 4%]
tests/test_api.py::TestLRUCache::test_evicts_oldest PASSED [ 9%]
tests/test_api.py::TestLRUCache::test_access_refreshes_order PASSED [ 13%]
tests/test_api.py::TestLRUCache::test_len PASSED [ 18%]
tests/test_api.py::TestLRUCache::test_clear PASSED [ 22%]
tests/test_api.py::TestCacheKey::test_deterministic PASSED [ 27%]
tests/test_api.py::TestCacheKey::test_different_for_different PASSED [ 31%]
tests/test_api.py::TestTTSRequestValidation::test_valid_defaults PASSED [ 36%]
tests/test_api.py::TestTTSRequestValidation::test_valid_rates PASSED [ 40%]
tests/test_api.py::TestTTSRequestValidation::test_invalid_rate PASSED [ 45%]
tests/test_api.py::TestTTSRequestValidation::test_invalid_voice PASSED [ 50%]
tests/test_api.py::TestTTSRequestValidation::test_valid_voice PASSED [ 54%]
tests/test_text_utils.py::TestSplitText::test_short_text PASSED [ 59%]
tests/test_text_utils.py::TestSplitText::test_split_paragraph PASSED [ 63%]
tests/test_text_utils.py::TestSplitText::test_split_sentence PASSED [ 68%]
tests/test_text_utils.py::TestSplitText::test_empty_text PASSED [ 72%]
tests/test_text_utils.py::TestSplitText::test_respects_max PASSED [ 77%]
tests/test_text_utils.py::TestMergeWordCaptions::test_empty PASSED [ 81%]
tests/test_text_utils.py::TestMergeWordCaptions::test_single PASSED [ 86%]
tests/test_text_utils.py::TestMergeWordCaptions::test_max_words PASSED [ 90%]
tests/test_text_utils.py::TestMergeWordCaptions::test_gap PASSED [ 95%]
tests/test_text_utils.py::TestEstimateMp3Duration::test_known PASSED [100%]
βββββββββββββββββββββββ 22 passed in 0.25s ββββββββββββββββββββββββ
make dev # uvicorn with --reload
make test # pytest -v
make test-cov # pytest with coverage
make lint # ruff check
make format # ruff format
make docker-up # docker compose up -d --build
make docker-down # docker compose down
make docker-logs # follow container logs
make clean # remove __pycache__, .pytest_cache, etc.git clone https://github.com/xdev-asia-labs/xTTS.git
cd xTTS
cp .env.example .env # adjust settings for production
docker compose up -d
# Verify
curl http://YOUR_SERVER:3099/healthlocation /tts/ {
proxy_pass http://127.0.0.1:3099/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 60s;
}All settings can be configured via environment variables or .env file.
cp .env.example .env| Variable | Default | Description |
|---|---|---|
| Server | ||
PORT |
3099 |
HTTP server port |
LOG_LEVEL |
info |
Uvicorn log level (debug, info, warning) |
| TTS Engine | ||
TTS_MAX_CHUNK |
500 |
Max characters per processing chunk |
TTS_MAX_RETRIES |
3 |
Retry count per chunk on transient errors |
TTS_MAX_TEXT_LENGTH |
20000 |
Maximum input text length |
TTS_DEFAULT_VOICE |
vi-VN-HoaiMyNeural |
Default voice when not specified |
TTS_DEFAULT_RATE |
+0% |
Default speaking rate |
TTS_CONCURRENCY |
2 |
Max concurrent edge-tts stream calls |
| Cache | ||
TTS_CACHE_SIZE |
100 |
Max cached TTS results (LRU eviction) |
| CORS | ||
TTS_CORS_ORIGINS |
* |
Allowed origins, comma-separated |
| Captions | ||
FPS |
30 |
Frame rate for caption timing (match your video) |
CAPTION_MAX_WORDS |
6 |
Max words per caption phrase |
CAPTION_MAX_GAP_FRAMES |
10 |
Frame gap threshold to split captions |
Frame 0 Frame 30 Frame 60
β β β
βββββ 1 second βββββββββ€ββββ 1 second βββββββββ€
β β β
β startFrame: 3 β startFrame: 35 β
β endFrame: 28 β endFrame: 58 β
β "Xin chΓ o anh em" β "ΔΓ’y lΓ X Dev" β
Captions use frame numbers at the configured FPS (default 30).
To convert: time_seconds = frame / FPS
xTTS includes an MCP Server (mcp_server.py) β a separate FastAPI service that lets AI agents (GitHub Copilot, Claude, etc.) manage xTTS configuration programmatically: view/update .env, restart the service, check Docker status.
βββββββββββββββββββββββ βββββββββββββββββββββββββββ
β VS Code / Copilot βββββββββΆβ xtts-mcp (:3100) β
β (MCP Client) β HTTP β mcp_server.py β
βββββββββββββββββββββββ β β
β ββ Read/write .env β
β ββ Docker restart β
β ββ Health check xtts β
ββββββββββββ¬βββββββββββββββ
β docker.sock
βΌ
βββββββββββββββββββββββββββ
β xtts (:3099) β
β TTS service β
βββββββββββββββββββββββββββ
Both services start together:
docker compose up -d
# Verify MCP server
curl http://localhost:3100/mcp/config/schemaThe workspace includes .vscode/mcp.json to register the MCP server automatically:
{
"servers": {
"xtts-config": {
"type": "http",
"url": "http://localhost:3100",
"description": "xTTS deployment configuration manager"
}
}
}After docker compose up -d, reload VS Code window (Ctrl+Shift+P β "Reload Window") for Copilot to pick up the server.
Returns all valid config keys with types, defaults, and descriptions.
curl http://localhost:3100/mcp/config/schemaReturns current .env values merged with defaults.
curl http://localhost:3100/mcp/config{
"config": {
"PORT": {
"value": "3099",
"default": "3099",
"is_custom": false,
"type": "int",
"description": "Server port"
}
}
}curl http://localhost:3100/mcp/config/TTS_MAX_CHUNKcurl -X PUT http://localhost:3100/mcp/config \
-H "Content-Type: application/json" \
-d '{"key": "TTS_MAX_CHUNK", "value": "1000"}'curl -X PUT http://localhost:3100/mcp/config/batch \
-H "Content-Type: application/json" \
-d '{"updates": {"TTS_MAX_CHUNK": "1000", "TTS_CONCURRENCY": "4"}}'curl -X POST http://localhost:3100/mcp/config/reset/TTS_MAX_CHUNKcurl -X POST http://localhost:3100/mcp/config/resetProxies health check to the main TTS service.
curl http://localhost:3100/mcp/service/statusTriggers docker compose restart xtts via mounted Docker socket.
curl -X POST http://localhost:3100/mcp/service/restartcurl http://localhost:3100/mcp/docker/statusShows which .env values differ from defaults.
curl http://localhost:3100/mcp/env/diffxtts-mcp:
build: .
container_name: xtts-mcp
restart: unless-stopped
ports:
- "3100:3100"
volumes:
- ./.env:/app/.env # Read/write config
- ./.env.example:/app/.env.example:ro # Default reference
- ./docker-compose.yml:/app/docker-compose.yml:ro
- /var/run/docker.sock:/var/run/docker.sock # Docker control
- /usr/bin/docker:/usr/bin/docker:ro
command: ["python", "mcp_server.py"]Note: The Docker socket mount allows the MCP server to restart the
xttscontainer. Only expose this on trusted networks.
MIT
