Skip to content

xdev-asia-labs/xTTS

Repository files navigation

πŸ”Š xTTS

Lightweight, blazing-fast TTS microservice powered by edge-tts.
Zero GPU required. Docker-ready. Vietnamese-first with 400+ voices.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  xTTS β€” Text-to-Speech Microservice                     β”‚
β”‚                                                          β”‚
β”‚  ✦ Direct edge-tts library calls (no subprocess)         β”‚
β”‚  ✦ Chunk-based processing for unlimited text length      β”‚
β”‚  ✦ Word-level captions merged into natural phrases       β”‚
β”‚  ✦ In-memory LRU cache for instant repeated requests     β”‚
β”‚  ✦ Streaming endpoint for browser <audio> playback       β”‚
β”‚  ✦ CORS-ready, Docker-ready, production-ready            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Architecture

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚      Client / Browser    β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚       β”‚
                          POST /ttsβ”‚       β”‚POST /tts/stream
                                   β–Ό       β–Ό
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚     FastAPI (async)       β”‚
                          β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
                          β”‚     β”‚  CORS / Error  β”‚     β”‚
                          β”‚     β”‚  Middleware     β”‚     β”‚
                          β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
                          β”‚             β–Ό              β”‚
                          β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
                          β”‚     β”‚  Input Valid.  β”‚     β”‚
                          β”‚     β”‚  voice + rate  β”‚     β”‚
                          β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
                          β”‚             β–Ό              β”‚
                          β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
                          β”‚     β”‚  LRU Cache     │────── hit ──▢ Return cached
                          β”‚     β”‚  (100 entries) β”‚     β”‚
                          β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
                          β”‚          miss              β”‚
                          β”‚             β–Ό              β”‚
                          β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
                          β”‚     β”‚  Text Chunker  β”‚     β”‚
                          β”‚     β”‚  ≀500 chars    β”‚     β”‚
                          β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
                          β”‚             β–Ό              β”‚
                          β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                          β”‚   β”‚  edge-tts Workers    β”‚ β”‚
                          β”‚   β”‚  (2 concurrent)      β”‚ β”‚
                          β”‚   β”‚                      β”‚ β”‚
                          β”‚   β”‚  chunk₁ ──▢ MP3 + WB β”‚ β”‚
                          β”‚   β”‚  chunkβ‚‚ ──▢ MP3 + WB β”‚ β”‚
                          β”‚   β”‚  chunk₃ ──▢ MP3 + WB β”‚ β”‚
                          β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                          β”‚             β–Ό              β”‚
                          β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
                          β”‚     β”‚ Caption Merger β”‚     β”‚
                          β”‚     β”‚ words β†’ phrasesβ”‚     β”‚
                          β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
                          β”‚             β–Ό              β”‚
                          β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
                          β”‚     β”‚ MP3 Concat +   β”‚     β”‚
                          β”‚     β”‚ Cache Store    β”‚     β”‚
                          β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
                          β”‚             β–Ό              β”‚
                          β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
                          β”‚     β”‚  Response      β”‚     β”‚
                          β”‚     β”‚  base64 / MP3  β”‚     β”‚
                          β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Request Flow (Mermaid)

sequenceDiagram
    participant C as Client
    participant A as FastAPI
    participant V as Voice Validator
    participant $ as LRU Cache
    participant T as Text Chunker
    participant E as edge-tts

    C->>A: POST /tts {text, voice, rate}
    A->>V: Validate voice exists
    V-->>A: βœ“ OK
    A->>$: Lookup cache key
    alt Cache Hit
        $-->>A: Cached audio + captions
        A-->>C: 200 JSON (cached: true)
    else Cache Miss
        A->>T: Split text β†’ chunks
        loop Each chunk (2 concurrent)
            A->>E: Communicate.stream()
            E-->>A: MP3 bytes + WordBoundary events
        end
        A->>A: Merge captions, concat MP3
        A->>$: Store in cache
        A-->>C: 200 JSON {audio, captions, ...}
    end
Loading

Module Dependency

graph TD
    SERVER[server.py<br/>Entry point] --> APP[app/__init__.py<br/>create_app factory]
    APP --> CORS[CORS Middleware]
    APP --> ERR[Error Handler]
    APP --> R_TTS[routes/tts.py<br/>POST /tts, /tts/stream]
    APP --> R_SYS[routes/system.py<br/>GET /health, /stats, /voices]
    R_TTS --> ENGINE[tts_engine.py<br/>Core TTS logic]
    R_SYS --> ENGINE
    ENGINE --> CACHE[cache.py<br/>LRU Cache]
    ENGINE --> TEXT[text_utils.py<br/>Chunking & captions]
    ENGINE --> EDGE[edge-tts<br/>Microsoft TTS]
    APP --> CONFIG[config.py<br/>pydantic-settings]
    ENGINE --> CONFIG
    R_TTS --> MODELS[models.py<br/>Request/Response]
    R_SYS --> MODELS

    style SERVER fill:#1a1a2e,stroke:#e94560,color:#fff
    style APP fill:#16213e,stroke:#0f3460,color:#fff
    style ENGINE fill:#0f3460,stroke:#533483,color:#fff
    style CACHE fill:#533483,stroke:#e94560,color:#fff
    style TEXT fill:#533483,stroke:#e94560,color:#fff
    style EDGE fill:#e94560,stroke:#fff,color:#fff
    style CONFIG fill:#16213e,stroke:#0f3460,color:#fff
    style MODELS fill:#16213e,stroke:#0f3460,color:#fff
Loading

CI/CD Pipeline

CI/CD Pipeline

Git Push ──▢ Lint (ruff) ──▢ Test (pytest) ──▢ Docker Build ──▢ Deploy
Stage Tool Command
Lint ruff make lint
Test pytest make test
Build Docker docker compose build
Deploy Docker Compose docker compose up -d

πŸ“ Project Structure

xTTS/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py          # App factory (create_app)
β”‚   β”œβ”€β”€ config.py            # Settings (pydantic-settings, .env)
β”‚   β”œβ”€β”€ models.py            # Request/Response Pydantic models
β”‚   β”œβ”€β”€ cache.py             # In-memory LRU cache
β”‚   β”œβ”€β”€ tts_engine.py        # Core TTS processing & voice validation
β”‚   β”œβ”€β”€ text_utils.py        # Text chunking & caption merging
β”‚   └── routes/
β”‚       β”œβ”€β”€ __init__.py      # Router registry
β”‚       β”œβ”€β”€ tts.py           # POST /tts, POST /tts/stream
β”‚       └── system.py        # GET /health, /stats, /voices
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_text_utils.py   # Text chunking & caption tests
β”‚   └── test_api.py          # Cache, model validation tests
β”œβ”€β”€ .env.example             # Environment variable template
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ .gitignore
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ Makefile                 # dev, test, lint, docker shortcuts
β”œβ”€β”€ mcp_server.py            # MCP config management server (:3100)
β”œβ”€β”€ pyproject.toml           # Python project config
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
└── server.py                # Thin entry point

πŸš€ Quick Start

Docker (recommended)

docker compose up -d

# Check it's running
curl http://localhost:3099/health

Native

cp .env.example .env          # edit as needed
pip install -r requirements.txt
python server.py

Development (auto-reload)

pip install -e ".[dev]"
make dev
# β†’ http://localhost:3099/docs

πŸ“‘ API Reference

POST /tts β€” Generate TTS (JSON response)

Returns base64-encoded MP3 audio with frame-level captions.

Request:

curl -X POST http://localhost:3099/tts \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Xin chΓ o anh em, Δ‘Γ’y lΓ  X Dev. HΓ΄m nay mΓ¬nh sαΊ½ nΓ³i về Kubernetes.",
    "voice": "vi-VN-HoaiMyNeural",
    "rate": "+0%"
  }'

Response:

{
  "audio": "//uQxAAAAAANIAAAAAExBTU...",
  "audioFormat": "mp3",
  "audioSize": 19728,
  "captions": [
    { "startFrame": 3,   "endFrame": 48,  "text": "Xin chΓ o anh em," },
    { "startFrame": 48,  "endFrame": 95,  "text": "Δ‘Γ’y lΓ  X Dev." },
    { "startFrame": 98,  "endFrame": 180, "text": "Hôm nay mình sẽ nói" },
    { "startFrame": 180, "endFrame": 260, "text": "về Kubernetes." }
  ],
  "durationSeconds": 4.8,
  "chunks": 1,
  "elapsed": 1.923,
  "cached": false
}

Fields:

Field Type Description
audio string Base64-encoded MP3 data
audioFormat string Always "mp3"
audioSize int Raw audio size in bytes
captions array Phrase-level captions with frame timing
captions[].startFrame int Start frame at 30fps
captions[].endFrame int End frame at 30fps
captions[].text string Caption text (3–6 words)
durationSeconds float Total audio duration
chunks int Number of text chunks processed
elapsed float Server processing time (seconds)
cached bool true if served from cache

POST /tts/stream β€” Generate TTS (binary MP3 stream)

Same request body. Returns raw audio/mpeg β€” ideal for browser <audio> or direct download.

# Save to file
curl -X POST http://localhost:3099/tts/stream \
  -H "Content-Type: application/json" \
  -d '{"text": "Xin chΓ o anh em"}' \
  -o output.mp3

# Play directly (macOS)
curl -X POST http://localhost:3099/tts/stream \
  -H "Content-Type: application/json" \
  -d '{"text": "Xin chΓ o anh em"}' | afplay -

Response Headers:

Header Description
Content-Type audio/mpeg
Content-Length Audio size in bytes
X-Duration-Seconds Total duration
X-Captions Base64-encoded JSON captions array
X-Chunks Number of chunks processed
X-Cached true / false

GET /health β€” Health Check

curl http://localhost:3099/health
{
  "ok": true,
  "version": "1.1.0",
  "uptime": 3600,
  "edge_tts": "7.0.2",
  "cache_size": 12,
  "voices_loaded": 441
}

GET /stats β€” Metrics

curl http://localhost:3099/stats
{
  "requests_total": 150,
  "requests_ok": 148,
  "requests_error": 2,
  "chars_processed": 52340,
  "audio_bytes_generated": 3145728,
  "cache_hits": 45,
  "cache_misses": 105,
  "uptime": 7200,
  "cache_entries": 42,
  "cache_max": 100
}

GET /voices?lang=vi β€” List Available Voices

curl "http://localhost:3099/voices?lang=vi"
curl "http://localhost:3099/voices?lang=en"
curl "http://localhost:3099/voices?lang=ja"
{
  "voices": [
    {
      "Name": "Microsoft Server Speech Text to Speech Voice (vi-VN, HoaiMyNeural)",
      "ShortName": "vi-VN-HoaiMyNeural",
      "Gender": "Female",
      "Locale": "vi-VN"
    },
    {
      "Name": "Microsoft Server Speech Text to Speech Voice (vi-VN, NamMinhNeural)",
      "ShortName": "vi-VN-NamMinhNeural",
      "Gender": "Male",
      "Locale": "vi-VN"
    }
  ],
  "total": 2
}

GET /docs β€” Interactive Swagger UI

Open http://localhost:3099/docs in browser.


πŸ’» Usage Examples

Python

import requests, base64

# JSON response (base64 audio)
resp = requests.post("http://localhost:3099/tts", json={
    "text": "Xin chΓ o, Δ‘Γ’y lΓ  X Dev",
    "voice": "vi-VN-HoaiMyNeural",
    "rate": "+0%",
})
data = resp.json()
with open("output.mp3", "wb") as f:
    f.write(base64.b64decode(data["audio"]))
print(f"Duration: {data['durationSeconds']}s, Captions: {len(data['captions'])}")

# Stream response (raw MP3)
resp = requests.post("http://localhost:3099/tts/stream", json={
    "text": "Xin chΓ o, Δ‘Γ’y lΓ  X Dev",
})
with open("output.mp3", "wb") as f:
    f.write(resp.content)

JavaScript (Browser)

// Play audio directly in browser
async function speak(text) {
  const resp = await fetch("http://localhost:3099/tts/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text, voice: "vi-VN-HoaiMyNeural" }),
  });

  const blob = await resp.blob();
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  audio.play();

  // Parse captions from header
  const captionsB64 = resp.headers.get("X-Captions");
  if (captionsB64) {
    const captions = JSON.parse(atob(captionsB64));
    console.log("Captions:", captions);
  }
}

speak("Xin chΓ o anh em, Δ‘Γ’y lΓ  X Dev");

JavaScript (Node.js)

const resp = await fetch("http://localhost:3099/tts", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    text: "Hello from Node.js",
    voice: "en-US-GuyNeural",
    rate: "+10%",
  }),
});

const { audio, captions, durationSeconds } = await resp.json();
const buffer = Buffer.from(audio, "base64");
require("fs").writeFileSync("output.mp3", buffer);
console.log(`Saved ${durationSeconds}s audio with ${captions.length} captions`);

Remotion (React video)

// Use captions for subtitle rendering in Remotion
const { captions } = ttsResponse;

<Sequence from={caption.startFrame} durationInFrames={caption.endFrame - caption.startFrame}>
  <Subtitle text={caption.text} />
</Sequence>

πŸ§ͺ Testing

Test Structure

tests/
β”œβ”€β”€ test_text_utils.py       # 10 tests β€” text processing logic
β”‚   β”œβ”€β”€ TestSplitText        #  5 tests β€” chunk splitting
β”‚   β”œβ”€β”€ TestMergeWordCaptions #  4 tests β€” caption merging
β”‚   └── TestEstimateMp3Dur.  #  1 test  β€” duration estimation
β”‚
└── test_api.py              # 12 tests β€” cache & validation
    β”œβ”€β”€ TestLRUCache          #  5 tests β€” cache eviction & LRU
    β”œβ”€β”€ TestCacheKey          #  2 tests β€” key determinism
    └── TestTTSRequestValid.  #  5 tests β€” model validation

Test Coverage Map

graph LR
    subgraph "test_text_utils.py"
        T1[Split: short text<br/>β†’ single chunk]
        T2[Split: paragraphs<br/>β†’ multi chunks]
        T3[Split: long paragraph<br/>β†’ by sentence]
        T4[Split: empty text<br/>β†’ fallback]
        T5[Split: respects<br/>max_chars limit]
        T6[Merge: empty<br/>β†’ empty]
        T7[Merge: single word<br/>β†’ passthrough]
        T8[Merge: max_words<br/>β†’ splits phrases]
        T9[Merge: gap detection<br/>β†’ splits on pause]
        T10[Duration: known size<br/>β†’ 1.0 sec]
    end

    subgraph "test_api.py"
        A1[Cache: put/get]
        A2[Cache: LRU eviction]
        A3[Cache: access refresh]
        A4[Cache: len]
        A5[Cache: clear]
        A6[Key: deterministic]
        A7[Key: different inputs]
        A8[Model: valid defaults]
        A9[Model: rate formats]
        A10[Model: invalid rate βœ—]
        A11[Model: invalid voice βœ—]
        A12[Model: valid voice βœ“]
    end

    T1 & T2 & T3 & T4 & T5 --> TU[text_utils.py]
    T6 & T7 & T8 & T9 --> TU
    T10 --> TU
    A1 & A2 & A3 & A4 & A5 --> CA[cache.py]
    A6 & A7 --> CA
    A8 & A9 & A10 & A11 & A12 --> MO[models.py]

    style TU fill:#0f3460,stroke:#533483,color:#fff
    style CA fill:#533483,stroke:#e94560,color:#fff
    style MO fill:#16213e,stroke:#0f3460,color:#fff
Loading

Running Tests

# All tests
make test
# or
python -m pytest tests/ -v

# With coverage
make test-cov

# Specific test file
python -m pytest tests/test_text_utils.py -v

# Specific test class
python -m pytest tests/test_api.py::TestLRUCache -v

Sample Output

tests/test_api.py::TestLRUCache::test_put_and_get              PASSED  [  4%]
tests/test_api.py::TestLRUCache::test_evicts_oldest             PASSED  [  9%]
tests/test_api.py::TestLRUCache::test_access_refreshes_order    PASSED  [ 13%]
tests/test_api.py::TestLRUCache::test_len                       PASSED  [ 18%]
tests/test_api.py::TestLRUCache::test_clear                     PASSED  [ 22%]
tests/test_api.py::TestCacheKey::test_deterministic             PASSED  [ 27%]
tests/test_api.py::TestCacheKey::test_different_for_different   PASSED  [ 31%]
tests/test_api.py::TestTTSRequestValidation::test_valid_defaults PASSED [ 36%]
tests/test_api.py::TestTTSRequestValidation::test_valid_rates   PASSED  [ 40%]
tests/test_api.py::TestTTSRequestValidation::test_invalid_rate  PASSED  [ 45%]
tests/test_api.py::TestTTSRequestValidation::test_invalid_voice PASSED  [ 50%]
tests/test_api.py::TestTTSRequestValidation::test_valid_voice   PASSED  [ 54%]
tests/test_text_utils.py::TestSplitText::test_short_text        PASSED  [ 59%]
tests/test_text_utils.py::TestSplitText::test_split_paragraph   PASSED  [ 63%]
tests/test_text_utils.py::TestSplitText::test_split_sentence    PASSED  [ 68%]
tests/test_text_utils.py::TestSplitText::test_empty_text        PASSED  [ 72%]
tests/test_text_utils.py::TestSplitText::test_respects_max      PASSED  [ 77%]
tests/test_text_utils.py::TestMergeWordCaptions::test_empty     PASSED  [ 81%]
tests/test_text_utils.py::TestMergeWordCaptions::test_single    PASSED  [ 86%]
tests/test_text_utils.py::TestMergeWordCaptions::test_max_words PASSED  [ 90%]
tests/test_text_utils.py::TestMergeWordCaptions::test_gap       PASSED  [ 95%]
tests/test_text_utils.py::TestEstimateMp3Duration::test_known   PASSED  [100%]

═══════════════════════ 22 passed in 0.25s ════════════════════════

βš™οΈ Makefile

make dev          # uvicorn with --reload
make test         # pytest -v
make test-cov     # pytest with coverage
make lint         # ruff check
make format       # ruff format
make docker-up    # docker compose up -d --build
make docker-down  # docker compose down
make docker-logs  # follow container logs
make clean        # remove __pycache__, .pytest_cache, etc.

🚒 Deploy

Docker Compose (recommended)

git clone https://github.com/xdev-asia-labs/xTTS.git
cd xTTS
cp .env.example .env   # adjust settings for production
docker compose up -d

# Verify
curl http://YOUR_SERVER:3099/health

Reverse Proxy (Nginx)

location /tts/ {
    proxy_pass http://127.0.0.1:3099/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_read_timeout 60s;
}

πŸ”§ Configuration

All settings can be configured via environment variables or .env file.

cp .env.example .env
Variable Default Description
Server
PORT 3099 HTTP server port
LOG_LEVEL info Uvicorn log level (debug, info, warning)
TTS Engine
TTS_MAX_CHUNK 500 Max characters per processing chunk
TTS_MAX_RETRIES 3 Retry count per chunk on transient errors
TTS_MAX_TEXT_LENGTH 20000 Maximum input text length
TTS_DEFAULT_VOICE vi-VN-HoaiMyNeural Default voice when not specified
TTS_DEFAULT_RATE +0% Default speaking rate
TTS_CONCURRENCY 2 Max concurrent edge-tts stream calls
Cache
TTS_CACHE_SIZE 100 Max cached TTS results (LRU eviction)
CORS
TTS_CORS_ORIGINS * Allowed origins, comma-separated
Captions
FPS 30 Frame rate for caption timing (match your video)
CAPTION_MAX_WORDS 6 Max words per caption phrase
CAPTION_MAX_GAP_FRAMES 10 Frame gap threshold to split captions

Caption Timing Explained

Frame 0                Frame 30               Frame 60
β”‚                      β”‚                      β”‚
β”œβ”€β”€β”€β”€ 1 second ───────────── 1 second ─────────
β”‚                      β”‚                      β”‚
β”‚  startFrame: 3       β”‚  startFrame: 35      β”‚
β”‚  endFrame: 28        β”‚  endFrame: 58        β”‚
β”‚  "Xin chΓ o anh em"  β”‚  "Δ‘Γ’y lΓ  X Dev"     β”‚

Captions use frame numbers at the configured FPS (default 30).
To convert: time_seconds = frame / FPS


πŸ€– MCP Server (AI Agent Config Manager)

xTTS includes an MCP Server (mcp_server.py) β€” a separate FastAPI service that lets AI agents (GitHub Copilot, Claude, etc.) manage xTTS configuration programmatically: view/update .env, restart the service, check Docker status.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  VS Code / Copilot  │◀──────▢│  xtts-mcp (:3100)       β”‚
β”‚  (MCP Client)       β”‚  HTTP   β”‚  mcp_server.py          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚                          β”‚
                                β”‚  β”œβ”€ Read/write .env      β”‚
                                β”‚  β”œβ”€ Docker restart       β”‚
                                β”‚  └─ Health check xtts    β”‚
                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚ docker.sock
                                           β–Ό
                                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                β”‚  xtts (:3099)            β”‚
                                β”‚  TTS service             β”‚
                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Start with Docker Compose

Both services start together:

docker compose up -d

# Verify MCP server
curl http://localhost:3100/mcp/config/schema

VS Code Integration

The workspace includes .vscode/mcp.json to register the MCP server automatically:

{
  "servers": {
    "xtts-config": {
      "type": "http",
      "url": "http://localhost:3100",
      "description": "xTTS deployment configuration manager"
    }
  }
}

After docker compose up -d, reload VS Code window (Ctrl+Shift+P β†’ "Reload Window") for Copilot to pick up the server.

MCP API Endpoints

GET /mcp/config/schema β€” Config Schema

Returns all valid config keys with types, defaults, and descriptions.

curl http://localhost:3100/mcp/config/schema

GET /mcp/config β€” Current Config

Returns current .env values merged with defaults.

curl http://localhost:3100/mcp/config
{
  "config": {
    "PORT": {
      "value": "3099",
      "default": "3099",
      "is_custom": false,
      "type": "int",
      "description": "Server port"
    }
  }
}

GET /mcp/config/{key} β€” Get Single Key

curl http://localhost:3100/mcp/config/TTS_MAX_CHUNK

PUT /mcp/config β€” Update Single Key

curl -X PUT http://localhost:3100/mcp/config \
  -H "Content-Type: application/json" \
  -d '{"key": "TTS_MAX_CHUNK", "value": "1000"}'

PUT /mcp/config/batch β€” Update Multiple Keys

curl -X PUT http://localhost:3100/mcp/config/batch \
  -H "Content-Type: application/json" \
  -d '{"updates": {"TTS_MAX_CHUNK": "1000", "TTS_CONCURRENCY": "4"}}'

POST /mcp/config/reset/{key} β€” Reset Key to Default

curl -X POST http://localhost:3100/mcp/config/reset/TTS_MAX_CHUNK

POST /mcp/config/reset β€” Reset All to Defaults

curl -X POST http://localhost:3100/mcp/config/reset

GET /mcp/service/status β€” Check xTTS Health

Proxies health check to the main TTS service.

curl http://localhost:3100/mcp/service/status

POST /mcp/service/restart β€” Restart xTTS Container

Triggers docker compose restart xtts via mounted Docker socket.

curl -X POST http://localhost:3100/mcp/service/restart

GET /mcp/docker/status β€” Docker Compose Status

curl http://localhost:3100/mcp/docker/status

GET /mcp/env/diff β€” Show Custom Overrides

Shows which .env values differ from defaults.

curl http://localhost:3100/mcp/env/diff

Docker Compose Config

xtts-mcp:
  build: .
  container_name: xtts-mcp
  restart: unless-stopped
  ports:
    - "3100:3100"
  volumes:
    - ./.env:/app/.env                        # Read/write config
    - ./.env.example:/app/.env.example:ro     # Default reference
    - ./docker-compose.yml:/app/docker-compose.yml:ro
    - /var/run/docker.sock:/var/run/docker.sock  # Docker control
    - /usr/bin/docker:/usr/bin/docker:ro
  command: ["python", "mcp_server.py"]

Note: The Docker socket mount allows the MCP server to restart the xtts container. Only expose this on trusted networks.


πŸ“œ License

MIT

About

Lightweight TTS microservice using edge-tts with chunk-based processing, REST API, Docker ready

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors