calesthio · calesthio · May 7, 2026 · May 4, 2026
@@ -0,0 +1,97 @@
+---
+name: doubao-tts
+description: Generate Mandarin and multilingual narration with Volcengine Doubao Speech 2.0. Use when creating Chinese voiceovers, when the user prefers Doubao/Volcengine/火山引擎/豆包 TTS, or when narration needs character-level timestamp metadata for subtitles.
+---
+
+# Doubao TTS
+
+Requires `DOUBAO_SPEECH_API_KEY` in `.env`.
+Set `DOUBAO_SPEECH_VOICE_TYPE` for the default voice, or pass `voice_id` to the tool.
+
+## Current API
+
+Use the new-console API key flow:
+
+```text
+X-Api-Key: ${DOUBAO_SPEECH_API_KEY}
+X-Api-Resource-Id: seed-tts-2.0
+```
+
+Do not use `X-Api-App-Id` and `X-Api-Access-Key` with a new-console API Key. If the API returns `load grant: requested grant not found`, the key type or auth header is probably wrong.
+
+For long-form video narration, prefer the async endpoint:
+
+```text
+POST https://openspeech.bytedance.com/api/v3/tts/submit
+POST https://openspeech.bytedance.com/api/v3/tts/query
+```
+
+This returns `audio_url` plus `sentences[].words[]` timing metadata that can be used to build subtitles.
+
+## OpenMontage Usage
+
+Generate with the TTS selector:
+
+```python
+from tools.audio.tts_selector import TTSSelector
+
+result = TTSSelector().execute({
+    "preferred_provider": "doubao",
+    "text": "如果 AI 真的会改变未来，普通人到底该怎么参与？",
+    "voice_id": "zh_female_vv_uranus_bigtts",
+    "output_path": "projects/my-video/assets/audio/narration.mp3",
+    "speech_rate": 0,
+    "enable_timestamp": True,
+})
+```
+
+Or call the provider directly:
+
+```python
+from tools.audio.doubao_tts import DoubaoTTS
+
+result = DoubaoTTS().execute({
+    "text": "短样本试听文本。",
+    "voice_id": "zh_female_vv_uranus_bigtts",
+    "output_path": "projects/my-video/assets/audio/doubao_sample.mp3",
+})
+```
+
+The provider writes:
+
+- `output_path`: downloaded audio file
+- `metadata_path`: full query response JSON, defaulting to `<output_path>.json`
+
+## Recommended Workflow
+
+1. Generate a 10-15 second sample before a full paid narration.
+2. Ask the user to approve voice naturalness, accent, and speed.
+3. Generate the full narration only after approval.
+4. Keep the query JSON. It is the source of truth for subtitle timing.
+5. Build captions from `sentences[].words[]`, not from estimated text length.
+6. Group captions by Chinese semantic phrases before applying timestamps. Do not split only by fixed character count; it can break phrases like "在不押单个公司的情况下" or "可能会被慢慢稀释" and hurt comprehension.
+7. Let the video duration follow the approved voice rhythm unless the user explicitly asks to match a prior runtime.
+
+## Parameters
+
+- `voice_id`: Doubao `speaker` / voice type. Defaults to `DOUBAO_SPEECH_VOICE_TYPE`.
+- `resource_id`: use `seed-tts-2.0` for Doubao Speech 2.0 voices.
+- `speech_rate`: `0` is normal, `100` is 2x, `-50` is 0.5x.
+- `sample_rate`: default `24000`.
+- `enable_timestamp`: default `true`.
+- `return_usage`: default `true`, requests usage metadata when available.
+
+Do not pass `additions.explicit_language` by default. Some endpoint/key combinations reject `zh-cn` with `unsupported additions explicit language zh-cn`.
+
+For calm Mandarin explainers, start with `speech_rate: 0`. If the result is too long for the approved format, make a short comparison sample with `speech_rate: 25` or `50` before regenerating the full narration. Do not speed up only to match a previous provider's duration if the user prefers Doubao's natural pace.
+
+## Troubleshooting
+
+- `load grant: requested grant not found`: wrong key type or wrong auth header. Use `X-Api-Key` for new-console API Keys.
+- `speaker permission denied`: voice id is wrong or not authorized for the selected resource.
+- `quota exceeded`: quota, lifetime characters, or concurrency exceeded.
+- Missing timestamps: verify `enable_timestamp: true`, keep the query JSON, and confirm the selected endpoint returned `sentences`.
+
+## Safety
+
+Never print or write the API key to logs, metadata, patches, or project artifacts. `.env.example` should contain only empty variable names.
@@ -13,6 +13,8 @@ GOOGLE_API_KEY=              # Google Imagen images, Google Cloud TTS (700+ voic
 ELEVENLABS_API_KEY=          # TTS narration, music generation, sound effects
 OPENAI_API_KEY=              # OpenAI TTS fallback and DALL-E image generation
 XAI_API_KEY=                 # Grok image generation/editing and Grok video generation
+DOUBAO_SPEECH_API_KEY=       # Volcengine Doubao Speech TTS (new console API Key)
+DOUBAO_SPEECH_VOICE_TYPE=    # Default Doubao speaker/voice type, e.g. zh_female_vv_uranus_bigtts
 # Piper local voices do not require env vars; install `piper-tts` via pip
 
 # --- Music ---

@@ -39,6 +39,8 @@ GOOGLE_API_KEY=              # Google TTS + Google Imagen
 ELEVENLABS_API_KEY=          # TTS, music, sound effects (10K chars/month free)
 OPENAI_API_KEY=              # OpenAI TTS + DALL-E 3 images
 XAI_API_KEY=                 # xAI Grok image generation/editing + Grok video generation
+DOUBAO_SPEECH_API_KEY=       # Volcengine Doubao Speech TTS (strong Mandarin narration)
+DOUBAO_SPEECH_VOICE_TYPE=    # Default Doubao speaker/voice type
 
 # MULTI-MODEL GATEWAY (one key, 6+ tools)
 FAL_KEY=                     # FLUX, Recraft, Kling, Veo, MiniMax video
@@ -159,6 +161,52 @@ No subscription — pure pay-as-you-go, no minimum spend.
 
 ---
 
+### Doubao Speech — Mandarin TTS
+
+> **Strong Mandarin narration.** Volcengine Doubao Speech is a good choice for Chinese explainer voiceovers and long-form narration that needs subtitle timing metadata.
+
+**Tools unlocked:** `doubao_tts`
+**Env vars:** `DOUBAO_SPEECH_API_KEY`, `DOUBAO_SPEECH_VOICE_TYPE`
+
+#### Setup
+
+1. Open the Volcengine Doubao Speech console and enable Speech Synthesis 2.0.
+2. Create a new-console API Key.
+3. Choose a Speech 2.0 voice type, for example `zh_female_vv_uranus_bigtts`.
+4. Add to `.env`:
+   ```bash
+   DOUBAO_SPEECH_API_KEY=your-api-key
+   DOUBAO_SPEECH_VOICE_TYPE=zh_female_vv_uranus_bigtts
+   ```
+
+#### API Notes
+
+OpenMontage uses the new-console API key flow:
+
+```text
+X-Api-Key: ${DOUBAO_SPEECH_API_KEY}
+X-Api-Resource-Id: seed-tts-2.0
+```
+
+Do not pass a new-console API Key as `X-Api-App-Id` or `X-Api-Access-Key`. That mismatch can produce `load grant: requested grant not found`.
+
+#### What It Is Best For
+
+- Natural Mandarin narration for Chinese-language explainers
+- Async long-form narration via `/api/v3/tts/submit` and `/api/v3/tts/query`
+- Character-level timing metadata for subtitle alignment
+- Calm educational pacing where the video duration can follow the approved voice rhythm
+
+#### Pacing
+
+Start with `speech_rate: 0` for natural Mandarin delivery. If the approved format needs a tighter runtime, compare short samples at `speech_rate: 25` or `50` before generating the full narration. Do not force Doubao to match another provider's duration unless the user explicitly wants that tradeoff.
+
+#### Pricing
+
+Doubao Speech 2.0 is billed by character package or usage in Volcengine. OpenMontage estimates cost from text length and prefers provider-returned usage metadata when available.
+
+---
+
 ### Google — TTS + Imagen (Shared Key)
 
 > **One key, two tools.** Google Cloud TTS has 700+ voices in 50+ languages — the strongest localization option. Imagen 4 generates high-quality images.