Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions .agents/skills/doubao-tts/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
name: doubao-tts
description: Generate Mandarin and multilingual narration with Volcengine Doubao Speech 2.0. Use when creating Chinese voiceovers, when the user prefers Doubao/Volcengine/火山引擎/豆包 TTS, or when narration needs character-level timestamp metadata for subtitles.
---

# Doubao TTS

Requires `DOUBAO_SPEECH_API_KEY` in `.env`.
Set `DOUBAO_SPEECH_VOICE_TYPE` for the default voice, or pass `voice_id` to the tool.

## Current API

Use the new-console API key flow:

```text
X-Api-Key: ${DOUBAO_SPEECH_API_KEY}
X-Api-Resource-Id: seed-tts-2.0
```

Do not use `X-Api-App-Id` and `X-Api-Access-Key` with a new-console API Key. If the API returns `load grant: requested grant not found`, the key type or auth header is probably wrong.

For long-form video narration, prefer the async endpoint:

```text
POST https://openspeech.bytedance.com/api/v3/tts/submit
POST https://openspeech.bytedance.com/api/v3/tts/query
```

This returns `audio_url` plus `sentences[].words[]` timing metadata that can be used to build subtitles.

## OpenMontage Usage

Generate with the TTS selector:

```python
from tools.audio.tts_selector import TTSSelector

result = TTSSelector().execute({
"preferred_provider": "doubao",
"text": "如果 AI 真的会改变未来,普通人到底该怎么参与?",
"voice_id": "zh_female_vv_uranus_bigtts",
"output_path": "projects/my-video/assets/audio/narration.mp3",
"speech_rate": 0,
"enable_timestamp": True,
})
```

Or call the provider directly:

```python
from tools.audio.doubao_tts import DoubaoTTS

result = DoubaoTTS().execute({
"text": "短样本试听文本。",
"voice_id": "zh_female_vv_uranus_bigtts",
"output_path": "projects/my-video/assets/audio/doubao_sample.mp3",
})
```

The provider writes:

- `output_path`: downloaded audio file
- `metadata_path`: full query response JSON, defaulting to `<output_path>.json`

## Recommended Workflow

1. Generate a 10-15 second sample before a full paid narration.
2. Ask the user to approve voice naturalness, accent, and speed.
3. Generate the full narration only after approval.
4. Keep the query JSON. It is the source of truth for subtitle timing.
5. Build captions from `sentences[].words[]`, not from estimated text length.
6. Group captions by Chinese semantic phrases before applying timestamps. Do not split only by fixed character count; it can break phrases like "在不押单个公司的情况下" or "可能会被慢慢稀释" and hurt comprehension.
7. Let the video duration follow the approved voice rhythm unless the user explicitly asks to match a prior runtime.

## Parameters

- `voice_id`: Doubao `speaker` / voice type. Defaults to `DOUBAO_SPEECH_VOICE_TYPE`.
- `resource_id`: use `seed-tts-2.0` for Doubao Speech 2.0 voices.
- `speech_rate`: `0` is normal, `100` is 2x, `-50` is 0.5x.
- `sample_rate`: default `24000`.
- `enable_timestamp`: default `true`.
- `return_usage`: default `true`, requests usage metadata when available.

Do not pass `additions.explicit_language` by default. Some endpoint/key combinations reject `zh-cn` with `unsupported additions explicit language zh-cn`.

For calm Mandarin explainers, start with `speech_rate: 0`. If the result is too long for the approved format, make a short comparison sample with `speech_rate: 25` or `50` before regenerating the full narration. Do not speed up only to match a previous provider's duration if the user prefers Doubao's natural pace.

## Troubleshooting

- `load grant: requested grant not found`: wrong key type or wrong auth header. Use `X-Api-Key` for new-console API Keys.
- `speaker permission denied`: voice id is wrong or not authorized for the selected resource.
- `quota exceeded`: quota, lifetime characters, or concurrency exceeded.
- Missing timestamps: verify `enable_timestamp: true`, keep the query JSON, and confirm the selected endpoint returned `sentences`.

## Safety

Never print or write the API key to logs, metadata, patches, or project artifacts. `.env.example` should contain only empty variable names.
2 changes: 2 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ GOOGLE_API_KEY= # Google Imagen images, Google Cloud TTS (700+ voic
ELEVENLABS_API_KEY= # TTS narration, music generation, sound effects
OPENAI_API_KEY= # OpenAI TTS fallback and DALL-E image generation
XAI_API_KEY= # Grok image generation/editing and Grok video generation
DOUBAO_SPEECH_API_KEY= # Volcengine Doubao Speech TTS (new console API Key)
DOUBAO_SPEECH_VOICE_TYPE= # Default Doubao speaker/voice type, e.g. zh_female_vv_uranus_bigtts
# Piper local voices do not require env vars; install `piper-tts` via pip

# --- Music ---
Expand Down
48 changes: 48 additions & 0 deletions docs/PROVIDERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ GOOGLE_API_KEY= # Google TTS + Google Imagen
ELEVENLABS_API_KEY= # TTS, music, sound effects (10K chars/month free)
OPENAI_API_KEY= # OpenAI TTS + DALL-E 3 images
XAI_API_KEY= # xAI Grok image generation/editing + Grok video generation
DOUBAO_SPEECH_API_KEY= # Volcengine Doubao Speech TTS (strong Mandarin narration)
DOUBAO_SPEECH_VOICE_TYPE= # Default Doubao speaker/voice type

# MULTI-MODEL GATEWAY (one key, 6+ tools)
FAL_KEY= # FLUX, Recraft, Kling, Veo, MiniMax video
Expand Down Expand Up @@ -159,6 +161,52 @@ No subscription — pure pay-as-you-go, no minimum spend.

---

### Doubao Speech — Mandarin TTS

> **Strong Mandarin narration.** Volcengine Doubao Speech is a good choice for Chinese explainer voiceovers and long-form narration that needs subtitle timing metadata.

**Tools unlocked:** `doubao_tts`
**Env vars:** `DOUBAO_SPEECH_API_KEY`, `DOUBAO_SPEECH_VOICE_TYPE`

#### Setup

1. Open the Volcengine Doubao Speech console and enable Speech Synthesis 2.0.
2. Create a new-console API Key.
3. Choose a Speech 2.0 voice type, for example `zh_female_vv_uranus_bigtts`.
4. Add to `.env`:
```bash
DOUBAO_SPEECH_API_KEY=your-api-key
DOUBAO_SPEECH_VOICE_TYPE=zh_female_vv_uranus_bigtts
```

#### API Notes

OpenMontage uses the new-console API key flow:

```text
X-Api-Key: ${DOUBAO_SPEECH_API_KEY}
X-Api-Resource-Id: seed-tts-2.0
```

Do not pass a new-console API Key as `X-Api-App-Id` or `X-Api-Access-Key`. That mismatch can produce `load grant: requested grant not found`.

#### What It Is Best For

- Natural Mandarin narration for Chinese-language explainers
- Async long-form narration via `/api/v3/tts/submit` and `/api/v3/tts/query`
- Character-level timing metadata for subtitle alignment
- Calm educational pacing where the video duration can follow the approved voice rhythm

#### Pacing

Start with `speech_rate: 0` for natural Mandarin delivery. If the approved format needs a tighter runtime, compare short samples at `speech_rate: 25` or `50` before generating the full narration. Do not force Doubao to match another provider's duration unless the user explicitly wants that tradeoff.

#### Pricing

Doubao Speech 2.0 is billed by character package or usage in Volcengine. OpenMontage estimates cost from text length and prefers provider-returned usage metadata when available.

---

### Google — TTS + Imagen (Shared Key)

> **One key, two tools.** Google Cloud TTS has 700+ voices in 50+ languages — the strongest localization option. Imagen 4 generates high-quality images.
Expand Down
Loading