Skip to content

feat(narration): add FishAudio as a selectable TTS provider#54

Open
fancyboi999 wants to merge 2 commits into
nexu-io:mainfrom
fancyboi999:feat/fishaudio-tts-provider
Open

feat(narration): add FishAudio as a selectable TTS provider#54
fancyboi999 wants to merge 2 commits into
nexu-io:mainfrom
fancyboi999:feat/fishaudio-tts-provider

Conversation

@fancyboi999

Copy link
Copy Markdown

Closes #53

What

Adds FishAudio alongside MiniMax as a narration (voiceover) backend, selectable per workspace in Studio → Settings → Audio. Background music stays MiniMax-only (FishAudio has no music generation).

Why

We use FishAudio TTS heavily (custom reference_id voices), but the studio's narration pipeline was MiniMax-only. See #53 for motivation.

Changes

  • core: new fishaudio.tsresolveFishAudioCredentials / generateFishTts / listFishVoices — plus a shared TtsAudioResult type and 16 unit tests.
  • config: MediaConfigStore gains FishAudio creds + a narrationProvider setting; the speech model is env-controlled (FISH_AUDIO_MODEL, default s1).
  • server: narration routes by provider; new /api/config/fishaudio, /api/config/narration-provider, /api/fishaudio/voices (the key stays server-side — the browser never sees it).
  • studio UI: provider toggle + FishAudio key pane (single host, no region); a searchable voice picker (reference_id) with sample preview; en + zh strings.

Design doc: research/2026-06-15-spec-10-fishaudio-tts-provider.md (RFC-10).

Provider differences absorbed

MiniMax FishAudio
Model select body field model header (env-driven)
Response JSON + hex envelope raw binary stream
Voices 6 fixed voice_id account reference_id (searchable)
Region intl / CN split single global host
Music yes none (stays MiniMax)

Screenshots

Settings → Audio provider toggle

FishAudio searchable voice picker

Verification (real, not mocks)

  • Unit: pnpm --filter @html-video/core test16/16.
  • Real API: generateFishTts → real MP3 (ffprobe-valid); listFishVoices → account models.
  • Studio E2E: configure key → switch provider → search + pick a voice → synthesize → asset created; export muxes the FishAudio narration into the MP4ffprobe shows h264 video + aac audio, and ffmpeg -f null - decodes clean.
  • Regression: cli 14/14, runtime 4/4, adapter-remotion 8/8, smoke all green; the MiniMax narration path is unchanged (only a creds variable rename).

Notes

  • A pre-existing adapter-hyperframes test script points at a missing test/ dir (unrelated to this change; present on main).
  • MiniMax actual generation wasn't re-run here (no key on hand); its routing + the graceful "key not configured" message were verified.

Adds FishAudio alongside MiniMax as a narration (voiceover) backend,
selectable per workspace in Studio → Settings → Audio. Background music
stays MiniMax-only (FishAudio has no music generation).

- core: new fishaudio.ts (resolveFishAudioCredentials / generateFishTts /
  listFishVoices) + shared TtsAudioResult type + 16 unit tests
- config: MediaConfigStore gains fishaudio creds + narrationProvider;
  the speech model is env-controlled (FISH_AUDIO_MODEL, default s1)
- server: narration routes by provider; new /api/config/fishaudio,
  /api/config/narration-provider, /api/fishaudio/voices (key stays server-side)
- studio UI: provider toggle + FishAudio key pane (no region); searchable
  voice picker (reference_id) with sample preview; en + zh strings

Verified end-to-end against the real FishAudio API. MiniMax path unchanged.
See research/2026-06-15-spec-10-fishaudio-tts-provider.md (RFC-10).
@lefarcen lefarcen requested a review from mrcfps June 15, 2026 04:36
@lefarcen lefarcen added size/XL Size XL (700-1499 LOC) risk/medium Medium risk type/feature Feature change labels Jun 15, 2026
@lefarcen

Copy link
Copy Markdown
Contributor

Hey @fancyboi999! 👋

Nice work on the FishAudio integration — the design doc + comprehensive testing (16 unit tests + real API verification + E2E export validation) gives a clear picture of what's changing. The provider abstraction looks clean, and keeping the server-side key handling is the right security move.

I've assigned this to @mrcfps for code review. A few things that look promising from the structure:

  • The credential resolution pattern mirrors MiniMax (env fallback + server-side only)
  • Provider differences are absorbed cleanly (model header vs body, binary stream vs JSON envelope)
  • The searchable voice picker is a nice UX touch for reference_id accounts

Related context: #45 discusses provider selection patterns, and #38 (SenseAudio) has overlapping config/UI seams worth checking for consistency.

Once @mrcfps reviews, we'll move this forward. Let me know if anything's unclear!

💡 To drive this PR to merge hands-free, paste this to your AI coding agent (Claude Code / Codex / opencode / Cursor …):
Take over nexu-io/html-video#54 until it merges — read https://raw.githubusercontent.com/nexu-io/looper/main/skills/pr-takeover/SKILL.md and follow it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk/medium Medium risk size/XL Size XL (700-1499 LOC) type/feature Feature change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add FishAudio as a selectable TTS provider for narration

2 participants