Cache spk weights by BrandonYU34 · Pull Request #8 · mtkresearch/BreezyVoice

BrandonYU34 · 2025-05-26T11:57:41Z

This PR introduces a modification that allows users to cache speaker embeddings in a spk2info.pt file. Additionally, users can now specify a speaker_id at runtime to directly reference the stored speaker information.

Added functionality to add/remove speaker-related embeddings/info into spk2info.pt.
When a speaker_id is provided, the system retrieves speaker info from spk2info.pt instead of reprocessing the audio file.
Added a FastAPI server compatible with the CosyVoice API.
Added a WebUI interface for managing and previewing speaker_id.

It should significantly reduces latency during voice cloning or synthesis by avoiding redundant speaker embedding extraction.

before:
after:

Splend1d

Hi,

Thank you for your contibution to the codebase. Could you refactor into the changes by the following 3 types you mentioned?

Support speaker caching
vram clean up
webui/api server

For speaker caching, please include a two step approach example

generate cache files
use cache files to generate speech

Please use an argument to point to the .pt file directly, don't use other model_dir arguments to find the cached path.

If you would like to adopt a spk_id approach, please register this during generating cached files. (Apologies if I missed this part)

BrandonYU34 · 2025-06-21T17:57:39Z

Thank you for your review! I'd like to address your questions as follows:

1. Regarding `spk2info.pt`

The spk2info.pt file included in this update is a modified version of the same file already provided in the BreezyVoice model folder on Hugging Face. The changes were made directly on top of that original file.

⚠️ Speakers from the original spk2info.pt are not compatible with this version due to slight differences in the internal structure.

2. How to Add a New Speaker and Generate Audio

Note: I encountered issues using the automatic Hugging Face model pulling feature, so I manually downloaded the model into the ./models directory. You can still run the commands with the default --model_dir MediaTek-Research/BreezyVoice-300M if you prefer to use the original model location.

(1) Add a New Speaker (using `data/example.wav`):

python3 add_spk.py --spk_id 臺灣女 \
  --speaker_prompt_audio_path "data/example.wav" \
  --speaker_prompt_text_transcription "在密碼學中，加密是將明文資訊改變為難以讀取的密文內容，使之不可讀的方法。只有擁有解密方法的對象，經由解密過程，才能將密文還原為正常可讀的內容。"

(2) Run TTS with the Specified Speaker:

python3 cache_inference.py --spk_id 臺灣女 \
  --content_to_synthesize "歡迎使用聯發創新基地 BreezyVoice 模型。" \
  --output_path results/output.wav

3. Launch the Server and Use the WebUI

For a more convenient experience, you can run the following command to start the server:

python3 server.py --port 50000

Then, open your browser and navigate to http://127.0.0.1:50000/.
This WebUI allows you to manage your current speakers and preview generated audio more easily.

4. Performance Note

🚀 With the optimizations in this version, running TTS on hardware at the level of an RTX 4090 now achieves near real-time performance—as long as the input is properly segmented by sentence.

Splend1d · 2025-07-01T03:42:13Z

Hi Brandon,

I see, thank you for your clarification. I missed the part where spkinfo is part of the model.

However, I regarding speaker caching, I would still require the following fixes:

model-dir -> model_path
default model_path has to be "MediaTek-Research/BreezyVoice". The example in README also has to inherit this default. If you are facing issues, please let us know about the issues. I am able to run your code with this setting.
please rename xxx_customized to the its functional purpose, I think in your case it should be main_cached_inference, frontend_with_cached_speaker_information. Feel free to adjust the naming to your liking.

Would you be able commit this separately, without the TTS server and UI code? I would like to review the other components individually.

Best,
Jeff

BrandonYU34 · 2025-07-01T06:42:26Z

Hi Jeff,

Just a quick update — I’ve set the default model_path to "MediaTek-Research/BreezyVoice-300M" and confirmed everything runs smoothly. Apologies for the earlier mix-up; I mistakenly wrote model_dir instead of model_path in my previous message, which might’ve led to some confusion. The command in the latest message should work as intended now.

I’ve also cleaned up some variable and function names that were potentially unclear. Let me know if the new naming fits with the project’s style.

As for the API server and UI parts — I’ve removed them from this PR for now, per your suggestion. Once this is merged, I’ll open a separate PR to submit the cosyvoice API server and Web UI components.

Appreciate your time!
Brandon

poyangyu added 5 commits May 26, 2025 19:08

cache spk weights to reduce the latency

6f48ee2

remove unnecessary args and add comments

543dae4

add TTS api server

f0b2c5a

WebUI interface

01149d9

remove unnecessary VRAM-occupying code

2cff61a

Splend1d suggested changes Jun 21, 2025

View reviewed changes

BrandonYU34 requested a review from Splend1d June 21, 2025 18:05

BrandonYU34 added 2 commits July 1, 2025 06:11

remove server and UI

42371ff

refactor & rename

0b77c0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache spk weights#8

Cache spk weights#8
BrandonYU34 wants to merge 7 commits intomtkresearch:mainfrom
BrandonYU34:cache-spk-weights

BrandonYU34 commented May 26, 2025 •

edited

Loading

Uh oh!

Splend1d left a comment

Uh oh!

BrandonYU34 commented Jun 21, 2025 •

edited

Loading

Uh oh!

Splend1d commented Jul 1, 2025

Uh oh!

BrandonYU34 commented Jul 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BrandonYU34 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Splend1d left a comment

Choose a reason for hiding this comment

Uh oh!

BrandonYU34 commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Regarding spk2info.pt

2. How to Add a New Speaker and Generate Audio

(1) Add a New Speaker (using data/example.wav):

(2) Run TTS with the Specified Speaker:

3. Launch the Server and Use the WebUI

4. Performance Note

Uh oh!

Splend1d commented Jul 1, 2025

Uh oh!

BrandonYU34 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BrandonYU34 commented May 26, 2025 •

edited

Loading

BrandonYU34 commented Jun 21, 2025 •

edited

Loading

1. Regarding `spk2info.pt`

(1) Add a New Speaker (using `data/example.wav`):

BrandonYU34 commented Jul 1, 2025 •

edited

Loading