Skip to content

Cache spk weights#8

Open
BrandonYU34 wants to merge 7 commits intomtkresearch:mainfrom
BrandonYU34:cache-spk-weights
Open

Cache spk weights#8
BrandonYU34 wants to merge 7 commits intomtkresearch:mainfrom
BrandonYU34:cache-spk-weights

Conversation

@BrandonYU34
Copy link
Copy Markdown

@BrandonYU34 BrandonYU34 commented May 26, 2025

This PR introduces a modification that allows users to cache speaker embeddings in a spk2info.pt file. Additionally, users can now specify a speaker_id at runtime to directly reference the stored speaker information.

  • Added functionality to add/remove speaker-related embeddings/info into spk2info.pt.
  • When a speaker_id is provided, the system retrieves speaker info from spk2info.pt instead of reprocessing the audio file.
  • Added a FastAPI server compatible with the CosyVoice API.
  • Added a WebUI interface for managing and previewing speaker_id.

It should significantly reduces latency during voice cloning or synthesis by avoiding redundant speaker embedding extraction.

  • before:
    origin
  • after:
    result

Copy link
Copy Markdown
Contributor

@Splend1d Splend1d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thank you for your contibution to the codebase. Could you refactor into the changes by the following 3 types you mentioned?

  1. Support speaker caching
  2. vram clean up
  3. webui/api server

For speaker caching, please include a two step approach example

  1. generate cache files
  2. use cache files to generate speech

Please use an argument to point to the .pt file directly, don't use other model_dir arguments to find the cached path.

If you would like to adopt a spk_id approach, please register this during generating cached files. (Apologies if I missed this part)

@BrandonYU34
Copy link
Copy Markdown
Author

BrandonYU34 commented Jun 21, 2025

Thank you for your review! I'd like to address your questions as follows:

1. Regarding spk2info.pt

The spk2info.pt file included in this update is a modified version of the same file already provided in the BreezyVoice model folder on Hugging Face. The changes were made directly on top of that original file.

⚠️ Speakers from the original spk2info.pt are not compatible with this version due to slight differences in the internal structure.


2. How to Add a New Speaker and Generate Audio

Note: I encountered issues using the automatic Hugging Face model pulling feature, so I manually downloaded the model into the ./models directory. You can still run the commands with the default --model_dir MediaTek-Research/BreezyVoice-300M if you prefer to use the original model location.

(1) Add a New Speaker (using data/example.wav):

python3 add_spk.py --spk_id 臺灣女 \
  --speaker_prompt_audio_path "data/example.wav" \
  --speaker_prompt_text_transcription "在密碼學中,加密是將明文資訊改變為難以讀取的密文內容,使之不可讀的方法。只有擁有解密方法的對象,經由解密過程,才能將密文還原為正常可讀的內容。"

(2) Run TTS with the Specified Speaker:

python3 cache_inference.py --spk_id 臺灣女 \
  --content_to_synthesize "歡迎使用聯發創新基地 BreezyVoice 模型。" \
  --output_path results/output.wav

3. Launch the Server and Use the WebUI

For a more convenient experience, you can run the following command to start the server:

python3 server.py --port 50000

Then, open your browser and navigate to http://127.0.0.1:50000/.
This WebUI allows you to manage your current speakers and preview generated audio more easily.


4. Performance Note

🚀 With the optimizations in this version, running TTS on hardware at the level of an RTX 4090 now achieves near real-time performance—as long as the input is properly segmented by sentence.


@BrandonYU34 BrandonYU34 requested a review from Splend1d June 21, 2025 18:05
@Splend1d
Copy link
Copy Markdown
Contributor

Splend1d commented Jul 1, 2025

Hi Brandon,

I see, thank you for your clarification. I missed the part where spkinfo is part of the model.

However, I regarding speaker caching, I would still require the following fixes:

  1. model-dir -> model_path
  2. default model_path has to be "MediaTek-Research/BreezyVoice". The example in README also has to inherit this default. If you are facing issues, please let us know about the issues. I am able to run your code with this setting.
  3. please rename xxx_customized to the its functional purpose, I think in your case it should be main_cached_inference, frontend_with_cached_speaker_information. Feel free to adjust the naming to your liking.

Would you be able commit this separately, without the TTS server and UI code? I would like to review the other components individually.

Best,
Jeff

@BrandonYU34
Copy link
Copy Markdown
Author

BrandonYU34 commented Jul 1, 2025

Hi Jeff,

Just a quick update — I’ve set the default model_path to "MediaTek-Research/BreezyVoice-300M" and confirmed everything runs smoothly. Apologies for the earlier mix-up; I mistakenly wrote model_dir instead of model_path in my previous message, which might’ve led to some confusion. The command in the latest message should work as intended now.

I’ve also cleaned up some variable and function names that were potentially unclear. Let me know if the new naming fits with the project’s style.

As for the API server and UI parts — I’ve removed them from this PR for now, per your suggestion. Once this is merged, I’ll open a separate PR to submit the cosyvoice API server and Web UI components.

Appreciate your time!
Brandon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants