Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speech tokenizer training code #9

Open
indiejoseph opened this issue Oct 21, 2024 · 2 comments
Open

speech tokenizer training code #9

indiejoseph opened this issue Oct 21, 2024 · 2 comments

Comments

@indiejoseph
Copy link

This project is demonstrated a very good way to tokenize a speech with different feature, such as style and pitch tokens, that enable downstream application having fine grained control of the generative voice.

I've tested the speech tokenizer in Cantonese, the output has very strong accent, probably due to the training dataset only contains English, I was wondering how can I trained the tokenizer? I know Fairseq has Hubert and HIFIGan training recipe, but not sure how to about pitch and style feature.

@hitchhicker
Copy link
Contributor

@tuanh208 Could you share some insights for this question? Thanks!

@tuanh208
Copy link
Contributor

Hi, I think the reason why the output has strong accent in Chinese is because we only trained the Hifigan vocoder on Expresso (which is in English).

For the pitch tokenizer, as mentioned in the paper, we trained a vqvae model on the extracted f0 (you can use any f0 extractor in this repo) following this work: https://github.com/facebookresearch/speech-resynthesis?tab=readme-ov-file#f0-quantizer-model

For the style tokenizer, we initially fine-tune Speechprop (in this work https://ai.meta.com/research/publications/sonar-expressive-zero-shot-expressive-speech-to-speech-translation/) to predict the styles on the expresso dataset, and train a k-mean tokenizer on the extracted features from speechprop, but for this release we distilled a smaller wav2vec2 model to predict the tokens producted by speechprop, which turns out to work not bad. So let's say if you want to train a new style tokenizer, I would suggest you fine-tune a good speech encoder (e.g. w2v2, wavlm) on some expressive datasets with style labels, and it should work well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants