You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project is demonstrated a very good way to tokenize a speech with different feature, such as style and pitch tokens, that enable downstream application having fine grained control of the generative voice.
I've tested the speech tokenizer in Cantonese, the output has very strong accent, probably due to the training dataset only contains English, I was wondering how can I trained the tokenizer? I know Fairseq has Hubert and HIFIGan training recipe, but not sure how to about pitch and style feature.
The text was updated successfully, but these errors were encountered:
For the style tokenizer, we initially fine-tune Speechprop (in this work https://ai.meta.com/research/publications/sonar-expressive-zero-shot-expressive-speech-to-speech-translation/) to predict the styles on the expresso dataset, and train a k-mean tokenizer on the extracted features from speechprop, but for this release we distilled a smaller wav2vec2 model to predict the tokens producted by speechprop, which turns out to work not bad. So let's say if you want to train a new style tokenizer, I would suggest you fine-tune a good speech encoder (e.g. w2v2, wavlm) on some expressive datasets with style labels, and it should work well.
This project is demonstrated a very good way to tokenize a speech with different feature, such as style and pitch tokens, that enable downstream application having fine grained control of the generative voice.
I've tested the speech tokenizer in Cantonese, the output has very strong accent, probably due to the training dataset only contains English, I was wondering how can I trained the tokenizer? I know Fairseq has Hubert and HIFIGan training recipe, but not sure how to about pitch and style feature.
The text was updated successfully, but these errors were encountered: