-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test time speaker adaptation #6
Comments
Hi, good question! We didn't focus much on this, but we can do the exact same TSA algorithm in NANSY for the speaker conversion model. We can just view the test time utterance as new training data and fine-tune the model on it for a few steps (optionally freeze some parameters). There are some limitations about transferring the accent when using the pretrained models. I personally believe the causes are:
This should help mitigate (1). However, it will not help (2) if the accent is already encoded in the SSL features. In this case, converting accent will require conversion of source SSL features, which our model (or NANSY) cannot achieve. I think in terms of accent transferability, ASR-TTS cascade VC systems still outperforms SSL-based approaches like our model and NANSY. One simple evidence, you can listen to the samples of NANSY++ (A recent improved model of NANSY). Listen to the third sample of Zero-Shot Voice Conversion, their model also outputs native accent English despite the heavy accented target speech. I think at current VC systems utilizing SSL features, are still subject to only conversion of speaker F0 and timbre, but cannot transfer accent well. I would imagine an interesting direction improving the accent conversion by working on the discrete units (e.g., having less clusters to force the units to encode less diverse realization of phonemes, or condition duration prediction on speaker embedding since accent is closely related to timing). |
Thanks for the clarification. I was more concerned about the identity mismatch than the accent difference. The voice texture of boman and prosenjit is not preserved in the generated audios. I tried with other source audios but the output texture didn't change. I presumed that the problem might be in speaker representation extracted from the target audio, hence the question about TSA. Also, is the voice texture different because the target audio is accented? |
Got it. It should be straightforward to apply TSA on fine-tuning the speaker embedding extractor, but require some work for implementation (basically have a training loop inside inference.py). I'll keep that in mind but I cannot guarantee when I will implement TSA function for this repo. |
Thanks for the confirmation. I can send a PR with TSA support. I am facing an issue - when I try to use the predicted mels from the code in the inference function they have two extra frames (1, 863, 80) compared to the mels extracted using |
During training, you can pass the target Mel-spectrogram length to the The slight mismatch is caused since I use a simple scale 1.73 to calculate the expected Mel-spectrogram length without considering the paddings/truncations during inference. |
Directly optimizing for L1 loss using the code in inference.py(with the mel length fix) results in further deterioration. I can see that the forward pass in train and inference differ. Is there anything that I should carry from there? |
It's hard to judge without the code. I could think of some pitfalls: The only trainable modules should be Also, you need to feed in the ground truth energy, pitch instead of the predicted ones for training. You can see in We also use adversarial loss during training. The checkpoint also contains all the parameters to recover these. |
Actually, I kept the speaker encoding tensor trainable(initialized to |
I see. Hmm. I'm not sure unplugging the adversarial loss will have that a huge effect on performance. |
|
|
I agree with both. Sharing the code. Let me know if you want to look at the audio samples as well. |
I think the learning rate is too high. Can you try |
It is indeed high. I deliberately increased it while doing initial experiments and forgot to revert it back. Yes, loss is going down. Tried
|
Also, why do we need https://github.com/b04901014/UUVC/blob/master/inference_exact_pitch.py#L162 ? I had to comment it out to make the shapes match |
You are right. It should be redundant in this context since for speaker conversion we are not changing the duration and can use the source melspec length for the output melspec length. So we don't need to estimate the output length by the scaling factor. The original code on its own should work also fine. I think you get shape error since you directly send the source melspec shape to the model as I suggested earlier for TSA. |
|
I used loud normalization at the |
So my current code has the following changes:
Now the identity match is better but it has the following problems(new_samples.zip):
Proposal:
Do you have any more thoughts on getting this to work? |
The identity seems to really get better! But there do have some weird noise there which should originally be caught by the adversarial loss. What you are doing sounds good to me. One additional suggestion may be trying to randomly batch the 5 min speech to add stochastic property to gradient decent, which may help the adversarial training. If you want to keep some layers trainable, maybe we can start by making all the parameters trainable and see if that improves or not. |
This was an awesome thread to read through. Any chance you could share some of this code as a PR or a fork @vishalbhavani? |
I tried the pretrained model and the one-shot vc results are not good. Is there a way to do TSA just like NANSY on a few examples of the speakers to get a better speaker identity representation?
The text was updated successfully, but these errors were encountered: