-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question regarding data format and loss calculation in stage 1 #101
Comments
Only audio tokens loss are calculated when training TTS. |
@superFilicos |
we trian these two tasks seperately. I think you can also train them at the same time. they are training different modules of the model. |
@superFilicos 请问一下训练TTS adapter用了多大的数据量,有多少个说话人呢?我在做TTS adapter任务的时候也是只考虑了Audio token loss。 |
您好,我们训练输出只有一个音色,而且音频数据使用内部工业级模型合成,所以肯定更加稳定。我怀疑是您用的数据bad case率比较高。 我们没有尝试过中文。 |
HI @superFilicos ,感谢解答,下面的一些问题能否在论文中进一步说明呢,以免大家复现时产生疑问不断在issue里面提问:
|
@superFilicos Still confusing, what i want to ask is TTS sample format used for training. are GT text token is used for input or filled with text pad tokens?
or
? |
In stage 1, only ASR and TTS is used.
ASR is Audio -> Text, so loss is only calculated for text tokens, not for audio tokens right?
TTS is Text -> Audio, but mini-omni outputs text and audio simultaneously. I'm not sure how to format input data for TTS.
Input: Text
Output: Text with token + Audio token
When training TTS, both text token and audio tokens are fed into LM, and loss is calculated only for audio tokens? or for TTS, it does not use text (only with pad)?
The text was updated successfully, but these errors were encountered: