Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding data format and loss calculation in stage 1 #101

Open
sphmel opened this issue Oct 16, 2024 · 7 comments
Open

Question regarding data format and loss calculation in stage 1 #101

sphmel opened this issue Oct 16, 2024 · 7 comments

Comments

@sphmel
Copy link

sphmel commented Oct 16, 2024

In stage 1, only ASR and TTS is used.

ASR is Audio -> Text, so loss is only calculated for text tokens, not for audio tokens right?

TTS is Text -> Audio, but mini-omni outputs text and audio simultaneously. I'm not sure how to format input data for TTS.

Input: Text
Output: Text with token + Audio token

When training TTS, both text token and audio tokens are fed into LM, and loss is calculated only for audio tokens? or for TTS, it does not use text (only with pad)?

@superFilicos
Copy link

Only audio tokens loss are calculated when training TTS.

@sphmel
Copy link
Author

sphmel commented Oct 16, 2024

@superFilicos
Were text tokens are fed into transformer in TTS training? also Audio tokens are fed into transformer in ASR training?

@superFilicos
Copy link

we trian these two tasks seperately. I think you can also train them at the same time. they are training different modules of the model.

@GuangChen2016
Copy link

@superFilicos 请问一下训练TTS adapter用了多大的数据量,有多少个说话人呢?我在做TTS adapter任务的时候也是只考虑了Audio token loss。
我用了多人的数据训练,并添加了spk emb作为LLM模型的输入,结果有大量的重复和漏词的问题,我是在中文上实验的,不知道您当时是不是也有遇到这些问题呢?
谢谢啦

@superFilicos
Copy link

您好,我们训练输出只有一个音色,而且音频数据使用内部工业级模型合成,所以肯定更加稳定。我怀疑是您用的数据bad case率比较高。 我们没有尝试过中文。

@vra
Copy link

vra commented Oct 18, 2024

HI @superFilicos ,感谢解答,下面的一些问题能否在论文中进一步说明呢,以免大家复现时产生疑问不断在issue里面提问:

  1. 3个训练Stage输入的audio token和text token分别时怎么获取的,是否在计算loss中使用了
  2. 3个训练Stage分别计算哪些loss,loss的gt和pred是怎么得到的

@sphmel
Copy link
Author

sphmel commented Oct 22, 2024

@superFilicos Still confusing, what i want to ask is TTS sample format used for training. are GT text token is used for input or filled with text pad tokens?

<audio1> <audio2> ... <audio n>
<text1> <text2> ... <text-pad>

or

<audio1> <audio2> ... <audio n>
 <text-pad>  <text-pad> ... <text-pad>

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants