Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook. #132

kerolos · 2023-08-25T10:43:28Z

I would like, if it is possible, what are the procedures used to filter the original dataset, for example; from YouTube.
Is there any script to you recommended being used for filter and cleanup?

I have used a Kaldi cleanup script /egs/wsj/s5/steps/cleanup/:
A) GMM (clean_and_segment_data.sh - find_badd_utts.sh). "Not worked perfect for me, especially if there are in systematic error in the dataset"
B) NNET (clean_and_segment_data_nnet3.sh - find_badd_utts_nnet3.sh). "It depends on the pretrained model, which is not good in my case"

You mentioned in the paper in section 3 Gigaspeech creation pipeline part 3.2 ,3.3 ,and 3.4 ; the step to take that. But I would like to know if you used different script than Kaldi, or what had been modified to the original script "cleanup"from Kaldi ?

Thanks in advance, I really appreciate any support.

dophist · 2023-09-17T06:00:44Z

The pipeline was developed based on existing Kaldi scripts as you mentioned above, but with a lot of bug fixes and ad-hoc modifications. However we have no near plan to open source these tools, coz it may require non-trivial efforts to clean up & generalize the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook. #132

Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook. #132

kerolos commented Aug 25, 2023

dophist commented Sep 17, 2023

Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook. #132

Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook. #132

Comments

kerolos commented Aug 25, 2023

dophist commented Sep 17, 2023