Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean the original dataset that collected from different resources YouTube , Podcast, and Audiobook. #132

Open
kerolos opened this issue Aug 25, 2023 · 1 comment

Comments

@kerolos
Copy link

kerolos commented Aug 25, 2023

I would like, if it is possible, what are the procedures used to filter the original dataset, for example; from YouTube.
Is there any script to you recommended being used for filter and cleanup?

I have used a Kaldi cleanup script /egs/wsj/s5/steps/cleanup/:
A) GMM (clean_and_segment_data.sh - find_badd_utts.sh). "Not worked perfect for me, especially if there are in systematic error in the dataset"
B) NNET (clean_and_segment_data_nnet3.sh - find_badd_utts_nnet3.sh). "It depends on the pretrained model, which is not good in my case"

You mentioned in the paper in section 3 Gigaspeech creation pipeline part 3.2 ,3.3 ,and 3.4 ; the step to take that. But I would like to know if you used different script than Kaldi, or what had been modified to the original script "cleanup"from Kaldi ?

Thanks in advance, I really appreciate any support.

@dophist
Copy link
Collaborator

dophist commented Sep 17, 2023

The pipeline was developed based on existing Kaldi scripts as you mentioned above, but with a lot of bug fixes and ad-hoc modifications. However we have no near plan to open source these tools, coz it may require non-trivial efforts to clean up & generalize the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants