Is XL subset the 33000hr unlabeled data? #114

mct10 · 2022-05-23T12:33:34Z

Hi,
As mentioned in the README, GigaSpeech contains "33,000+ hours for unsupervised/semi-supervised learning". I am trying to use these unlabeled data, and I have already downloaded the XL subset. But after I summed up the duration of each audio in GigaSpeech.json, the number is only around 25000 hour.
So my question is, is the entire XL subset the 33,000 hour data, or are there any additional steps needed to retrieve the 33000 hour data?
Many thanks!

The text was updated successfully, but these errors were encountered:

dophist · 2022-06-07T13:30:00Z

There are 33000+ hours audio files in total under GigaSpeech directory.
GigaSpeech.json contains 10000 hours of audio segments with transcription for supervised training.

dophist added the documentation Improvements or additions to documentation label Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is XL subset the 33000hr unlabeled data? #114

Is XL subset the 33000hr unlabeled data? #114

mct10 commented May 23, 2022

dophist commented Jun 7, 2022

Is XL subset the 33000hr unlabeled data? #114

Is XL subset the 33000hr unlabeled data? #114

Comments

mct10 commented May 23, 2022

dophist commented Jun 7, 2022