Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gigaspeech.json里没有audio/podcast/P0081-P0084 #126

Open
wwfcnu opened this issue Feb 27, 2023 · 15 comments
Open

gigaspeech.json里没有audio/podcast/P0081-P0084 #126

wwfcnu opened this issue Feb 27, 2023 · 15 comments

Comments

@wwfcnu
Copy link

wwfcnu commented Feb 27, 2023

gigaspeech.json里没有audio/podcast/P0081-P0084这四个文件,但是files.yaml里面是有的,下载完也是缺这四个文件

@dophist
Copy link
Collaborator

dophist commented Feb 27, 2023

We provide tools (utils/check_audio_md5.sh & utils/check_metadata_md5.sh) to check the integrity of your local copy.
Refer to #115 if your current download is incomplete or interrupted.

@wwfcnu
Copy link
Author

wwfcnu commented Feb 27, 2023

We provide tools (utils/check_audio_md5.sh & utils/check_metadata_md5.sh) to check the integrity of your local copy. Refer to #115 if your current download is incomplete or interrupted.

我重新运行了命令,仍然缺那四个文件

@chenguoguo
Copy link
Collaborator

Which downloading host was it (you should be able to see it from the logs)? I got another person asking about a similar issue. @wwfcnu

@wwfcnu
Copy link
Author

wwfcnu commented Feb 28, 2023

Which downloading host was it (you should be able to see it from the logs)? I got another person asking about a similar issue. @wwfcnu
magicdata host

@chenguoguo
Copy link
Collaborator

When you run the command utils/download_gigaspeech.sh, could you provide the host parameter, something line utils/download_gigaspeech.sh --host speechocean and see if that will be able to download the missing files? @wwfcnu

@wwfcnu
Copy link
Author

wwfcnu commented Feb 28, 2023

When you run the command utils/download_gigaspeech.sh, could you provide the host parameter, something line utils/download_gigaspeech.sh --host speechocean and see if that will be able to download the missing files? @wwfcnu

host指定speechocean时会报错 bash utils/download_gigaspeech.sh --host speechocean
/mnt/data/asr_datasets/Gigaspeech/
utils/download_gigaspeech.sh: Downloading with PySpeechColab...
Traceback (most recent call last):
File "", line 4, in
File "/root/anaconda3/lib/python3.9/site-packages/speechcolab/datasets/gigaspeech.py", line 65, in download
download(access_term_path, f'{self.gigaspeech_release_url}/TERMS_OF_ACCESS')
File "/root/anaconda3/lib/python3.9/site-packages/speechcolab/utils/download.py", line 47, in download
return download_from_ftp(local_filename, url_info.hostname, url_info.path, url_info.username,
File "/root/anaconda3/lib/python3.9/site-packages/speechcolab/utils/download.py", line 33, in download_from_ftp
ftp.login(username, password)
File "/root/anaconda3/lib/python3.9/ftplib.py", line 414, in login
resp = self.sendcmd('PASS ' + passwd)
File "/root/anaconda3/lib/python3.9/ftplib.py", line 281, in sendcmd
return self.getresp()
File "/root/anaconda3/lib/python3.9/ftplib.py", line 254, in getresp
raise error_perm(resp)
ftplib.error_perm: 500 OOPS: cannot change directory:/home/vsftp/GigaSpeech_01

@wwfcnu
Copy link
Author

wwfcnu commented Feb 28, 2023

照理说gigaspeech.json这个metadata文件会包含audio/podcast/P0081-P0084这四个文件的信息呀,用md5验证了这个json文件也是没问题的

@chenguoguo
Copy link
Collaborator

There could be issues with the MagicData server. I'm downloading from tsinghua and see if we have the same issue. In the meanwhile could you try bash utils/download_gigaspeech.sh --host tsinghua YOUR_GIGASPEECH_FOLDER and see if that fixes the issue? @wwfcnu

@dophist
Copy link
Collaborator

dophist commented Feb 28, 2023

照理说gigaspeech.json这个metadata文件会包含audio/podcast/P0081-P0084这四个文件的信息呀,用md5验证了这个json文件也是没问题的

please provide more info such as the MD5 of your local gigaspeech.json

@wwfcnu
Copy link
Author

wwfcnu commented Feb 28, 2023

照理说gigaspeech.json这个metadata文件会包含audio/podcast/P0081-P0084这四个文件的信息呀,用md5验证了这个json文件也是没问题的

please provide more info such as the MD5 of your local gigaspeech.json

19c777dc296ff3eb714bc677a80620a3 GigaSpeech.json
380d7b6d180662d980129a630fe3f75b GigaSpeech.json.gz.aes

@chenguoguo
Copy link
Collaborator

I downloaded my GigaSpeech.json from tsinghua, it has the same MD5 as yours and it has sections for P0081, see the screenshot below:

image

@dophist
Copy link
Collaborator

dophist commented Feb 28, 2023

And I just confirmed the resources on MagicData host are fine. You should always be able to re-run the download script to fix corrupted download session, and remember to use provided tools to enforce the correctness of your local copy.

@wwfcnu
Copy link
Author

wwfcnu commented Feb 28, 2023

I downloaded my GigaSpeech.json from tsinghua, it has the same MD5 as yours and it has sections for P0081, see the screenshot below:

image

这是youtube呀,我缺的是podcast下面的

@chenguoguo
Copy link
Collaborator

You are right, I was looking at the wrong category. It's possible that we removed all the segments of those few files from a certain version of the meta data due to quality issues but still kept those files because we wanted to keep the raw data as well. @wgb14 @chaisz19 do you still remember the details?

@chaisz19
Copy link
Collaborator

chaisz19 commented Mar 4, 2023

The audio in podcast's P0081-P0084 belongs to RADIO. The original transcript of RADIO has some problems during processing that some text is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants