Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected EOF Errors During Actinopterygii Genomes Download in RefSeq #294

Open
mkrg01 opened this issue Dec 3, 2023 · 16 comments
Open
Labels
bug Something isn't working

Comments

@mkrg01
Copy link

mkrg01 commented Dec 3, 2023

Background:

Encountered multiple unexpected EOF errors while attempting to download the RefSeq genomes of Actinopterygii (taxon id: 7898) using datasets version 15.30.0.

Steps to Reproduce:

  1. Initial Download Command:

    datasets download genome taxon 7898 --dehydrated --reference --annotated --include gbff --assembly-source RefSeq --filename data/raw_data/Actinopterygii_dataset.zip
  2. Unzipping the Package:

    unzip data/raw_data/Actinopterygii_dataset.zip -d data/raw_data/Actinopterygii_dataset
  3. Rehydration Process (Error Occurs Here):

    datasets rehydrate --directory data/raw_data/Actinopterygii_dataset/ --no-progressbar

Observed Error Messages:

During the rehydration step, the process repeatedly fails with unexpected EOF errors. The error log is as follows:

Collecting 177 genome records [------------------------------------------------]   0% 0/177
Downloading: data/raw_data/Actinopterygii_dataset.zip    167kB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4
Error: 
unexpected EOF
[repeated multiple times]

Use datasets rehydrate <command> --help for detailed help about a command.

I would greatly appreciate your assistance in addressing this matter.

@ericcox1
Copy link
Collaborator

ericcox1 commented Dec 4, 2023

Hi @mkrg01,

Thanks for opening this issue.

We were able to reproduce this bug and we are looking into a fix.

In the meantime, try adding the --gzip flag to your rehydrate command to bypass the error, like this:
datasets rehydrate --directory Actinopterygii_dataset/ --gzip

The downloaded data files will be gzip compressed on your computer.

I'll post a comment on this thread when the bug has been fixed.

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
Sequence Enhancements, Tools and Delivery (SeqPlus)
NIH/NLM/NCBI
[email protected]

@mkrg01
Copy link
Author

mkrg01 commented Dec 6, 2023

Hello @ericcox1,

Thank you for working on this issue.

I tried running it again with the --gzip flag, but I encountered the same error.

In any case, I am looking forward to the bug fix. Thank you.

@ericcox1
Copy link
Collaborator

ericcox1 commented Dec 8, 2023

Hi @mkrg01,

Although I was able to reproduce the bug the first day that you reported it, I have tried several times and have not been able to reproduce it since. Would you mind checking if you still see the bug on your end?

Best,
Eric

@mkrg01
Copy link
Author

mkrg01 commented Dec 10, 2023

Hi @ericcox1,

I tried downloading the datasets yesterday, but I still saw the same bug...

The scripts are as follows:

datasets download genome taxon 7898 --dehydrated --reference --annotated --include gbff --assembly-source RefSeq --filename data/raw_data/Actinopterygii_dataset.zip

unzip data/raw_data/Actinopterygii_dataset.zip -d data/raw_data/Actinopterygii_dataset

datasets rehydrate --directory data/raw_data/Actinopterygii_dataset/ --gzip --no-progressbar

I used a docker image (docker://aurelia01/deep_adapt_ncbi:v2) when running the command this time (and also the first time I posted on this issue). The datasets version is 15.30.0.

@ericcox1
Copy link
Collaborator

Thanks for the update @mkrg01, we will continue looking into it. I'll comment on this thread with updates.

@alpole23
Copy link

alpole23 commented Dec 18, 2023

Hello developers, I just wanted to add that this is not an isolated issue. I have been experiencing the same bug when downloading large datasets.

datasets download genome taxon "Enterobacteriales" --include gbff --assembly-source GenBank --filename Enterobacteriales.zip --exclude-atypical --dehydrated

unzip Enterobacteriales.zip -d ~/multismash/ncbi_datasets/Enterobacteriales

datasets rehydrate --directory ~/multismash/ncbi_datasets/Enterobacteriales/ --gzip

Capture

I am running datasets version 15.33.0.

In the meantime, while you look for a fix for this issue, is there another way that I can download this dataset? Our research project depends on it.

This issue does not occur when downloading smaller datasets, for example, all of "Erwiniacaea"... only with large datasets.

@ericcox1
Copy link
Collaborator

Hi @alpole23,

Thanks for your comment on this issue. We are aiming to release a fix for this issue later today.

Best,
Eric

@ericcox1
Copy link
Collaborator

Hi @mkrg01 and @alpole23,

We released a fix for this issue last night. You'll need to update to the latest version of the command line tools, 15.34.0.

Best,
Eric

@alpole23
Copy link

alpole23 commented Dec 19, 2023

Wonderful, thank you!

I tested the fix with the Yersiniacea family and received another unexpected EOF error message.

datasets download genome taxon "Yersiniaceae" --include gbff --assembly-source GenBank --filename 9_Enterobacterales_Yersiniaceae.zip --exclude-atypical --dehydrated

unzip 9_Enterobacterales_Yersiniaceae.zip -d ~/multismash/ncbi_datasets/9_Enterobacterales_Yersiniaceae

datasets rehydrate --gzip --directory ~/multismash/ncbi_datasets/9_Enterobacterales_Yersiniaceae

Capture

It does appear that all of the files rehydrated except for the three "EOF" error files. So, I assume it has something to do with a flaw in the GenBank entry? What exactly does "unexpected EOF" mean?

P.S. the files do exist when I check for them via the FTP site... so I was able to manually download the ones with the "unexpected EOF" error.

Capture

@ericcox1 ericcox1 reopened this Dec 19, 2023
@ericcox1
Copy link
Collaborator

Thanks @alpole23, I appreciate the detailed report. Your help debugging this is much appreciated and if you don't mind I have some more questions for you.

  1. Could you check if invalid files representing the 3 files were created?
  2. Could you please rerun the command and and see if you are able to download those 3 files with the second attempt?

Thanks!

@alpole23
Copy link

I appreciate your time and help in working through this issue.
Oddly enough, in the first attempt, the files were actually created despite the EOF error.
However, everything seemed to work perfectly the second attempt. No EOF errors.

Capture

Thanks!

@mkrg01
Copy link
Author

mkrg01 commented Dec 20, 2023

Thank you for the release of the new version, @ericcox1.

I attempted to download genomes using datasets version 15.34.0 but encountered errors similar to those previously mentioned.

In fact, my use case was a bit different. I was trying to download genomes specifically for Teleostei (taxid: 32443), which is a subclade of Actinopterygii. The command I used is below:

datasets download genome taxon 32443 --dehydrated --reference --annotated --include gbff --assembly-source RefSeq --filename data/raw_data/Teleostei_dataset.zip
unzip data/raw_data/Teleostei_dataset.zip -d data/raw_data/Teleostei_dataset
datasets rehydrate --directory data/raw_data/Teleostei_dataset/ --gzip --no-progressbar

The error messages I received are as follows (only a part is displayed):

Screenshot 2023-12-20 at 23 47 40

I could find these files were created. However, file sizes seem too small (e.g., the size of data/raw_data/Teleostei_dataset/ncbi_dataset/data/GCF_002021735.2/genomic.gbff.gz is 6.7 M, though the genome size is 2.4 Gb).

@ericcox1
Copy link
Collaborator

Thanks @mkrg01, I'll share this with the development team.

@alpole23
Copy link

Hi ericcox1,

I have a log file of all of the rehydrate errors that I encountered when downloading and rehydrating GenBank sequences from taxon Enterobacterales. I have attached that list here if you and your team would like to use for troubleshooting.

Enterbacterales_rehydrate_errors.txt

@olearyna
Copy link
Contributor

Hialpole23,

Thanks for sharing this!

Nuala

@ericcox1
Copy link
Collaborator

We are continue to work on this issue. With the latest production release, we have implemented a feature to delete invalid files during rehydration.

-Eric

@ericcox1 ericcox1 added the bug Something isn't working label Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants