Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genome DB is unavailable #196

Open
karl-cottenie opened this issue Oct 11, 2024 · 5 comments
Open

genome DB is unavailable #196

karl-cottenie opened this issue Oct 11, 2024 · 5 comments

Comments

@karl-cottenie
Copy link

I tested these two links, and they worked as expected.

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=6060535&retmote=rsr
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=rentrez

Here is my problem.

I received an error message for this search:
R> entrez_search(db = "genome", term = "sphingidae")
Error in ans[[1]] : subscript out of bounds

I checked if the genome database was listed:
R> entrez_dbs()
[1] "pubmed" "protein" "nuccore" "ipg" "nucleotide" "structure"
[7] "genome" "annotinfo" "assembly" "bioproject" "biosample" "blastdbinfo"
[13] "books" "cdd" "clinvar" "gap" "gapplus" "grasp"
[19] "dbvar" "gene" "gds" "geoprofiles" "medgen" "mesh"
[25] "nlmcatalog" "omim" "orgtrack" "pmc" "popset" "proteinclusters"
[31] "pcassay" "protfam" "pccompound" "pcsubstance" "seqannot" "snp"
[37] "sra" "taxonomy" "biocollections" "gtr"

But then I found that the DB is unavailable:
R> entrez_db_summary(db = "genome")
DbName: genome
MenuName: Genome
Description: Genomic sequences, contigs, and maps
DbBuild:
Warning: pback220: DB is unavailable

While the other databases are updated, e.g.:
R> entrez_db_summary(db = "taxonomy")
DbName: taxonomy
MenuName: Taxonomy
Description: Taxonomy db
DbBuild: Build240912-1410.1
Count: 2744579
LastUpdate: 2024/09/12 16:00

I tested whether the genome database was available through the CLI, and that was the case:
$ datasets summary genome taxon 'bats' --assembly-source refseq --as-json-lines | dataformat tsv genome --fields accession,assminfo-name,annotinfo-name,annotinfo-release-date,organism-name
Assembly Accession Assembly Name Annotation Name Annotation Release Date Organism Name
GCF_004115265.2 mRhiFer1_v1.p GCF_004115265.2-RS_2023_02 2023-02-27 Rhinolophus ferrumequinum
GCF_022682495.1 HLdesRot8A GCF_022682495.1-RS_2023_02 2023-02-27 Desmodus rotundus
GCF_027574615.1 DD_ASM_mEF_20220401 GCF_027574615.1-RS_2023_03 2023-03-28 Eptesicus fuscus
GCF_004126475.2 mPhyDis1.pri.v3 NCBI Phyllostomus discolor Annotation Release 101 2020-08-31 Phyllostomus discolor
[...]

In conclusion, it seems to me that the genome database is accessible, but not through the R entrez interface?

@allenbaron
Copy link

allenbaron commented Oct 14, 2024

Sorry. Can you clarify. Did you use esearch on the command line with the same inputs and it worked?

A quick debug shows that the error is returned from the entrez utils server. For me it returned: Search Backend failed: Database is not supported: genome.

entrez_db_searchable("genome") also returns nothing. And entrez_info("genome") returns only links.

@karl-cottenie
Copy link
Author

I accessed the genome database through the CLI following these instructions: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genomes/get-genome-metadata/

with the following command

datasets summary genome taxon 'bats' [... the rest of the command is just to reformat the output ...]

That command did give me output

Assembly Accession Assembly Name Annotation Name Annotation Release Date Organism Name
GCF_004115265.2 mRhiFer1_v1.p GCF_004115265.2-RS_2023_02 2023-02-27 Rhinolophus ferrumequinum
GCF_022682495.1 HLdesRot8A GCF_022682495.1-RS_2023_02 2023-02-27 Desmodus rotundus
GCF_027574615.1 DD_ASM_mEF_20220401 GCF_027574615.1-RS_2023_03 2023-03-28 Eptesicus fuscus
GCF_004126475.2 mPhyDis1.pri.v3 NCBI Phyllostomus discolor Annotation Release 101 2020-08-31 Phyllostomus discolor
[...]

So I just assumed that the genome database is accessible? Unless the database called "genome" through the CLI "dataset" command is different from the database called "genome" accessed through R?

@allenbaron
Copy link

That looks like a different API. rentrez uses NCBI's Entrez Utilities (https://www.ncbi.nlm.nih.gov/home/tools/).

@karl-cottenie
Copy link
Author

I did not know that, thanks for the clarification. Thus the problem that the function entrez_db_summary(db = "genome") returns nothing is on the NCBI end, and there is nothing you can do about that?

@allenbaron
Copy link

That is correct. The issues you are having are all directly because of the data provided by NCBI's servers and not rentrez.

NCBI has replaced the webpages corresponding to these databases with the datasets API you've mentioned. I can't tell you for sure if access to this database via rentrez/E-utilities has changed or not, since I don't use it regularly, but it is supposed to have stayed the same according to their documentation: https://support.nlm.nih.gov/kbArticle/?pn=KA-05455.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants