Is there a way to download only a part of the taxonomy? #83

famosab · 2025-02-03T05:12:28Z

https://github.com/gjeunen/reference_database_creator?tab=readme-ov-file#511---download-taxonomy

I understand that I can only download one of the files, but what I would want is a smaller (sort of test-data) version of downloading the taxonomies. Or if not maybe you can point me to the original file locations and then I will try to create smaller test data myself?

Thank you!

marchoeppner · 2025-02-03T08:38:40Z

Not an author of this tool, but I don't think there is a technical reason for why this wouldn't work, as long as you can make sure that all the entries in your database have a hit in the tax files. That said, the structure is not entirely 1-to-1 so reducing the tax database to a smaller test set may prove a little tricky.

I think this is the file that Crabs downloads for this:
https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz

gjeunen · 2025-02-03T10:56:41Z

Hello @famosab and @marchoeppner,

Thank you for your query and response!

If you are referring to building the taxonomic lineages, CRABS already only builds the ones that are needed for the reference sequences and exclude all others. If you are referring to the initial download of the files, I'm not aware of a way to specify a subset, as NCBI stores this as a single file (nucl_gb.accession2taxid). If you find a solution that will enable a subset to be downloaded, please let me know and I'll implement it in the next update :)

Thanks,
Gert-Jan

marchoeppner · 2025-02-03T11:02:42Z

I was assuming the idea is to do that "offline" with a locally stored version of nodes.dmp, names.dmp and nucl_gb.accession2taxid - and then, I don't know, grap 10 taxa, all the matchning entries in whatever database is to be used, reduce all the tax files to those taxa and ids and build a very minimal db for linting purposes or checking pipeline function? In any case, I suspect this will be a lot of work to not break stuff

famosab · 2025-02-03T11:54:39Z

Thanks to both of you for the information! I will try and work with the tar.gz for our testcase. If it does not work I will tag you again @gjeunen.

famosab · 2025-02-04T03:19:07Z

I found downsampled data which is suitable for my case. Unfortunately I always run into this error:

│     /usr/local/lib/python3.12/site-packages/function/crabs_functions.py:775: SyntaxWarning: invalid escape sequence '\.'                                              │
│       for item in ['_sp\.','_SP\.','_indet.', '_sp.', '_SP.']:                                                                                                        │
│     /usr/local/lib/python3.12/site-packages/function/crabs_functions.py:775: SyntaxWarning: invalid escape sequence '\.'                                              │
│       for item in ['_sp\.','_SP\.','_indet.', '_sp.', '_SP.']:                                                                                                        │
│     Matplotlib created a temporary cache directory at /tmp/matplotlib-am3lbtwt because the default path (/.config/matplotlib) is not a writable directory; it is      │
│ highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support     │
│ multiprocessing.                                                                                                                                                      │
│                                                                                                                                                                       │
│     /// CRABS | v1.0.7                                                                                                                                                │
│                                                                                                                                                                       │
│     |            Function | Import sequence data into CRABS format                                                                                                    │
│     | Read data to memory |                                       0% -:--:-- 0:00:00                                                                                  │
│     Traceback (most recent call last):                                                                                                                                │
│       File "/usr/local/bin/crabs", line 847, in <module>                                                                                                              │
│         crabs()                                                                                                                                                       │
│       File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1157, in __call__                                                                            │
│         return self.main(*args, **kwargs)                                                                                                                             │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                             │
│       File "/usr/local/lib/python3.12/site-packages/rich_click/rich_command.py", line 152, in main                                                                    │
│         rv = self.invoke(ctx)                                                                                                                                         │
│              ^^^^^^^^^^^^^^^^                                                                                                                                         │
│       File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1434, in invoke                                                                              │
│         return ctx.invoke(self.callback, **ctx.params)                                                                                                                │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                │
│       File "/usr/local/lib/python3.12/site-packages/click/core.py", line 783, in invoke                                                                               │
│         return __callback(*args, **kwargs)                                                                                                                            │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                            │
│       File "/usr/local/bin/crabs", line 561, in crabs                                                                                                                 │
│         seq_input_dict, initial_seq_number = input_to_memory(task, progress_bar, input_)                                                                              │
│                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                              │
│       File "/usr/local/lib/python3.12/site-packages/function/crabs_functions.py", line 393, in embl_to_memory                                                         │
│         seq_name = line.split('|')[1]                                                                                                                                 │
│                    ~~~~~~~~~~~~~~~^^^                                                                                                                                 │
│     IndexError: list index out of range

This is the command I executed:

│     crabs --import \                                                                                                                                                  │
│         --input genome.fasta \                                                                                                                                        │
│         --output test.crabsdb.fa \                                                                                                                                    │
│         --acc2tax nucl_gb.accession2taxid \                                                                                                                           │
│         --names names.dmp \                                                                                                                                           │
│         --nodes nodes.dmp \                                                                                                                                           │
│         --import-format embl --ranks 'superkingdom;phylum;class;order;family;genus;species' \

Do you know what this could relate to? I used v.1.0.7!

gjeunen · 2025-02-05T12:00:56Z

Hello @famosab,

It's likely that the file is structured differently. Can you post below the first couple of lines of the document please?

Best,
Gert-Jan

famosab · 2025-02-06T00:59:18Z

Sure!

This is the accession2taxid file:

accession	accession.version	taxid	gi
MT192765	MT192765.1	2697049	1821109001
NZ_LS483480	NZ_LS483480.1	727	1409087034

This is the names file:

2697049	|	2019-nCoV	|		|	equivalent name	|
2697049	|	COVID-19 virus	|		|	equivalent name	|
2697049	|	HCoV-19	|		|	equivalent name	|
2697049	|	Human coronavirus 2019	|		|	equivalent name	|
2697049	|	SARS-2	|		|	equivalent name	|
2697049	|	SARS2	|		|	equivalent name	|
2697049	|	SARS-CoV-2	|		|	acronym	|
2697049	|	SARS-CoV2	|		|	equivalent name	|
2697049	|	Severe acute respiratory syndrome coronavirus 2	|		|	scientific name	|
727	|	ATCC 33391	|	ATCC 33391 <type strain>	|	type material	|
727	|	"Bacterium influenzae" Lehmann and Neumann 1896	|		|	authority	|
727	|	Bacterium influenzae	|		|	synonym	|
727	|	CCUG 23945	|	CCUG 23945 <type strain>	|	type material	|
727	|	CIP 102514	|	CIP 102514 <type strain>	|	type material	|
727	|	"Coccobacillus pfeifferi" Neveu-Lemaire 1921	|		|	authority	|
727	|	Coccobacillus pfeifferi	|		|	synonym	|
727	|	DSM 4690	|	DSM 4690 <type strain>	|	type material	|
727	|	Haemophilus influenzae (Lehmann and Neumann 1896) Winslow et al. 1917	|		|	authority	|
727	|	Haemophilus influenzae	|		|	scientific name	|
727	|	"Haemophilus meningitidis" (Martins) Hauduroy et al. 1937	|		|	authority	|
727	|	Haemophilus meningitidis	|		|	synonym	|
727	|	"Influenza-bacillus" Pfeiffer 1892	|		|	authority	|
727	|	Influenza-bacillus	|		|	synonym	|
727	|	"Mycobacterium influenzae" (Lehmann and Neumann 1896) Chester 1901	|		|	authority	|
727	|	Mycobacterium influenzae	|		|	synonym	|
727	|	NCTC 8143	|	NCTC 8143 <type strain>	|	type material	|

And this is the nodes file:

1	|	1	|	no rank	|		|	8	|	0	|	1	|	0	|	0	|	0	|	0	|	0	|		|
10239	|	1	|	superkingdom	|		|	9	|	0	|	1	|	0	|	0	|	0	|	0	|	0	|		|
2559587	|	10239	|	clade	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
2732396	|	2559587	|	kingdom	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
2732408	|	2732396	|	phylum	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
2732506	|	2732408	|	class	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
76804	|	2732506	|	order	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
2499399	|	76804	|	suborder	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
11118	|	2499399	|	family	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
2501931	|	11118	|	subfamily	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
694002	|	2501931	|	genus	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
2509511	|	694002	|	subgenus	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
694009	|	2509511	|	species	|	SA	|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant; specified	|
2697049	|	694009	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
131567	|	1	|	no rank	|		|	8	|	1	|	1	|	1	|	0	|	1	|	1	|	0	|		|
2	|	131567	|	superkingdom	|		|	0	|	0	|	11	|	0	|	0	|	0	|	0	|	0	|		|
1224	|	2	|	phylum	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|		|
1236	|	1224	|	class	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
135625	|	1236	|	order	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
712	|	135625	|	family	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
724	|	712	|	genus	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
727	|	724	|	species	|	HI	|	0	|	1	|	11	|	1	|	0	|	1	|	1	|	0	|	code compliant; specified	|

The genome.fasta looks as follows:

>MT192765.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/PC00101P/2020, complete genome
GTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGT
GTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAG
TAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGG
GTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTT
ACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAG
ATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCG
GATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGG
TGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTA
ATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTAT
GAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGC
ATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTG
...... (more lines of bases)

gjeunen · 2025-02-14T00:52:01Z

Hello @famosab,

Apologies for the slow response, I'm currently out of office.

The issue seems to be the format of the input file, which doesn't follow EMBL formatting. I've placed an example of the EMBL format below. Can you please confirm that genomes.fasta was downloaded from EMBL? I think it might resemble more the NCBI format.

>ENA|KY468184|KY468184.1 Mus musculus clone TCONS_00185153 ilncRNA-EC14 lncRNA, complete sequence.
ATTGCTGAGTCAACAGTGGTTTCGTTGTTCCACTGGCTGATGGCTTAACATTTGATTGTC
TGTAGTATTTTTGTGAAGCAGGTAGTTTGAGCAAAAGACCTAAGGCTTTAAAGACCAGAG
TTTATCTAATAGTGGTGTTTTCATCACATATTATTTTCTGAGTTTATATAAGTATTCTAA
AATCCTTTTGCTGAGAGTCATAATGTTTAGACCTAGAACACACTTAGCAGTGAGCGTGCA
CCGTATTTTACACCGGGAAGTCTGCGGCCTGGACACATGAACTAACTTGCCCAAGAGCAC
ACAACTGTAGTTTAAATACCGCCATT
>ENA|KY787229|KY787229.1 Mus musculus transgenic clone B11_VH3-4-ContMZ-I-10C immunoglobulin heavy chain variable region gene, partial cds.
ACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAG
TGGTTACTACTGGAGCTGGATCCGCCAGTCCCCAGGGAAGGGGCTGGAGTGGATTGGGGA

Thanks,

Gert-Jan

famosab · 2025-02-17T10:18:22Z

Changing to the EMBL Formatted file downloaded from here: https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=OY074094&format=fasta&style=raw solved my problem! Thank you :)

famosab mentioned this issue Feb 3, 2025

add crabs/dbimport from readsimulator pipeline and rename to crabs/import nf-core/modules#6584

Open

17 tasks

famosab closed this as completed Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to download only a part of the taxonomy? #83

Is there a way to download only a part of the taxonomy? #83

famosab commented Feb 3, 2025

marchoeppner commented Feb 3, 2025

gjeunen commented Feb 3, 2025

marchoeppner commented Feb 3, 2025

famosab commented Feb 3, 2025

famosab commented Feb 4, 2025

gjeunen commented Feb 5, 2025

famosab commented Feb 6, 2025

gjeunen commented Feb 14, 2025

famosab commented Feb 17, 2025

Is there a way to download only a part of the taxonomy? #83

Is there a way to download only a part of the taxonomy? #83

Comments

famosab commented Feb 3, 2025

marchoeppner commented Feb 3, 2025

gjeunen commented Feb 3, 2025

marchoeppner commented Feb 3, 2025

famosab commented Feb 3, 2025

famosab commented Feb 4, 2025

gjeunen commented Feb 5, 2025

famosab commented Feb 6, 2025

gjeunen commented Feb 14, 2025

famosab commented Feb 17, 2025