Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to download only a part of the taxonomy? #83

Closed
famosab opened this issue Feb 3, 2025 · 9 comments
Closed

Is there a way to download only a part of the taxonomy? #83

famosab opened this issue Feb 3, 2025 · 9 comments

Comments

@famosab
Copy link

famosab commented Feb 3, 2025

https://github.com/gjeunen/reference_database_creator?tab=readme-ov-file#511---download-taxonomy

I understand that I can only download one of the files, but what I would want is a smaller (sort of test-data) version of downloading the taxonomies. Or if not maybe you can point me to the original file locations and then I will try to create smaller test data myself?

Thank you!

@marchoeppner
Copy link

Not an author of this tool, but I don't think there is a technical reason for why this wouldn't work, as long as you can make sure that all the entries in your database have a hit in the tax files. That said, the structure is not entirely 1-to-1 so reducing the tax database to a smaller test set may prove a little tricky.

I think this is the file that Crabs downloads for this:
https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz

@gjeunen
Copy link
Owner

gjeunen commented Feb 3, 2025

Hello @famosab and @marchoeppner,

Thank you for your query and response!

If you are referring to building the taxonomic lineages, CRABS already only builds the ones that are needed for the reference sequences and exclude all others. If you are referring to the initial download of the files, I'm not aware of a way to specify a subset, as NCBI stores this as a single file (nucl_gb.accession2taxid). If you find a solution that will enable a subset to be downloaded, please let me know and I'll implement it in the next update :)

Thanks,
Gert-Jan

@marchoeppner
Copy link

I was assuming the idea is to do that "offline" with a locally stored version of nodes.dmp, names.dmp and nucl_gb.accession2taxid - and then, I don't know, grap 10 taxa, all the matchning entries in whatever database is to be used, reduce all the tax files to those taxa and ids and build a very minimal db for linting purposes or checking pipeline function? In any case, I suspect this will be a lot of work to not break stuff

@famosab
Copy link
Author

famosab commented Feb 3, 2025

Thanks to both of you for the information! I will try and work with the tar.gz for our testcase. If it does not work I will tag you again @gjeunen.

@famosab
Copy link
Author

famosab commented Feb 4, 2025

I found downsampled data which is suitable for my case. Unfortunately I always run into this error:

│     /usr/local/lib/python3.12/site-packages/function/crabs_functions.py:775: SyntaxWarning: invalid escape sequence '\.'                                              │
│       for item in ['_sp\.','_SP\.','_indet.', '_sp.', '_SP.']:                                                                                                        │
│     /usr/local/lib/python3.12/site-packages/function/crabs_functions.py:775: SyntaxWarning: invalid escape sequence '\.'                                              │
│       for item in ['_sp\.','_SP\.','_indet.', '_sp.', '_SP.']:                                                                                                        │
│     Matplotlib created a temporary cache directory at /tmp/matplotlib-am3lbtwt because the default path (/.config/matplotlib) is not a writable directory; it is      │
│ highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support     │
│ multiprocessing.                                                                                                                                                      │
│                                                                                                                                                                       │
│     /// CRABS | v1.0.7                                                                                                                                                │
│                                                                                                                                                                       │
│     |            Function | Import sequence data into CRABS format                                                                                                    │
│     | Read data to memory |                                       0% -:--:-- 0:00:00                                                                                  │
│     Traceback (most recent call last):                                                                                                                                │
│       File "/usr/local/bin/crabs", line 847, in <module>                                                                                                              │
│         crabs()                                                                                                                                                       │
│       File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1157, in __call__                                                                            │
│         return self.main(*args, **kwargs)                                                                                                                             │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                             │
│       File "/usr/local/lib/python3.12/site-packages/rich_click/rich_command.py", line 152, in main                                                                    │
│         rv = self.invoke(ctx)                                                                                                                                         │
│              ^^^^^^^^^^^^^^^^                                                                                                                                         │
│       File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1434, in invoke                                                                              │
│         return ctx.invoke(self.callback, **ctx.params)                                                                                                                │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                │
│       File "/usr/local/lib/python3.12/site-packages/click/core.py", line 783, in invoke                                                                               │
│         return __callback(*args, **kwargs)                                                                                                                            │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                            │
│       File "/usr/local/bin/crabs", line 561, in crabs                                                                                                                 │
│         seq_input_dict, initial_seq_number = input_to_memory(task, progress_bar, input_)                                                                              │
│                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                              │
│       File "/usr/local/lib/python3.12/site-packages/function/crabs_functions.py", line 393, in embl_to_memory                                                         │
│         seq_name = line.split('|')[1]                                                                                                                                 │
│                    ~~~~~~~~~~~~~~~^^^                                                                                                                                 │
│     IndexError: list index out of range 

This is the command I executed:

│     crabs --import \                                                                                                                                                  │
│         --input genome.fasta \                                                                                                                                        │
│         --output test.crabsdb.fa \                                                                                                                                    │
│         --acc2tax nucl_gb.accession2taxid \                                                                                                                           │
│         --names names.dmp \                                                                                                                                           │
│         --nodes nodes.dmp \                                                                                                                                           │
│         --import-format embl --ranks 'superkingdom;phylum;class;order;family;genus;species' \    

Do you know what this could relate to? I used v.1.0.7!

@gjeunen
Copy link
Owner

gjeunen commented Feb 5, 2025

Hello @famosab,

It's likely that the file is structured differently. Can you post below the first couple of lines of the document please?

Best,
Gert-Jan

@famosab
Copy link
Author

famosab commented Feb 6, 2025

Sure!

This is the accession2taxid file:

accession	accession.version	taxid	gi
MT192765	MT192765.1	2697049	1821109001
NZ_LS483480	NZ_LS483480.1	727	1409087034

This is the names file:

2697049	|	2019-nCoV	|		|	equivalent name	|
2697049	|	COVID-19 virus	|		|	equivalent name	|
2697049	|	HCoV-19	|		|	equivalent name	|
2697049	|	Human coronavirus 2019	|		|	equivalent name	|
2697049	|	SARS-2	|		|	equivalent name	|
2697049	|	SARS2	|		|	equivalent name	|
2697049	|	SARS-CoV-2	|		|	acronym	|
2697049	|	SARS-CoV2	|		|	equivalent name	|
2697049	|	Severe acute respiratory syndrome coronavirus 2	|		|	scientific name	|
727	|	ATCC 33391	|	ATCC 33391 <type strain>	|	type material	|
727	|	"Bacterium influenzae" Lehmann and Neumann 1896	|		|	authority	|
727	|	Bacterium influenzae	|		|	synonym	|
727	|	CCUG 23945	|	CCUG 23945 <type strain>	|	type material	|
727	|	CIP 102514	|	CIP 102514 <type strain>	|	type material	|
727	|	"Coccobacillus pfeifferi" Neveu-Lemaire 1921	|		|	authority	|
727	|	Coccobacillus pfeifferi	|		|	synonym	|
727	|	DSM 4690	|	DSM 4690 <type strain>	|	type material	|
727	|	Haemophilus influenzae (Lehmann and Neumann 1896) Winslow et al. 1917	|		|	authority	|
727	|	Haemophilus influenzae	|		|	scientific name	|
727	|	"Haemophilus meningitidis" (Martins) Hauduroy et al. 1937	|		|	authority	|
727	|	Haemophilus meningitidis	|		|	synonym	|
727	|	"Influenza-bacillus" Pfeiffer 1892	|		|	authority	|
727	|	Influenza-bacillus	|		|	synonym	|
727	|	"Mycobacterium influenzae" (Lehmann and Neumann 1896) Chester 1901	|		|	authority	|
727	|	Mycobacterium influenzae	|		|	synonym	|
727	|	NCTC 8143	|	NCTC 8143 <type strain>	|	type material	|

And this is the nodes file:

1	|	1	|	no rank	|		|	8	|	0	|	1	|	0	|	0	|	0	|	0	|	0	|		|
10239	|	1	|	superkingdom	|		|	9	|	0	|	1	|	0	|	0	|	0	|	0	|	0	|		|
2559587	|	10239	|	clade	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
2732396	|	2559587	|	kingdom	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
2732408	|	2732396	|	phylum	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
2732506	|	2732408	|	class	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
76804	|	2732506	|	order	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
2499399	|	76804	|	suborder	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
11118	|	2499399	|	family	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
2501931	|	11118	|	subfamily	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
694002	|	2501931	|	genus	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
2509511	|	694002	|	subgenus	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
694009	|	2509511	|	species	|	SA	|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|	code compliant; specified	|
2697049	|	694009	|	no rank	|		|	9	|	1	|	1	|	1	|	0	|	1	|	0	|	0	|		|
131567	|	1	|	no rank	|		|	8	|	1	|	1	|	1	|	0	|	1	|	1	|	0	|		|
2	|	131567	|	superkingdom	|		|	0	|	0	|	11	|	0	|	0	|	0	|	0	|	0	|		|
1224	|	2	|	phylum	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|		|
1236	|	1224	|	class	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
135625	|	1236	|	order	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
712	|	135625	|	family	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
724	|	712	|	genus	|		|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|	code compliant	|
727	|	724	|	species	|	HI	|	0	|	1	|	11	|	1	|	0	|	1	|	1	|	0	|	code compliant; specified	|

The genome.fasta looks as follows:

>MT192765.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/PC00101P/2020, complete genome
GTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGT
GTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAG
TAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGG
GTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTT
ACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAG
ATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCG
GATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGG
TGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTA
ATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTAT
GAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGC
ATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTG
...... (more lines of bases)

@gjeunen
Copy link
Owner

gjeunen commented Feb 14, 2025

Hello @famosab,

Apologies for the slow response, I'm currently out of office.

The issue seems to be the format of the input file, which doesn't follow EMBL formatting. I've placed an example of the EMBL format below. Can you please confirm that genomes.fasta was downloaded from EMBL? I think it might resemble more the NCBI format.

>ENA|KY468184|KY468184.1 Mus musculus clone TCONS_00185153 ilncRNA-EC14 lncRNA, complete sequence.
ATTGCTGAGTCAACAGTGGTTTCGTTGTTCCACTGGCTGATGGCTTAACATTTGATTGTC
TGTAGTATTTTTGTGAAGCAGGTAGTTTGAGCAAAAGACCTAAGGCTTTAAAGACCAGAG
TTTATCTAATAGTGGTGTTTTCATCACATATTATTTTCTGAGTTTATATAAGTATTCTAA
AATCCTTTTGCTGAGAGTCATAATGTTTAGACCTAGAACACACTTAGCAGTGAGCGTGCA
CCGTATTTTACACCGGGAAGTCTGCGGCCTGGACACATGAACTAACTTGCCCAAGAGCAC
ACAACTGTAGTTTAAATACCGCCATT
>ENA|KY787229|KY787229.1 Mus musculus transgenic clone B11_VH3-4-ContMZ-I-10C immunoglobulin heavy chain variable region gene, partial cds.
ACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAG
TGGTTACTACTGGAGCTGGATCCGCCAGTCCCCAGGGAAGGGGCTGGAGTGGATTGGGGA

Thanks,

Gert-Jan

@famosab
Copy link
Author

famosab commented Feb 17, 2025

Changing to the EMBL Formatted file downloaded from here: https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=OY074094&format=fasta&style=raw solved my problem! Thank you :)

@famosab famosab closed this as completed Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants