-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to download only a part of the taxonomy? #83
Comments
Not an author of this tool, but I don't think there is a technical reason for why this wouldn't work, as long as you can make sure that all the entries in your database have a hit in the tax files. That said, the structure is not entirely 1-to-1 so reducing the tax database to a smaller test set may prove a little tricky. I think this is the file that Crabs downloads for this: |
Hello @famosab and @marchoeppner, Thank you for your query and response! If you are referring to building the taxonomic lineages, CRABS already only builds the ones that are needed for the reference sequences and exclude all others. If you are referring to the initial download of the files, I'm not aware of a way to specify a subset, as NCBI stores this as a single file (nucl_gb.accession2taxid). If you find a solution that will enable a subset to be downloaded, please let me know and I'll implement it in the next update :) Thanks, |
I was assuming the idea is to do that "offline" with a locally stored version of nodes.dmp, names.dmp and nucl_gb.accession2taxid - and then, I don't know, grap 10 taxa, all the matchning entries in whatever database is to be used, reduce all the tax files to those taxa and ids and build a very minimal db for linting purposes or checking pipeline function? In any case, I suspect this will be a lot of work to not break stuff |
Thanks to both of you for the information! I will try and work with the tar.gz for our testcase. If it does not work I will tag you again @gjeunen. |
I found downsampled data which is suitable for my case. Unfortunately I always run into this error: │ /usr/local/lib/python3.12/site-packages/function/crabs_functions.py:775: SyntaxWarning: invalid escape sequence '\.' │
│ for item in ['_sp\.','_SP\.','_indet.', '_sp.', '_SP.']: │
│ /usr/local/lib/python3.12/site-packages/function/crabs_functions.py:775: SyntaxWarning: invalid escape sequence '\.' │
│ for item in ['_sp\.','_SP\.','_indet.', '_sp.', '_SP.']: │
│ Matplotlib created a temporary cache directory at /tmp/matplotlib-am3lbtwt because the default path (/.config/matplotlib) is not a writable directory; it is │
│ highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support │
│ multiprocessing. │
│ │
│ /// CRABS | v1.0.7 │
│ │
│ | Function | Import sequence data into CRABS format │
│ | Read data to memory | 0% -:--:-- 0:00:00 │
│ Traceback (most recent call last): │
│ File "/usr/local/bin/crabs", line 847, in <module> │
│ crabs() │
│ File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1157, in __call__ │
│ return self.main(*args, **kwargs) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.12/site-packages/rich_click/rich_command.py", line 152, in main │
│ rv = self.invoke(ctx) │
│ ^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.12/site-packages/click/core.py", line 1434, in invoke │
│ return ctx.invoke(self.callback, **ctx.params) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.12/site-packages/click/core.py", line 783, in invoke │
│ return __callback(*args, **kwargs) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/bin/crabs", line 561, in crabs │
│ seq_input_dict, initial_seq_number = input_to_memory(task, progress_bar, input_) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/usr/local/lib/python3.12/site-packages/function/crabs_functions.py", line 393, in embl_to_memory │
│ seq_name = line.split('|')[1] │
│ ~~~~~~~~~~~~~~~^^^ │
│ IndexError: list index out of range This is the command I executed: │ crabs --import \ │
│ --input genome.fasta \ │
│ --output test.crabsdb.fa \ │
│ --acc2tax nucl_gb.accession2taxid \ │
│ --names names.dmp \ │
│ --nodes nodes.dmp \ │
│ --import-format embl --ranks 'superkingdom;phylum;class;order;family;genus;species' \ Do you know what this could relate to? I used v.1.0.7! |
Hello @famosab, It's likely that the file is structured differently. Can you post below the first couple of lines of the document please? Best, |
Sure! This is the accession2taxid file:
This is the names file:
And this is the nodes file:
The genome.fasta looks as follows:
|
Hello @famosab, Apologies for the slow response, I'm currently out of office. The issue seems to be the format of the input file, which doesn't follow EMBL formatting. I've placed an example of the EMBL format below. Can you please confirm that
Thanks, Gert-Jan |
Changing to the EMBL Formatted file downloaded from here: https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ena_sequence&id=OY074094&format=fasta&style=raw solved my problem! Thank you :) |
https://github.com/gjeunen/reference_database_creator?tab=readme-ov-file#511---download-taxonomy
I understand that I can only download one of the files, but what I would want is a smaller (sort of test-data) version of downloading the taxonomies. Or if not maybe you can point me to the original file locations and then I will try to create smaller test data myself?
Thank you!
The text was updated successfully, but these errors were encountered: