You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for developing and maintaining mmseqs2. I have a slightly specific use case and I think I might be overcomplicating it, so I would like to ask for your advice on how to proceed.
For context, I am starting from a file with my target probes.fasta, which has N genes each with different versions of the gene (let's say an average of M versions). So the total number of sequences there is NxM.
Now, separately I have a set of samples and after assembling them I want to filter the assembly contigs according to the match to the genes. The current simple approach is to do a mmseqs search with probes.fasta and the assembly but I feel I can do better than that.
My idea is to cluster the genes and make profiles out of them and use those for profiles to do a more sensitive search. Now, the proper way to achieve this is what I am unsure about.
Firstly I split probes.fasta per gene, so I now have a split_probes folder where I have N fasta files. The reason to do this is that I do not want to risk the clustering to mix genes. So to start clustering:
Now this is where it starts to get confusing because for searching I want to have a single DB with all the profiles. Should I merge the clustering results and then make a profile? or should I make N profiles and then merge them? In the end, I did the latter:
A small parenthesis, I found the CLI of mmseqs mergedbs quite odd. Why is the output the second positional argument? Why not the first or the last? It would help to play more nicely with wildcard use. Anyway, that's why these
Hello there,
Thanks for developing and maintaining
mmseqs2
. I have a slightly specific use case and I think I might be overcomplicating it, so I would like to ask for your advice on how to proceed.For context, I am starting from a file with my target
probes.fasta
, which has N genes each with different versions of the gene (let's say an average of M versions). So the total number of sequences there is NxM.Now, separately I have a set of samples and after assembling them I want to filter the assembly contigs according to the match to the genes. The current simple approach is to do a
mmseqs search
withprobes.fasta
and the assembly but I feel I can do better than that.My idea is to cluster the genes and make profiles out of them and use those for profiles to do a more sensitive search. Now, the proper way to achieve this is what I am unsure about.
Firstly I split
probes.fasta
per gene, so I now have asplit_probes
folder where I have N fasta files. The reason to do this is that I do not want to risk the clustering to mix genes. So to start clustering:Now this is where it starts to get confusing because for searching I want to have a single DB with all the profiles. Should I merge the clustering results and then make a profile? or should I make N profiles and then merge them? In the end, I did the latter:
A small parenthesis, I found the CLI of
mmseqs mergedbs
quite odd. Why is the output the second positional argument? Why not the first or the last? It would help to play more nicely with wildcard use. Anyway, that's why theseFinally, I got a single DB with what I thought should be a all the profiles. So I proceeded to do the
mmseq search
something like this:mmseqs search ../cleaning_test/assembled/filtering/dbs/samples/Andryala_integrifolia_ERR7618428 merged_profsDB andryala_results tmp_andryala --threads 4 -s 7.5 -e 1.00E-6 --min-length 15 --remove-tmp-files -a mmseqs convertalis ../cleaning_test/assembled/filtering/dbs/samples/Andryala_integrifolia_ERR7618428 merged_profsDB andryala_results my_prof_andryala_table.tsv --format-mode 4 --format-output "query,evalue,qstart,qend,qlen,tstart,tend,tlen,theader,gapopen,nident,mismatch" --threads 4
However, the table result I got seems malformed: my_prof_andryala_table.tsv.txt
To give you an idea of what I expect here is what the current approach produces: Andryala_integrifolia_ERR7618428.tsv.txt
This is it, I would really appreciate your advice to troubleshoot this.
Thank you in advance
The text was updated successfully, but these errors were encountered: