Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help to troubleshoot a clustering-profile-search workflow #935

Open
sivico26 opened this issue Jan 18, 2025 · 0 comments
Open

Help to troubleshoot a clustering-profile-search workflow #935

sivico26 opened this issue Jan 18, 2025 · 0 comments

Comments

@sivico26
Copy link

Hello there,

Thanks for developing and maintaining mmseqs2. I have a slightly specific use case and I think I might be overcomplicating it, so I would like to ask for your advice on how to proceed.

For context, I am starting from a file with my target probes.fasta, which has N genes each with different versions of the gene (let's say an average of M versions). So the total number of sequences there is NxM.

Now, separately I have a set of samples and after assembling them I want to filter the assembly contigs according to the match to the genes. The current simple approach is to do a mmseqs search with probes.fasta and the assembly but I feel I can do better than that.

My idea is to cluster the genes and make profiles out of them and use those for profiles to do a more sensitive search. Now, the proper way to achieve this is what I am unsure about.

Firstly I split probes.fasta per gene, so I now have a split_probes folder where I have N fasta files. The reason to do this is that I do not want to risk the clustering to mix genes. So to start clustering:

mkdir -p seqDBs clustersDB
parallel mkdir -p tmp/tmp_{/.} ::: split_probes/*.fasta
parallel mmseqs createdb {} seqDBs/{/.} --dbtype 1 ::: split_probes/*.fasta
parallel mmseqs cluster seqDBs/{/.} clustersDB/{/.} tmp/tmp_{/.} --min-seq-id 0.8 --threads 1 ::: split_probes/*.fasta

Now this is where it starts to get confusing because for searching I want to have a single DB with all the profiles. Should I merge the clustering results and then make a profile? or should I make N profiles and then merge them? In the end, I did the latter:

mkdir -p repsDB profilesDB
parallel mmseqs createsubdb clustersDB/{/.} seqDBs/{/.} repsDB/{/.} ::: split_probes/*.fasta
parallel mmseqs createsubdb clustersDB/{/.} seqDBs/{/.}_h repsDB/{/.}_h ::: split_probes/*.fasta
parallel mmseqs result2profile repsDB/{/.} seqDBs/{/.} clustersDB/{/.} profilesDB/{/.} --threads 1 --mask-profile 0 ::: split_probes/*.fasta

A small parenthesis, I found the CLI of mmseqs mergedbs quite odd. Why is the output the second positional argument? Why not the first or the last? It would help to play more nicely with wildcard use. Anyway, that's why these

parallel echo profilesDB/{/.} ::: split_probes/*.fasta > profsDBs
mmseqs mergedbs $(head -n1 profsDBs) merged_profsDB $(tail -n +2 profsDBs)
parallel echo profilesDB/{/.}_h ::: split_probes/*.fasta > profsDBs_h
mmseqs mergedbs $(head -n1 profsDBs_h) merged_profsDB $(tail -n +2 profsDBs_h)

Finally, I got a single DB with what I thought should be a all the profiles. So I proceeded to do the mmseq search something like this:

mmseqs search ../cleaning_test/assembled/filtering/dbs/samples/Andryala_integrifolia_ERR7618428 merged_profsDB andryala_results tmp_andryala --threads 4 -s 7.5 -e 1.00E-6 --min-length 15 --remove-tmp-files -a
mmseqs convertalis ../cleaning_test/assembled/filtering/dbs/samples/Andryala_integrifolia_ERR7618428 merged_profsDB andryala_results my_prof_andryala_table.tsv --format-mode 4 --format-output "query,evalue,qstart,qend,qlen,tstart,tend,tlen,theader,gapopen,nident,mismatch" --threads 4

However, the table result I got seems malformed: my_prof_andryala_table.tsv.txt

To give you an idea of what I expect here is what the current approach produces: Andryala_integrifolia_ERR7618428.tsv.txt

This is it, I would really appreciate your advice to troubleshoot this.
Thank you in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant