Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example parallel command usage for speed-up #6

Open
splaisan opened this issue Jun 8, 2022 · 2 comments
Open

example parallel command usage for speed-up #6

splaisan opened this issue Jun 8, 2022 · 2 comments

Comments

@splaisan
Copy link

splaisan commented Jun 8, 2022

I used the following scheme to process 1000's of input proteins in a more realistic time.
maybe this can help others!

Please test if you have enough RAM when using multiple cores here!

# ECPred is installed for me at /opt/biotools/ECPred, edit for your own path
ECPRED_PATH=/opt/biotools/ECPred

# split the multifasta into single fasta files,one per protein (faSplit is from UCSC tools)
mkdir splitseqs
faSplit byname multi-proteins.fa splitseqs/

# run the prediction in parallel with N parallel jobs
pthr=48
mkdir results

find splitseqs -type f -name '*.fa' | \
  parallel -j ${pthr} -k 'java -jar ${ECPRED_PATH}/ECPred.jar \
    weighted {} \
    /${ECPRED_PATH} \
    $PWD \
    results/$(basename {})_out'

# collect and merge results
echo -e "Protein ID\tEC Number\tConfidence Score(max 1.0)" > ECPred_results.tsv
cat results/*_out | grep -v '^Protein' | sort -k 1V,1 >> ECPred_results.tsv
@fmoorhof
Copy link

fmoorhof commented Feb 24, 2023

Thank you so much for this comment!

Further, you can avoid the usage of the 3rd party tool 'faSplit from UCSC tools' with:
awk '/^>/ {OUT="splitseqs/" substr($0,2) ".fa"}; OUT {print >OUT}' multi-proteins.fa
Additionally, instead of parallel someone could also use xargs -P ${pthr} if parallel is not installed...

@dsaeedeh
Copy link

dsaeedeh commented Nov 28, 2023

Thanks for your help. I wanted to put the complete command of xargs here for your reference:
find splitseqs -type f -name '*.fasta' | \
xargs -P ${pthr} -I {} java -jar ${ECPRED_PATH}/ECPred.jar \
weighted {} \
${ECPRED_PATH} \
$PWD \
results/$(basename {})_out

#collect and merge results
echo -e "Protein ID\tEC Number\tConfidence Score(max 1.0)" > ECPred_results.tsv
cat results/*_out 2>/dev/null | grep -v '^Protein' | sort -k 1V,1 >> ECPred_results.tsv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants