You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used the following scheme to process 1000's of input proteins in a more realistic time.
maybe this can help others!
Please test if you have enough RAM when using multiple cores here!
# ECPred is installed for me at /opt/biotools/ECPred, edit for your own path
ECPRED_PATH=/opt/biotools/ECPred
# split the multifasta into single fasta files,one per protein (faSplit is from UCSC tools)
mkdir splitseqs
faSplit byname multi-proteins.fa splitseqs/
# run the prediction in parallel with N parallel jobs
pthr=48
mkdir results
find splitseqs -type f -name '*.fa' | \
parallel -j ${pthr} -k 'java -jar ${ECPRED_PATH}/ECPred.jar \
weighted {} \
/${ECPRED_PATH} \
$PWD \
results/$(basename {})_out'
# collect and merge results
echo -e "Protein ID\tEC Number\tConfidence Score(max 1.0)" > ECPred_results.tsv
cat results/*_out | grep -v '^Protein' | sort -k 1V,1 >> ECPred_results.tsv
The text was updated successfully, but these errors were encountered:
Further, you can avoid the usage of the 3rd party tool 'faSplit from UCSC tools' with: awk '/^>/ {OUT="splitseqs/" substr($0,2) ".fa"}; OUT {print >OUT}' multi-proteins.fa
Additionally, instead of parallel someone could also use xargs -P ${pthr} if parallel is not installed...
Thanks for your help. I wanted to put the complete command of xargs here for your reference:
find splitseqs -type f -name '*.fasta' | \
xargs -P ${pthr} -I {} java -jar ${ECPRED_PATH}/ECPred.jar \
weighted {} \
${ECPRED_PATH} \
$PWD \
results/$(basename {})_out
I used the following scheme to process 1000's of input proteins in a more realistic time.
maybe this can help others!
Please test if you have enough RAM when using multiple cores here!
The text was updated successfully, but these errors were encountered: