Match protein sequences to a genome and predict genes in the matching genome regions. Using pblat and running exonerate only on the smaller matching section of the genome improves the speed of the generally slower process of running exonerate genome-wide.
Takes a protein and DNA input file (usually scaffolds), plats the proteins against the DNA sequences, then predicts genes in the matched sections (+-500nt) using exconerate protein2genome
FastProteinExonerate_v220221.sh <protein file> <DNA file> <n cores> <maxIntron>
By default, the script tries to find a conda executable in
CONDASH=/data/miniconda3/etc/profile.d/conda.sh
If your conda.sh is in a different location, edit the path to $CONDASH in the script.
All output files will be in a new folder called protExon. If this folder exists it will be overwritten!
cleaned_proteins.fasta(Basic clean up of input.fastafile, a.fastafile in one-line format)protein_out.psl(Output ofpblat)pblat.log(pblatlog file)pblat.err(pblaterror file)best_hits_protein_out.psl(Filteredpblatoutput file, only take the best hit for each input protein query)coord.info.tsv(a.tsvfile with genome coordinate info +- 500nt of the matched region, one line for each matched protein sequence)match_coord.bed(matched coordinates in.bedformat)match_sections.fasta(nucleotide sequences of regions)run.sh(The actual script that does all the work. It is created at runtime and will be quite large as it contains sequence data, not recommended toless/more/catit)run.log(LogSTDOUTfile of the run)run.err(ErrorSTDERRfile of the run)final.gff(Output: Predicted genes in GFF format)final.proteins.fa(Output: translated protein sequences)final.cds.fa(Output: coding sequences (CDSs))
The script attempts to create a conda environment proteinexonerate which will install the following dependencies. If proteinexonerate exists, it will activate the existing environment.
pblatbedtoolsexonerategffread