-
Notifications
You must be signed in to change notification settings - Fork 24
Home
Maintainer: Mike Lin
PhyloCSF is a method to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region. PhyloCSF does not rely on homology to known protein sequences; instead, it examines evolutionary signatures characteristic to alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and non-sense substitutions (CSF = Codon Substitution Frequencies). One of PhyloCSF’s main current applications is to help distinguish protein-coding and non-coding RNAs represented among novel transcript models obtained from high-throughput transcriptome sequencing.
The mathematical details of the method, and benchmarks comparing it to other approaches, are described in:
- Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011)
The following papers also provide relevant background and/or illustrate potential applications.
- Lin MF, Carlson JW, Crosby MA, Matthews BA, et al. (2007). Revisiting the protein-coding gene catalog of Drosophila melanogaster using twelve fly genomes. Genome Research 17:1823-1836
- Lin MF, Deoras AN, Rasmussen MD, and Kellis M (2008). Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes. PLoS Comput Biol 4:e1000067
- Guttman M, Garber M, Levin JZ, Donaghey J, et al. (2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotech 28: 503-510
The remainder of this document discusses how to obtain and use the publicly available implementation of PhyloCSF. We keep a WishList of planned or (mostly) wished-for future improvements.
Input: The user must prepare a cross-species nucleotide sequence alignment for PhyloCSF. PhyloCSF can be applied to alignments of complete open reading frames (ORFs), individual exons, or complete or partial transcripts (i.e. concatenated exons).
Output: PhyloCSF outputs a score, positive if the alignment is likely to represent a conserved coding region, and negative otherwise.
OPAM Quick Install Instructions
This will install a minimal OCaml toolchain (if not already present on your system) and the OPAM package manager. Once it’s installed, initialize the installation by running opam init
and then set up your environment with eval $(opam config env)
.
opam install batteries ocaml+twt gsl
To successfully install the gsl
package, which provides OCaml bindings to the GNU Scientific Library (GSL), you may also need to install GSL itself, if it’s not already available on your system. On Debian/Ubuntu systems GSL is available in the libgsl0-dev
package, and on Mac OS X, gsl
is available in both Homebrew and MacPorts. Once GSL itself is installed, it’s still necessary to install the OCaml bindings using opam
, as shown above.
git clone git://github.com/mlin/PhyloCSF.git
cd PhyloCSF
make
Once the software compiles successfully, you can then run ./PhyloCSF
as in the examples shown below.
PhyloCSF should be usable in this way by anyone on the machine/cluster with rx
permissions; they don’t have to set up OPAM or anything else. PhyloCSF can be started from any working directory.
If you want to “install” PhyloCSF somewhere, you must transplant ./PhyloCSF
, ./PhyloCSF.ARCH
(where e.g. ARCH=Linux.x86_64
), and the ./PhyloCSF_Parameters
directory to the desired location.
PhyloCSF leaves it up to you to extract and align the orthologous loci from the genomes you wish to use. Galaxy and PHAST both have some tools that can help you extract and “stitch” blocks from MAF whole-genome multiple alignments. Alternatively, you might cook up your own alignments of mRNA sequences from different species (see the Aldh2 example below).
The input alignments for PhyloCSF need to be in FASTA format. Here’s an example, representing the complete ORF of tal-AA in Drosophila melanogaster, aligned with the orthologous regions from D. ananassae, D. pseudoobscura, D. willistoni, and D. virilis.
PhyloCSF_Examples/tal-AA.fa
>dmel|chr3R:9,639,593-9,639,688| ATGCTGGATCCCACTGGAACATACCGGCGACCACGCGACACGCAGGACTCCCGCCAAAAGAGGCGACAGGACTGCCTGGATCCAACCGGGCAGTAC >dana ATGCTGGATCCCACAGGAACGTACAGGCGACCCCGCGACTCGCAGGACACACGCCAGAAGCGGCGCCAGGATTGCCTGGATCCAACCGGGCAGTAC >dpse ATGCTGGATCCCACAGGAACCTACCGTCGCCCACGCGACACAAAAGACACACGCCAGAAACGGCGTCAGGACTGCCTAGATCCCACCGGGCAGTAC >dwil ATGCTGGATCCAACTGGTACATATCGCCGACCAAGGGATACACAGGACACACGCCATAAACGGC---------GCCTGGATCCAACTGGACAATAT >dvir ATGCTGGATCCAACGGGCACCTATCGGCGGCCGCGTGAGTCGCGTGACACGCGCCACAAGCAGCGGCAGCTGGCGCTGGATCCTACCGGGCAGTAC
Details:
- The alignment may represent anywhere from two up to all of the available species in one of the phylogenies for which parameters are available (see below). For example, we provide parameters for twelve Drosophila species but the above alignment includes only five.
- By default, the reference sequence (first row) must be ungapped, to ensure the reading frame is completely unambiguous. This can, however, be relaxed using command-line options.
- The order of the rows does not matter, as long as the species name in each header line can be matched up with the leaves of the phylogenetic tree in the parameters. In particular, any available species can be used as the reference sequence, as long as it comes first.
- Any gapped codons are marginalized out (standard for phylogenetic likelihood calculations)
- Everything including and after the first | (pipe) in a header line is ignored.
- The sequences can contain newlines but not any other whitespace.
- The program will pretend U’s are T’s.
Although your alignments may provide any subset of the species in these phylogenies, some subsets are more useful than others. For example, providing only human/chimp alignments will not lead to informative results. In fact, it is possible for vacuous or nearly-vacuous alignments (representing a reference sequence, but no appropriate informant species) to get a reasonably high score based solely on sequence composition of the reference sequence. (This arises from the generative nature of PhyloCSF’s underlying phylogenetic models.) Typically, with informative alignments, the substitution evidence is far stronger than the compositional biases; but in a vacuous alignment there is nothing else to go on. Therefore, it is generally advisable to provide PhyloCSF only with well-populated alignments that are likely to yield informative results about the patterns of codon substitution therein.
Regrettably, the algorithms for training the PhyloCSF parameters on different phylogenies are not yet available. We are still working on engineering a version suitable for public consumption. In the meantime, PhyloCSF does contain an experimental, simplified mode called the Omega Test for which it is much easier to provide parameters.
Suppose we have the alignment of a complete open reading frame, beginning at the start codon and ending immediately before the stop codon. This is the type of input one would typically provide to PAML. The above tal-AA.fa is such an example, and it’s included in the the distribution. Assuming we start in the main directory of the distribution, here’s how we’d run PhyloCSF on it:
$ ./PhyloCSF 12flies PhyloCSF_Examples/tal-AA.fa
PhyloCSF_Examples/tal-AA.fa decibans 297.6235
$
PhyloCSF thinks for a few seconds and reports that its model of protein-coding sequence evolution is strongly preferred over its non-coding model — by exactly 297.6235 decibans.
The first command line parameter 12flies
tells PhyloCSF where to find the parameters it needs for the phylogeny of interest, which are stored in the PhyloCSF_Parameters subdirectory of the distribution. The name is actually a prefix to several other files found in that directory (12flies.nh, 12flies_coding.ECM and 12flies_noncoding.ECM). If you were analyzing mammals instead of flies, you would say 29mammals instead of 12flies.
Here’s a six-way alignment of the fifth coding exon of ALDH2, a highly conserved mammalian gene.
PhyloCSF_Examples/ALDH2.exon5.fa
>Human GTATTATGCCGGCTGGGCTGATAAGTACCACGGGAAAACCATCCCCATTGACGGAGACTTCTTCAGCTACACACGCCATGAACCTGTGGGGGTGTGCGGGCAGATCATTCCG >Mouse CTATTACGCTGGCTGGGCTGACAAGTACCATGGGAAAACCATTCCCATCGACGGCGACTTCTTCAGCTATACCCGCCATGAGCCTGTGGGCGTGTGTGGACAGATCATTCCG >Rat CTATTATGCTGGCTGGGCTGACAAGTACCACGGGAAAACCATTCCCATCGATGGCGACTTCTTCAGCTACACCCGCCACGAGCCTGTGGGCGTGTGTGGACAGATCATTCCG >Cow TTACTATGCCGGCTGGGCTGACAAATACCATGGGAAAACCATTCCCATTGACGGGGACTACTTCAGCTACACCCGCCATGAACCTGTGGGAGTGTGTGGGCAGATCATCCCA >Horse TTATTATGCTGGCTGGGCCGATAAGTACCATGGGAAAACCATTCCCATCGATGGGGACTTCTTCAGCTACACCCGCCATGAACCTGTGGGGGTGTGTGGGCAGATCATTCCG >Dog TTACTATGCTGGCTGGGCTGATAAGTACCATGGGAAAACCATTCCTATCGATGGGGACTTTTTCAGTTATACCCGCCATGAACCTGTCGGGGTGTGCGGGCAGATCATTCCA
If we try to run this like the last example:
$ ./PhyloCSF 29mammals PhyloCSF_Examples/ALDH2.exon5.fa
PhyloCSF_Examples/ALDH2.exon5.fa decibans -178.9246
$
It gets a terrible score…what happened?? This exon has phase=1, i.e. it contains an extra nucleotide preceding the first complete codon. PhyloCSF doesn’t know that, though, so it evaluated the whole exon out-of-frame. We could trim this extra position off the alignment, but more conveniently, we can ask PhyloCSF to try all three reading frames and report the best:
$ ./PhyloCSF 29mammals PhyloCSF_Examples/ALDH2.exon5.fa --frames=3
PhyloCSF_Examples/ALDH2.exon5.fa decibans 218.2674 1 111
$
There are two extra columns to the output, indicating the relative positions within the sequence of the region PhyloCSF is reporting. These positions are zero-based, inclusive. In this case, PhyloCSF has figured out that there’s an extra nucleotide at the beginning (position zero), so it’s reporting the score of the region from the second nucleotide (position one) through the end of the alignment. In this mode, it will always go to the end of the alignment; this will change later.
Now, suppose we didn’t even know the strand of transcription for this exon, in addition to not knowing the reading frame. We can ask for a six-frame search:
$ ./PhyloCSF 29mammals PhyloCSF_Examples/ALDH2.exon5.fa --frames=6
PhyloCSF_Examples/ALDH2.exon5.fa decibans 218.2674 1 111 +
$
PhyloCSF tries the alignment (as given) in three frames, then reverse-complements it and tries three frames again, and finally reports the best of all six. In this case the ‘+’ indicates the best score was found in the alignment as given, not its reverse complement; it would be ‘-’ otherwise. The relative positions are always counting from the 5’-most position on the reported strand, so if the reported strand was ‘-’ then they would be counting from the end of the given alignment.
Search for the best ORF within a transcript alignment (boundaries, strand and/or reading frame unknown)
Now suppose we have an alignment of a (possibly incomplete) transcript model including one or more exons, and it may contain lengthy UTRs, or it may even be an entirely non-coding RNA. We can ask PhyloCSF to enumerate all possible ORFs above a given length, and report the best-scoring one.
For this example, we downloaded the complete mRNA of mouse Aldh2 (NM_009656.3) and its human (NM_000690.2) and dog (XM_848535.1) orthologs. Then we aligned them using EBI’s web server for MUSCLE. The results are in PhyloCSF_Examples/Aldh2.mRNA.fa
(not reproduced here due to length). Note that we are using the mouse mRNA as the reference sequence, demonstrating the program’s agnosticism to the species order.
First, let’s see what happens if we run this more or less as in the preceding section, trying the complete alignment in three frames:
$ ./PhyloCSF 29mammals PhyloCSF_Examples/Aldh2.mRNA.fa --frames=3 --removeRefGaps --aa
PhyloCSF_Examples/Aldh2.mRNA.fa decibans 1910.4972 1 2206 FSSPPYLHRCEP*APATLLGAHT...
$
We added two options to run it like this, --removeRefGaps
to automatically remove alignment columns that are gapped in the reference (mouse) sequence, and --aa
, which outputs the amino acid translation of the reported region (in the reference sequence). The translation has been truncated here as it is quite lengthy.
Note that although the alignment gets a high score overall, it contains some stop codons (the asterisks in the translation). This is because the mRNA includes UTRs. It doesn’t make a great deal of sense to include them in the analysis; if the UTRs were very lengthy, they could wash out the ORF and even lead to a negative overall score.
We can ask PhyloCSF to examine all start-to-stop ORFs within the transcript, and report the best:
$ ./PhyloCSF 29mammals PhyloCSF_Examples/Aldh2.mRNA.fa --orf=ATGStop --frames=3 --removeRefGaps --aa
PhyloCSF_Examples/Aldh2.mRNA.fa decibans 2013.9264 343 1899 MLRAALTTVRRGPRLSRLLSAAA...
$
Note that the score improved by excluding the UTRs, there are no stop codons in the protein translation, and in fact PhyloCSF found the Aldh2 protein sequence exactly.
Furthermore, if we didn’t even known the strand of transcription for the alignment, we could use --frames=6
.
There are several possible settings for the --orf
option, which enumerate possible ORFs in slightly different ways. The following schematic illustrates these settings, where the red bars indicate stop codons and the green bars indicate start codons in the reference sequence:
This assumes that all the illustrated regions are above a minimum length, which can be specified with the --minCodons
option. Also, it shows the ORFs only for one reading frame; it’s usually desirable to use an ORF search in three or six frames, in which case PhyloCSF will apply the requested strategy to each of the reading frames.
$ ./PhyloCSF --help usage: PhyloCSF.Linux.x86_64 parameter_set [file1 file2 ...] input will be read from stdin if no filenames are given.
options:
-h, --help show this help message and exit --strategy=mle|fixed|omega evaluation strategy (default mle)
input interpretation:
--files input list(s) of alignment filenames instead of individual alignment(s) --removeRefGaps automatically remove any alignment columns that are gapped in the reference sequence (nucleotide columns are removed individually; be careful about reading frame). By default, it is an error for the reference sequence to contain gaps --species=Species1,Species2,... hint that only this subset of species will be used in any of the alignments; this does not change the calculation mathematically, but can speed it up
searching mulitple reading frames and ORFs:
-f1|3|6, --frames=1|3|6 how many reading frames to search (default 1) --orf=AsIs|ATGStop|StopStop|StopStop3|ToFirstStop|FromLastStop|ToOrFromStop search for ORFs (default AsIs) --minCodons=INT minimum ORF length for searching over ORFs (default 25 codons) --allScores report scores of all regions evaluated, not just the max
output control:
--bls include alignment branch length score (BLS) for the reported region in output --ancComp include ancestral sequence composition score in output --dna include DNA sequence in output, the part of the reference (first) sequence whose score is reported --aa include amino acid translation in output --debug print extra information about parameters and errors
PhyloCSF outputs a score, positive if the alignment is likely to represent a conserved coding region, and negative otherwise. This score has a precise theoretical interpretation: it quantifies how much more probable the alignment is under PhyloCSF’s model of protein-coding sequence evolution than under the non-coding model. As indicated in the output, this likelihood ratio is expressed in units of decibans. A score of 10 decibans means the coding model is 10:1 more likely than the non-coding model; 20 decibans, 100:1; 30 decibans, 1000:1; and so on. A score of -20 means the non-coding model is 100:1 more likely, and a score of 0 indicates the two models are equally likely.
This theoretical interpretation is nice to have, at least compared to a purely ad hoc score. However, if the likelihood ratio is, say, 1000:1 (30 decibans), we shouldn’t necessarily take those exact odds with our own money at stake, for the underlying models of sequence evolution make many simplifying assumptions that are not fully satisfied in reality. There are also more general caveats, such as:
- An alignment with a good score could represent an ancestral coding gene that recently “died”, and is now a pseudogene in the species of interest.
- An alignment with a poor score could be protein-coding in the species of interest, but not conserved in the others.
- The input alignment may not represent an appropriate orthologous region from some or all of the species. For example, a duplicated pseudogene could be aligned to its parent gene, or more generally, a complex orthology/paralogy situation may not be correctly resolved.
Therefore: please do not claim that anything “is” or “is not” protein-coding based solely on PhyloCSF. PhyloCSF provides one line of evidence either for or against the hypothesis that the input alignment represents a coding region, by examining evolutionary signatures. If you wish to make strong statements about whether you have found coding or non-coding regions, you should complement PhyloCSF with additional lines of evidence, such as homology searches to known protein sequence databases (e.g. BLASTX on nr or HMMER on Pfam).
It can also be helpful to compare the scores of novel regions with positive and negative control sets, e.g. cross-species alignments of orthologous mRNAs of known genes (positive) and long non-coding RNAs or even random non-coding regions (negative), generated using the same alignment pipeline.
Lastly, it is always, always, always a good idea to manually inspect at least some of the alignments for which PhyloCSF reports an interesting score, to ensure you are finding the types of examples you want.
to be written
There are a few ways to use PhyloCSF to process a batch of alignments.
- Run it individually on each file, as above. This is easy but there is some inefficiency as the software must read the parameters for, and instantiate, large phylogenetic models each time you start it. If you do not specify a filename on the command line, PhyloCSF will expect an alignment piped through standard input.
- Provide several alignment filenames on the command line. PhyloCSF will proceed through each in turn, and the output is a reasonable tab-delimited format that includes the filename processed on each line.
- If you use the
--files
option, then each filename supplied on the command line is assumed to refer to a file which itself contains a list of individual alignment files to process, one filename per line. - If you use the
--files
option but do not supply any filenames on the command line, PhyloCSF will expect a list of alignment filenames piped through standard input, again one filename per line. In this mode PhyloCSF can be used on-line by a driver script: it reads an alignment filename from standard in, inputs and processes the alignment, reports the results (on standard output), and then input the next filename.
PhyloCSF will stop everything if it encounters a basic problem with the way an alignment is formatted (file not found, unknown species, invalid characters, etc.), as this suggests you need to make adjustments to your pipeline. For other errors, such as numerical convergence problems or no ORFs found in ORF search mode, PhyloCSF will report the problem but still continue on to the next alignment.