Skip to content

Synteny matrix representation with maximum likelihood pipeline

Notifications You must be signed in to change notification settings

zhaotao1987/Syn-MRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pipeline for whole-genome microsynteny-based phylogenetic inference

Our synteny-based phylogenetic reconstruction approach includes four main steps, in turn namely phylogenomic synteny network construction, network clustering, matrix representation, and maximum-likelihood estimation. Together we call our approach ‘Syn-MRL’ for short.

The synteny network construction consists of two main steps: first, all-vs-all reciprocal annotated-protein comparisons of the whole genome using DIAMOND was performed, followed by MCScanX, which was used for pairwise synteny 489 block detection. Parameter settings for MCScanX have been tested and compared before; here we adopt ‘b5s5m25’ (b: number of top homologous pairs, s: number of minimum matched syntenic anchors, m: number of max gene gaps), which has proven to be appropriate by various studies for the evolutionary distances among angiosperm genomes. To avoid large numbers of local collinear gene pairs due to tandem arrays, if consecutive homologs (up to five genes apart) share a common gene, homologs are collapsed to one representative pair (with the smallest E-value). Further details regarding phylogenomic synteny network construction can be found in a tutorial available in the associated GitHub repository (https://github.com/zhaotao1987/SynNet-Pipeline). Each pairwise synteny block represents pairs of connected nodes (syntenic genes), all pairwise identified synteny blocks together form a comprehensive synteny network with millions of nodes and edges. In this synteny network, nodes are genes (from the synteny blocks), while edges connect syntenic genes. For our work, the entire synteny network summarizes information from 7,435,502 pairwise syntenic blocks, and contains 503 3,098,333 nodes (genes) and 94,980,088 edges (syntenic connections). The entire synteny network (database) is clustered for further analysis. We used the Infomap algorithm for detecting synteny clusters within the map equation framework(https://github.com/mapequation/infomap). We have discussed before why Infomap is more appropriate for clustering phylogenomic synteny networks. We used the two-level partitioning mode with ten trials (--clu -N 10 --map -2). The network was treated as undirected and unweighted. Resulting synteny clusters vary in size and composition, which is associated with synteny either being well conserved or rather lineage-/species-specific. A typical synteny cluster comprises of syntenic genes shared by groups of species, which precisely represent phylogenetic relatedness of genomic architecture among species. Here, we classified the entire synteny network into 137,833 synteny clusters.

A cluster phylogenomic profile shows its composition by the number of nodes in each species. We summarize the total information residing in all synteny clusters as a data matrix for tree inference. Phylogenomic profiles of all clusters construct a large data matrix, where rows represent species, and columns as clusters. The matrix was then reduced to a binary presence-absence matrix to obtain the final synteny matrix. Tree estimation was based on maximum-likelihood as implemented in IQ-TREE (version 1.7-beta7) (Nguyen et al., 2014), using the MK+R+FO model. (where “M” stands for “Markov” and “k” refers to the number of states observed, in our case, k =2). The +R (FreeRate) model was used to account for site-heterogeneity, and typically fits data better than the Gamma model for large datasets. State frequencies were optimized by maximum-likelihood (by using ‘+FO’). We generated 1000 bootstrap replicates for the SH-like approximate likelihood ratio test (SH-aLRT), and 1000 ultrafast bootstrap (UFBoot) replicates (-alrt 1000 -bb 1000).

 

Microsynteny-based vs sequence-alignment based phylogenetic reconstruction

me

About

Synteny matrix representation with maximum likelihood pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published