domingo, 31 de marzo de 2019

What is the right way to do phylogenetic reconstruction?



Phylogenies represent our attempts to reconstruct the evolutionary history of life” (Huelsenbeck and Ronquist, 2001), and while computers have increased their speed and capacity, there are more methods to infer phylogenetic trees, based on distance and based on characters (Rizzo and Rouchka, 2007). Maximum Parsimony (MP) (Fitch, 1971), Maximum Likelihood (ML) (Felsenstein, 1981) and Bayesian Inference (IB), are three character-based methods very used currently but MP is the only that does not find branch lenghts, therefore, it has the problem of not specifying the amount of change (Egan and Crandall, 2006). On the other hand, ML finds branch lenghts but assumes that model used is precise, thus, if the model does not precise reflect the data set, the method is inconsistent, causing problems, even when the method is designed to be robust; also, the extensive computation required and new evidence that suggests multiple maximum likelihood points for a given phylogenetic tree, are more disadvantages (Rizzo and Rouchka, 2007).

Now, we are going to focus in IB, a method with advanced techniques that allows to analyze more than 350 successful sequences using a moderate computational effort and the implementation of evolutionary models more complex and realistic than before (Huelsenbeck et al., 2001; Ronquist and Huelsenbeck, 2003). For this reason in this post I going to evaluate the effect of changes in the size of the matrices and the number of taxa used in a phylogenetic analysis with IB. My hypotheses is that resolution of topologies are more affected by the size of the matrix than by the number of taxa.

To do this, I did a simulation of 3 topologies which I used to simulate 54 matrices in total, using HKY model, three different character sizes (100, 500 and 2500), and two number of taxa (12 and 48), each type of data had 3 replicas. Then, each matrix was analyzed using Bayesian inference and I determined the number of resolved nodes in each one to compared between them. Scripts with methodology, software specifications and parameters that I used are detailed in the following link

None of the topologies were completely resolved, nevertheless the best resolution was presented in topologies whose matrices had 2500 characters. The number of resolved nodes was increasing as the number of characters increases but the change was more evident in analysis with 48 taxa when size of matrix went to 100 from 500. On the other hand, the change in resolution of topologies with 12 taxa was minimal in any of the three numbers of characters (Figure 1.). 


              Fig.1. Mean of resolved nodes in 6 types of topologies. In the x axis we can see the different sizes of matrices and number of taxa.
In this case, only in sizes less than 500 characters, size of matrix had more impact on the resolution of the topology than the number of terminals. What indicates that in the phylogenetic analysis with Bayesian inference, increasing the number of characters would help to increase precision and number of resolved nodes, contrary to increase the number of taxa that have a lower impact. In other words, we can said that the estimator is consistent. 


So if I can choose a method to do phylogenetic reconstruction, I would prefer IB, because this has several advantages like the ability to incorporate prior knowledge, at this way, the Bayesian framework offers a more direct expression of uncertainty, including complete ignorance, what is suitable to create cumulative knowledge. Also, some computational advantages, the capacity to handle highly complex models efficiently and does not assume or require normal distributions apart of the parameters of a model (Schoot et al., 2014; Huelsenbeck and Ronquist, 2001). Of course, the success of a good Bayesian analysis falls on the objectivity and precision with which the priors are assigned, however, assuming that this done correctly (as it should be), all the advantages of a Bayesian analysis, make it the best option to have a well-made phylogeny. 


References
-Egan, A.N., and Crandall, K.A. (2006). Theory of Phylogenetic Estimation, in Evolutionary Genetics: Concepts and Case Studies. 1 ed. Oxford University Press, 426-436.

-Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., vol, 17(6), 368-376.

-Fitch, W. (1971). Toward defining the course of evolution: minimum change for a specified tree topology. Syst. Zool., 20, 406-416.

-Huelsenbeck, J.P., Ronquist, F., Nielsen, R. and Bollback, J.P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology, Science, 294, 2310–2314

-Huelsenbeck, J.P. and Ronquist, F. (2001). MRBAYES: Bayesian inference of phylogenetic trees, Bioinformatics, 17(8), 754-755.

-Rizzo, J. and Rouchka E.C. (2007). Review of Phylogenetic Tree Construction. University of Louisville Bioinformatics Laboratory Technical Report Series, 2-7.

-Ronquist, F., and Huelsenbeck, J.P. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics19(12), 1572-1574.

-Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J. and Aken, M. A. (2014). A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. Child Dev, 85, 842-860.




No hay comentarios: