domingo, 31 de marzo de 2019

What is the right way to do phylogenetic reconstruction?



Phylogenies represent our attempts to reconstruct the evolutionary history of life” (Huelsenbeck and Ronquist, 2001), and while computers have increased their speed and capacity, there are more methods to infer phylogenetic trees, based on distance and based on characters (Rizzo and Rouchka, 2007). Maximum Parsimony (MP) (Fitch, 1971), Maximum Likelihood (ML) (Felsenstein, 1981) and Bayesian Inference (IB), are three character-based methods very used currently but MP is the only that does not find branch lenghts, therefore, it has the problem of not specifying the amount of change (Egan and Crandall, 2006). On the other hand, ML finds branch lenghts but assumes that model used is precise, thus, if the model does not precise reflect the data set, the method is inconsistent, causing problems, even when the method is designed to be robust; also, the extensive computation required and new evidence that suggests multiple maximum likelihood points for a given phylogenetic tree, are more disadvantages (Rizzo and Rouchka, 2007).

Now, we are going to focus in IB, a method with advanced techniques that allows to analyze more than 350 successful sequences using a moderate computational effort and the implementation of evolutionary models more complex and realistic than before (Huelsenbeck et al., 2001; Ronquist and Huelsenbeck, 2003). For this reason in this post I going to evaluate the effect of changes in the size of the matrices and the number of taxa used in a phylogenetic analysis with IB. My hypotheses is that resolution of topologies are more affected by the size of the matrix than by the number of taxa.

To do this, I did a simulation of 3 topologies which I used to simulate 54 matrices in total, using HKY model, three different character sizes (100, 500 and 2500), and two number of taxa (12 and 48), each type of data had 3 replicas. Then, each matrix was analyzed using Bayesian inference and I determined the number of resolved nodes in each one to compared between them. Scripts with methodology, software specifications and parameters that I used are detailed in the following link

None of the topologies were completely resolved, nevertheless the best resolution was presented in topologies whose matrices had 2500 characters. The number of resolved nodes was increasing as the number of characters increases but the change was more evident in analysis with 48 taxa when size of matrix went to 100 from 500. On the other hand, the change in resolution of topologies with 12 taxa was minimal in any of the three numbers of characters (Figure 1.). 


              Fig.1. Mean of resolved nodes in 6 types of topologies. In the x axis we can see the different sizes of matrices and number of taxa.
In this case, only in sizes less than 500 characters, size of matrix had more impact on the resolution of the topology than the number of terminals. What indicates that in the phylogenetic analysis with Bayesian inference, increasing the number of characters would help to increase precision and number of resolved nodes, contrary to increase the number of taxa that have a lower impact. In other words, we can said that the estimator is consistent. 


So if I can choose a method to do phylogenetic reconstruction, I would prefer IB, because this has several advantages like the ability to incorporate prior knowledge, at this way, the Bayesian framework offers a more direct expression of uncertainty, including complete ignorance, what is suitable to create cumulative knowledge. Also, some computational advantages, the capacity to handle highly complex models efficiently and does not assume or require normal distributions apart of the parameters of a model (Schoot et al., 2014; Huelsenbeck and Ronquist, 2001). Of course, the success of a good Bayesian analysis falls on the objectivity and precision with which the priors are assigned, however, assuming that this done correctly (as it should be), all the advantages of a Bayesian analysis, make it the best option to have a well-made phylogeny. 


References
-Egan, A.N., and Crandall, K.A. (2006). Theory of Phylogenetic Estimation, in Evolutionary Genetics: Concepts and Case Studies. 1 ed. Oxford University Press, 426-436.

-Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., vol, 17(6), 368-376.

-Fitch, W. (1971). Toward defining the course of evolution: minimum change for a specified tree topology. Syst. Zool., 20, 406-416.

-Huelsenbeck, J.P., Ronquist, F., Nielsen, R. and Bollback, J.P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology, Science, 294, 2310–2314

-Huelsenbeck, J.P. and Ronquist, F. (2001). MRBAYES: Bayesian inference of phylogenetic trees, Bioinformatics, 17(8), 754-755.

-Rizzo, J. and Rouchka E.C. (2007). Review of Phylogenetic Tree Construction. University of Louisville Bioinformatics Laboratory Technical Report Series, 2-7.

-Ronquist, F., and Huelsenbeck, J.P. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics19(12), 1572-1574.

-Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J. and Aken, M. A. (2014). A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. Child Dev, 85, 842-860.




Some methods to do phylogeny and why choose them


When we perform a phylogenetic analysis, we can find a lot of papers arguin the use (or not) of some phylogenetics software and methods. Here I’m going to discuss about three of them: parsimony, maximum likelihood (ML) and Bayesian inference (BI), in focus to evaluate empirically what of them is the most appropriate to the phylogenic analysis in base of its statistics consistency with distances between trees.

Some arguments that justify the method used in phylogeny is statistics consistency, property discussed between parsimony and ML advocates (Brower, 2017). Felsenstein (1978) said ML has the property of consistency and parsimony have it too when probabilities of evolution change are small (parsimony and ML behave the same), but no when probability of a tips characters sets is higher than the others in way more data are added (Felsenstein, 1973). Moreover, BI consistency is about posterior probability, holding consistence at every point of a parameter space depending of its priors (Goshal, 1998).

To test these assumptions, I simulated two matrices of five taxa with 1000, 2000 and 4000 characters, with SeqGen (Rambaut & Grassly, 1997) in R v.3.5.2, using the JC model and two different rooted trees (one for each matrix) with all branch lengths fixed to 0.5. The parsimony analyses were made using TNT v1.1 (Goloboff et al., 2008), with 30 replicates of SPR search; to ML test I used PhyML v3.0 (Guindon & Gascuel, 2003) using the JC evolution model; BI was realized in MrBayes v3.2.6 (Ronquist et al., 2012) under JC model and the others parameters by default. Topological distances were calculated with Robinson-Foulds (RF) distances (Robinson & Foulds 1981); and branch score distances were calculate with the algorithm of Kuhner and Felsenstein (1994), both in R.

Distances among resulting trees was 0 between trees of the same DNA matrix, which they got the same relations (Fig. 1). Moreover, distances between branch lengths of ML and BI trees showing a kind of patron, being the trees with a smaller number of characters (ML phylogenies) those that presented the highest distances comparing with trees with a greater number of characters (Fig. 2). Also, ML of second matrix tree with 4000 characters (ML2_4000 in Fig. 2) showing short distances among trees (see also Distances branch lengths in the GitHub link), compare to the others, mainly with ML1_1000 that have the longest distances. For BI trees, distances show that among more data is added closer are the branch lengths, corroborating the precision of this method (Wiens & Moen, 2008), but not its accuracy.

Distances between trees is not the better way to estimate consistency in BI, because it doesn’t compare among posterior probabilities, because all methods rebuild the same relations between taxa. However, that test will be consider in a posterior work.

In this way, BI, ML and parsimony can rebuild the same relations between taxa no matter amount of characters is added, but if we based in one characteristics of consistency (precision) BI show a better behave compare to the others two methods.



Figure 1. Resulting trees: A) BI, first matrix; B)BI, second matrix; C) ML, first matrix; D) ML, second matrix; E) Parsimony, first matrix; and F) Parsimony, second matrix.

 

Figure 2. Summation of branch lengths distances of each method with its characters amount.
 

References

Brower, A. (2017). Statistical consistency and phylogenetic inference: a brief review. Cladistics, 0: 1-6.

Felsenstein, J. (1973).  Maximum-likelihood estimation of evolutionary trees from continuous characters. American Journal of Human Genetics, 25:471-492.


Felsenstein, J. (1978). Cases in which Parsimony or Compatibility Methods Will be Positively Misleading. Systematic Zoology, 27(4): 401-410.

Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5): 774-786.  

Goshal, S. (1998). A review of consistency and convergence of posterior distribution. Indian Statistical Institute. Calcutta, India.
Guindon, S., Gascuel, O. (2003). A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Systematic Biology, 52(5): 696-704.

Kuhner, M. K. and Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3): 459–468.
 
Rambaut, A. and Grassly, N.C. (1997) “Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees”, Computer Applications In The Biosciences, 13(3): 235-238.
Robinson, D. and Foulds, L. (1981) Comparison of phylogenetic trees. Mathematical Biosciences, 53(1): 131–147.

Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D., Darling, A., Höhna, S., Larget, B., Liu, L., Suchard, M., Huelsenbeck, J. (2012). MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model ChoiceAcross a Large Model Space. Systematic Biology, 61(3): 539-542.


Wiens, J., Moen, D. (2008). Missing data and the accuracy of Bayesian phylogenetics. Journal of Systematics and Evolution, 46(3): 307-314.


 Script and data: https://github.com/DanielaP10/Post-2