domingo, 31 de marzo de 2019

Some methods to do phylogeny and why choose them


When we perform a phylogenetic analysis, we can find a lot of papers arguin the use (or not) of some phylogenetics software and methods. Here I’m going to discuss about three of them: parsimony, maximum likelihood (ML) and Bayesian inference (BI), in focus to evaluate empirically what of them is the most appropriate to the phylogenic analysis in base of its statistics consistency with distances between trees.

Some arguments that justify the method used in phylogeny is statistics consistency, property discussed between parsimony and ML advocates (Brower, 2017). Felsenstein (1978) said ML has the property of consistency and parsimony have it too when probabilities of evolution change are small (parsimony and ML behave the same), but no when probability of a tips characters sets is higher than the others in way more data are added (Felsenstein, 1973). Moreover, BI consistency is about posterior probability, holding consistence at every point of a parameter space depending of its priors (Goshal, 1998).

To test these assumptions, I simulated two matrices of five taxa with 1000, 2000 and 4000 characters, with SeqGen (Rambaut & Grassly, 1997) in R v.3.5.2, using the JC model and two different rooted trees (one for each matrix) with all branch lengths fixed to 0.5. The parsimony analyses were made using TNT v1.1 (Goloboff et al., 2008), with 30 replicates of SPR search; to ML test I used PhyML v3.0 (Guindon & Gascuel, 2003) using the JC evolution model; BI was realized in MrBayes v3.2.6 (Ronquist et al., 2012) under JC model and the others parameters by default. Topological distances were calculated with Robinson-Foulds (RF) distances (Robinson & Foulds 1981); and branch score distances were calculate with the algorithm of Kuhner and Felsenstein (1994), both in R.

Distances among resulting trees was 0 between trees of the same DNA matrix, which they got the same relations (Fig. 1). Moreover, distances between branch lengths of ML and BI trees showing a kind of patron, being the trees with a smaller number of characters (ML phylogenies) those that presented the highest distances comparing with trees with a greater number of characters (Fig. 2). Also, ML of second matrix tree with 4000 characters (ML2_4000 in Fig. 2) showing short distances among trees (see also Distances branch lengths in the GitHub link), compare to the others, mainly with ML1_1000 that have the longest distances. For BI trees, distances show that among more data is added closer are the branch lengths, corroborating the precision of this method (Wiens & Moen, 2008), but not its accuracy.

Distances between trees is not the better way to estimate consistency in BI, because it doesn’t compare among posterior probabilities, because all methods rebuild the same relations between taxa. However, that test will be consider in a posterior work.

In this way, BI, ML and parsimony can rebuild the same relations between taxa no matter amount of characters is added, but if we based in one characteristics of consistency (precision) BI show a better behave compare to the others two methods.



Figure 1. Resulting trees: A) BI, first matrix; B)BI, second matrix; C) ML, first matrix; D) ML, second matrix; E) Parsimony, first matrix; and F) Parsimony, second matrix.

 

Figure 2. Summation of branch lengths distances of each method with its characters amount.
 

References

Brower, A. (2017). Statistical consistency and phylogenetic inference: a brief review. Cladistics, 0: 1-6.

Felsenstein, J. (1973).  Maximum-likelihood estimation of evolutionary trees from continuous characters. American Journal of Human Genetics, 25:471-492.


Felsenstein, J. (1978). Cases in which Parsimony or Compatibility Methods Will be Positively Misleading. Systematic Zoology, 27(4): 401-410.

Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5): 774-786.  

Goshal, S. (1998). A review of consistency and convergence of posterior distribution. Indian Statistical Institute. Calcutta, India.
Guindon, S., Gascuel, O. (2003). A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Systematic Biology, 52(5): 696-704.

Kuhner, M. K. and Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3): 459–468.
 
Rambaut, A. and Grassly, N.C. (1997) “Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees”, Computer Applications In The Biosciences, 13(3): 235-238.
Robinson, D. and Foulds, L. (1981) Comparison of phylogenetic trees. Mathematical Biosciences, 53(1): 131–147.

Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D., Darling, A., Höhna, S., Larget, B., Liu, L., Suchard, M., Huelsenbeck, J. (2012). MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model ChoiceAcross a Large Model Space. Systematic Biology, 61(3): 539-542.


Wiens, J., Moen, D. (2008). Missing data and the accuracy of Bayesian phylogenetics. Journal of Systematics and Evolution, 46(3): 307-314.


 Script and data: https://github.com/DanielaP10/Post-2

No hay comentarios: