Filosofía, especie y sistemática: Can you recuperate the true?

Recover the “real” topology is an ambiguous terminology, how can we know, that what we get is real? It is impossible. There is another problem, the methods for phylogenetic reconstruction (Parsimony, Maximum Likelihood and Bayes) may give different topologies. When we reconstruct a topology we are testing something, for example, the monophyly hypothesis of a group. Monophyly defined by Henning (1966) as “A monophyletic group comprises all descendants of a group of individuals that at their time belonged to a [...] single biological species” and Ashlock (1971), said to evaluate monophyly, we have to do a cladistic analysis between the terminals and search one or more characters that bear the ancestor and all his descendants. Besides, Ashlock (1972) said: “Cladistic refers to branching sequences and connections within a phylogenetic dendrogram ...” , this clarification makes the concept of monophyly applicable to the cladistic and phenetic schools and, allows to compare topologies from different methods. In this post, I tested the difference between topologies, because I wanted to know which method is accurate, and accurate seen as recovery of real topology. In simple words, which method could give the correct topology.

As I said before, test something against the reconstruction, and some authors test monophyly of some taxonomic group (De Queiroz & Gauthier, 1990), in this post there is not a taxonomic group, there are reference topologies and I used there to simulate molecular data. I used Ape in R (Paradis and Schliep, 2018) to generated the reference topologies, with different terminal size (5, 10, 20, 30 and 40 terminals) and the branch lengths were all fixed to 0.5, I simulated 3 topologies for each terminal size. For DNA simulation, I used seq-gen v1.3.4 (Rambaut and Grass, 1997) under HKY model and for 10000 characters. Phylogenetic reconstructions were run under Parsimony, Maximum Likelihood, and Bayesian methodologies. In Parsimony I applied two approaches, linear and concave (implicit weight), both in TNT (Goloboff et al., 2008). ML approach was run in RaxML-NG (Kozlov et al., 2018) with two configurations, all by default and using model selected by AIC. And for Bayesian I used MrBayes 3 (Ronquist and Huelsenbeck, 2003) with GTR+Gamma model for 10000 generations and 4 chains, others parameters was running under “by default”. For compare the topologies recovered, I calculated Robinson-Foulds distance in Phytools R packages (Revell, 2012), ever comparing the reconstruction against the reference topology. All scripts are available in https://github.com/andres1898/post2.

In figure 1, the methods behave in a similar way, the RF distance increase when there are more terminals. ML looks great, I mean, yeah it has difference against the topology reference but the value are below 20 and there are just two fair values that correspond to one replica. Now, if we see the two approaches of ML, there is no real difference, so the model did not influence. Parsimony looks more scattered than ML but less than Bayes, and inside it, again, there is no difference between linear or concave weight, also the RF distance is less than 20 but it has farther values than ML. And Bayes looks very dispersed and has the highest values of RF distance.

Figure 2 is better to see the behavior of the methods, Bayes is the most dispersal, followed by Parsimony and the last is ML, but why? For the poor accuracy of Bayes, there is just one explication, the flat priors. In Bayes inference is so important to choose the correct prior (Huelsenbeck, 2002) but one is more important, the topological prior, these prior described how the tree is building and the nucleotide substitution model can change between the branches and nodes (Yang and Rannala, 1997). In this post I specified the model as the number of substitution state and rates, the other priors were by default. Topological prior is flat, meaning all topologies were equally likely, this is a problem because the number of generation need to be higher to explore all topology space (Pickett and Randle, 2005). In my case, there are just 10000 generations, so maybe the all possible topologies were not explored and when the terminals increase, also increase the possible topologies. In addition, these could explain that accuracy is better (less RF distance) when the number of terminals is small.

Parsimony looks great, for me, that result was a surprise, I thought parsimony would be the worst or would present the highest values of RF distance, but no. This high accuracy could be because of the data simulation process/parameters and the size of the molecular matrix. Parsimony gets low RF distance because of its behavior as ML approach, this because of the simple model used (Goloboff, 2003), the lack of long branch lengths and a large character matrix (Kim, 1993). That is my stage, a simple model of substitution bases, constant branch lengths without long branches, and big data matrix. Another curiosity is there is no difference between linear and concave weights, that because I only interest about the topologies and explained by Goloboff (1993), use differential weights impacts the length of the final tree and decreases homoplasy, and that could influence the topology or not.

Some problems of Maximum Likelihood is the tree search (Russo et al., 1996), some programs as PAUP, explores a lot the tree space and next makes a perturbation (for example NNI) (Swofford, 2001) that slows down the analysis and could restring the tree space. RaxML-NG implements a search with parsimony and heuristic start trees, and for each tree makes an SPR perturbation at different levels, that improves the speed to find the Maximal Likelihood topology (Kozlov et al., 2018). So this intense search could make ML the most accurate method. By the other hand, models are crucial in ML (Huelsenbeck and Crandall, 1997), but in this post, there is no real difference, because the simulation model was simple and the branch lengths were constant and, there were not long branches.

Finally, I conclude that Maximum Likelihood is the best approach, but working in RaxML-NG, I know that I did not evaluate other programs but in the literature, there are many articles that criticize the computation time of the analyses in other programs like PAUP or MEGA, in other hand, programs like PhyML could give the same accuracy and compute time. ML was the method that presented the least RF distance in all terminal sizes, that's mean it has the best accuracy; and was insensible to the model selection. In addition, the Bayesian approach was the method I defended in anterior post, but in this post I saw the difficult to choose correct priors and the sensibility to the number of generation, a good Bayes analysis could take more time to explore priors and perform the computations, so for fast analyses and good accuracy I prefer now ML.

References

Ashlock, P. D. (1971). Monophyly and associated terms. Systematic Biology, 20(1), 63-69.
Ashlock, P. D. (1972). Monophyly again. Systematic Zoology, 21(4), 430-438.
De Queiroz, K., & Gauthier, J. (1990). Phylogeny as a central principle in taxonomy: phylogenetic definitions of taxon names. Systematic zoology, 39(4), 307-322.
Goloboff, P. A. (2003). Parsimony, likelihood, and simplicity. Cladistics, 19(2), 91-103.
Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774-786.
HENNIG, W. 1966. Phylogenetic systematics. University of Illinois Press, Urbana.
Huelsenbeck, J. P., & Crandall, K. A. (1997). Phylogeny estimation and hypothesis testing using maximum likelihood. Annual Review of Ecology and systematics, 28(1), 437-466.
Huelsenbeck, J. P., Larget, B., Miller, R. E., & Ronquist, F. (2002). Potential applications and pitfalls of Bayesian inference of phylogeny. Systematic biology, 51(5), 673-688.
Kim, J. (1993). Improving the accuracy of phylogenetic estimation by combining different methods. Systematic Biology, 42(3), 331-340.
Kozlov, A., Darriba, D., Flouri, T., Morel, B., & Stamatakis, A. (2018). RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference. BioRxiv, 447110.
Kozlov, A., Darriba, D., Flouri, T., Morel, B., & Stamatakis, A. (2018). RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference. BioRxiv: 447110.
Paradis E. & Schliep K. 2018. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35: 526-528.
Pickett, K. M., & Randle, C. P. (2005). Strange Bayes indeed: uniform topological priors imply non-uniform clade priors. Molecular phylogenetics and evolution, 34(1), 203-211.
Rambaut, A., & Grass, N. C. (1997). Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics, 13(3), 235-238.
Revell, L. J. (2012) phytools: An R package for phylogenetic comparative biology (and other things). Methods Ecol. Evol., 3, 217-223.
Robinson, D. R.; Foulds, L. R. (1981). "Comparison of phylogenetic trees". Mathematical Biosciences. 53 (1–2): 131–147.
Ronquist, F., and J.P. Huelsenbeck. 2003. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574.
Russo, C. A., Takezaki, N., & Nei, M. (1996). Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Molecular biology and evolution, 13(3), 525-536.
Russo, C. A., Takezaki, N., & Nei, M. (1996). Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Molecular biology and evolution, 13(3), 525-536.
Swofford, D. L. (2001). Paup*: Phylogenetic analysis using parsimony (and other methods) 4.0. B5.
Yang, Z., & Rannala, B. (1997). Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo method. Molecular biology and evolution, 14(7), 717-724.
Github site for script https://github.com/andres1898/post2

Filosofía, especie y sistemática

viernes, 29 de marzo de 2019

Can you recuperate the true?

No hay comentarios:

Contribuyentes