Filosofía, especie y sistemática: Me and the comparative biology: methods to reconstruct a phylogeny

Being able to gather several key concepts, the need to reconstruct a phylogeny was born with the origin of species by Darwin in 1859¹. Before that, knowing the evolutionary history of the groups had not been thought. Haeckel in 1869² was the first to use the “phylogeny"¹ word, nowadays it describes the relationships of species, taking into account their evolutionary processes and patterns throughout history³. The types of phylogenetic relationship can be monophyletic, paraphyletic or polyphyletic⁴. Various methods were developed to infer phylogenetic trees³, however many times the results on which is the best tree that supports the observations vary^5-6. Some approaches use a distance method, compressing phylogenetic information within a set of sequences in a pairwise-matrix distance³. While others use characters, taking advantage of all the information available in the sequences of each homologous³. However, before embarking on phylogenetic discover, first we must define the research question of interest. In addition, the conceptual framework and a quantifiable hypothesis need to be generated for our group of study and once we know our objective, we start obtaining the respective sequences⁷. In this post, my goal is to test the accuracy in reconstructing an initial topology between maximum parsimony (MP), maximum likelihood (ML) and Bayesian inference (BI) methods.

In this study, I simulated the reference topologies through the ape⁸ package using R v3.5.4⁹, for different sample sizes (4,10, 20 and 40 terminals) with all branch lengths equal to 0.5 and 3 replicas. For the DNA simulation I used Seq-Gen v1.3.4¹⁰, using HKY model for 10000 characters. For phylogenetic reconstruction in MP and ML I implemented PAUP* v 4.0¹¹. For BI I used MrBayes v3.1.2¹² with 10000 generations and the other parameters set by default. To compare recovered topologies, I chose Robinson-Foulds distance¹³ within phangorn¹⁴package.

The MP analysis is carried out under the implicit assumption¹⁵ that the best estimation of the phylogeny is the shortest tree, given by the number of evolutionary transformations between the characters⁶. Thus, by involving the least amount of changes is it being said that evolution is parsimonious? Its defenders maintain that the only thing that parsimony assumes is that there has been descent with modification³, but this method does not express a real proposition¹⁵. Also, when taxa with long branch are distant by short internal branches, MP would group the long branches together¹⁶. ML method uses a criterion-based approach: the best tree is the one with the highest probability of producing the data we observe⁶, given a specific evolution model, the topology of the tree and the branch lengths between nodes^3,15. ML differs from MP when considering the length of the branch in relation to a model, but It’s affected when sister taxa have long branch repulsion¹⁶. Bayesian inference uses likelihood calculations, but the probabilities stand by estimating the probability of the tree topology given the data and the model rather than the probability of the data given the model and tree topology⁶. Choosing a method makes a practical difference and knowing which is the best one, does not have a clear answer yet, each phylogenetic approach has advantages and disadvantages.

Figura 1. RF distance between reference and recovered topologies by MP (maximum parsimony), ML (maximum likelihood) and BI (bayesian inference) methods. A lower distance value reflects a better initial construction. Under this assumption, ML is most accurate approach while BI presents values further away as the number of terminals increases.

In my results, maximum parsimony and likelihood maintained similar distance values, but the difference became clearer after 10 terminals, with the lowest distance being ML (Fig. 1). One common cause of reconstruction error is a short sequence length¹⁷, bootstrapping and Bayesian sampling methods provide possible trees instead of a single-tree estimate, which highly supported features are regarded as likely features of the true tree¹⁷. Previously, I have declared my philosophy with a Bayesian spirit, but at least to reconstruct the initial topology the largest differences in RF distance are in it. BI is not only the method less accurate, but the one with most variation too (Fig. 2). BI explores posterior probability space using MCMC to find the model/tree topology with the highest posterior probability⁶, if the initial parameters are not defined correctly, the bias will grow. The amount of space explored for large terminals required a greater walk, when the number of terminals was low the topology was completely reconstructed (Fig. 1). The sum of complex evolutionary models, priors and posterior probabilities¹² doesn't take away my interest, but in order to give a less erroneous answer and have a defined position, the experiment could be repeated with a greater number of generations and a set of more appropriate priors for BI.

Figura 2. Variation in RF distance between MP (maximum parsimony), ML (maximum likelihood) and BI (bayesian inference) methods. The pink points represent the mean and brown point represent the outliers of the RF distance. A clearer variation in BI is observed.

References

Darwin C. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. Londres: John Murray (1859).
Haeckel, E. Generelle Morphologie der Organismen. Reimer, Berlin (1866).
Besse P. Molecular Plant Taxonomy: Methods and Protocols. Springer Science+Business Media New York 13, 257-275 (2014).
Hennig , W. Phylogenetic Systematics . University of Illinois Press , Urbana (1966).
Burnham, K. and Anderson, D. Model Selection and Inference – a Practical Information Theoretic Approach. New York: Springer (1998).
Wiley, E. and Lieberman, B. Phylogenetics: Theory and Practice of Phylogenetic Systematics. (John Wiley & Sons, 2011).
Mount, D. Choosing a Method for Phylogenetic Prediction. Cold Spring Harbor Protocols 2008, pdb.ip49-pdb.ip49 (2008).
Paradis, E. Analysis of Phylogenetics and Evolution with R (Second Edition). New York: Springer (2012).
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2012).
Rambaut, A. and Grassly, N.C. “Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees”, Computer Applications In The Biosciences, 13:3, 235-238 (1997).
Swofford, D. L. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts .(2002).
Ronquist F, Teslenko M, van der Mark P, et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61(3):539–542 (2012)
de Oliveira Martins L., Mallo D., Posada D. A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction. Syst. Biol. 65(3): 397-416 (2016).
Schliep K.P. Phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4) 592-593 (2011).
Sober, E. The Contest Between Parsimony and Likelihood. Systematic Biology 53, 644-653 (2004).
Siddall, M. Success of Parsimony in the Four-Taxon Case: Long-Branch Repulsion by Likelihood in the Farris Zone. Cladistics 14, 209-220 (1998).
P. M. Huggins, W. Li, D. Haws, T. Friedrich, J. Liu, R. Yoshida, Bayes Estimators for Phylogenetic Reconstruction, Systematic Biology, 60 (4) 528–540 (2011).

Script

https://github.com/lizsteny/Comparative-biology/blob/master/Post%202%20Script

Filosofía, especie y sistemática

domingo, 31 de marzo de 2019

Me and the comparative biology: methods to reconstruct a phylogeny

No hay comentarios:

Contribuyentes