domingo, 31 de marzo de 2019

Me and the comparative biology: methods to reconstruct a phylogeny


Being able to gather several key concepts, the need to reconstruct a phylogeny was born with the origin of species by Darwin in 18591. Before that, knowing the evolutionary history of the groups had not been thought. Haeckel in 18692 was the first to use the “phylogeny"1 word, nowadays it describes the relationships of species, taking into account their evolutionary processes and patterns throughout history3. The types of phylogenetic relationship can be monophyletic, paraphyletic or polyphyletic4. Various methods were developed to infer phylogenetic trees3, however many times the results on which is the best tree that supports the observations vary5-6. Some approaches use a distance method, compressing phylogenetic information within a set of sequences in a pairwise-matrix distance3. While others use characters, taking advantage of all the information available in the sequences of each homologous3. However, before embarking on phylogenetic discover, first we must define the research question of interest. In addition, the conceptual framework and a quantifiable hypothesis need to be generated for our group of study and once we know our objective, we start obtaining the respective sequences7. In this post, my goal is to test the accuracy in reconstructing an initial topology between maximum parsimony (MP), maximum likelihood (ML) and Bayesian inference (BI) methods.

In this study, I simulated the reference topologies through the ape8 package using R v3.5.49, for different sample sizes (4,10, 20 and 40 terminals) with all branch lengths equal to 0.5 and 3 replicas. For the DNA simulation I used Seq-Gen v1.3.410, using HKY model for 10000 characters. For phylogenetic reconstruction in MP and ML I implemented PAUP* v 4.011. For BI I used MrBayes v3.1.212 with 10000 generations and the other parameters set by default. To compare recovered topologies, I chose Robinson-Foulds distance13 within phangorn14 package.

The MP analysis is carried out under the implicit assumption15 that the best estimation of the phylogeny is the shortest tree, given by the number of evolutionary transformations between the characters6. Thus, by involving the least amount of changes is it being said that evolution is parsimonious? Its defenders maintain that the only thing that parsimony assumes is that there has been descent with modification3, but this method does not express a real proposition15. Also, when taxa with long branch are distant by short internal branches, MP would group the long branches together16. ML method uses a criterion-based approach: the best tree is the one with the highest probability of producing the data we observe6, given a specific evolution model, the topology of the tree and the branch lengths between nodes3,15. ML differs from MP when considering the length of the branch in relation to a model, but It’s affected when sister taxa have long branch repulsion16. Bayesian inference uses likelihood calculations, but the probabilities stand by estimating the probability of the tree topology given the data and the model rather than the probability of the data given the model and tree topology6. Choosing a method makes a practical difference and knowing which is the best one, does not have a clear answer yet, each phylogenetic approach has advantages and disadvantages.


Figura 1. RF distance between reference and recovered topologies by MP (maximum parsimony), ML (maximum likelihood) and BI (bayesian inference) methods. A lower distance value reflects a better initial construction. Under this assumption, ML is most accurate approach while BI presents values further away as the number of terminals increases.


In my results, maximum parsimony and likelihood maintained similar distance values, but the difference became clearer after 10 terminals, with the lowest distance being ML (Fig. 1). One common cause of reconstruction error is a short sequence length17, bootstrapping and Bayesian sampling methods provide possible trees instead of a single-tree estimate, which highly supported features are regarded as likely features of the true tree17. Previously, I have declared my philosophy with a Bayesian spirit, but at least to reconstruct the initial topology the largest differences in RF distance are in it. BI is not only the method less accurate, but the one with most variation too (Fig. 2). BI explores posterior probability space using MCMC to find the model/tree topology with the highest posterior probability6, if the initial parameters are not defined correctly, the bias will grow. The amount of space explored for large terminals required a greater walk, when the number of terminals was low the topology was completely reconstructed (Fig. 1). The sum of complex evolutionary models, priors and posterior probabilities12 doesn't take away my interest, but in order to give a less erroneous answer and have a defined position, the experiment could be repeated with a greater number of generations and a set of more appropriate priors for BI.


Figura 2. Variation in RF distance between MP (maximum parsimony), ML (maximum likelihood) and BI (bayesian inference) methods. The pink points represent the mean and brown point represent the outliers of the RF distance. A clearer variation in BI is observed.


References
  1. Darwin C. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. Londres: John Murray (1859).
  2. Haeckel, E. Generelle Morphologie der Organismen. Reimer, Berlin (1866).
  3. Besse P. Molecular Plant Taxonomy: Methods and Protocols. Springer Science+Business Media New York 13, 257-275 (2014).
  4. Hennig , W. Phylogenetic Systematics . University of Illinois Press , Urbana (1966).
  5. Burnham, K. and Anderson, D. Model Selection and Inference – a Practical Information Theoretic Approach. New York: Springer (1998).
  6. Wiley, E. and Lieberman, B. Phylogenetics: Theory and Practice of Phylogenetic Systematics. (John Wiley & Sons, 2011).
  7. Mount, D. Choosing a Method for Phylogenetic Prediction. Cold Spring Harbor Protocols 2008, pdb.ip49-pdb.ip49 (2008).
  8. Paradis, E. Analysis of Phylogenetics and Evolution with R (Second Edition). New York: Springer (2012).
  9. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2012).
  10. Rambaut, A. and Grassly, N.C. “Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees”, Computer Applications In The Biosciences, 13:3, 235-238 (1997).
  11. Swofford, D. L. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts .(2002).
  12. Ronquist F, Teslenko M, van der Mark P, et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61(3):539–542 (2012)
  13. de Oliveira Martins L., Mallo D., Posada D. A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction. Syst. Biol. 65(3): 397-416 (2016).
  14. Schliep K.P. Phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4) 592-593 (2011).
  15. Sober, E. The Contest Between Parsimony and Likelihood. Systematic Biology 53, 644-653 (2004).
  16. Siddall, M. Success of Parsimony in the Four-Taxon Case: Long-Branch Repulsion by Likelihood in the Farris Zone. Cladistics 14, 209-220 (1998).
  17. P. M. Huggins, W. Li, D. Haws, T. Friedrich, J. Liu, R. Yoshida, Bayes Estimators for Phylogenetic Reconstruction, Systematic Biology, 60 (4) 528–540 (2011).

Script

  • https://github.com/lizsteny/Comparative-biology/blob/master/Post%202%20Script


No hay comentarios: