Being
able to gather several key concepts, the need to reconstruct a phylogeny was
born with the origin of
species by Darwin in 18591. Before that, knowing the evolutionary
history of the groups had not been thought. Haeckel
in 18692 was the first to use the “phylogeny"1 word,
nowadays it describes the relationships of species, taking into account their
evolutionary processes and patterns throughout history3. The types
of phylogenetic relationship can be
monophyletic, paraphyletic or polyphyletic4. Various methods
were developed to infer phylogenetic trees3,
however many times the results on which is the best tree that supports the
observations vary5-6. Some approaches use a distance method, compressing phylogenetic
information within a set of sequences in a pairwise-matrix distance3.
While others use characters, taking advantage of all the information available
in the sequences of each homologous3. However, before embarking on
phylogenetic discover, first we must define the research question of interest.
In addition, the conceptual framework and a quantifiable hypothesis need
to be generated for our group
of study and once we know our objective, we start obtaining the respective
sequences7. In this post, my goal is to test the accuracy in reconstructing an initial
topology between maximum parsimony (MP), maximum likelihood (ML) and Bayesian
inference (BI) methods.
In this study, I simulated the reference topologies through the ape8 package
using R v3.5.49, for different sample sizes (4,10, 20 and 40
terminals) with all branch lengths equal to 0.5 and 3 replicas. For the DNA
simulation I used Seq-Gen v1.3.410, using HKY
model for 10000 characters. For
phylogenetic reconstruction in MP and ML I implemented PAUP* v 4.011. For BI I used MrBayes v3.1.212 with 10000 generations and
the other parameters set by default. To compare recovered topologies, I chose
Robinson-Foulds distance13 within phangorn14 package.
The MP analysis is carried out under the implicit assumption15 that
the best estimation of the phylogeny is the shortest tree, given by the number
of evolutionary transformations between the characters6. Thus, by
involving the least amount of changes is it being said that evolution is
parsimonious? Its defenders maintain that the only thing that parsimony assumes
is that there has been descent with modification3, but this method
does not express a real proposition15. Also, when taxa with long branch
are distant by short internal branches, MP would group the long branches
together16. ML method uses a criterion-based approach: the best tree
is the one with the highest probability of producing the data we observe6,
given a specific evolution model, the topology of the tree and the branch
lengths between nodes3,15. ML differs from MP when considering the
length of the branch in relation to a model, but It’s affected when sister taxa
have long branch repulsion16. Bayesian inference uses likelihood
calculations, but the probabilities stand by estimating the probability of the
tree topology given the data and the model rather than the probability of
the data given the model and tree topology6. Choosing a method makes a practical
difference and knowing which is the best one, does not have a clear answer yet,
each phylogenetic approach has advantages and disadvantages.
Figura 1. RF distance between reference and recovered topologies by MP (maximum parsimony), ML (maximum likelihood) and BI (bayesian inference) methods. A lower distance value reflects a better initial construction. Under this assumption, ML is most accurate approach while BI presents values further away as the number of terminals increases.
In my
results, maximum parsimony and likelihood maintained similar distance values,
but the difference became clearer after 10 terminals, with the lowest distance
being ML (Fig. 1). One common cause of reconstruction error is a short sequence
length17, bootstrapping and Bayesian sampling methods provide
possible trees instead of a single-tree estimate, which highly supported
features are regarded as likely features of the true tree17.
Previously, I have declared my philosophy with a Bayesian spirit, but at
least to reconstruct the initial topology the largest differences in RF
distance are in it. BI is not only the method less accurate, but the one with
most variation too (Fig. 2). BI explores posterior probability space using MCMC
to find the model/tree topology with the highest posterior probability6,
if the initial parameters are not defined correctly, the bias will grow. The
amount of space explored for large terminals required a greater walk, when the
number of terminals was low the topology was completely reconstructed (Fig. 1).
The sum of complex evolutionary models, priors and posterior probabilities12
doesn't take away my interest, but in order to give a less erroneous answer and
have a defined position, the experiment could be repeated with a greater number
of generations and a set of more appropriate priors for BI.
Figura 2. Variation in RF distance between MP (maximum parsimony), ML (maximum likelihood) and BI (bayesian inference) methods. The pink points represent the mean and brown point represent the outliers of the RF distance. A clearer variation in BI is observed.
References
- Darwin C. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. Londres: John Murray (1859).
- Haeckel, E. Generelle Morphologie der Organismen. Reimer, Berlin (1866).
- Besse P. Molecular Plant Taxonomy: Methods and Protocols. Springer Science+Business Media New York 13, 257-275 (2014).
- Hennig , W. Phylogenetic Systematics . University of Illinois Press , Urbana (1966).
- Burnham, K. and Anderson, D. Model Selection and Inference – a Practical Information Theoretic Approach. New York: Springer (1998).
- Wiley, E. and Lieberman, B. Phylogenetics: Theory and Practice of Phylogenetic Systematics. (John Wiley & Sons, 2011).
- Mount, D. Choosing a Method for Phylogenetic Prediction. Cold Spring Harbor Protocols 2008, pdb.ip49-pdb.ip49 (2008).
- Paradis, E. Analysis of Phylogenetics and Evolution with R (Second Edition). New York: Springer (2012).
- R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2012).
- Rambaut, A. and Grassly, N.C. “Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees”, Computer Applications In The Biosciences, 13:3, 235-238 (1997).
- Swofford, D. L. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts .(2002).
- Ronquist F, Teslenko M, van der Mark P, et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61(3):539–542 (2012)
- de Oliveira Martins L., Mallo D., Posada D. A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction. Syst. Biol. 65(3): 397-416 (2016).
- Schliep K.P. Phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4) 592-593 (2011).
- Sober, E. The Contest Between Parsimony and Likelihood. Systematic Biology 53, 644-653 (2004).
- Siddall, M. Success of Parsimony in the Four-Taxon Case: Long-Branch Repulsion by Likelihood in the Farris Zone. Cladistics 14, 209-220 (1998).
- P. M. Huggins, W. Li, D. Haws, T. Friedrich, J. Liu, R. Yoshida, Bayes Estimators for Phylogenetic Reconstruction, Systematic Biology, 60 (4) 528–540 (2011).
Script
- https://github.com/lizsteny/Comparative-biology/blob/master/Post%202%20Script
No hay comentarios:
Publicar un comentario