Introduction
Consistency
refers to logical and numerical coherence, and this property is
attributable to an estimator if it converges in probability to its
estimant as sample size increases. Steel (2015), defines it within
the following lemma: “Suppose k sites are generated i.i.d. by
s (T,θ). Under the sufficient condition (*) for statistical
consistency, the bootstrap support of every edge e of T−ρ
converges in probability to 1 as k→∞. Moreover, the bootstrap
support for T−ρ converges in probability to 1 as k→∞.” where
T is the tree, θ represents a vector of continuous parameters, ρ
represents the root vertex, T−ρ appears as an unrooted tree. The
aforementioned implies that the distributions of the estimates become
more and more concentrated towards the true value, so that the
probability of the estimator being close to this value converges to
one. The problem of consistency can be assessed from three approches:
Parsimony, Maximum Likelihood (ML), and Bayesian analysis.
(Felsestein, 1978) postulated that when long branched taxa are
separated by a short internal branch, parsimony would group the long
branches together. The likelihood methods have shown to overcome this
problem recovering the true modeled tree more readily in cases when
long branched lineages are not sister taxa. (Yang, 1994) on the other
hand, provides evidence that the Maximum likelihood method is
consistent. Finally, the Bayesian approach, whose evaluations have
been performed and published rather recently, appears to be a
consistent estimator of phylogeny only if it is guaranteed the
provided tree to be the correct one, given a large amount (possibly
infinite) of independent data to be examined.
Due
to the differences that these three approaches display, the purpose
of this experiment was to assess their consistency based on 4 initial
topologies (Farris zone, Felsestein zone, short, and long branches)
in the four-taxon case.
Materials
and Methods
The
reference trees were designed according to the different topologies
and zones I intended to test: the Farris zone, Felsestein zone, long
branches, and short branches. Based on these initial topologies, I
obtained the simulated DNA sequences which vary in length 10, 100,
and 1000 characters, using the software Seq-gen (Rambaut and Grassly,
1997), each of them following the evolutive models F84, HKY and GTR.
The tree search was done using tree different analyses: parsimony,
maximum likelihood (ML), and Bayesian analysis. The analysis of
parsimony was performed with TNT v.1.1 (Goloboff et al., 2001) under
the parameters: mult 100, tbr, coll 0. The analysis of ML was
performed using PhyML (Guindon et al., 2009) under the model in which
the sequences were simulated and NNI and SPR swaping. The Bayesian
analysis was performed in MrBayes v.3.2.4. (Ronquist et al., 2012)
with two Monte-carlo chains, 500.000 generations (p reach was below
10-3), GTR model and burnin of 0.05. The comparisons between
reference topologies and the obtained in each analysis were performed
in the software R 3.1.1. (R Core Team, 2014) using the packages ape
3.1.4 (Paradis and Strimmer, 2004), and phangorn (1.99-11) (Schiliep,
2011) using the ape and phangorn packages. The function 'treedist'
(Steel and Penny, 1993) calculated the branch score difference
(Kuhner and Felsestein, 1994), and the Robinson & Foulds distance
(Robinson and Foulds, 1981). Since parsimony does not recover branch
longitude, only this distance was estimated for the case, using the
frequency of its maximum value of 2. The graphs were performed in the
same software using the package Plotly (https://plot.ly/r/). In order
to perform the experiment, I wrote the scripts using the command
language of BASH, for Ubuntu 14.04 (see supplementary material).
Results
and Discusion
Fig 1. Results of the estimation of Robinson & Foulds index and Branch score difference for the three methodologies. A. Parsimony. B.C. Maximum Likelihood. D.E. Bayesian analysis. |
Robinson
& Foulds distance and approach: what about correctness?
Parsimony
reconstructed the phylogeny with the correct nodes as the length of
the data set increased. The models HKY and GTR with a character
length of 1000, showed a distance of zero for all the topologies, but
in F84 it this value was not reached, except when the reference
topology was short-branched and Farris-zone (Fig. 1. A). These latter
results can sustain to (Steel and Penny, 1993). In ML only the HKY
reconstructed trees of distance zero except in the Farris zone. With
F84 and GTR models in long branches and Farris-zone it was not
possible to obtain equal topologies(Fig. 1.B). In Bayesian analysis,
the frequency of the distance of 2 was above 20, which makes a first
difference regarding the other two methods. For the F84 and GTR
models with 1000 characters, the distance was zero for all the
topologies, while in the HKY model with the same character number,
its distance reached zero only in the Farris and Felsestein zones
(Fig. 1.D). These differences in the identification of the correct
tree under many evolutionary conditions (evolutionary models) can be
used to ponder the method more than only evaluating the ability of
making the tree (Huelsenbeck and Hillis, 1993).
Branch
score and approach: is it then a matter of length?
Overall,
in ML analysis, the tendency was to decrease the branch score
difference as the number of characters increased, but it greatly
differed from the original branch lengths. Interestingly, in short
branches ML reconstructed a very similar tree comparing to the
initial one (Fig. 1.C). The Bayesian analysis showed low values of
the branch score difference in the short branched topologies (Fig 1.
E.). When all the branches were long, Bayesian had more problems to
reconstruct the original topology with the correct nodes and branch
length. However, in the four topologies it is evident that the bs
difference decreased when there were more characters in the dataset.
Overall, the Farris zone and Felsestein zone showed and increase in
the bs difference in long branches for almost every model.
References
The
International Statistical Institute, "The Oxford Dictionary of
Statistical Terms", edited by Yadolah Dodge, Oxford University
Press, 2003.
http://stats.oecd.org/glossary/detail.asp?ID=5125
DeBry,
R. W. (1992). The consistency of several phylogeny-inference methods
under varying evolutionary rates. Molecular Biology and Evolution,
9(3), 537–551.
Felsenstein,
J. (1978). Cases in which parsimony or compatibility methods will be
positively misleading. Systematic Biology, 27(4), 401–410.
Goloboff,
P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program
for phylogenetic analysis. Cladistics, 24(5), 774-786.
Guindon,
S., Delsuc, F., Dufayard, J.F., Gascuel, O. (2009). Estimating
maximum likelihood phylogenies with PhyML. Methods Mol Biol.
537:113-37.
Guindon,
S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., &
Gascuel, O. (2010). New algorithms and methods to estimate
maximum-likelihood phylogenies: assessing the performance of PhyML
3.0. Systematic biology, 59(3), 307-321.
Huelsenbeck,
J. P., & Hillis, D. M. (1993). Success of phylogenetic methods in
the four-taxon case. Systematic Biology, 42(3), 247–264.
Kuhner,
M. K., & Felsenstein, J. (1994). A simulation comparison of
phylogeny algorithms under equal and unequal evolutionary rates.
Molecular Biology and Evolution, 11(3), 459–468.
Paradis,
E., Claude, J., & Strimmer, K. (2004). APE: analyses of
phylogenetics and evolution in R language. Bioinformatics, 20(2),
289–290.
R
Core Team. (2014). R: A Language and Environment for Statistical
Computing. Vienna, Austria.
Robinson,
D., & Foulds, L. R. (1981). Comparison of phylogenetic trees.
Mathematical Biosciences, 53(1), 131–147.
Rambaut,
A. and Grassly, N. C. (1997) Seq-Gen: An application for the Monte
Carlo simulation of DNA sequence evolution along phylogenetic trees.
Comput. Appl. Biosci. 13: 235-238.
Ronquist,
F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Höhna,
S., … Huelsenbeck, J. P. (2012). MrBayes 3.2: Efficient Bayesian
Phylogenetic Inference and Model Choice Across a Large Model Space.
Systematic Biology, 61(3), 539–542.
Schliep,
K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics,
27(4), 592–593.
Steel,
M. A., Hendy, M. D., & Penny, D. (1993). Parsimony can be
consistent! Systematic Biology, 581–587.
Supplementary Material: The scripts and reference topologies are available at: https://drive.google.com/file/d/0B1NHg7c5hIVbTU1uQy11amhiZ3c/view?usp=sharing. Authors: Ayus-Ortíz, V. González-Piñeres, N.