Consistency Test in the Four-Taxon case with the variation of the Evolutionary Model in Parsimony, Maximum Likelihood and Bayesian methods.
INTRODUCTION
In statistics, a consistent
estimator has the property that as the number of data points used increases
indefinitely, the resulting sequence of estimates converges to a value being
evaluated or to the ‘correct value’. This means that the distributions of the
estimates become more and more concentrated near the true value, so that the
probability of the estimator being close to converges to one (Nikulin, M.S. 2001).
Consistency is a property of what occurs as the sample size “grows to infinity”.
This term can be extrapolated to the
phylogenetic area, in which we have three approaches to evaluate the
consistency. The first of them is Parsimony; about it (Felsenstein, 1978) postulated that when long branched taxa are
separated by a short internal branch, Parsimony would group the long branches
together. And in this problem Likelihood methods have been shown to recover the
true modeled tree more readily in cases when long branched lineages are not
sister taxa. (Yang, 1994) provides evidence that the
Maximum Likelihood method of phylogenetic tree estimation is consistent. Also in
this paper he cites to Kendall and Stuart, 1979:39-64 who claim that “under
regularity conditions, maximum likelihood (ML) estimators are well known to be
consistent, that is, all the information in the data is taken into account and
properly handled by the method”. And the final approach is Bayesian, who is a
little bit recent used in phylogeny and this type of evaluations are little
known. So, with this in mind, a phylogenetic method is a consistent estimator
of phylogeny if and only if it is guaranteed to give the correct tree, given
that sufficient (possibly infinite) independent data are examined (DeBry, 1992). Due to the aforementioned, the
purpose of this experiment was to assess the consistency of three methods in
three different programs based on 4 initial trees or ‘reference trees’ (Farris
zone, Felsenstein zone, short, and long branches) in the four-taxon case.
MATERIALS AND METHODS
The reference trees were designed
according to the different topologies and zones I want to test: the Farris zone, Felsenstein zone, long
branches, and short branches (Figure 1). Based on these initial topologies, I
obtained the simulated DNA sequences, which vary in length (10, 100, and 1000),
using the software Seq-gen (Rambaut and Grassly, 1997), each of them following
the evolutionary models F84, HKY and GTR. The tree search was done using tree
different analyses: Parsimony, maximum likelihood (ML), and Bayesian analysis.
The analysis of parsimony was performed with TNT v.1.1 (Goloboff et al., 2001)
under the parameters: mult 100, tbr, coll 0. The analysis of ML was performed
using PhyML (Guindon et al., 2009) under the model in which were simulated the
sequences and NNI and SPR swaping, the Bayesian analysis was performed in
MrBayes v.3.2.4. (Ronquist et al., 2012) with two Monte-carlo chains, 500.000
generations (p reach a value below 10-3),
GTR model and burnin of 0.5. The comparisons between reference topologies and
the obtained in each analysis were performed in the software R 3.1.1 (R Core Team, 2014) using the ape (3.1.4) (Paradis, Claude, & Strimmer, 2004) and phangorn (1.99-11) (Schliep, 2011) packages.
The function 'treedist' (Steel M.
A. and Penny P. (1993)
calculated the branch score difference (Kuhner & Felsenstein, 1994) and the Robinson and Foulds
distance (Robinson & Foulds, 1981). The graphs were performed in
the same software using the package Plotly (
). In order to perform the experiment, I wrote the scripts using the
command language of BASH, for Ubuntu 14.04 (see supplementary material).
Figure 1. Reference topologies used in this analysis.
Short branches tree is not showed.
RESULTS
AND DISCUSSION
Initially, when the branch score difference
(bs.difference) was estimated between pairs of topologies, in the graphs of the
Bayesian analysis (Figure 2, A), shows that the low values of bs.difference were obtained in topologies
with short branches. When all the branches are long, Bayesian have more
problems to reconstruct the original tree with the correctly nodes and branch
longitude. But in the four topologies you can see the decreasing in the bs.difference
when there are more characters in the dataset. However, a pattern can be
observed that is, in long branches, Farris zone and Felsenstein zone the bs.difference
increases in almost all models. Moreover, in Maximum Likelihood (ML) analysis (Figure3, A),
in all the cases the tendency was to decrease the branch score difference, but
it is so far to the originals branch longitudes. In an interesting way, in short branches ML
reconstruct a tree very similar to the initial tree. For parsimony (Figure 4) because of
the lack of branch longitude, only the Robinson & Foulds distance was estimated
and calculated its maximum value =2 frequency. Parsimony reconstruct the
phylogeny with de correctly nodes as the length of the data set increases.
If we look only the correctness of the nodes via Robinson & Foulds
distance, in Bayesian (Figure 2 B) in GTR and F84 models with 1000 characters the distance
is zero for all the topologies, but in HKY model with 1000 characters the
distance reached zero only in the Farris and Felsenstein Zone. In Parsimony GTR
and HKY with the before longitude (1000) the distance is zero for all the
topologies, in F84 just in long branches zero is not reached, these results can
sustain to (Steel, Hendy, &
Penny, 1993) like they said “in the way in which Parsimony
reconstruct and select the tree. In ML (Figure 3 B) only with HKY was possible to reconstruct
trees of distance zero except in the Farris zone. With the other two models in long
branches and Farris-zone equal topologies were not obtained. These differences
in the identification of the correct tree under many evolutionary conditions
(evolutionary models) can be used to judge the method more than evaluated just
the ability of making the tree (Huelsenbeck &
Hillis, 1993).
Figure 2. Results of Bayesian analysis. A. The graph shows the branch score
difference for the comapisons with each reference topology by model and length.
Every three bars represent the variation of a model F84, HKY and GTR
respectively. B. Frecuency of the máximum
value obtained in Robinson Foulds distance (2). Zero represents identical
topologies.
Figure 3. Results of Maximum likelihood analysis. A. The graph shows the branch score
difference for the comapisons with each reference topology by model and length.
Every three bars represent the variation of a model F84, HKY and GTR
respectively. B. Frecuency of the máximum
value obtained in Robinson Foulds distance (2). Zero represents identical
topologies.
Figure 4. Results of Parsimony analysis. The graph shows
the frecuency of the máximum value obtained in Robinson Foulds distance (2).
Zero represents identical topologies.
REFERENCES
DeBry, R.
W. (1992). The consistency of several phylogeny-inference methods under varying
evolutionary rates. Molecular Biology and Evolution, 9(3),
537–551.
Felsenstein, J. (1978). Cases in
which parsimony or compatibility methods will be positively misleading. Systematic
Biology, 27(4), 401–410.
Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774-786.
Guindon, S., Dufayard, J.
F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010).
New algorithms and methods to estimate maximum-likelihood phylogenies:
assessing the performance of PhyML 3.0. Systematic biology, 59(3), 307-321.
Huelsenbeck, J. P., & Hillis,
D. M. (1993). Success of phylogenetic methods in the four-taxon case. Systematic
Biology, 42(3), 247–264.
Kuhner, M. K., & Felsenstein,
J. (1994). A simulation comparison of phylogeny algorithms under equal and
unequal evolutionary rates. Molecular Biology and Evolution, 11(3),
459–468.
Paradis, E., Claude, J., &
Strimmer, K. (2004). APE: analyses of phylogenetics and evolution in R
language. Bioinformatics, 20(2), 289–290.
R Core Team. (2014). R: A Language
and Environment for Statistical Computing. Vienna, Austria.
Robinson, D., & Foulds, L. R.
(1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1),
131–147.
Schliep, K. P. (2011). phangorn:
Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.
Steel, M. A., Hendy, M. D., &
Penny, D. (1993). Parsimony can be consistent! Systematic Biology,
581–587.
Yang, Z. (1994). Statistical
properties of the maximum likelihood method of phylogenetic estimation and
comparison with distance matrix methods. Systematic Biology, 43(3),
329–342.
Based
on the above, you can choose the method that you believe convenient for your
analysis, the model that fixed better and show more consistent and finally remember if
you have more data... is better.
1 comentario:
The "Felsenstein Zone" tree has too long a central branch to result in inconsistency. The central branch should have the same length as the short branches.
If you made the central branch as short as the short branches you might get different results for that case.
Joe Felsenstein
Publicar un comentario