Consistency Test in the Four-Taxon case with the variation of the Evolutionary Model in Parsimony, Maximum Likelihood and Bayesian methods.
In statistics, a consistent estimator has the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges to a value being evaluated or to the ‘correct value’. This means that the distributions of the estimates become more and more concentrated near the true value, so that the probability of the estimator being close to converges to one (Nikulin, M.S. 2001). Consistency is a property of what occurs as the sample size “grows to infinity”. This term can be extrapolated to the phylogenetic area, in which we have three approaches to evaluate the consistency. The first of them is Parsimony; about it (Felsenstein, 1978) postulated that when long branched taxa are separated by a short internal branch, Parsimony would group the long branches together. And in this problem Likelihood methods have been shown to recover the true modeled tree more readily in cases when long branched lineages are not sister taxa. (Yang, 1994) provides evidence that the Maximum Likelihood method of phylogenetic tree estimation is consistent. Also in this paper he cites to Kendall and Stuart, 1979:39-64 who claim that “under regularity conditions, maximum likelihood (ML) estimators are well known to be consistent, that is, all the information in the data is taken into account and properly handled by the method”. And the final approach is Bayesian, who is a little bit recent used in phylogeny and this type of evaluations are little known. So, with this in mind, a phylogenetic method is a consistent estimator of phylogeny if and only if it is guaranteed to give the correct tree, given that sufficient (possibly infinite) independent data are examined (DeBry, 1992). Due to the aforementioned, the purpose of this experiment was to assess the consistency of three methods in three different programs based on 4 initial trees or ‘reference trees’ (Farris zone, Felsenstein zone, short, and long branches) in the four-taxon case.
MATERIALS AND METHODS
The reference trees were designed according to the different topologies and zones I want to test: the Farris zone, Felsenstein zone, long branches, and short branches (Figure 1). Based on these initial topologies, I obtained the simulated DNA sequences, which vary in length (10, 100, and 1000), using the software Seq-gen (Rambaut and Grassly, 1997), each of them following the evolutionary models F84, HKY and GTR. The tree search was done using tree different analyses: Parsimony, maximum likelihood (ML), and Bayesian analysis. The analysis of parsimony was performed with TNT v.1.1 (Goloboff et al., 2001) under the parameters: mult 100, tbr, coll 0. The analysis of ML was performed using PhyML (Guindon et al., 2009) under the model in which were simulated the sequences and NNI and SPR swaping, the Bayesian analysis was performed in MrBayes v.3.2.4. (Ronquist et al., 2012) with two Monte-carlo chains, 500.000 generations (p reach a value below 10-3), GTR model and burnin of 0.5. The comparisons between reference topologies and the obtained in each analysis were performed in the software R 3.1.1 (R Core Team, 2014) using the ape (3.1.4) (Paradis, Claude, & Strimmer, 2004) and phangorn (1.99-11) (Schliep, 2011) packages. The function 'treedist' (Steel M. A. and Penny P. (1993) calculated the branch score difference (Kuhner & Felsenstein, 1994) and the Robinson and Foulds distance (Robinson & Foulds, 1981). The graphs were performed in the same software using the package Plotly ( ). In order to perform the experiment, I wrote the scripts using the command language of BASH, for Ubuntu 14.04 (see supplementary material).
Figure 1. Reference topologies used in this analysis. Short branches tree is not showed.
RESULTS AND DISCUSSION
Initially, when the branch score difference (bs.difference) was estimated between pairs of topologies, in the graphs of the Bayesian analysis (Figure 2, A), shows that the low values of bs.difference were obtained in topologies with short branches. When all the branches are long, Bayesian have more problems to reconstruct the original tree with the correctly nodes and branch longitude. But in the four topologies you can see the decreasing in the bs.difference when there are more characters in the dataset. However, a pattern can be observed that is, in long branches, Farris zone and Felsenstein zone the bs.difference increases in almost all models. Moreover, in Maximum Likelihood (ML) analysis (Figure3, A), in all the cases the tendency was to decrease the branch score difference, but it is so far to the originals branch longitudes. In an interesting way, in short branches ML reconstruct a tree very similar to the initial tree. For parsimony (Figure 4) because of the lack of branch longitude, only the Robinson & Foulds distance was estimated and calculated its maximum value =2 frequency. Parsimony reconstruct the phylogeny with de correctly nodes as the length of the data set increases. If we look only the correctness of the nodes via Robinson & Foulds distance, in Bayesian (Figure 2 B) in GTR and F84 models with 1000 characters the distance is zero for all the topologies, but in HKY model with 1000 characters the distance reached zero only in the Farris and Felsenstein Zone. In Parsimony GTR and HKY with the before longitude (1000) the distance is zero for all the topologies, in F84 just in long branches zero is not reached, these results can sustain to (Steel, Hendy, & Penny, 1993) like they said “in the way in which Parsimony reconstruct and select the tree. In ML (Figure 3 B) only with HKY was possible to reconstruct trees of distance zero except in the Farris zone. With the other two models in long branches and Farris-zone equal topologies were not obtained. These differences in the identification of the correct tree under many evolutionary conditions (evolutionary models) can be used to judge the method more than evaluated just the ability of making the tree (Huelsenbeck & Hillis, 1993).
Figure 2. Results of Bayesian analysis. A. The graph shows the branch score difference for the comapisons with each reference topology by model and length. Every three bars represent the variation of a model F84, HKY and GTR respectively. B. Frecuency of the máximum value obtained in Robinson Foulds distance (2). Zero represents identical topologies.
Figure 3. Results of Maximum likelihood analysis. A. The graph shows the branch score difference for the comapisons with each reference topology by model and length. Every three bars represent the variation of a model F84, HKY and GTR respectively. B. Frecuency of the máximum value obtained in Robinson Foulds distance (2). Zero represents identical topologies.
Figure 4. Results of Parsimony analysis. The graph shows the frecuency of the máximum value obtained in Robinson Foulds distance (2). Zero represents identical topologies.
DeBry, R. W. (1992). The consistency of several phylogeny-inference methods under varying evolutionary rates. Molecular Biology and Evolution, 9(3), 537–551.
Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27(4), 401–410.
Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774-786.
Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic biology, 59(3), 307-321.
Huelsenbeck, J. P., & Hillis, D. M. (1993). Success of phylogenetic methods in the four-taxon case. Systematic Biology, 42(3), 247–264.
Kuhner, M. K., & Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3), 459–468.
Paradis, E., Claude, J., & Strimmer, K. (2004). APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20(2), 289–290.
R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria.
Robinson, D., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1), 131–147.
Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.
Steel, M. A., Hendy, M. D., & Penny, D. (1993). Parsimony can be consistent! Systematic Biology, 581–587.
Yang, Z. (1994). Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Systematic Biology, 43(3), 329–342.
Based on the above, you can choose the method that you believe convenient for your analysis, the model that fixed better and show more consistent and finally remember if you have more data... is better.