lunes, 23 de febrero de 2015

Consistency Test in the Four-taxon Case

Assessing the influence of character length and evolutionary model

Introduction

Consistency refers to logical and numerical coherence, and this property is attributable to an estimator if it converges in probability to its estimant as sample size increases. Steel (2015), defines it within the following lemma: “Suppose k sites are generated i.i.d. by s (T,θ). Under the sufficient condition (*) for statistical consistency, the bootstrap support of every edge e of T−ρ converges in probability to 1 as k→∞. Moreover, the bootstrap support for T−ρ converges in probability to 1 as k→∞.” where T is the tree, θ represents a vector of continuous parameters, ρ represents the root vertex, T−ρ appears as an unrooted tree. The aforementioned implies that the distributions of the estimates become more and more concentrated towards the true value, so that the probability of the estimator being close to this value converges to one. The problem of consistency can be assessed from three approches: Parsimony, Maximum Likelihood (ML), and Bayesian analysis. (Felsestein, 1978) postulated that when long branched taxa are separated by a short internal branch, parsimony would group the long branches together. The likelihood methods have shown to overcome this problem recovering the true modeled tree more readily in cases when long branched lineages are not sister taxa. (Yang, 1994) on the other hand, provides evidence that the Maximum likelihood method is consistent. Finally, the Bayesian approach, whose evaluations have been performed and published rather recently, appears to be a consistent estimator of phylogeny only if it is guaranteed the provided tree to be the correct one, given a large amount (possibly infinite) of independent data to be examined.

Due to the differences that these three approaches display, the purpose of this experiment was to assess their consistency based on 4 initial topologies (Farris zone, Felsestein zone, short, and long branches) in the four-taxon case.

Materials and Methods

The reference trees were designed according to the different topologies and zones I intended to test: the Farris zone, Felsestein zone, long branches, and short branches. Based on these initial topologies, I obtained the simulated DNA sequences which vary in length 10, 100, and 1000 characters, using the software Seq-gen (Rambaut and Grassly, 1997), each of them following the evolutive models F84, HKY and GTR. The tree search was done using tree different analyses: parsimony, maximum likelihood (ML), and Bayesian analysis. The analysis of parsimony was performed with TNT v.1.1 (Goloboff et al., 2001) under the parameters: mult 100, tbr, coll 0. The analysis of ML was performed using PhyML (Guindon et al., 2009) under the model in which the sequences were simulated and NNI and SPR swaping. The Bayesian analysis was performed in MrBayes v.3.2.4. (Ronquist et al., 2012) with two Monte-carlo chains, 500.000 generations (p reach was below 10-3), GTR model and burnin of 0.05. The comparisons between reference topologies and the obtained in each analysis were performed in the software R 3.1.1. (R Core Team, 2014) using the packages ape 3.1.4 (Paradis and Strimmer, 2004), and phangorn (1.99-11) (Schiliep, 2011) using the ape and phangorn packages. The function 'treedist' (Steel and Penny, 1993) calculated the branch score difference (Kuhner and Felsestein, 1994), and the Robinson & Foulds distance (Robinson and Foulds, 1981). Since parsimony does not recover branch longitude, only this distance was estimated for the case, using the frequency of its maximum value of 2. The graphs were performed in the same software using the package Plotly (https://plot.ly/r/). In order to perform the experiment, I wrote the scripts using the command language of BASH, for Ubuntu 14.04 (see supplementary material).

Results and Discusion

Fig 1. Results of the estimation of Robinson & Foulds index and Branch score difference for the three methodologies. A. Parsimony. B.C. Maximum Likelihood. D.E. Bayesian analysis.



Robinson & Foulds distance and approach: what about correctness?
Parsimony reconstructed the phylogeny with the correct nodes as the length of the data set increased. The models HKY and GTR with a character length of 1000, showed a distance of zero for all the topologies, but in F84 it this value was not reached, except when the reference topology was short-branched and Farris-zone (Fig. 1. A). These latter results can sustain to (Steel and Penny, 1993). In ML only the HKY reconstructed trees of distance zero except in the Farris zone. With F84 and GTR models in long branches and Farris-zone it was not possible to obtain equal topologies(Fig. 1.B). In Bayesian analysis, the frequency of the distance of 2 was above 20, which makes a first difference regarding the other two methods. For the F84 and GTR models with 1000 characters, the distance was zero for all the topologies, while in the HKY model with the same character number, its distance reached zero only in the Farris and Felsestein zones (Fig. 1.D). These differences in the identification of the correct tree under many evolutionary conditions (evolutionary models) can be used to ponder the method more than only evaluating the ability of making the tree (Huelsenbeck and Hillis, 1993).


Branch score and approach: is it then a matter of length?
Overall, in ML analysis, the tendency was to decrease the branch score difference as the number of characters increased, but it greatly differed from the original branch lengths. Interestingly, in short branches ML reconstructed a very similar tree comparing to the initial one (Fig. 1.C). The Bayesian analysis showed low values of the branch score difference in the short branched topologies (Fig 1. E.). When all the branches were long, Bayesian had more problems to reconstruct the original topology with the correct nodes and branch length. However, in the four topologies it is evident that the bs difference decreased when there were more characters in the dataset. Overall, the Farris zone and Felsestein zone showed and increase in the bs difference in long branches for almost every model.


References


The International Statistical Institute, "The Oxford Dictionary of Statistical Terms", edited by Yadolah Dodge, Oxford University Press, 2003.

http://stats.oecd.org/glossary/detail.asp?ID=5125

DeBry, R. W. (1992). The consistency of several phylogeny-inference methods under varying evolutionary rates. Molecular Biology and Evolution, 9(3), 537–551.

Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27(4), 401–410.

Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774-786.

Guindon, S., Delsuc, F., Dufayard, J.F., Gascuel, O. (2009). Estimating maximum likelihood phylogenies with PhyML. Methods Mol Biol. 537:113-37.

Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic biology, 59(3), 307-321.

Huelsenbeck, J. P., & Hillis, D. M. (1993). Success of phylogenetic methods in the four-taxon case. Systematic Biology, 42(3), 247–264.

Kuhner, M. K., & Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3), 459–468.

Paradis, E., Claude, J., & Strimmer, K. (2004). APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20(2), 289–290.

R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria.
Robinson, D., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1), 131–147.

Rambaut, A. and Grassly, N. C. (1997) Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13: 235-238.

Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Höhna, S., … Huelsenbeck, J. P. (2012). MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space. Systematic Biology, 61(3), 539–542.

Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.
Steel, M. A., Hendy, M. D., & Penny, D. (1993). Parsimony can be consistent! Systematic Biology, 581–587.

Yang, Z. (1994). Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Systematic Biology, 43(3), 329–342.


Supplementary Material: The scripts and reference topologies are available at: https://drive.google.com/file/d/0B1NHg7c5hIVbTU1uQy11amhiZ3c/view?usp=sharing. Authors: Ayus-Ortíz, V. González-Piñeres, N.

domingo, 22 de febrero de 2015

Consistency Test in the Four-Taxon case with the variation of the Evolutionary Model in Parsimony, Maximum Likelihood and Bayesian methods.



Consistency Test in the Four-Taxon case with the variation of the Evolutionary Model in Parsimony, Maximum Likelihood and Bayesian methods.

INTRODUCTION
In statistics, a consistent estimator has the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges to a value being evaluated or to the ‘correct value’. This means that the distributions of the estimates become more and more concentrated near the true value, so that the probability of the estimator being close to converges to one (Nikulin, M.S. 2001). Consistency is a property of what occurs as the sample size “grows to infinity”.  This term can be extrapolated to the phylogenetic area, in which we have three approaches to evaluate the consistency. The first of them is Parsimony; about it (Felsenstein, 1978)  postulated that when long branched taxa are separated by a short internal branch, Parsimony would group the long branches together. And in this problem Likelihood methods have been shown to recover the true modeled tree more readily in cases when long branched lineages are not sister taxa. (Yang, 1994) provides evidence that the Maximum Likelihood method of phylogenetic tree estimation is consistent. Also in this paper he cites to Kendall and Stuart, 1979:39-64 who claim that “under regularity conditions, maximum likelihood (ML) estimators are well known to be consistent, that is, all the information in the data is taken into account and properly handled by the method”. And the final approach is Bayesian, who is a little bit recent used in phylogeny and this type of evaluations are little known. So, with this in mind, a phylogenetic method is a consistent estimator of phylogeny if and only if it is guaranteed to give the correct tree, given that sufficient (possibly infinite) independent data are examined (DeBry, 1992). Due to the aforementioned, the purpose of this experiment was to assess the consistency of three methods in three different programs based on 4 initial trees or ‘reference trees’ (Farris zone, Felsenstein zone, short, and long branches) in the four-taxon case.


MATERIALS AND METHODS

The reference trees were designed according to the different topologies and zones I want to test:  the Farris zone, Felsenstein zone, long branches, and short branches (Figure 1). Based on these initial topologies, I obtained the simulated DNA sequences, which vary in length (10, 100, and 1000), using the software Seq-gen (Rambaut and Grassly, 1997), each of them following the evolutionary models F84, HKY and GTR. The tree search was done using tree different analyses: Parsimony, maximum likelihood (ML), and Bayesian analysis. The analysis of parsimony was performed with TNT v.1.1 (Goloboff et al., 2001) under the parameters: mult 100, tbr, coll 0. The analysis of ML was performed using PhyML (Guindon et al., 2009) under the model in which were simulated the sequences and NNI and SPR swaping, the Bayesian analysis was performed in MrBayes v.3.2.4. (Ronquist et al., 2012) with two Monte-carlo chains, 500.000 generations (p reach a value below 10-3), GTR model and burnin of 0.5. The comparisons between reference topologies and the obtained in each analysis were performed in the software R 3.1.1 (R Core Team, 2014) using the ape (3.1.4) (Paradis, Claude, & Strimmer, 2004) and phangorn (1.99-11) (Schliep, 2011) packages. The function 'treedist' (Steel M. A. and Penny P. (1993) calculated the branch score difference (Kuhner & Felsenstein, 1994) and the Robinson and Foulds distance (Robinson & Foulds, 1981). The graphs were performed in the same software using the package Plotly (  ). In order to perform the experiment, I wrote the scripts using the command language of BASH, for Ubuntu 14.04 (see supplementary material).


Figure 1. Reference topologies used in this analysis. Short branches tree is not showed.


RESULTS AND DISCUSSION

 Initially, when the branch score difference (bs.difference) was estimated between pairs of topologies, in the graphs of the Bayesian analysis (Figure 2, A), shows that the low values of bs.difference were obtained in topologies with short branches. When all the branches are long, Bayesian have more problems to reconstruct the original tree with the correctly nodes and branch longitude. But in the four topologies you can see the decreasing in the bs.difference when there are more characters in the dataset. However, a pattern can be observed that is, in long branches, Farris zone and Felsenstein zone the bs.difference increases in almost all models. Moreover, in Maximum Likelihood (ML) analysis (Figure3, A), in all the cases the tendency was to decrease the branch score difference, but it is so far to the originals branch longitudes. In an interesting way, in short branches ML reconstruct a tree very similar to the initial tree. For parsimony (Figure 4) because of the lack of branch longitude, only the Robinson & Foulds distance was estimated and calculated its maximum value =2 frequency. Parsimony reconstruct the phylogeny with de correctly nodes as the length of the data set increases. If we look only the correctness of the nodes via Robinson & Foulds distance, in Bayesian (Figure 2 B) in GTR and F84 models with 1000 characters the distance is zero for all the topologies, but in HKY model with 1000 characters the distance reached zero only in the Farris and Felsenstein Zone. In Parsimony GTR and HKY with the before longitude (1000) the distance is zero for all the topologies, in F84 just in long branches zero is not reached, these results can sustain to (Steel, Hendy, & Penny, 1993) like they said “in the way in which Parsimony reconstruct and select the tree. In ML (Figure 3 B) only with HKY was possible to reconstruct trees of distance zero except in the Farris zone. With the other two models in long branches and Farris-zone equal topologies were not obtained. These differences in the identification of the correct tree under many evolutionary conditions (evolutionary models) can be used to judge the method more than evaluated just the ability of making the tree (Huelsenbeck & Hillis, 1993)


Figure 2. Results of Bayesian analysis. A. The graph shows the branch score difference for the comapisons with each reference topology by model and length. Every three bars represent the variation of a model F84, HKY and GTR respectively. B. Frecuency of the máximum value obtained in Robinson Foulds distance (2). Zero represents identical topologies.
 




Figure 3. Results of Maximum likelihood analysis. A. The graph shows the branch score difference for the comapisons with each reference topology by model and length. Every three bars represent the variation of a model F84, HKY and GTR respectively. B. Frecuency of the máximum value obtained in Robinson Foulds distance (2). Zero represents identical topologies.

  

Figure 4. Results of Parsimony analysis. The graph shows the frecuency of the máximum value obtained in Robinson Foulds distance (2). Zero represents identical topologies.
 


REFERENCES

DeBry, R. W. (1992). The consistency of several phylogeny-inference methods under varying evolutionary rates. Molecular Biology and Evolution, 9(3), 537–551.
Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27(4), 401–410.

Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774-786.

Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic biology, 59(3), 307-321.
 
Huelsenbeck, J. P., & Hillis, D. M. (1993). Success of phylogenetic methods in the four-taxon case. Systematic Biology, 42(3), 247–264.
Kuhner, M. K., & Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3), 459–468.
Paradis, E., Claude, J., & Strimmer, K. (2004). APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20(2), 289–290.
R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria.
Robinson, D., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1), 131–147.
Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.
Steel, M. A., Hendy, M. D., & Penny, D. (1993). Parsimony can be consistent! Systematic Biology, 581–587.
Yang, Z. (1994). Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Systematic Biology, 43(3), 329–342.


Based on the above, you can choose the method that you believe convenient for your analysis, the model that fixed better and show more consistent and finally remember if you have more data... is better.