lunes, 23 de febrero de 2015

Consistency Test in the Four-taxon Case

Assessing the influence of character length and evolutionary model

Introduction

Consistency refers to logical and numerical coherence, and this property is attributable to an estimator if it converges in probability to its estimant as sample size increases. Steel (2015), defines it within the following lemma: “Suppose k sites are generated i.i.d. by s (T,θ). Under the sufficient condition (*) for statistical consistency, the bootstrap support of every edge e of T−ρ converges in probability to 1 as k→∞. Moreover, the bootstrap support for T−ρ converges in probability to 1 as k→∞.” where T is the tree, θ represents a vector of continuous parameters, ρ represents the root vertex, T−ρ appears as an unrooted tree. The aforementioned implies that the distributions of the estimates become more and more concentrated towards the true value, so that the probability of the estimator being close to this value converges to one. The problem of consistency can be assessed from three approches: Parsimony, Maximum Likelihood (ML), and Bayesian analysis. (Felsestein, 1978) postulated that when long branched taxa are separated by a short internal branch, parsimony would group the long branches together. The likelihood methods have shown to overcome this problem recovering the true modeled tree more readily in cases when long branched lineages are not sister taxa. (Yang, 1994) on the other hand, provides evidence that the Maximum likelihood method is consistent. Finally, the Bayesian approach, whose evaluations have been performed and published rather recently, appears to be a consistent estimator of phylogeny only if it is guaranteed the provided tree to be the correct one, given a large amount (possibly infinite) of independent data to be examined.

Due to the differences that these three approaches display, the purpose of this experiment was to assess their consistency based on 4 initial topologies (Farris zone, Felsestein zone, short, and long branches) in the four-taxon case.

Materials and Methods

The reference trees were designed according to the different topologies and zones I intended to test: the Farris zone, Felsestein zone, long branches, and short branches. Based on these initial topologies, I obtained the simulated DNA sequences which vary in length 10, 100, and 1000 characters, using the software Seq-gen (Rambaut and Grassly, 1997), each of them following the evolutive models F84, HKY and GTR. The tree search was done using tree different analyses: parsimony, maximum likelihood (ML), and Bayesian analysis. The analysis of parsimony was performed with TNT v.1.1 (Goloboff et al., 2001) under the parameters: mult 100, tbr, coll 0. The analysis of ML was performed using PhyML (Guindon et al., 2009) under the model in which the sequences were simulated and NNI and SPR swaping. The Bayesian analysis was performed in MrBayes v.3.2.4. (Ronquist et al., 2012) with two Monte-carlo chains, 500.000 generations (p reach was below 10-3), GTR model and burnin of 0.05. The comparisons between reference topologies and the obtained in each analysis were performed in the software R 3.1.1. (R Core Team, 2014) using the packages ape 3.1.4 (Paradis and Strimmer, 2004), and phangorn (1.99-11) (Schiliep, 2011) using the ape and phangorn packages. The function 'treedist' (Steel and Penny, 1993) calculated the branch score difference (Kuhner and Felsestein, 1994), and the Robinson & Foulds distance (Robinson and Foulds, 1981). Since parsimony does not recover branch longitude, only this distance was estimated for the case, using the frequency of its maximum value of 2. The graphs were performed in the same software using the package Plotly (https://plot.ly/r/). In order to perform the experiment, I wrote the scripts using the command language of BASH, for Ubuntu 14.04 (see supplementary material).

Results and Discusion

Fig 1. Results of the estimation of Robinson & Foulds index and Branch score difference for the three methodologies. A. Parsimony. B.C. Maximum Likelihood. D.E. Bayesian analysis.



Robinson & Foulds distance and approach: what about correctness?
Parsimony reconstructed the phylogeny with the correct nodes as the length of the data set increased. The models HKY and GTR with a character length of 1000, showed a distance of zero for all the topologies, but in F84 it this value was not reached, except when the reference topology was short-branched and Farris-zone (Fig. 1. A). These latter results can sustain to (Steel and Penny, 1993). In ML only the HKY reconstructed trees of distance zero except in the Farris zone. With F84 and GTR models in long branches and Farris-zone it was not possible to obtain equal topologies(Fig. 1.B). In Bayesian analysis, the frequency of the distance of 2 was above 20, which makes a first difference regarding the other two methods. For the F84 and GTR models with 1000 characters, the distance was zero for all the topologies, while in the HKY model with the same character number, its distance reached zero only in the Farris and Felsestein zones (Fig. 1.D). These differences in the identification of the correct tree under many evolutionary conditions (evolutionary models) can be used to ponder the method more than only evaluating the ability of making the tree (Huelsenbeck and Hillis, 1993).


Branch score and approach: is it then a matter of length?
Overall, in ML analysis, the tendency was to decrease the branch score difference as the number of characters increased, but it greatly differed from the original branch lengths. Interestingly, in short branches ML reconstructed a very similar tree comparing to the initial one (Fig. 1.C). The Bayesian analysis showed low values of the branch score difference in the short branched topologies (Fig 1. E.). When all the branches were long, Bayesian had more problems to reconstruct the original topology with the correct nodes and branch length. However, in the four topologies it is evident that the bs difference decreased when there were more characters in the dataset. Overall, the Farris zone and Felsestein zone showed and increase in the bs difference in long branches for almost every model.


References


The International Statistical Institute, "The Oxford Dictionary of Statistical Terms", edited by Yadolah Dodge, Oxford University Press, 2003.

http://stats.oecd.org/glossary/detail.asp?ID=5125

DeBry, R. W. (1992). The consistency of several phylogeny-inference methods under varying evolutionary rates. Molecular Biology and Evolution, 9(3), 537–551.

Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27(4), 401–410.

Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774-786.

Guindon, S., Delsuc, F., Dufayard, J.F., Gascuel, O. (2009). Estimating maximum likelihood phylogenies with PhyML. Methods Mol Biol. 537:113-37.

Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic biology, 59(3), 307-321.

Huelsenbeck, J. P., & Hillis, D. M. (1993). Success of phylogenetic methods in the four-taxon case. Systematic Biology, 42(3), 247–264.

Kuhner, M. K., & Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3), 459–468.

Paradis, E., Claude, J., & Strimmer, K. (2004). APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20(2), 289–290.

R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria.
Robinson, D., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1), 131–147.

Rambaut, A. and Grassly, N. C. (1997) Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13: 235-238.

Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Höhna, S., … Huelsenbeck, J. P. (2012). MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space. Systematic Biology, 61(3), 539–542.

Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.
Steel, M. A., Hendy, M. D., & Penny, D. (1993). Parsimony can be consistent! Systematic Biology, 581–587.

Yang, Z. (1994). Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Systematic Biology, 43(3), 329–342.


Supplementary Material: The scripts and reference topologies are available at: https://drive.google.com/file/d/0B1NHg7c5hIVbTU1uQy11amhiZ3c/view?usp=sharing. Authors: Ayus-Ortíz, V. González-Piñeres, N.