viernes, 18 de marzo de 2016

Statistical consistency of Maximum Likelihood and Maximum Parsimony

Statistical consistency is often one of the factors often cited to favor statistical estimation methods in phylogenetic reconstruction . It is defined as the porperty of estimator, for wich as data (characters)  tends to infinity, the inferred topology lead to  converge to the true: the real phylogeny (Truszkowski, & Goldman, 2015).

Debate about inconsistency  has a great story behind it. Since the proposal Felsenstein(1978) with a simple evolutionary model, much work has been done to evaluate the statistical properties of estimators in phylogeny under a set of different scenarios. Much of the effort has made to show that model based approaches like Maximum Likelihood (and Bayesian Inference) has advantages over Parsimony (Wheeler, 2011).

Here I test the statistical consistency of Maximum Likelihood  and two flavours of parsimony reconstruction in different sets of  topologies and models of nucleotide substitution,  with four taxa-trees  in a simulated experiment.

Materials and methods

To carry out the analysis of statistical consistency. First nucleotide sequences were simulated in the software Seq-Gen (Rambaut & Grass, 1997). in three different models of DNA evolution: GTR, and JC HKY. Also  each model was analyzed in four topologies (original topologies) whose length (q and p) differs as defined below:

• Farris Zone.  q=0.6, p=0.1.Tips with long branches are sisters .
• Felsenstein Zone. p=0.6, q= 0.1.  Tips with branches of different length are sisters.
• Long Branches. p=q =0.6
• Short Branches.p=q= 0.1
• Central branch of topologies always take the value of q.

For maximum parsimony analysis two approaches were used: Static and dynamic homology (The static homology parsimony was conducted in TNT (Goloboff et. al, 2008). POY ( Varón et al, 2010) was the software used in the analysis of dynamic homology. Maximum Likelihood inferences were carried out in PhyML software (Guindon et al, 2010).

To asses differences between the inferred topologies and the originals, the symmetric difference (Steel & Penny, 1993) was used as implemented in RF.dist function ot the R package  phangorn (Schliep, 2011). The frequency of the true tree recovered  was the measure of consistency here used.

Results and Discussion.

Results show that  maximum parsimony analysis are consistent in three of the four areas evaluated or topologies. These are the Farris zone, short branches and  Long Branches. Maximum likelihood was equally consistent in three zones: Long branches, short branches and Felsenstein zone (Figs 1,2,3).
There is no dramatic difference between frequencies of recovered rigth trees by model.

The tendency is to converge to the rigth tree in three of the four zones as the data avalaible increase, regardless the model of the data.

Fig 1.  Frequency of recovered trees under the JC model. X axis show log of  base pairs.
Fig.2.  Frequency of recovered trees under the HKY model. X axis show log of  base pairs.

Fig 3. Frequency of recovered trees under the GTR model. X axis show log of  base pairs.

It can be seen as direct optimization leads to inconsistency much faster than does the traditional static parsimony under the JC model. These results confirm that, regardless of the concept of homology used,  parsimony  is inconsistent (Warnow, 2012) for the  Jukes Cantor  model  and more complex models in the Felsenstein zone.
Statistical consistency is a desirable property in estimates of phylogenetic reconstruction (Wheeler, 2011). Under different configurations branches estimators behave similarly and all fail in at least some zones, even under simple models.
If the three (or two) approaches here assesed show a similar behavior (Failing to achieve perfect consistency in at least one zone), and perfect consistency is not reached one must choose an approach between them with freedom knowing that in some configuration of branch length it can be yield the rigth tree when the avalaible data is enougth.

References

Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27(4), 401–410.

Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis.Cladistics, 24(5), 774-786.

Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic biology, 59(3), 307-321

Rambaut, A., & Grass, N. C. (1997). Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees.Computer applications in the biosciences: CABIOS, 13(3), 235-238.

Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.

Steel, M. A., Hendy, M. D., & Penny, D. (1993). Parsimony can be consistent! Systematic Biology, 581–587.

Truszkowski, J., & Goldman, N. (2015). Maximum likelihood phylogenetic inference is consistent on multiple sequence alignments, with or without gaps.Systematic biology, syv089.

Varón, A., L. S. Vinh, W. C. Wheeler. 2010. POY version 4: phylogenetic analysis using dynamic homologies. Cladistics, 26:72-85.

Warnow T. Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent. PLOS Currents Tree of Life. 2012 Mar 12 . Edition 1. doi: 10.1371/currents.RRN1308.

Wheeler, W. C. Comparison of Optimality Criteria. Systematics: A Course of Lectures, 2011.269-287.