viernes, 18 de marzo de 2016

ML and IB consistency

Statistical consistency is the capacity of the estimator that be true when the lenght of data tend to infinite. Felsenstein (1978) show that phylogenetic reconstruction using parsimony is incosistent when the branch lengths have certain proportions (Figure 1) this proportions have the name of Felsentein zone. However Farris (1999) show that phylogenetic reconstruction using Maximum Likelihood (ML) is inconsistent also when branch lengths have certain proportions (Figure 2) this type of zone have the name of Farris zone.
Contrary to the previous phylogenetics reconstructions techniques, bayesian analysis (IB) is consistency without import the zone or branch lengths relations (Steel, 2013).

Figure 1. Felsenstein zone


Figure 2. Farris zone


However i checked experimentally if is true the consistency of IB and ML. For this i use four type of unrooted trees with 4 tips (farris zone, felsenstein zone, "long branches", "small branches"). I choose three sizes base pairs (100, 1000, 10000) and for each size i generated 25 sequences of DNA based in the tree and model JC . I recover the trees using Mrbayes v 3.2.6 compiled to MPI and Phyml 20160303 also compiled to MPI, and i using the JC model. I compared each tree vs original tree using Robinson Fould metric (Steel & Penny, 1993). I define the consistency as: Number of trees that Robinson Fould metric = 0/Number of total trees for each type of tree and size of bases pairs.  
In the Farris zone (Figure 3) i need sample more sizes of DNA because only three sizes show that ML is consistent in the Farris zone and this is false, IB was consistent in this zone. In the Felsentein zone (Figure 4), ML and IB were consistent. In "long branches"(Figure 5) and "small branches" (Figure 6) ML and IB were consistent. For this is true that IB is consistency independently of the zone or branches lenght proportions. 






Figure 3. Statistical consistency in Farris zone. ML = Maximum Likelihood, IB =Bayesian Inference

Figure 4. Statistical consistency in Felsenstein zone. ML = Maximum Likelihood, IB =Bayesian Inference

 
Figure 5. Statistical consistency for "Long branches". ML = Maximum Likelihood, IB =Bayesian Inference

 
I programmed some useful functions in R for the threes.
 
https://github.com/dpabon/bio_comparada/blob/master/consistencia/bin/R/functions.R
 https://github.com/dpabon/bio_comparada/blob/master/consistencia/bin/R/gentree.R
If you want run all analysis in your computer you can download a copy of the repository and follow the instructions.

https://github.com/dpabon/bio_comparada/tree/master/consistencia

Farris, J. (1999). Likelihood and Inconsistency. Cladistics 15, 199–204

Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27(4), 401–410.

Steel M. A. and Penny P. (1993) Distributions of tree comparison metrics - some new results, Syst. Biol.,42(2), 126-141

Steel, M. (2013). Consistency of Bayesian inference of resolved phylogenetic trees. Journal of theoretical biology 336: 246–49.