viernes, 22 de abril de 2011

Bayesian Inference in Phylogenetic Analysis

The growing peak of Bayesian methods in phylogenetic inference in the last two decades is the result of the implementation of Markov chain Monte Carlo algorithms (MCMC; which include Monte Carlo method) and Metropolis-Hastings algorithm, on the estimation of posterior probability distributions and the exploration of more parameter-rich evolutionary models (Nylander, et, al. 2004). Statistically, the Bayesian inference calculates the probability that a hypothesis be true given the posterior probability based on priors probabilities and the likelihood under each hypothesis. Here the probability is used to represent uncertainty about the phylogeny and in the parameters of the model and not as expected frequency of occurrence like in classical or frequentist statistics (Yang, 2006). Under a philosophical context, the Bayesian and Maximum Likelihood approaches are inference methods, which are based in calculation of probabilities of evolutionary transformations of characters (for example, according to a nucleotide substitution model) instead of evaluation of possible synapomorphies, also any homologous characters (apomorphic and plesiomorphic) can be used for inferring phylogenies and they assume that evolution may be reticulated and not always dichotomous (Lukhtanov, 2010).

At present, is possible to identify six key topics related with Bayesian approach in phylogenetic reconstruction: (1) Integration of complex evolutionary models, (2) heterogeneity across the sites (3) heterogeneity across the data in analysis of combined data (Huelsenbeck et, al. 2001), (4) computational efficiency (Ronquist & Huelsenbeck, 2003), (5) posterior probability for a tree or clade has an easy interpretation(Yang, 2006) and (6) incorporation of priors values. The development of Bayesian MCMC algorithms is the associated cause with the increase in computational efficiency making possible to analyze more complex and realistic evolutionary models. This does not mean that parameters rich models are appropriated for all data set because we may make the mistake of overparameterization and more complex evolutionary models are associated with more topological uncertainty (Nylander, et, al. 2004). However, the interesting is that it open up the possibility of exploring more realistic models and complex (but a real model can be simple) to recognize heterogeneity across the sites and combined data.

In term of computational efficiency, although Maximum Likelihood analysis has gained ground with the improvement and development of new algorithms implemented in software such PhyML (Guindon and Gascuel, 2003) and RAxML (Stamatakis, 2007) and now is possible perform moderately fast and accurate bootstrapping to determine confidence, in Bayesian inference the interpretation of posterior probability is easier, it is the probability that the tree or clade is correct given the data, model and priors (Yang, 2006), whereas the interpretation of bootstrap although tend to be more conservative has been controversial specially with model misspecification and when the signal is only detectable at some sites (Ronquist and Deans, 2009). Despite the above, Bayesian posterior probability according to Yang (2006) can be spuriously high due to lack of convergence, poor mixing, misspecification of the likelihood substitution model and misspecification and sensitivity of the priors. About this last, it allows to incorporate prior knowledge about a particular hypothesis or to use vague or uninformative priors when the little information is available or do not want to build the analysis on any previous (Ronquist et, al. 2008). Since, priors can be subjective, is convenient to assess the influence of the priors on the posterior probability. Pickett and Randle (2005) pointed out that with a uniform prior the posterior probability of a clade depends on the both the size on the clade and the number of species.

Regarding to the lack of convergence and poor mixing in Bayesian approach them leading to spuriously high support for the trees visited in the chain (Yang & Rannala, 2005), multiple long chains from different starting points could be a possible solution when the data set is very large. However, Bayesian inference to face to another problem, this is inconsistency. Under Bayes´ criterion the tree topologies are estimated without estimating branch lengths this is integrating branch lengths for a given tree topology over a distribution of possible values (Goloboff & Pol, 2005), then statistical consistency can be lose and Bayesian inference is biased in favor of topologies that group long branches together, even when the true model and priors distributions of evolutionary parameters over a group of phylogenies are known. According to Kolaczkowski and Thornton (2009), this bias becomes more severe as more data are analyzed and sequences sites evolve heterogeneously and is relatively weak when the true model is simple. So, Bayesian inference is less efficient and less robust to the use of an incorrect evolutionary model than ML. Despite the above, Bayesian MCMC inference is a promising approach to phylogenetic analysis and computational biology that has been in progress with the development of algorithms, methods and software, nevertheless more research is needed on how to incorporate previous knowledge into appropriate informative priors and how to deal with complex models and large data sets without falling into inconsistency.

References


Goloboff P., Pol D. Parsimony and Bayesian phylogenetics. 2005. In: Albert V. (ed). Parsimony, phylogeny, and genomics. Oxford University Press. 210-266.

Guindon S., Gascuel O. 2003. PhyML: " A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology. 2003 52(5):696-704.

Huelsenbeck J. P., Ronquist F., Rasmus N., Bollback J. P. 2001. Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology. Science. 294 (5550) : 2310-2314

Kolaczkowski B., Thornton J. W. 2009. Long-Branch Attraction Bias and Inconsistency in Bayesian Phylogenetics. PLoS One. 4:e7891. doi:10.1371/journal.pone.0007891

Lukhtanov V. A. 2010. From Haeckel’s Phylogenetics and Hennig’s Cladistics to the Method of Maximum Likelihood: Advantages and Limitations of Modern and Traditional Approaches to Phylogeny Reconstruction. Entomological Review. Vol. 90(3) : 299-310

Nylander J. A. A., Ronquist F., Huelsenbeck J. P., Nieves-Aldrey J. L. 2004. Bayesian Phylogenetic Analysis of Combined Data. Syst Biol. 53(1) : 47-67

Pickett K. M., Randle C. P. 2005. Strange Bayes indeed: Uniform topological priors imply non-uniform clade priors. Mol. Phyl. Evol. 34:203-211

Ronquist F., Deans A. R. 2009.Bayesian Phylogenetics and Its Influence on Insect Systematics. Systematic Entomology 35 (3): 349-378.

Ronquist F., van der Mark P., Huelsenbeck J. P. 2009. Bayesian phylogenetic analysis using MrBayes. In: Lemey P., Salemi M., and Anne-Mieke V. (eds.) The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing. Cambridge University Press. 219-236.

Stamatakis, A., Blagojevic, F., Nikolopoulos, D., Antonopoulos, C. 2007 Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell. The Journal of VLSI Signal Processing. 48 : 271–286

Yang Z. 2006. Computational Molecular Evolution. Oxford University Press, Oxford, England

Yang, Z., and B. Rannala. 2005. Branch-length prior influences Bayesian posterior probability of phylogeny. Systematic Biology 54: 455-470.