Bootstrap versus Posterior Probability… Are their values related?
The phylogenetic methods use to reconstruct the historical relationship among organism and molecular entities have been widely explore. The relations in the data can be observed in the topology obtained like a mechanism of summary of the results. However, a researcher cannot affirm that the relationships in the topology are correct just with the visualization of the tree. In order to avoid the subjectivity, the statistics appears to be the best way to asses if the inferences are almost well done, in terms of confidence, reliability, support, or robustness. This step is important because is a measure of how supported is the inference given the model and the data. One of these measure is the non parametric bootstrap, is a technique that uses data resampling to estimate the values. The proportion of times a particular phylogentic relationship is observed can be interpreted as a probability, used like a measure of “confidence in” or “support for” (Cummings et al., 2003). Other measure is the posterior Probabilities, result of the Bayesian analysis when it coupled Markov chain Monte Carlo. This approach determines this probability given prior probability, likelihood function, and data. Both measures are widely used like support of the relationships obtained in the topology (Cummings et al., 2003). Even when the theory of each measure is independent, there are some people that claim the equivalence between it (Efron, H. and Holmes, 1996). In previous work researchers have demonstrated that posterior probability is overestimated (Erixon, Svennblad, Britton, & Oxelman, 2003). The purpose of this work was to compare bootstrap replicates and the posterior probability when the characters length and the number of taxa is different.
MATERIALS AND METHODS
The reference trees were designed in based of three different number of taxa (6, 12 and 24 terminals). Based on these initial topologies, I obtained the simulated DNA sequences, which vary in length (1000 and 4000), using the software Seq-gen (Rambaut and Grassly, 1997), each of them following the evolutionary model GTR. The tree search was done using two different analyses: Maximum Likelihood (ML), and Bayesian analysis. The analysis of ML was performed using PhyML (Guindon et al., 2009) under the model in which were simulated the sequences, NNI and SPR swaping and 100 bootstrap replicates. The Bayesian analysis was performed in MrBayes v.3.2.4. (Ronquist et al., 2012) with two Markov chain Monte-Carlo, 1.00.000 generations (p reach a value below 10-3), GTR model and burnin of 0.5. Three replicates of the analysis were performed. The correlation and the linear regression between pairs of topologies with the same length, taxa and common nodes were performed in the software R 3.1.1 (R Core Team, 2014). The graphs were performed in the same software. In order to perform the experiment, I wrote the scripts using the command language of BASH, for Ubuntu 14.04 (see supplementary material).
RESULTS AND DISCUSSION
Figure 1. Topologies obtained in each analysis. In the superior part the trees with 6 taxa and the variation of the support in de common nodes (in red). And in the inferior part the trees with 12 taxa and the support of the common nodes (in yellow).
Figure 2. Linear regression of Bootstrap and Posterior probability for the common nodes in one of the three replicates. r2 = 0,3667 an p-value 2.88* 10-7.
After the comparisons of the topologies in the three replicates, in general, the support increases when more data is aggregate to the analysis, this might suggest that the method or the result is consistent. But when the number of taxa increases the support in the basal nodes decreases. This results are similar to the obtain in (Cummings et al., 2003) who said that “Given sufficient data, the appropriate likelihood model, and the appropriate search strategies, the values of Pboot and P(τ | D) are similar”. An interestingly thing is the apparent overestimation of the posterior probabilities in all trees and replicates. Maximum likelihood does not reconstruct the same topology of Bayesian and both have errors to reconstruct the original tree.
After the linear regression I obtain a significant relation for the two support measure. The determination coefficient was 0.366, it value in literature is considered like a moderate relation. In most cases when Bootstrap (BS) was high near to 100 the posterior probability (PP) was high too. But in some cases when the BS was low the PP kept high. The same was obtained in (Erixon et al., 2003), in this work they proof the overestimation of the posterior probability that reject the universally equivalence with bootstrap value. These affirmations are against the theory that affirms the similarity and equivalence of both measures. But here are those who support the idea of that PP of Bayesian methods is a better measure of support given the high sensitivity in the short internodes (Alfaro, Zoller, and Lutzoni, 2003) and claim that bootstrap require more data for reach the goal.
Alfaro, M. E., Zoller, S., & Lutzoni, F. (2003). Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. Molecular Biology and Evolution, 20(2), 255–266.
Cummings, M. P., Handley, S. A., Myers, D. S., Reed, D. L., Rokas, A., & Winka, K. (2003). Comparing bootstrap and posterior probability values in the four-taxon case. Systematic Biology, 52(4), 477–487.
Efron, B., Halloran, E., & Holmes, S. (1996). Bootstrap confidence levels for phylogenetic trees. Proceedings of the National Academy of Sciences, 93(23), 13429.
Erixon, P., Svennblad, B., Britton, T., & Oxelman, B. (2003). Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. Systematic Biology, 52(5), 665–673.
R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria.
Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Höhna, S., … Huelsenbeck, J. P. (2012). MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology, 61(3), 539–42. doi:10.1093/sysbio/sys029