jueves, 31 de marzo de 2016

Summarizing branch lengths in phylogenies

Phylogenetic trees are a way to represent the evolutionary relationships of organisms; using them in biology is very broad. On numerous occasions phylogenetic analyzes  yield more than a phylogenetic reconstruction, the task of deciding how to represent all this variety of results can be complicated.

However, the overwhelming majority of attempts are carried out on trees branches whose length is despised. Take into account the length of the branches makes sense in statistical reconstruction methods, however the significance of these is difficult to determine. As an alternative to this deficiency one could take a measure of central tendency, the mean (average) or median of the branch lengths of all the alternative reconstructions. Felsenstein (2004) argues that one could choose as a parameter the median length of the branches on a tree consensus, however, there are problems when the clade of interest is not present in all the trees.

Take for example the case of support measures. Usually we see how the bootstrap values ​​and posterior probability are reported in the product trees ML and Bayesian Inference. But where exactly are these trees and branches lengths?

In Maximum Likelihood the result is a point estimate, the tree with the highest likelihood value. To take into account the confidence limits is generally used resampling matrix using bootstrap. The results are a measure of support from each of the internal nodes. Typically the Maximum Likelihood  tree  is reported and in it  is annotated the  node frequencies in the bootstrap replicates.

More complex Bayesian inference analysis in during each of the generations a different tree algorithm according to MCMC is sampled. What you get is a posterior distribution of trees rather than a single point estimate. This fact makes a representation to be obtained of the set of trees sampled.

The two software most used in  Bayesian inference in phylogeny are MRBAYES (Ronquist and Huelsenbeck,2003)  and Beast (Drummond et al, 2012) ; information of the run  can be summarized in a single tree. MRBAYES makes a majority rule consensus of  compatible trees  and if the ​​branch lengths values  have been saved, the average ​​branch lengths  and  their variance.. This tree can, in some cases, have not been found during the search conducted.

Beast, meanwhile, has a his own software to summarize the results of the trees: Treeannotator. The default option  for this utility is to produce a tree that maximizes the product of the posterior probability of bipartitions, called as maximum clade credibility tree. The branch length which  is taken may be the mean, median, common ancestror heigths  (hereafter ca) or maintain the target tree heigths. Helet and Buckaert (2013) show as there is no single methodology for summarizing  tree that maximizes the posterior probability of the clades and at the same time, has good branch lengths using different metrics.

Here, in a much simpler test than that used by them, I try to explore the differences between choosing different ways to summarize the trees of an  analysis of Bayesian inference., including topology (consensus, maximum clade credibility  ) and different branch length measurements  (mean, median, ca)


Two scenarios of evolution of sequences were explored; one under strict molecular clock and the other under uncorrelated relaxed molecular clock . For both scenaries , a tree were generated using RateEvolver (Ho et al, 2005), this software  produce a phylogeny of 9 terminals completely bifurcated , whose branch lengths measured as substitution  rates are modified according to the chosen molecular clock.
 From these trees,  10 replicates of 1000 nucleotides sequences  were generated in Seq-Gen (Rambaut and grass, 1997)  under the  JC model.
Phylogenetic reconstruction was carried out in Beast v1.8.3  following the molecular clock model under which the data were generated. 1500000 generations were run, 10% was used as a burn-in.
I summarized trees using two software:  Sumtrees (Sukumaran and Holder, 2015)  and Treeannotator  discarding the burn-in. Measurements of branch length   used were mean, median and ca. For topologies use MCCT and   majority  rule consensus.
It was calculated for each replicate the difference from the original tree, for this function treedist  of the phangorn (Schiliep, 2011) R package was used .

Results and Discussion

The results are shown in Figures 1 and 2.
Under a uncorrelated  relaxed molecular clock results are quite similar (Fig 1). The lowest values ​​of difference are using the majority rule consensus using either mean or median value as branch lengths. A similar value is found using  MCCT when ca is chosen to represent the branches.

Fig 1. Uncorrelated Relaxed  Clock. Results of mean Branch Score Difference for 10 replicates with respect to reference tree

The results obtained with strict molecular clock (Fig 2) results seem contradictory. The best distance values ​​(BSD) from the original tree are obtained using s maximum  calde credibility tree . A abrupt difference in the distance is observed compared to consensus trees. Given the kind of molecular clock it is expected that the results of consensus and MCCT would be very similar. One possible explanation could be a mistake in the software used, but  this seems unlikely given that  the results obtained under relaxed molecular clock seem fair.

Fig 2. Strict Clock. Results of mean Branch Score Difference for 10 replicates with respect to reference tree

With the results is not possible to draw a clear conclusion about what is the best way to summarize the phylogenetic reconstruction with branch length taken into account. The difference between choosing the mean or median does not seem to greatly affect the length of the branches. However, more work must be done due to ambiguous results reported here.


Drummond AJ, Suchard MA, Xie D & Rambaut A (2012) Bayesian phylogenetics with BEAUti and the BEAST 1.7 Molecular Biology And Evolution 29: 1969-197

Felsenstein, J., & Felenstein, J. (2004). Inferring phylogenies (Vol. 2). Sunderland: Sinauer Associates.

Heled, J., & Bouckaert, R. R. (2013). Looking for trees in the forest: summary tree from posterior samples. BMC evolutionary biology, 13(1), 1.

Ho, S. Y., Phillips, M. J., Drummond, A. J., & Cooper, A. (2005).
Accuracy of rate estimation using relaxed-clock models with a critical focus on the early metazoan radiation.
Molecular Biology and Evolution, 22(5), 1355-1363.

Rambaut, A., & Grass, N. C. (1997). Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees.Computer applications in the biosciences: CABIOS, 13(3), 235-238.

Ronquist, F. and J. P. Huelsenbeck. 2003. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574. 

Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.

Sukumaran, J and MT Holder. SumTrees: Phylogenetic Tree Summarization. | | 4.0.0 (Jan 31 2015). Available at | | https://github.com/jeetsukumaran/DendroPy.

No hay comentarios: