Filosofía, especie y sistemática: Summarizing branch lengths in phylogenies

Phylogenetic trees are a way to represent the evolutionary relationships of organisms; using them in biology is very broad. On numerous occasions phylogenetic analyzes yield more than a phylogenetic reconstruction, the task of deciding how to represent all this variety of results can be complicated.

However, the overwhelming majority of attempts are carried out on trees branches whose length is despised. Take into account the length of the branches makes sense in statistical reconstruction methods, however the significance of these is difficult to determine. As an alternative to this deficiency one could take a measure of central tendency, the mean (average) or median of the branch lengths of all the alternative reconstructions. Felsenstein (2004) argues that one could choose as a parameter the median length of the branches on a tree consensus, however, there are problems when the clade of interest is not present in all the trees.

Take for example the case of support measures. Usually we see how the bootstrap values and posterior probability are reported in the product trees ML and Bayesian Inference. But where exactly are these trees and branches lengths?

In Maximum Likelihood the result is a point estimate, the tree with the highest likelihood value. To take into account the confidence limits is generally used resampling matrix using bootstrap. The results are a measure of support from each of the internal nodes. Typically the Maximum Likelihood tree is reported and in it is annotated the node frequencies in the bootstrap replicates.

More complex Bayesian inference analysis in during each of the generations a different tree algorithm according to MCMC is sampled. What you get is a posterior distribution of trees rather than a single point estimate. This fact makes a representation to be obtained of the set of trees sampled.

The two software most used in Bayesian inference in phylogeny are MRBAYES (Ronquist and Huelsenbeck,2003) and Beast (Drummond et al, 2012) ; information of the run can be summarized in a single tree. MRBAYES makes a majority rule consensus of compatible trees and if the branch lengths values have been saved, the average branch lengths and their variance.. This tree can, in some cases, have not been found during the search conducted.

Beast, meanwhile, has a his own software to summarize the results of the trees: Treeannotator. The default option for this utility is to produce a tree that maximizes the product of the posterior probability of bipartitions, called as maximum clade credibility tree. The branch length which is taken may be the mean, median, common ancestror heigths (hereafter ca) or maintain the target tree heigths. Helet and Buckaert (2013) show as there is no single methodology for summarizing tree that maximizes the posterior probability of the clades and at the same time, has good branch lengths using different metrics.

Here, in a much simpler test than that used by them, I try to explore the differences between choosing different ways to summarize the trees of an analysis of Bayesian inference., including topology (consensus, maximum clade credibility ) and different branch length measurements (mean, median, ca)

Methods

Two scenarios of evolution of sequences were explored; one under strict molecular clock and the other under uncorrelated relaxed molecular clock . For both scenaries , a tree were generated using RateEvolver (Ho et al, 2005), this software produce a phylogeny of 9 terminals completely bifurcated , whose branch lengths measured as substitution rates are modified according to the chosen molecular clock.
From these trees, 10 replicates of 1000 nucleotides sequences were generated in Seq-Gen (Rambaut and grass, 1997) under the JC model.
Phylogenetic reconstruction was carried out in Beast v1.8.3 following the molecular clock model under which the data were generated. 1500000 generations were run, 10% was used as a burn-in.
I summarized trees using two software: Sumtrees (Sukumaran and Holder, 2015) and Treeannotator discarding the burn-in. Measurements of branch length used were mean, median and ca. For topologies use MCCT and majority rule consensus.
It was calculated for each replicate the difference from the original tree, for this function treedist of the phangorn (Schiliep, 2011) R package was used .

Results and Discussion

The results are shown in Figures 1 and 2.
Under a uncorrelated relaxed molecular clock results are quite similar (Fig 1). The lowest values of difference are using the majority rule consensus using either mean or median value as branch lengths. A similar value is found using MCCT when ca is chosen to represent the branches.

Fig 1. Uncorrelated Relaxed Clock. Results of mean Branch Score Difference for 10 replicates with respect to reference tree

The results obtained with strict molecular clock (Fig 2) results seem contradictory. The best distance values (BSD) from the original tree are obtained using s maximum calde credibility tree . A abrupt difference in the distance is observed compared to consensus trees. Given the kind of molecular clock it is expected that the results of consensus and MCCT would be very similar. One possible explanation could be a mistake in the software used, but this seems unlikely given that the results obtained under relaxed molecular clock seem fair.

Fig 2. Strict Clock. Results of mean Branch Score Difference for 10 replicates with respect to reference tree

With the results is not possible to draw a clear conclusion about what is the best way to summarize the phylogenetic reconstruction with branch length taken into account. The difference between choosing the mean or median does not seem to greatly affect the length of the branches. However, more work must be done due to ambiguous results reported here.

References.

Drummond AJ, Suchard MA, Xie D & Rambaut A (2012) Bayesian phylogenetics with BEAUti and the BEAST 1.7 Molecular Biology And Evolution 29: 1969-197

Felsenstein, J., & Felenstein, J. (2004). Inferring phylogenies (Vol. 2). Sunderland: Sinauer Associates.

Heled, J., & Bouckaert, R. R. (2013). Looking for trees in the forest: summary tree from posterior samples. BMC evolutionary biology, 13(1), 1.

Ho, S. Y., Phillips, M. J., Drummond, A. J., & Cooper, A. (2005).
Accuracy of rate estimation using relaxed-clock models with a critical focus on the early metazoan radiation.
Molecular Biology and Evolution, 22(5), 1355-1363.

Rambaut, A., & Grass, N. C. (1997). Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees.Computer applications in the biosciences: CABIOS, 13(3), 235-238.

Ronquist, F. and J. P. Huelsenbeck. 2003. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574.

Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.

Sukumaran, J and MT Holder. SumTrees: Phylogenetic Tree Summarization. | | 4.0.0 (Jan 31 2015). Available at | | https://github.com/jeetsukumaran/DendroPy.

Filosofía, especie y sistemática

jueves, 31 de marzo de 2016

Summarizing branch lengths in phylogenies

No hay comentarios:

Contribuyentes