lunes, 21 de septiembre de 2015

Estimating divergence times under the methods autocorrelated variation rate and independent rate depending on the size of the topology using MULTIDIVTIME and BEAST.

Introduction

Estimating divergence times can be performed by methods based on variation of substitution rates. The AR method establishes that the rates of change between ancestor and descendant are autocorrelated and the rates can follow a log-normal distribution with a mean equal to the parent rate (Thorne, Kishino and Painter, 1998; Kishino, Thorne and Bruno, 2001; Thorne and Kishino, 2002) or can be determined by a non-central χ 2 distribution (Lepage et al., 2007). Another method used in molecular clock analysis is the RR method which assigns random independent rates for each lineage and these rates are drawn from a single underlying parametric distribution such as an exponential or log-normal (Drummond et al., 2006; Rannala and Yang, 2007; Lepage et al., 2007). 

RR method has been implemented in BEAST the AR method has been implemented in MULTIDIVTIME. In terms of the programs, other difference is the priors on node ages, where MULTIDIVTIME uses a dirichlet distribution without an explicit assumption about the biological processes (Kishino, Thorne and Bruno, 2001; Thorne and Kishino, 2002) and in the other hand, BEAST uses a Yule prior and the Birth-Death prior (Yule, 1924; Rannala and Yang, 1996; Yang and Rannala, 1997).

These two methods have been widely used to dating phylogenies and it is recommended to use several calibration points; however, it is not common to find a number of fossil equivalent to the number of nodes in the topology. At the moment, it has not been established what is the minimum number of constraints to achieve correctness divergence times.

Figure 1. Number of citations. Comparison between Bayesian inference and Maximum Likelihood programs.

Objectives

General objective
To determine what are the minimum number of calibration points related to the number of tips of a topology using the Autocorrelated rate and Random rate methods.

Specific objectives
To correlate the divergence time estimated with the divergence time simulated.
To determine the delta of variation when the number of points increase.

Methods

1. Simulations

The trees will be simulated considering four different number of tips: 10, 25, 50 and 100, plus an age of 31Mya at the root, with the package phytools v.0.4-56 (Revell, 2012) in the R language and it will be replicated 3 times. Based on the these trees, the sequences will be simulated using the program Seq-gen v1.3.3 (Rambaut and Grass, 1997), under the model HKY, with base frequencies 0.30[T], 0.25[C], 0.30[A], 0.15[G]; a transition-transversion rate K=5, gamma parameter alpha = 0.5 and a length of 1000 bp.

2. Molecular clock analysis and calibration

The analyzes will be performed using the AR and RR methods respectively in MULTIDIVTIME v9.25.03 (Thorne and Kishino, 2002) (MDT) and BEAST v.1.8.2 (Drummond and Rambaut, 2007; Drummond et al., 2012). In MDT the divergence time will be calculated after 10⁶ generations employing a correlated relaxed lognormal clock. In BEAST the analyzes will be done under an uncorrelated relaxed lognormal clock, Yule speciation process, a normal distribution for the tmrca and different number of generations depending on the authors's recommendation about the relation between it and the number of tips. For both methods the calibration points will be chosen randomly, leaving the base node of the ingroup fixed to 30 Mya. The number of points will be replicated for every tree simulation.

3. Correlation and additional comparisons

The times simulated and the outcomes of these analyzes will be compared and related by a Pearson correlation for each topology of different size and replica. The standard deviation among replicas was calculated and the delta between number of points, with the aim of observe variance in the results.

Results and Discussion

Accurate in dating with AR and RR methods varies according to the number of tips in the phylogeny; by increasing the number of tips the correlation coefficient (R2) decrease, thus the number of calibration points must be in proportion to the number of tips. It is evidenced when for the two methods there is a point from which the correlation decrease (Figure 2, a, b).

A


B
Figura 2. Variation of the correlation coefficient after increase the number of calibrated nodes. A. Autocorrelated rate method. B. Random rate method.

The addition of the second point of calibration does not improve the dating of the phylogeny in approximately 90% of the cases reviewed (Figure 2). Similar results were obtained by Battistuzi et al., 2010, they assess the effect of multiples calibrations into the credibility Intervals (CrIs) size, and they do not found differences in dating with one or two points. On the other hand, the use of exact divergence times can generate this behavior, in which differences between one and two points are not perceived (Benton et al., 2009). Consequently, the minimum number of calibration points is one, because with just one point the value of the correlation coefficient is greater than 0.89, in all cases. However, when I add three points, the coeficient (R2) was high and in almost 70% of the cases was the maximum value.


Standard deviation among replicas was of 10⁻¹, although cases of standard deviation of 10⁻³ were recorded , it demonstrated that there is no changes in using different points and multiples runs of the programs. This is consistent with Hedges et al., 2004, who found that along many methods if the data and times are constant the estimations will be similar. Both methods evaluated in this work are based on Bayesian Inference and the priors play a key role in the estimations of divergence time and in all the analyses, but Brown and Yang in 2010 tested the influence of the priors and conclude that it affect the estimation, however, one can obtain similar results.

A

B

Figure 3. Standard deviation among replicas for each number of tips. A. MultiDivTime. B. BEAST.

The delta of increasing the number of points varied in the order of 10⁻¹ until 10⁻³. Thus, demonstrating, that the differences among number of calibrations vary in a little range. Then, when the position of the unique point was assessed, the position imports, is not the same dating towards the tips that dating the base node of the ingroup, but the value of the coefficient of correlation is not below 0.75 in both cases (Figure 4). According to Battistuzi et al., 2010, there is no significant differences in the relationship of the calibration nodes and the estimated nodes, “this suggests that the relative positions of the two nodes do not affect the time estimations as long as the two nodes are closely related (Battistuzi et al., 2010).” In contrast, with empirical data, the estimations depends on the position, the method, and the dataset (Hug and Roger, 2007), idea discussed by many other authors who assess this problem via simulations and empirical data and found similar results among different methods, or the parameters of the analysis.

Figure 4. Variation of the correlation coefficient according to different positions of the calibrated node.

In addition, to reduce the error by 5% and increase the coefficient of correlation in larger topologies compared to that obtained in small topologies, with a single point calibration, a number of points four time major than the points evaluated are required. However, a solution would be dated by small separate clades within the topology in order to obtain a high coefficient correlation for all phylogeny.

Conclusions

Finally, the number of tips influence the estimation, regardless the method used in estimating divergence times. Although a point is the minimum number, when the number of tips increase with the error occurs exactly the same.

References

Battistuzzi, F. U., Filipski, A., Hedges, S. B., & Kumar, S. (2010). Performance of relaxed-clock methods in estimating evolutionary divergence times and their credibility intervals. Molecular biology and evolution27(6), 1289-1300.

Brown, R. P., & Yang, Z. (2010). Bayesian dating of shallow phylogenies with a relaxed clock. Systematic biology59(2), 119-131.

Donoghue, P., Benton, M., Ziheng, Y., & Inoue, J. (2009). Calibrating and constraining the molecular clockJournal of vertebrate paleontology29, 89A-89A.

Drummond, A. J., Ho, S. Y., Phillips, M. J., & Rambaut, A. (2006). Relaxed phylogenetics and dating with confidence. PLoS biology4(5), 699.

Drummond, A. J., & Rambaut, A. (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC evolutionary biology7(1), 214.

Drummond, A. J., Suchard, M. A., Xie, D., & Rambaut, A. (2012). Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular biology and evolution,29(8), 1969-1973.

Hedges, S. B., Blair, J. E., Venturi, M. L., & Shoe, J. L. (2004). A molecular timescale of eukaryote evolution and the rise of complex multicellular life. BMC evolutionary biology4(1), 2.

Hug, L. A., & Roger, A. J. (2007). The impact of fossils and taxon sampling on ancient molecular dating analyses. Molecular biology and evolution24(8), 1889-1897.

Kishino, H., Thorne, J. L., & Bruno, W. J. (2001). Performance of a divergence time estimation method under a probabilistic model of rate evolution.Molecular Biology and Evolution18(3), 352-361.

Lepage, T., Bryant, D., Philippe, H., & Lartillot, N. (2007). A general comparison of relaxed molecular clock models. Molecular biology and evolution,24(12), 2669-2680.

Rambaut, A., & Grass, N. C. (1997). Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees.Computer applications in the biosciences: CABIOS13(3), 235-238.

Rannala, B., & Yang, Z. (1996). Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of molecular evolution43(3), 304-311.

Rannala, B., & Yang, Z. (2007). Inferring speciation times under an episodic molecular clock. Systematic Biology56(3), 453-466.

Revell, L. J. (2012). phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution3(2), 217-223.

Thorne, J. L., Kishino, H., & Painter, I. S. (1998). Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution,15(12), 1647-1657.

Thorne, J. L., & Kishino, H. (2002). Divergence time and evolutionary rate estimation with multilocus data.Systematic Biology51(5), 689-702.

Yang, Z., & Rannala, B. (1997). Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Molecular biology and evolution14(7), 717-724.


Yule, G. U. (1924). An introduction to the theory of statistics. Bull. Amer. Math. Soc. 30 (1924), 465-466