jueves, 31 de marzo de 2016

Dirichlet pior


                                                                        Sensibility analysis: Dirichlet prior
Introduction
Compound Dirichlet priors for branch lengths was proposed as a solution to the problem of branch length overestimation in bayesian inference because the default exponential prior produces too large branch lengths and in some cases are unreal. Several authors have investigated this problem: Brown et al. (2010), Marshall (2010),  and Rannala et al. (2012).  Rannala et al. (2012) suggest the main cause is the poor default prior on branch lengths used in MrBayes, which assigns independient and identical distribution (i.i.d.) for branch lengths of the tree and proposed the compound Dirichlet priors on branch lengths, this uses a diffuse gamma or inverse gamma distribution on the tree length and the Dirichlet distribution is used to partition the tree length into branch lengths. The goal of this present work is compare default prior on branch lengths with the new compound Dirichlet priors implemented in MrBayes 3.1.2 (Ronquist and Huelsenbeck 2003).

Methodology
First, download sequences for 94 taxa of the froglet Crinia, genes : 12S, 16S, trnF, tRNA-Val using GenBank. Align by genes in MUSCLE (multiple sequence comparison by log-expectation) (Edgar, R.C. 2004). Set the evolutionary model using AIC in the software R version 3.2.3 (R Core Team, 2014) using phangorn (Schliep, 2011) package and the function modelTest. Create a nexus file to run  the Bayesian analysis in MrBayes version 3.1.2 (Ronquist and Huelsenbeck 2003), for compound Dirichlet prior the prior mean was set to be 0.0001, 0.001, 0.1, 1, 50, 100. Parameters α=c=1 were fixed. For each parameter was set four MCMC chains, and 2'500.000 generations and the analysis was run until the average standard deviation of split frequencies < 0.01. For the default exponential prior was used the same values of prior mean. Then, two nodes were extracted to see the behavour, one with a large branch and other one wihout large branchs.  To see the variation with respect to branch length was carried out the metric: Branch Score Difference (Kuhner & Felsenstein, 1994) taken exponential tree as reference topology and Robinson Foulds metric (Robinson & Foulds, 1981) to see if the topology changed. I simulated one tree with a large branch and another one without large branch using Seq-gen (Rambaut and Grassly, 1997).

Results and discussion



NODES





Branch lenght:
In general there is a pattern when Bt obtains values greater than 1, BSD decreases. In other words the difference in branch lengths between the consensus trees obtained using exponential and Dirichlet prior is becoming smaller when Bt increases (Fig 1).


Node 133 is similar to reference tree (exponential). Unlike 99 node which presents more variation respect the tree obtained under exponential prior, this can be explain by the large branch length (Fig 1).



Figure 1. Branch Score Difference (BSD) between each consensus tree of the exponential and Dirichlet prior for both nodes with differents values of Bt (prior mean).

For both nodes the exponential prior produces larger trees than compound Dirichlet priors for differents values of Bt. node 99 presents larger differences because exponential prior favors large branch (Zhang et al 2012) (Table 1).
 

 
Bt Expo Dir
Node 99 Node 133 Node 99 Node 133
1 0.3734763 0.05280543 0.134272 0.029316
0.1 0.386117 0.05436114 0.143912 0.029326
0.01 0.3879789 0.05413017 0.134548 0.029298
10 0.3100098 0.04519354 0.126194 0.028405
100 0.1683404 0.03025578 0.09396 0.022991
0.0001 0.3879235 0.05424801 0.145419 0.029751
50 0.2114827 0.03458418 0.105251 0.024711


Table 1. Tree length (TL) for each concensus topology under differents Bt values for exponential and Dirichlet priors.                                          

Topology:
Regarding topology there isn't difference among trees obtained from exponential and Dirichlet prior (RF = 0).

TOTAL TOPOLOGY

Posterior distribution:
In the posterior exponential prior generated longer trees than Dirichlet prior (Fig 2).


Figure 2.  Effect of the exponential and Dirichlet prior on the posterior distribution of tree length.. Each colour represents a differente BT value: Red=1, pink= 0.1, blue claro= 0.01, green= 10, dark blue=100 , orange=0.0001 , gray=50.

In the complete topology, exponential prior shows the largest phylogenetic trees especially when Bt takes values less than 1, when Bt takes values greater than 1 decreases the size. The standard distribution is larger for Dir prior than exponential prior being greater for Bt values less than 1 (Fig 3).
 
Figure 3.  Standard Deviation (SD) for the tree total length from each Bt value.

In the total topology the behavour is the same like 99 and 133 node. BSD begins to decrease for Bt values greater than 10 but RF shows differences.
SIMULATION
The values obtained are very close to the reference topology (tree simulated with and without large branch) using both priors and in Dir prior the variance was a little bit bigger than Exp prior, which was also found in 99 and 133 nodes.
Conclusion
Despite the difference between the results obtained from an empirical perspective and simulations, I think that this differences relate to the fact that in simulations all is perfect, all is in order and in biological systems is difficult to reach this point. I also found that there isn't influence in the posterior distribution from prior distribution in the Dirichlet prior and in exponential prior this does not happen. For this reasons I consider the Compound Dirichlet prior could be the solution to branch-length overestimation in bayesian inference because this produces more reasonable branch-length estimates than the default prior used in MrBayes and suggest to do more replicates and work with a known taxonomic group for which you know its diversification times to reach a best view. 

Bibliography

Brown JM, Hedtke SM, Lemmon AR, Lemmon EM. 2010. When trees grow too long: investigating the
causes of highly inaccurate Bayesian branch-length estimates. Syst Biol, 59:145-161.

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res. 32(5):1792-1797

Kuhner, M. K., & Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3), 459–468.

Marshall DC. 2010. Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in the land of
long trees. Syst Biol, 59:108-117.

Rannala B, Zhu T, Yang Z. 2011. Tail paradox, partial identifiability and influential priors in Bayesian
branch length inference. Mol Biol Evol.

R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria.

Robinson, D., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1), 131–147.

Schliep, K. P. (2011). phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4), 592–593.

Zhang C, Rannala B, Yang Z. 2012. Robustness of compound dirichlet prior for bayesian inference of branch lenghts. Syst Biol.