sábado, 6 de abril de 2019

Dirichlet vs. Fixed frequencies: Sensitivity of Bayesian Inference to bases frequencies prior



Introduction
The phylogenetic reconstruction through Bayesian Inference (BI) represents a significant advance whit advantages like the possibility of incorporate prior information, easy interpretation of the results, computational efficiency and involve evolution complex models (Huelsenbeck et al., 2001; Ronquist & Huelsenbeck, 2003). Also, due to it uses the likelihood function, it shares its efficiency and consistency whit Maximum Likelihood (Huelsenbeck et al., 2002). 

Nevertheless, BI has its critical site, since it is limited due its inferences always remain within the support of its priors. Therefore, what we would call “the truth” it must be partially believed before it can be made known, thus, the appropriate prior specifications becomes crucial in this type of analysis  (Gelman & Shalizi, 2012; Schoot et al., 2014).

For this reason, I am going to estimate the sensitivity of BI, in terms of accuracy and precision, to changes in the criterion of the bases frequencies prior. The hypotheses that I am going to test are: (1)Using fixed frequencies like prior, BI is more precise than when we use Dirichlet like bases frequencies prior, and (2) BI estimations using fixed frequencies are more accuracy than estimations using Dirichlet like prior.

Methodology
Using RStudio I simulated 3 topologies of 12 and 48 terminals, for a total of 6 different topologies, which I used to simulate in Seq-Gen 108 matrices with the HKY and JC models, and 3 character sequence lengths (100-500-2500 ). For each particular matrix I made three replicas. Then, I analysed all matrices with BI using MrBayes, once for fixed frequencies and once with a flat dirichlet of 1,1,1,1. Due to my interest was focused on the topology and not on the branch length, I calculated the Robinson-Foulds distance to indirectly estimate precision and accuracy in the recovery of trees, under the two cases that I analyzed.

Scripts with detailed methodology, parameters that I used and software specifications, can be found in the following link:
  https://github.com/IndiraMG/Trabajo_final_BioComparada-I

Results and Discussion
Figure 1 and 2, show us very similar trends in the mean and standard deviation of RF in all the scenarios we observed. This probably happens due to the use of a flat Dirichlet value = (1,1,1,1), which is appropriate if we want to estimate these parameters from the data, assuming that we do not have prior knowledge about their values (Ronquist et al., 2011 ).  Therefore, it is logical to think that the program estimated the value of the prior and this was very close to the values ​​that had been set in in fixed frequencies.

On the other hand, the effect of the size of the matrix influenced the precision in which both estimates generated the topologies but when model used was HKY, precision increased from 500 characters. contrary to JC model which was more variable, therefore less precise. It probably happens because JC is the most simple model (Matthew et al., 2005) and that what could reduce the probability of a minor adjustment to the data, decreasing its precision. About the increasing in precision when character also increase, is important highlight that the changes in the number of taxa did not have any impact, but to be able to conclude something more general, more differences in this aspect are needed to be able to contrast them.


Fig.1. Statistical descriptions of RF behavior in the different types of inferred topologies using Dirichlet as a probability distribution in the base frequency prior. (A) Mean of RF in topologies only constructed with JC model. (B) Standard deviation of RF in topologies only constructed with JC model. (C) Mean of RF in topologies only constructed with HKY model. (D)  Standard deviation of RF in topologies only constructed with JC model.

Fig.2. Statistical descriptions of RF behavior in the different types of inferred topologies using Fixed frequencies as a criterion in the base frequency prior. (A) Mean of RF in topologies only constructed with JC model. (B) Standard deviation of RF in topologies only constructed with JC model. (C) Mean of RF in topologies only constructed with HKY model. (D)  Standard deviation of RF in topologies only constructed with JC model.

The Oxford dictionary defines accuracy like "the degree to which the result of a measurement, calculation, or specification conforms to the correct value or a standard.", and the problem with the determination of the accuracy of an estimator, is the ignorance of the reference value with which a result should be compared (Schoot et al., 2014). In this case, it is not possible to determine the accuracy of BI because, due to a small number of replicas, there is no way to guarantee that topologies estimated in either case (Dirichlet or fixed frequencies), will be influenced by the generation of matrices in seq-gen. For this reason, I can not test my second hypothesis without committing a large bias when comparing topologies recovered against initial topologies.

Conclusion
With this work I did not generate a response to which prior is more accurate or precise since with the data that was available there was no difference between the stimates using Dirichlet or fixed frequencies. But it is clear that both are consistent because their precision increases when we work with large data sets.

References
-Gelman, A., & Shalizi, C.R. (2012). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8-38.

-Huelsenbeck, J. P., Larget, B., Miller, R. E., & Ronquist, F. (2002). Potential applications and pitfalls of Bayesian inference of phylogeny. Systematic biology, 51(5), 673-688.

-Huelsenbeck, J.P., Ronquist, F., Nielsen, R. and Bollback, J.P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology, Science, 294, 2310–2314.

-Matthew, S., Edward, S., Andrew, J. R., (2005) Likelihood, Parsimony, and Heterogeneous   Evolution, Molecular Biology and Evolution, 22(5),1161–1164.

-Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J. and Aken, M. A. (2014). A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. Child Dev, 85, 842-860.

-Ronquist, F., Huelsenbeck, J., & Teslenko, M. (2011). Draft MrBayes version 3.2 manual: tutorials and model summaries. Distributed with the software from http://brahms. biology. rochester. edu/software. html.

- The Oxford Dictionary. Taken from: https://en.oxforddictionaries.com




No hay comentarios: