jueves, 13 de marzo de 2008

Estimating Branch lengths: A Bayesian approach

Villabona-Arenas, C. J.
Laboratorio de Sistemática y Biogeografía, Escuela de Biología
Universidad Industrial de Santander


The Bayes theorem is used in Bayesian inference (BI) to calculate the posterior distribution of the parameter, the conditional distribution of the parameter given the data (Holder and Lewis, 2003). Bayesian Evolutionary Analysis Sampling Trees (BEAST) is a software for Bayesian MCMC analysis of molecular sequences which in contrast to other programs, it is orientated towards rooted, time-measured phylogenies. The RNA viruses have broadly similar substitution rates even having different genome organizations and biological properties that implies that both the error rate associated with RNA polymerase and the rate of viral replication are roughly constant (Holmes, 2003); therefore RNA viruses are suitable for BEAST framework. In this work I explored the change in the branch lengths estimates obtained with Maximum Likelihood and Bayesian methods using simulated and real data sets.


Seq-gen version 1.3.2 (Rambaud, 2007; http://tree.bio.ed.ac.uk/software/seqgen/) was used to simulate aligned sequences [The length of sequences was 1000 nucleotides and the model was Hasegawa-Kishino-Yano, 85 (HKY85)], producing three replicate alignments for three set of simulation parameters used (Figure 1). The Table 1 presents the taxa used for the analyses with real data. The data set included 11 published partial nucleoprotein gene sequences of Rabies viruses (RABV) isolated in Colombia during 1994-2005 deposited in Genbank and the CTN-181 reference RABV strain as out-group. The RABV nucleotide sequences were aligned with Muscle 3.6 software package (Edgar, 2005) using default parameters. The alignments were used to reconstruct a Maximum Likelihood tree (ML) with phyML 2.4.4 software (Guindon and Gascuel, 2003) and Bayesian Inference (BI) with beast software (Drummond and Rambaut, 2007). A bootstrapping with 1000 replicates was used to place confidence values on groupings within the ML tree. For BI two approximations were used: one specifying the different points in time for the sequences and another one without them. The MCMC search was run for 1,000,000 generations, sampling the Markov chain every 1000 generations and using a coalescent tree prior that assumes a constant population size back through time. The 25% trees were discarded as “burn-in” summarizing the posterior distribution of tree topologies and branch lengths finding the maximum credibility tree and the mean node height for each of the clades. Each BI analysis was performed three times


Figure 2 and Table 2 presents the topologies and Branch lengths obtained with ML and BI for each set of simulations respectively. Maximun likelihood recover the true topology in all nine simulations while Bayesian Inference just all cases A and B; in case C IB recover the true topology one of three times. Both methods obtained similar branch lengths values and close to the initial ones for simulations A and B; ML recovers also the branch lengths for simulation C, but BI did not. The ML tree for RABV is presented in Figure 3; The BI trees are presented in Figure 4 and 5. The three trees have the same grouping. BI recovers the same branches as ML when not specifying years; when specifying points in time, the branch length changes according to them.


IB presented the rapidly evolving sequences in simulation C as being closely related regardless of their true relationships; this situation supports that the method can suffer from Long branch attraction. Because MCMC is a stochastic algorithm that produces sample-based estimates of a target distribution and the BEAST implementation assumes calibrated trees the method interprets this similarity as a descend-relationship increasing the probability that both taxa be sample a sisters.
In the case of branch length in the previous mentioned simulation, BEAST uses as basic model for rates among branches a strict or relaxed molecular clock. Because of the strong assumption that the rate of evolutionary change of the specified sequences is approximately constant over time, the method no recover the branches well, mainly because it try to adjust the encounter differences to an arrangement when a strong variation among the branches of the tree is not quite common. In the other hand, BI works perfectly when there is not such variation in rates as seeing in closely related species or within populations.
As Figures 3 and 4 show, when the assumptions go according with the main requirements, BI behaves as ML. when dates are incorporate into the model, provide a source of information about the overall rate of evolutionary change that is seen is this case, and a change in the estimated branch lengths in contrast where not years were specified. As present here there are scenes where IB does not work well; in general when working with the coherent framework of the method, IB can be used for evolutionary parameter estimation. Even though not time data implementation is allowed fro ML methods, it recovered branch lengths and correct topologies in all the evaluated scenarios evidencing it as a method to accurately describe molecular sequence variation.

1 comentario:

Salva dijo...

Hola Christian

Bueno, lastima no pude ver las figuritas...

Una cosa me angustio:
"BEAST implementation assumes calibrated tress the method interprets this similarity as a descend-relationship increasing the probability that both taxa be sample a sisters"

BEAST es un método basado en distancias que usa bayesiano? Eso si sería la tapa: todo lo malo del bayesiano, más lo malo de las distancias!