martes, 22 de septiembre de 2015

Estimating Evolutionary Rates and Times to the last Common Ancestor DEV-2, using partitions vs complete coding region

Introduction
Dengue Virus serotype 2(DENV-2) is one of the four serotypes of the arthropod borne viruses (arbovirus) that belongs to the Flaviridae family. As the other serotypes, it is responsible for causing Dengue fever and severe Dengue. The genome is formed by a single stranded positive-sense RNA of approximately 11kb long. This viral RNA serves as an mRNA for the translation of 10 viral proteins: 3 structural(Capside C, Membrane precursor preM, Envelope E), that assemble the nucleocapside and envelope of mature virions; and 7 non-structural(NS1, NS2a, NS2b, NS3, NS4a, NS4B, NS5), which are mainly responsible for viral replication and other cell functions.

Due to its considerable diversity, DENV-2, has been grouped into 6 lineages, designated as genotypes (Rico-hesse et al., 1997, 1998; S Susanna Twiddy et al., 2002)⁠. 1.) Asian I genotype, represents strains from Thailand, Malaysia, Cambodia, Myanmar, Vietnam and Australia. 2.) Asian II, represents strains from China, Indonesia, The Philippines, Taiwan, Sri Lanka, India, Honduras and Mexico. 3.) Southeast (SE) Asian/American genotype comprises all strains from Southeast Asia, and strains collected in Central and South America and the Caribbean over the last 30 years. 4.) The Cosmopolitan genotype, represents strains distributed in a wide geographic area, including East and West Africa, the Middle East, the Indian subcontinent, Indian and Pacific Ocean Islands and Australia. 5.) American genotype, representing strains from Central and South America, the Caribbean and older strains collected in the Indian subcontinent and the Pacific Islands. Lastly, there is the Sylvatic genotype, representing strains from humans, canopy-dwelling arboreal mosquitoes, and non-human primates collected in West Africa and Southeast Asia. Moreover, the great genetic diversity within the DENV-2 genotypes, reflect their continual divergence and diverse geographic distribution (Chen & Vasilakis, 2011)⁠.

Several studies have been aimed at understanding the epidemiology of DENV, rates and dates of evolution, and selection pressure in its different genes and genotypes (Rico-hesse & Rico-hesse, 1990; Rico-Hesse et al., 1997; S. S. Twiddy, Holmes, & Rambaut, 2003; Weaver & Vasilakis, 2009; Zanotto, Gould, Gao, Harvey, & Holmes, 1996)⁠. Further, a relatively recent review (Chen & Vasilakis, 2011)⁠ reports large scale phylogenetic studies of all dengue serotypes and discusses how factors such as rates of evolution, selection pressures, population sizes and evolutionary constraints influences DENV transmission, pathogenesis and emergence. New approaches reveal the patterns in the spread and epidemiology of viral diseases by phylogeographic studies. On a global scale, DENV-2 has been reported as the serotype with the highest effective population size (Costa, Voloch, & Schrago, 2012)⁠, and the time to the last common ancestor as 338.1 (227.1–462.9) (Patil et al., 2011)⁠. Nonetheless, all these studies have been based only in estimations made from the E gene sequences.

The objectives of this study were to infer the time to the Most recent common ancestor MRCA of the DENV2 genotypes, and estimate the evolutionary rates of the data set, using the complete genome and individual genes.

Materials and Methods
Sequence Data
Complete genome sequences of DENV-1 were retrieved from GenBank(1093 sequences with collection dates (1944 to 2014), from (48) distinct countries, database up to June 2015). In order to delete duplicates, recombinants, and chimeras, I used uclust v.4.0.38 (Edgar, 2010)⁠, with a cut-off id value of 90%. An statistical test for detection of significant recombination events in the dataset was run using SplitsTree v.4.2 (Bruen, Philippe, & Bryant, 2006)⁠. Sequences were aligned using MUSCLE v3.8.31(Edgar, 2004)⁠. From this alignment, I obtained 11 partitions corresponding to the individual genes(10), and the complete coding sequence(1). Substitution model was selected under the Akaike information criterion, using PhyML (Guindon, Dufayard, Lefort, & Anisimova, 2010)⁠⁠, called with the function phymltest(package ape, R v.3.2.1.(R Core Team 2014)).
Phylogenetic Signal
In order to test the phylogenetic signal, ergo, the topological resolution yielded by the data set of each partition, a Likelihood mapping analysis was run for 10 million random quartets using TreePuzzle v.5.3.(Schmidt, Strimmer, Vingron, & von Haeseler, 2002)⁠. The quartets were reconstructed following a ML approach under the selected substitution models for each partition. Thus, an independent analysis was run for each dataset. The results were plotted as a triangle with fully resolved quartets shown as dots concentrated near the vertices, and the unresolved ones are located in the center. (See anexes)
Genotypes Assignation
a Maximum likelihood (ML) tree was inferred using RaxML, using the complete coding sequence (CDS), with the model GTR+G+I.
Estimation of Evolutionary rates and TMCRA
Tree topologies, model parameters, evolutionary rates for each partition dataset, and times of the MCRA for each genotype were co-estimated, using BEAST v.8.1.2.(A. J. Drummond, Suchard, Xie, & Rambaut, 2012)⁠. Inferrences were performed using a relaxed molecular clock (uncorrelated lognormal[rate: 7.5*10-4, variation: 0.4 (Costa et al., 2012)⁠]. A Bayesian skyline plot was used as a coalescent prior during the inference (a. J. Drummond, Rambaut, Shapiro, & Pybus, 2005; Ho & Shapiro, 2011)⁠, for it is an approach with no dependence on a pre-specified parametric model of demographic history, such as logistic, exponential. Analysis was run for twice for 50 million generations, sampling every 10000. Topology was rooted with a sequence of Japanese Encephalitis Virus(EF107523) from 1988.

Results and Discussion
The mean evolutionary rate for each gene ranged 5.3 10-4 –1.5 × 10-4 nucleotide substitutions per site per year, being the NS5, NS2B, and E, the ones with the highest substitution rates. This estimations can be compared to the ones obtained by(S. S. Twiddy et al., 2003)⁠. The complete coding sequence was not account for comparisons, because converge of this parameter is not guaranteed (ESS<100) for the partition. In general, evolutionary rates for the dataset obtained from the E gene sequence were similar compared to the non-structural genes NS2B, NS4A, NS4B, and NS5.
Table 1. Estimated Mean Nucleotide Substitution Rates per Site per Year for each partition of Dengue Virus Serotype 2. 


Rate(subs/site/year) Rate 95%HPD
C 4.10E-004 [9.5E-4, 2.4E-4]
preM 4.20E-004 [8.6E-4, 2.5E-4]
E 2.30E-004 [4.9E-4, 1.4E-4]
NS1 3.80E-004 [6.4E-4, 1.1E-4]
NS2A 5.30E-004 [8.1E-4, 1.8E-4]
NS2B 1.40E-004 [4.1E-4, 7.3E-3]
NS3 2.80E-004 [5.2E-4, 1.1E-4]
NS4A 2.50E-004 [6.5E-4, 1.3E-4]
NS4B 2.60E-004 [6.5E-4, 2.4E-4]
NS5 1.50E-004 [7.5E-4, 3.1E-3]
CDS 4.30E-004 [9.4E-4, 1.2E-3]

Regarding the time the most recent common ancestor (MRCA), results are consistent to the times found by (Walimbe, Lotankar, Cecilia, & Cherian, 2014), and are comparable to the times of outbreaks and introductions summarized in (Chen & Vasilakis, 2011)
In average, the partitions whose TMCRA for Asian/American genotype were closest to the proposed time in other studies (using the E gene sequence: 42.757-*44.4), were NS1:46.4, NS4A:45.6, and NS4B:43.7. For genotype Asian1/Asian2, the reference value (using the E gene sequence: 64.189-*70.45), was nearly recovered by NS1: 70.7, and preM: 67.8. For the Cosmopolitan genotype, (using the E gene sequence: 59.8-*67.9), the TMCRA was roughly recovered by NS2B: 63.885, NS4B: 60.929, and the CDS: 77.779. The TMCRA of the sylvatic2 genotype (using the E gene sequence: 121.1-*158.4), was recovered by NS5:157.1, CDS:157.2, NS1:120.1, NS2A:135.3, and NS4B: 126.9. The only exception for this case, would be the American 2 genotype, whose TMCRA estimations were different (103.3*), from the obtained in this work (56.2). * Values obtained by (Patil et al., 2011; Walimbe et al., 2014).
Genes pre-M, NS1, NS2A, NS4B, showed the closest values to the references. Thence, they could be an alternative to use when estimating evolutionary processes in DENV2. Besides, regarding the potential resolution of phylogenetic relationships between clades, these genes have shown good phylogenetic signal (>90%). Also, they have relatively smaller sizes compared to the E gene (with exception of NS1), which may be even practical when running long analyses.

Table 2. Estimated time to the most recent common ancestor (MRCA) for every partition, estimated under a Bayesian skyline coalescent tree prior.


root 95%HDP AA 95%HDP A1/A2 95%HDP Cosmo 95%HDP A2 95%HDP Sylvatic 95%HDP
C 2174.6 [1100.8, 3381.1] 55.1 [38.5, 74.4] 57.2 [50, 72.1] 64.9 [46.2, 87.7] 54.9 [40.1, 77.1] 100.8 [56.7, 160.1]
preM 3268.7 [1866.7, 4821.1] 55.1 [39.9, 72.5] 67.8 [54.3, 83.4] 54.9 [43.9, 68.4] 46.8 [38.3, 59.2] 107.4 [64.3, 165.1]
E 6175.5 [3780, 7200.1] 42.8 [34.5, 52.1] 64.2 [53.8, 77.5] 59.8 [45.5, 76.4] 56.7 [42.3, 78.9] 121.1 [70.2, 179.5]
NS1 5452.1 [3371.3, 7790.7] 46.4 [34.9, 58.7] 70.7 [56.2, 87.3] 56.5 [44.9, 70.7] 59.5 [40.1, 71.5] 120.1 [69.9, 181.3]
NS2A 3362.9 [1944.8, 4966.4] 43.5 [34.9, 54.5] 67.1 [54.1, 83.7] 59.1 [46.6, 74.7] 53.3 [39.5, 72.4] 135.3 [72.4, 206.5]
NS2B 2712.5 [1497.1, 4033.4] 49.9 [35.9, 66.4] 62.9 [51.4, 78.7] 63.9 [45.3, 89.1] 57.2 [43.2, 74.8] 101.4 [61.2, 156.5]
NS3 2914.1 [1502.1, 4102.2] 56.9 [38.3, 73.5] 63.7 [54.1, 76.1] 59.1 [42.8, 83.9] 58.8 [46.4, 71.6] 119.1 [58.2, 215.6]
NS4A 3989.9 [2283.7, 5881.3] 45.6 [35.3, 57.8] 65.3 [52.4, 82.1] 55.1 [43.9, 66.5] 45.1 [38.1, 53.3] 118 [69.6, 187.7]
NS4B 2454.2 [1290.3, 3688.6] 43.7 [35.3, 54.2] 60.9 [51.6, 72.4] 60.9 [46.1, 79.8] 46.1 [37.9, 58.4] 126.9 [73.1, 187.9]
NS5 5675.6 [3371.4, 7790.7] 58.6 [39.9,72.5] 77.9 [56.2, 95.1] 67.3 [52.5, 83.5] 63.5 [49.3, 76.9] 157.1 [72.4, 206.5]
CDS 10944.7 [3459.8, 25368.5] 105 [36.6, 550.8] 117.6 [53.6, 337.7] 77.8 [43.9, 172.7] 76.6 [39.5, 145] 157.2 [75, 290.6]
AA:American/Asian. A1A2:Asian1/Asian2. Cosmo: Cosmopolitan. A2: American2.

One visible problem with the analysis was that the convergence of parameter for the inference based on complete coding region sequence was not reached. Although this problem was expectable, increasing the number of generations to the double of the used in this study, would probably yield better estimations in terms of ESS, and I consider that this is a worth-doing comparison.


Conclusion
In terms of the estimation of evolutionary rates and TMRCA, NS1, NS2A, NS4B, would be good candidates to infer evolutionary processes in DENV-2.

Anexes

A1.) List of accessions and details of the sequences used in the study. (Link to Dropbox)

A2.) Substitution models for each partition
Partition Subs Model
C GTR+G+I
preM GTR+G+I
E GTR+G+I
NS1 GTR+G+I
NS2A GTR+G+I
NS2B TN93+G+I
NS3 GTR+G+I
NS4A TN93+G
NS4B GTR+G+I
NS5 GTR+G+I
CDS *

A3.) Phylogenetic signal for each partition.
A. Gene C; B. Gene preM; C. Gene E; D. Gene NS1; E. Gene NS2a; F. Gene NS2b; G. Gene NS3; H. Gene NS4a; I. Gene NS4b; J. Gen NS5; K. Complete CDS.
RQ= Resolved quartets.

References
Bruen, T. C., Philippe, H., & Bryant, D. (2006). A simple and robust statistical test for detecting the presence of recombination. Genetics172(4), 2665–2681. doi:10.1534/genetics.105.048975.

Chen, R., & Vasilakis, N. (2011). Dengue--quo tu et quo vadis? 
Viruses3(9), 1562–608. doi:10.3390/v3091562.


Costa, R. L., Voloch, C. M., & Schrago, C. G. (2012). Comparative evolutionary epidemiology of dengue virus serotypes. 
Infection, Genetics and Evolution12(2), 309–314. doi:10.1016/j.meegid.2011.12.011.


Drummond, a. J., Rambaut, a., Shapiro, B., & Pybus, O. G. (2005). Bayesian coalescent inference of past population dynamics from molecular sequences. 
Molecular Biology and Evolution22(5), 1185–1192. doi:10.1093/molbev/msi103.


Drummond, A. J., Suchard, M. a, Xie, D., & Rambaut, A. (2012). Bayesian P hylogenetics with BEAUti and the BEAST 1 . 7. 
Molecular Biology and Evolution29(8), 1969–1973. doi:10.1093/molbev/mss075.


Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. 
Nucleic Acids Research32(5), 1792–1797. doi:10.1093/nar/gkh340.


Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. 
Bioinformatics26(19), 2460–2461. doi:10.1093/bioinformatics/btq461.


Guindon, S., Dufayard, J.-F., Lefort, V., & Anisimova, M. (2010). New Alogrithms and Methods to Estimate Maximum- Likelihoods Phylogenies: Assessing the performance of PhyML 3.0. 
Systematic Biology59(3), 307–321.


Ho, S. Y. W., & Shapiro, B. (2011). Skyline-plot methods for estimating demographic history from nucleotide sequences. 
Molecular Ecology Resources11(3), 423–434. doi:10.1111/j.1755-0998.2011.02988.x.


Patil, J. a., Cherian, S., Walimbe, a. M., Patil, B. R., Sathe, P. S., Shah, P. S., & Cecilia, D. (2011). Evolutionary dynamics of the American African genotype of dengue type 1 virus in India (1962-2005). 
Infection, Genetics and Evolution11(6), 1443–1448. doi:10.1016/j.meegid.2011.05.011.


Rico-hesse, R., Harrison, L. M., Nisalak, A., Vaughn, D. W., Kalayanarooj, S., Green, S., … Ennis, F. A. (1998). MOLECULAR EVOLUTION OF DENGUE TYPE 2 VIRUS IN THAILAND, 
58(1), 96–101.


Rico-Hesse, R., Harrison, L. M., Salas, R. a, Tovar, D., Nisalak, a, Ramos, C., … da Rosa, a T. (1997). Origins of dengue type 2 viruses associated with increased pathogenicity in the Americas. 
Virology,230(2), 244–251. doi:10.1128/CVI.00363-08.


Rico-hesse, R., Harrison, L. M., Salas, R. A., Tovar, D., Nisalak, A., Ramos, C., … Nogueira, R. M. R. (1997). Origins of Dengue Type 2 Viruses Associated with Increased Pathogenicity in the Americas, 
251(230), 244–251.


Rico-hesse, R., & Rico-hesse, R. (1990). Molecular Evolution and Distribution of Dengue Viruses Type 1 and 2 in Nature, 
493, 479–493.


Schmidt, H. a, Strimmer, K., Vingron, M., & von Haeseler, A. (2002). TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. 
Bioinformatics (Oxford, England)18(3), 502–504. doi:10.1093/bioinformatics/18.3.502.


Twiddy, S. S., Farrar, J. J., Chau, N. V., Wills, B., Gould, E. A., Gritsun, T., … Holmes, E. C. (2002). Phylogenetic Relationships and Differential Selection Pressures among Genotypes of Dengue-2 Virus 1, 
72, 63–72. doi:10.1006/viro.2002.1447.


Twiddy, S. S., Holmes, E. C., & Rambaut, a. (2003). Inferring the Rate and Time-Scale of Dengue Virus Evolution. 
Molecular Biology and Evolution20(1), 122–129. doi:10.1093/molbev/msg010.


Walimbe, A. M., Lotankar, M., Cecilia, D., & Cherian, S. S. (2014). Global phylogeography of Dengue type 1 and 2 viruses reveals the role of India. 
Infection, Genetics and Evolution22, 30–39. doi:10.1016/j.meegid.2014.01.001.


Weaver, S. C., & Vasilakis, N. (2009). Molecular evolution of dengue viruses: contributions of phylogenetics to understanding the history and epidemiology of the preeminent arboviral disease. 
Infection, Genetics and Evolution : Journal of Molecular Epidemiology and Evolutionary Genetics in Infectious Diseases9(4), 523–40. doi:10.1016/j.meegid.2009.02.003.


Zanotto, P. M., Gould, E. a, Gao, G. F., Harvey, P. H., & Holmes, E. C. (1996). Population dynamics of flaviviruses revealed by molecular phylogenies. 
Proceedings of the National Academy of Sciences of the United States of America93(2), 548–553.


Zhang, Z., Li, J., & Yu, J. (2006). Computing Ka and Ks with a consideration of unequal transitional substitutions. 
BMC Evolutionary Biology6, 44. doi:10.1186/1471-2148-6-44.