domingo, 7 de abril de 2019

Do sequences based on DNA and protein alignments phylogenies agree?


Introduction

Phylogenies inferred from independent data partitions tend to differ from each other in topology, despite being extracted from the same set of taxa1. The determination of phylogenetic trees based on DNA and protein sequences has been hampered by the sensitivity of tree reconstruction algorithms to unequal rate effects2. Therefore, the evolution rates found in different sites within a sequence, the variation from site to site and alignment biases parameters2-3 affect the comparisons. Taxonomic congruence is the degree of recovering a proposed or defined group4, and its incongruence can be given by biological or methodological sources. The biological causes are based on the degree of causal interdependence between the two sets of characters used5, while the methodological are due to sampling error or the use of inappropriate phylogenetic models6.

In this study, my purpose is to determine if sequences based on DNA and protein alignments are congruent in phylogenies evaluating as hypothesis that the reconstructed phylogenetic trees do not agree with each other.

Methods

♥ Simulation process
First, I simulated the reference topologies for different sample sizes (4 and 16 terminals) with all branch lengths equal to 0.5 and 3 replicas through the ape7 package using R v3.5.48. In Seq-Gen v1.3.49 I performed the sequence simulation using 102, 1002 and 10002 characters with HKY for DNA and WAG for evolutionary amino acid models.

♥ Estimation of phylogenetic trees
I estimated maximum parsimony trees with 100 replicas and TBR branch swapping using TNT v1.510. For maximum likelihood in RAxML-NG11 I used two parsimony-based randomized stepwise and two random stars from a topology, the best-scoring topology was selected as the best tree. The reference models were GTR for nucleotides and LG for amino acids for the optimized model parameters.

♥ Tree Comparisons
I compared trees through Robinson-Foulds12 and Kuhner-Felsenestein13 distances. To ensure that the resulting trees were not the product of a deviation in the variation of the initial topology, I compared the reference trees with the estimated trees of each replica and then the trees reconstructed by DNA vs proteins. Finally, I plotted random trees with FigTree.v1.4.414 to represent the observed behaviors of the data.

Results  and discussion

The initial trees and their replicas did not vary. For maximum parsimony, all the trees estimated under the different sequence simulation models and characters size showed the same topology, so I chose to graph replica 2 under a character matrix by HKY as an example (Fig. 1).

Figure 1. Maximum parsimony tree for 4 (A) and 16 (B) terminals with 10000 characters. Phylogeny based on DNA and Protein alignment had the same behavior with a RF value of 0.


In  maximum likelihood the figure 2 shows that the distances between the branch length values improve with increasing number of characters, for both 4 and 16 terminals. The KF of the reference topologies vs the topologies recovered shows higher indices for 16 terminals, but for both cases as the number of characters increases, the values tend to converge.


Figure 2. KF distance between DNA vs protein (pink), DNA vs reference topology (green) and protein vs reference topology (brown) comparisons. 


Proteins have some regions that, due to their functional or structural importance, are very well conserved, while other regions evolve faster both in terms of nucleotide substitutions and insertions or deletions15. Sequence protein alignment allows to observe clearer behaviors when the organisms are distant taking into account amino acids substitutions that a DNA sequence phylogeny could miss. In contrast, they will lose all information about synonymous mutations or invariant codon sites. Proteins with a faster rate of evolution are less accurate in relationships at close levels due to the proportion of convergent changes. If we implemented proteins with a high rate of evolution but they do not contain a high proportion of convergent changes, they could be as precise in constructing phylogenies as proteins with a lower evolutionary rate but a greater proportion of convergent changes16.


Substitution models of amino acid evolution intend to mimic the evolution of protein data and even if specific matrices should be implemented for certain analyses, general matrices as WAG tend to perform well in many cases, as shown by Keane et al. (2006)18. I used WAG model but my simulated sequence did not take into account insertion and deletion events, cleaned alignments produce better topologies although with lower bootstrap values indicating that divergent and problematic alignment regions may lead to an apparently better supported but more biased topologies17. This could explain why increasing the number of characters improves behavior but variation of KF for DNA and proteins grows (Figure 3).




Figure 3. KF distance variation between DNA vs protein, DNA vs reference topology and protein vs reference topology comparisons. The pink points represent de mean and the brown points represent the outliers of the KF distance. A clearer variation is observed in DNA vs protein values. 


Alignment quality have as much impact on phylogenetic reconstruction as the phylogenetic methods used17. It has been well established that the performance of model-based methods, such as maximum likelihood depends on the ability of the chosen model to capture the evolutionary process correctly18. On the other hand, if complex models are used, the additional parameters will capture stochastic signal and the decreased amount of information available for each calculation will lead to increased variation in parameter estimates19 and that could happen when using RaxML-NG under parameterized models even if they are the best. 

Conclusion


The phylogenies reconstructed from DNA and protein sequence alignments are congruent in their topology but not in their branch length, probably due to differences in evolution rates. The null hypothesis is accepted, the phylogenies do not agree at all.

 References

1.     Rodrigo, A.G., Kelly-Borges, M., Bergquist, P.R. and Bergquist, P.L.A randomization test of the null hypothesis that two cladograms are sample estimates of a parametric phylogenetic tree. N.Z. J. Bot. 31, 257–268 (1993).
2.     Lake, J.A. Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances. Proc. Natl. Acade. Sci. 29, 1455-1459 (1994).
3.     Mindel, D. P. Phylogenetic Analysis of DNA Sequences. Oxford Univ. Press, London 119-136 (1991).
4.     Mickevich, M. F. Taxonomic congruence. Syst. Zool. 27, 143-158 (1978).
5.     Crisci, J. V. Taxonomic congruence. Taxon 33:2, 233 (1984).
6.     Hipp, A.L., Hall, J.C. and Sytsma, K.J. Congruence Versus Phylogenetic Accuracy: Revisiting the Incongruence Length Difference Test. Sys. Bio., 53:1, 81–89 (2004).
7.     Paradis, E. Analysis of Phylogenetics and Evolution with R (Second Edition). New York: Springer (2012).
8.     R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2012).
9.     Rambaut, A. and Grassly, N.C. "Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees", Computer Applications In The Biosciences, 13:3, 235-238 (1997).
10.  Goloboff, P.A. and Catalano, S.A. TNT version 1.5, including a full implementation of phylogenetic morphometrics. Cladistics 32, 221-238 (2016).
11.  Kozlov, A.M., Darriba, D., Flouri, T., Morel, B. and Stamatakis A. RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference. bioRxiv (2018).
12.  Robinson, D.F. and Foulds L.R. Comparison of phylogenetic trees. Mathematical Biosciences 53:1–2 131-147 (1981).
13.  Kuhner, M.K. and Felsenstein, J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:3 459–468 (1994)
14.  Rambaut, A. FigTree v1.3.1. (2010).
15.  Pesole, G., Sbisa, E. and Saccone, C. The evolution of the mitochondrial D-loop region and the origin of modern man. Mol. Biol. Evol. 9, 587-598 (1992)
16.  Thorne, J. L. Models of protein sequence evolution and their applications. Current Opinion in Genetics & Development. 10, 602–605 (2000).
17.  Talavera, G. and Castresana, J. Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments. Systematic Biology, 56:4 564–577 (2007).
18.  Keane T.M, Creevey C.J, Pentony M.M, Naughton T.J, McInerney J.O. Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. Evol. Biol.6:29 (2006).
19.  Sullivan, J. and Swofford, D. Should We Use Model-Based Methods for Phylogenetic Inference When We Know That Assumptions About Among-Site Rate Variation and Nucleotide Substitution Pattern Are Violated?. Systematic Biology, 50(5), pp.723-729 (2001).

20.  Lemmon, A. and Moriarty, E. The Importance of Proper Model Assumption in Bayesian Phylogenetics. Systematic Biology, 53(2), pp.265-277 (2004).




No hay comentarios: