Introduction
Phylogenies inferred from
independent data partitions tend to differ from each other in topology, despite
being extracted from the same set of taxa1.
The determination of phylogenetic trees based on DNA and protein
sequences has been hampered by the sensitivity of tree reconstruction
algorithms to unequal rate effects2. Therefore, the evolution rates
found in different sites within a sequence, the variation from site to site and
alignment biases parameters2-3 affect the comparisons. Taxonomic
congruence is the degree of recovering a proposed or defined group4, and its incongruence
can be given by biological or methodological sources. The biological causes are
based on the degree of causal interdependence between the two sets of
characters used5, while the methodological are due to sampling error
or the use of inappropriate phylogenetic models6.
In this study, my purpose is to determine if sequences based on DNA and protein alignments
are congruent in phylogenies evaluating as hypothesis that the reconstructed
phylogenetic trees do not agree with each other.
Methods
♥ Simulation process
First, I simulated the reference
topologies for different sample sizes (4 and 16 terminals) with all branch
lengths equal to 0.5 and 3 replicas through the ape7 package using
R v3.5.48. In Seq-Gen v1.3.49 I performed the sequence
simulation using 102, 1002 and 10002 characters with HKY for DNA
and WAG for evolutionary amino acid models.
♥ Estimation of phylogenetic
trees
I estimated maximum parsimony trees with 100
replicas and TBR branch swapping using TNT v1.510. For maximum likelihood in
RAxML-NG11 I used two
parsimony-based randomized stepwise and two random stars from a topology, the
best-scoring topology was selected as the best tree. The reference models were
GTR for nucleotides and LG for amino acids for the optimized model parameters.
♥ Tree Comparisons
I compared trees through Robinson-Foulds12 and Kuhner-Felsenestein13 distances. To ensure
that the resulting trees were not the product of a deviation in the variation
of the initial topology, I compared the reference trees with the estimated
trees of each replica and then the trees reconstructed by DNA vs proteins. Finally,
I plotted random trees with FigTree.v1.4.414 to represent the
observed behaviors of the data.
Results and discussion
The initial
trees and their replicas did not vary. For maximum parsimony, all the trees
estimated under the different sequence simulation models and characters size showed the same topology,
so I chose to graph replica 2 under a character matrix by HKY as an example
(Fig. 1).
Figure 1. Maximum
parsimony tree for 4 (A) and 16 (B) terminals with 10000 characters. Phylogeny
based on DNA and Protein alignment had the same behavior with a RF value of 0.
In maximum likelihood the figure 2
shows that the distances between the branch length values improve with
increasing number of characters, for both 4 and 16 terminals. The KF of
the reference topologies vs the topologies recovered shows higher indices for
16 terminals, but for both cases as the number of characters increases, the
values tend to converge.
Figure 2. KF distance between DNA vs protein (pink), DNA vs reference topology (green) and protein vs reference topology (brown) comparisons.
Proteins
have some regions that, due to their functional or structural importance, are
very well conserved, while other regions evolve faster both in terms of
nucleotide substitutions and insertions or deletions15. Sequence protein
alignment allows to observe clearer behaviors when the organisms are distant taking
into account amino acids substitutions that a DNA sequence phylogeny could miss.
In contrast, they will lose all information about synonymous mutations or
invariant codon sites. Proteins with a faster rate of evolution are less accurate
in relationships at close levels due to the proportion of convergent changes. If
we implemented proteins with a high rate of evolution but they do not contain a
high proportion of convergent changes, they could be as precise in constructing
phylogenies as proteins with a lower evolutionary rate but a greater proportion
of convergent changes16.
Substitution
models of amino acid evolution intend to mimic the evolution of protein data
and even if specific matrices should be implemented for certain analyses, general
matrices as WAG tend to perform well in many cases, as shown by Keane et al. (2006)18. I
used WAG model but my simulated sequence did not take into account insertion and
deletion events, cleaned alignments produce better topologies
although with lower bootstrap values indicating that divergent and problematic
alignment regions may lead to an apparently better supported but more biased
topologies17. This could explain why increasing the number of
characters improves behavior but variation of KF for DNA and proteins grows (Figure 3).
Figure 3. KF distance variation between DNA vs protein, DNA vs reference topology and protein vs reference topology comparisons. The pink points represent de mean and the brown points represent the outliers of the KF distance. A clearer variation is observed in DNA vs protein values.
References
Alignment quality have as much impact
on phylogenetic reconstruction as the phylogenetic methods used17. It has been well established that
the performance of model-based methods, such as maximum likelihood depends
on the ability of the chosen model to capture the evolutionary process correctly18. On the other hand, if complex models are used, the
additional parameters will capture stochastic signal and the decreased
amount of information available for each calculation will lead to increased
variation in parameter estimates19
and that could happen when using RaxML-NG under parameterized models even if they are the best.
Conclusion
The phylogenies
reconstructed from DNA and protein sequence alignments are congruent in their
topology but not in their branch length, probably due to differences in
evolution rates. The null hypothesis is accepted, the phylogenies do not agree
at all.
1. Rodrigo, A.G., Kelly-Borges, M.,
Bergquist, P.R. and Bergquist, P.L.A randomization test of the null hypothesis
that two cladograms are sample estimates of a parametric phylogenetic tree.
N.Z. J. Bot. 31, 257–268 (1993).
2. Lake, J.A. Reconstructing
evolutionary trees from DNA and protein sequences: Paralinear distances. Proc.
Natl. Acade. Sci. 29, 1455-1459 (1994).
3. Mindel, D. P. Phylogenetic Analysis
of DNA Sequences. Oxford Univ. Press, London 119-136 (1991).
4. Mickevich, M. F. Taxonomic
congruence. Syst. Zool. 27, 143-158 (1978).
5. Crisci, J. V. Taxonomic congruence.
Taxon 33:2, 233 (1984).
6. Hipp, A.L., Hall, J.C. and Sytsma, K.J.
Congruence Versus Phylogenetic Accuracy: Revisiting the Incongruence Length
Difference Test. Sys. Bio., 53:1, 81–89 (2004).
7.
Paradis, E. Analysis of Phylogenetics and Evolution
with R (Second Edition). New York: Springer (2012).
8.
R Core Team. R: A language and environment for
statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2012).
9. Rambaut, A. and
Grassly, N.C. "Seq-Gen: An Application for the Monte Carlo Simulation of
DNA Sequence Evolution along Phylogenetic Trees", Computer Applications In
The Biosciences, 13:3, 235-238 (1997).
10. Goloboff,
P.A. and Catalano, S.A. TNT version 1.5, including a full implementation of
phylogenetic morphometrics. Cladistics 32, 221-238 (2016).
11. Kozlov,
A.M., Darriba, D., Flouri, T., Morel, B. and Stamatakis A. RAxML-NG: A fast,
scalable, and user-friendly tool for maximum likelihood phylogenetic inference. bioRxiv (2018).
12. Robinson, D.F. and Foulds L.R.
Comparison of phylogenetic trees. Mathematical Biosciences 53:1–2 131-147 (1981).
13. Kuhner, M.K. and Felsenstein, J. A
simulation comparison of phylogeny algorithms under equal and unequal
evolutionary rates. Mol. Biol. Evol. 11:3 459–468 (1994)
14. Rambaut,
A. FigTree v1.3.1. (2010).
15. Pesole, G., Sbisa, E. and Saccone,
C. The evolution of the mitochondrial D-loop region and the origin of modern
man. Mol. Biol. Evol. 9, 587-598 (1992)
16. Thorne, J. L. Models of protein
sequence evolution and their applications. Current Opinion in Genetics &
Development. 10, 602–605 (2000).
17. Talavera, G. and Castresana, J.
Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned
Blocks from Protein Sequence Alignments. Systematic Biology, 56:4 564–577 (2007).
18. Keane T.M, Creevey C.J, Pentony M.M,
Naughton T.J, McInerney J.O. Assessment of methods for amino acid matrix
selection and their use on empirical data shows that ad hoc assumptions for
choice of matrix are not justified. Evol. Biol.6:29 (2006).
19. Sullivan, J. and Swofford, D. Should We Use Model-Based Methods for Phylogenetic Inference When We
Know That Assumptions About Among-Site Rate Variation and Nucleotide
Substitution Pattern Are Violated?. Systematic Biology, 50(5), pp.723-729 (2001).
20. Lemmon, A. and Moriarty, E.
The Importance of Proper Model Assumption in Bayesian Phylogenetics. Systematic
Biology, 53(2), pp.265-277 (2004).
No hay comentarios:
Publicar un comentario