lunes, 18 de diciembre de 2017

Character Weighting on Binary Data

Homoplasy is understood as similarity between taxa because of convergence or-not parallel coancestría- (Rendall & Di Fiore, 2007)⁠, in contrast to the phylogenetic definition of homology where one character is similar due there was a common ancestor between two taxa (Vogt & Vogt, 2002)⁠. Farris (Farris, 1983)⁠ established the neative relationship between homoplasy and search for the most parsimonious tree. Therefore, taking into account the homoplasy of each character would be useful when corroborating phylogenetic hypotheses. The implicit weighing consists of assigning weight to the characters according to their adjustment to the tree that best fits the character. Therefore, the weight of a character will be a value based on its homoplasy (Goloboff, 1993).

A tree of 20 terminals without branch length was generated using the "APE" package in software R (Paradis, 2012)⁠, then  a binary morphological data was simulated in the Mesquite 3.31 software under the mk1 model (Lewis, 2001)⁠. Three matrices of 100, 500 and 1000 characters were generated. The maximum parsimony analysis with implicit weighing was made in the software TNT (Goloboff, Farris, & Nixon, 2008) under different values of concavity (k = 3, k = 9, k = 15, k = 30, k = 500 and k = 999), additionally it was compared with a search without weighing. Finally, to compare the differences between the topologies obtained, the method proposed by Robinson and Foulds in 1981 was used .

Results and Discussion
The minimum value of RF was 0.4, taking into account that RF is a value ranging from 0 to 1 it can be understood that Robinson-Foulds distance shows that the weighing method does not represent large differences with respect to the search without weighing that had a value also close to 0,4.However, with a value of K = 9 the shortest distances were obtained, although with differences too low to be able to draw conclusions about a positive contribution in obtaining a topology. The number of characters did not influence the obtaining of the topology too much since the three matrices in general had a similar performance in the search of the tree.

Figure 1: Robinson-Foulds distance trees obtained acording to each model. In red the trees obtained with the 100 characters matrix, in orange those obtained with the 500 character matrix and in yellow with the 1000 characters matrix.

In conclusion, although it is considered that in general the weighing of characters helps to improve the search of the most parsimonious tree, for the matrices simulated with this model it didn't represent big differences.

Farris, J. S. (1983). The Logical Basis of Phylogenetic Analysis. Advances in Cladistics 2: Oceedings of the Second Meeting of the Willi Hennig Society, (September), 7–36.
Goloboff, P. A. (1993). ESTIMATING CHARACTER WEIGHTS DURING TREE SEARCH. Cladistics, 9(1), 83–91.
Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774–786.
Lewis, P. O. (2001). A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data. Syst. Biol, 50(6), 913–925. Retrieved from
Paradis, E. (2012). Analysis of phylogenetics and evolution with R. Springer.
Rendall, D., & Di Fiore, A. (2007). Homoplasy, homology, and the perceived special status of behavior in evolution. Journal of Human Evolution, 52(5), 504–521.

Vogt, L., & Vogt, L. (2002). Testing and weighting characters. Organisms, Diversity and Evolution, 2(4), 319–333.

domingo, 17 de diciembre de 2017

Criteria for phylogenetic analyses

When performing a phylogenetic analysis, we encountered several obstacles. The first of these is the tree space; we cannot be sure if the tree obtained is the real tree or the most suitable oneThe second is the heterogeneity between the branches; since different characters change at different rates within and among evolutionary lineages (Gaut et al., 1992). The third is the character convergence; since they can suggest evolutionary similarities without the taxa being phylogenetically related (Felsenstein, 1988). The fourth is the lack of data, either of the terminals or of characters of genealogical importance. Because of this, researchers have proposed different criteria to reconstruct phylogenetic relationships among taxa. These criteria are parsimony, genetic distance, likelihood and Bayesian inference.

The distance methods calculate the total distance between pairs of taxa, taking into account the differences in their sequences (Sourdis & Nei, 1987). The criterion of Parsimony is based on Occam's Razor (Steel and Penny, 2000).This criterion, states that one should prefer simpler explanations (requiring fewer assumptions) over more complex, ad hoc ones. Then, the tree that requires fewer evolutionary events explains better the observed data (Steel and Penny, 2000). Likelihood and Bayesian inference are criteria of statistical inference. Likelihood seeks to find the tree topology that confers the highest probability on the observed characteristics of tip species (Sober, 2004). Maxi. Likelihood considers the fit between a model of the evolutionary process, the data and each of the possible phylogenetic trees to find the best tree (Salemi, 2009). On the other hand, Bayesian inference uses probability distributions to describe the uncertainty of all parameters unknowns, including the model parameter(s). In phylogeny tree topology and substitution model specify the statistical model of the data (Nascimento et al., 2017), some of the parameters are base frequency, exchange rates, heterogeneity.

Each of these approaches has advantages and disadvantages, each one has its own biases in the reconstruction. Therefore, defining what is the most appropriate criterion is an open question in phylogenetic analysis.

Distance methods assume that the most similar organisms are phylogenetically closer. So it can lead to convergences. One of the most common distance-based methods is UPGMA, which can lead to errors in phylogenetic reconstruction such as grouping species that are not closely related if heterogeneity exists in the rate of evolutionary change (Nei, 1991).

Parsimony is biased when the rate of change per branch is high and tends to reconstruct the wrong tree due to long-branch attraction, while likelihood does not suffer from these problems (Clemente et al., 2009). But also, maximum likelihood and Bayesian methods when the characters under study evolve at non-uniform rates over time have been shown to be inconsistent and perform worse than parsimony (Clemente et al., 2009).

Also, phylogenetic reconstructions performed by Bayesian inference and likelihood due to ambiguous characters or lack of evidence, either of characters or terminals, can produce errors such as misleading estimates of topology and branch lengths  (Alan et al., 2009). To itself, in Bayesian inference, the priors of branch lengths and the parameters of heterogeneity of the rates due to incomplete data can generate high deceitful posterior probabilities (Alan et al., 2009).

The above are just some of the arguments available in the literature to find the different criteria for the reconstruction of phylogenetic analysis. Then, given the above, it would not be possible to give the magic recipe or the indicated criterion for a good phylogenetic analysis, since of all the criteria proposed so far none addresses or resolves all the common problems that we find when reconstructing the phylogenetic relationships between the taxa.


  • Alan R. Lemmon, Jeremy M. Brown, Kathrin Stanger-Hall, Emily Moriarty Lemmon; The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference.Syst. Biol. 58(1):130–145, 2009. pp:130-145.
  • Nascimento F, Reis M, Yang Z. (2017). A biologist’s guide to Bayesian phylogenetic analysis. Nature Ecology & Evolution.DOI:10.1038/s41559-017-0280-x
  • Clemente J, Ikeo K, Valinete G, et al (2009). Optimized ancestral state reconstruction using Sankoff parsimony. BMC Bioinformatics vol:10(1) pp:51.
  • Felsenstein J. (1988). Phylogenies from molecular sequences: Inferences and reliability. Annual Review of Genetics 22:521-565.
  • Gau, B.S., Muse S.V., Clark W.D. y Clegg M.T. 1992. Relative rates of nucelotide substitution at rbcL locus in moncotyldeonous plants. Journal of Molecular Evolution 35:292-303.
  • Goloboff,  P.  A.  (1999).Analyzing Large Data Sets in Reasonable Times: Solutions for Composite Optima. Cladistics 15. 415-428
  • Nei M. (1991). Efficiencies of different tree-making methods for molecular data. En: Miyamoto M. M., Cracaft, J. Edrs. Phylogenetic analysis of DNA sequences, Oxford, New York. 90-128.
  • Salemi, M., Vandamme, A.-M., & Lemey, P. (2009).The phylogenetic handbook: a practical to phylogenetic analysis and hypothesis testing. Cambridge University Press.
  • Steel M, Penny D. (2000). Parsimony, Likelihood, and the Role of Models in Molecular Phylogenetics.  2000 Jun;17(6):839-50.
  • Sober E. (2004). The Contest Between Parsimony and Likelihood. Syst. Biol. 53(4): 644-653.
  • Sourdis, J & NEI, M. Relative Efficiencies of the Maximum Parsimony and Distance-Matrix Methods in Obtaining the Correct Phylogenetic Tree. Center for Demographic and Population Genetics, The University of Texas Health Science Center at Houston.,1987;14 páginas.