Character weighting in cladistics refers to the assignment of differential costs to a character or a type of characters assuming they give different information, in order to obtain results that are consistent with the evolutive history of organisms (Farris, 1969; Hennig, 1968; Neff, 1986). This process can be a priori or a posteriori, according to the information you are using and the step of the phylogenetic reconstruction in which you use it (Neff, 1986). The a priori scheme is based on three notions: adaptation, independence, and chemistry (Neff, 1986; Patterson, 1982; Sneath and Sokal, 1973). The a posteriori weighthing is based on frequencies of kind of change. It is also founded on the notion that homoplasy is a “bad” thing and that homoplasy varies inversely with "reliability” (P. A. Goloboff, 1993; P Goloboff, 1998).
There are different strategies that has been known as successive weighting, and it is a posteriori differentially weighting (Farris, 1969; Goloboff, 1991) respect to a phylogeny. This concept of successive weighting (a posteriori) has more favorable rationale (Carpenter, 1994), even though the details of the method may be debated (Goloboff, 1993). Several of the methods of and a posteriori weighting and related schemes include linear and convex parsimony (Goloboff, 1998; Rodrigo, 1989), Implied Weighting (Goloboff, 1993) and Extended Weighting (Goloboff, 2013; Goloboff, 2014).
The amount of weight to be applied to the characters has been considered as a subjective exercise that does not always have the owing justification on the weights picking and their correpondent topology (Nixon & Carpenter, 2011, 2012). Therefore, a wide range of weights have been proposed, and for this reason it is not strange to find transition to transversion differential weighting ratios of zero to one, or even one to 10, often within a single phylogenetic study (Allard and Carpenter, 1996). Thence, examining these aspects of phylogenetic reconstruction using an empiric model is an interesting exercise to decide whether there is a specific weighting scheme to reconstruct it or not.
For this exercise I decided to use the butterfly family of Pieridae (order Lepidoptera) to test these schemes. Due to the above, the aim of this study was to assess the effect of the weighting scheme over the phylogenetic reconstruction of the family Pieridae (Lepidoptera: Pieridae). The butterfly family Pieridae is a relatively small family of Lepidoptera comprising about 1000 species placed in four subfamilies and 85 genera. These species have long been subjects of ecological and evolutionary studies. The monophyly of the family is now well established (Wahlberg et al. 2005; a 2014, Heikkila et al. 2012).
Due to the above, the aim of this study was to assess the effect of the weighting scheme over the phylogenetic reconstruction of the family Pieridae (Lepidoptera: Pieridae).
Materials and Methods
Data and Phylogenetic reconstruction
I downloaded the sequences of 4 genes from 8 taxa of the family Pieridae from the GenBank database: 1 of the mitochondrial region: Cytochrome oxidase subunit I (COI), and 3 from the nuclear regions: elongation factor I-alpha (EF-1a), isocitrate dehydrogenase (IDH) and malate dehydrogenase (MDH) (see Supplementary Material). The alignment was performed with MUSCLE v.3.5. (Edgar, 2004), to identify the sequences that present inconsistencies, such as duplicates, undefined nucleotides, or incompleteness. The phylogenetic analyses was performed using TNT (Goloboff, Farris, & Nixon, 2008), for all schemes I saved the strict consensus, with a bootstrap resampling of 100. To design the macros for each analysis I used the macro op.run (Arias & Miranda-Esquivel, 2010) as a reference. For linear parsimony I used a k value of 1, for convex parsimony a k value of 10, for implied weighting a value of 6. Although linear parsimony has been long used as the reference method (Goloboff, 2013; Goloboff, 1998; Swofford & Maddison, 1987), I used the topology of phyML as the reference topology (Guindon et al., 2009) under the model GTR, NNI and SPR swapping and 100 bootstrap replicates.
The comparisons between reference topologies and the obtained with each weighting scheme analysis were performed in the software R 3.1.1. (R Core Team, 2014) using the packages ape 3.1.4 (Paradis & Strimmer, 2004), phangorn (1.99-11) (Schiliep, 2011), and plotly (https://plot.ly/r/) to calculate the percentage of recovered nodes; the Robinson & Foulds distance (Robinson and Foulds, 1981) to determine how alike or different the topologies were; and a bootstrap score based on the sum of the bootstrap support observed in the nodes of the topologies, and the ratio of this value with the expected score. The graphs were displayed using seaview (Gouy et al.,2010). In order to run the experiment, I wrote the scripts using the command language of BASH, for Ubuntu 14.04 (see supplementary material).
Results and Discussion
In comparison to the resulting topologies from the Maximum Likelihood analysis, only the topologies based on nuclear regions EF-1a, MDH, IDH, and successfully recovered the totality of the nodes. This might be an interesting result, since the sample grouped only 21 species from 8 taxa (Fig 1.)
Fig 1. Percentage of recovered nodes in the topologies based on each gene and weighting scheme. The topologies obtained with phyML (PHY-REF) are the reference trees. It is observable that MDH and EF recovered the 100% of the maximum nodes in all the weighting schemes. The COI gene recovered less than the 80% of the maximum nodes.
Using the schemes of linear parsimony with k=1, and implied weighting with k=6, the topologies based on the nuclear genes EF-1a and MDH recovered the 100% of the nodes. Unlike these two regions, COI and IDH respectively recovered the 75% and 90% of the maximal nodes. For the scheme of convex parsimony, EF-1a and IDH recovered the 100% of the node, while COI and MDH did it for the 75% and 90%. Overall, there was no difference among the weighting schemes that I used, and the behavior of nodes recovery was equivalent, highlighting the nuclear gene-based topologies as the ones that recovered the totality of nodes in almost all cases, given a reference topology.
Fig 2. Robinson & Foulds distance of the resulting topologies. It is evident that the topologies greatly differ from one gene to another, and that these differences are less observable in the phyML topologies.
Robinson & Foulds Distance
I considered that this measure encompassed an interval from 0 to 2(n-2), representing the number of edges in the topology, thus 38 was attributed to identical trees, and 0 for trees sharing no bipartition -since the rationale of this distance is to deem every tree branch as a bipartition to all leaves-. The resultant topologies were different between each other, taking values from 10 to 36. Relatively, the pairs of COI and MDH topologies had the lowest values of this measure. As shown in the previous section of nodes recovery, the R&F distance did not variate among the three weighting schemes.
Fig 3. Values of Bootstrap score for the topologies build from different weighting schemes , based on the sum of the bootstrap values of all of the nodes of the topologies. The implied weighting scheme (IP) yielded the highest scores for the genes IDH and MDH.
Considering that all nodes should contribute equally to the topology, I calculated a bootstrap score, as the ratio of the sum of the support of all nodes and the number of nodes of the topology (See supplementary material). I distinguished the attribute of topologies as follows: 100 as the maximal tree support, 90-99 as strong support, 65–89 as moderate support, 50–64 as weak support, and <50 as negligible tree support (Lu et al., 2013). The topologies based on nuclear genes essentially resulted in moderate values of support, while the COI-based topologies showed weak values of support. Regarding the weighting scheme, the results did not show differences.
Limitations and Flaws
The experiment per se had several failures. Since the purpose of the study was to assess the influence of the weighting scheme over the phylogenetic reconstruction, the first step would have been using a different reference topology that could somehow group the different partitions into one sequence to thence consider it as the “complete” evidence. On the other hand, the range of concavity constant was definitely an eye pad. This was reflected in the little differences between each scheme. As recommended by other authors, it is reasonable to explore different values of k, to thus compare whether the resultant taxonomic groups are indeed well established or not (Goloboff et al., 2008). Although I decided to count gaps as characters, I did not consider to assign specific costs to the codon positions, which have been proved to affect the final results (Rota, 2011). Finally, the bootstrap score calculation was perhaps a good approach to assess support as a full-topology attribute. Nevertheless, I did not applied a significance test on the data, thence it was not sane to make any asseveration from those results, but the assignation of attributes for instance, strong or moderately supported.
Despite all of the aforementioned, I found some interesting results. First, the COI gene showed that it is not suitable to reconstruct the phylogeny of Pieridae under the weighting schemes I used. This, probably because it is a highly conservative gene (Lunt et al., 1996), which explains why it does not recover a correct phylogeny. This result is consistent with other studies that evaluate the phylogeny of a group, using different genome partitions ( Heikkilä et al., 2011; Chen et al., 2013). Second, unlike this gene, the EF-1a showed a different behavior in the different reconstructions, which is also consistent with Heikkilä et al., 2011. For each method, it managed to recover all the tribes from the study of Wahlberg et al., 2014 (See supplementary Material. Fig 1.). Accordingly, using this partition in further studies might be useful to yield reasonable results with a small part of the genome. As regards the weighting scheme, molecular data should be carefully managed to consider every partition as a single different unit for the analysis, thence a prior exploration of k values is convenient. Notwithstanding the poor resolution of the results presented here, it has been proved for other groups, that the weights of partitions do affect the support and stability of the phylogenetic reconstructions (Chen et al., 2013).
Fig 1S. Resultant topologies for each partition.
Allard, M. W., & Carpenter, J. M. (1996). On weighting and congruence. Cladistics, 12, 183–198.
Arias, J.S., Miranda-Esquivel, D.R. (2005) "Yuu-PRC", macros para TNT distributed by authors. Universidad Industrial de Santander, Bucaramanga (Colombia).
Braby, M. F., Vila, R., & Pierce, N. E. (2006). Molecular phylogeny and systematics of the Pieridae (Lepidoptera: Papilionoidea): Higher classification and biogeography. Zoological Journal of the Linnean Society, 147(1986), 239–275. doi:10.1111/j.1096-3642.2006.00218.x
Carpenter, J. M. (1994). Successive eighting, reliability and evidence. Cladistics, 10, 215–220.
Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. doi:10.1093/nar/gkh340
Farris, J. S. (1969). A Successive Approximations Approach to Character Weighting. Syst Biol, 18(4), 374–385.
Goloboff, P. (1998). Tree Searches Under Sankoff Parsimony. Cladistics, 14(3), 229–237. doi:10.1006/clad.1998.0068
Goloboff, P. a. (2014). Hide and vanish: Data sets where the most parsimonious tree is known but hard to find, and their implications for tree search methods. Molecular Phylogenetics and Evolution, 79, 118–31. doi:10.1016/j.ympev.2014.06.008
Goloboff, P. A. (1993). Estimating character weights during tree search. Cladistics, 9, 83–91.
Goloboff, P. A. (1997). Self-Weighted Optimization: Tree Searches and Character State Reconstructions under Implied Transformation Costs. Cladistics, 13(3), 225–245. doi:10.1111/j.1096-0031.1997.tb00317.x
Goloboff, P. A. (2013). Cladistics Extended implied weighting, 1–13.
Goloboff, P. a., Carpenter, J. M., Arias, J. S., & Esquivel, D. R. M. (2008). Weighting against homoplasy improves phylogenetic analysis of morphological data sets. Cladistics, 24(5), 758–773. doi:10.1111/j.1096-0031.2008.00209.x
Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774–786. doi:10.1111/j.1096-0031.2008.00217.x
Heikkila, M., Kaila, L., Mutanen, M., Pena, C., & Wahlberg, N. (2012). Cretaceous origin and repeated tertiary diversification of the redefined butterflies. Proceedings of the Royal Society B: Biological Sciences, 279(September 2011), 1093–1099. doi:10.1098/rspb.2011.1430
Hennig, W. (1968). Elementos de una sistemática filogenética. Buenos Aires: Eudeba.
Neff, N. a. (1986). A Rational Basis for a Priori Character Weighting. Systematic Biology, 35(1), 110–123. doi:10.1093/sysbio/35.1.110
Nixon, K. C., & Carpenter, J. M. (2011). Cladistics On homology, 27, 1–10.
Nixon, K. C., & Carpenter, J. M. (2012). Cladistics On homology, 28, 160–169.
Patterson, C. (1988). Homology in Classical and Molecular Biology. Molecular Biology and Evolution, 5(6), 603–625.
Rodrigo, A. G. (1989). An information-rich character weighting procedure for parsimony analysis. New Zealand Natural Sciences: 16-97, 103.
Sneath R.R, S. P. H. A. (1973). Numerical Taxonomy. Freeman, San Francisco.
Swofford, D. L., & Maddison, W. P. (1987). Reconstructing ancestral character states under Wagner parsimony. Mathematical Biosciences, 87(2), 199–229. doi:http://dx.doi.org/10.1016/0025-5564(87)90074-5
Wahlberg, N., Rota, J., Braby, M. F., Pierce, N. E., & Wheat, C. W. (2014). Revised systematics and higher classification of pierid butterflies (Lepidoptera: Pieridae) based on molecular data. Zoologica Scripta, 1–10. doi:10.1111/zsc.12075