miércoles, 24 de junio de 2015

Host Phylogeny of West Nile Virus: Does it shape the spatiotemporal structure?

The West Nile Virus (WNV) is a mosquito-born flavivirus that causes neurologic diseases such as encephalitis, meningitis, and acute flaccid paralysis (Lim, Koraka, Osterhaus, & Martina, 2011)⁠. Similar to other flaviviruses, WNV is an enveloped virus with a single-stranded, positive sense, ∼11-kb RNA genome whose strains are grouped into at least 7 genetic lineages. WNV was first isolated in Uganda in 1937. Posteriorly, the first large outbreak of West Nile neuroinvasive disease (WNND) was recorded in Romania in 1996, with 393 confirmed cases (Tsai, Popovici, Cernescu, Campbell, & Nedelcu, 1998)⁠. Three years later, it became a global public health concern after its introduction into North America, and subsequently into Central and South America (Lanciotti et al., 1999)⁠. Since then, major outbreaks of WNV fever and encephalitis took place in all continents, apart from Antarctica, causing human and animal deaths. Although its enzootic cycle is mainly maintained between mosquitoes and birds, it can eventually infect horses, humans, and other vertebrates (Hayes et al., 2005)⁠. Despite this variety of hosts, studies on the host structure and its influence on the spatiotemporal structure are still scarce. Since host genetic factors have a significant influence on disease distribution patterns, the overall purpose of this study is to assess the host structure of the phylogenetic relationships of WNV in a phylogeographic context, taking the spatiotemporal structure into account.

Specific Objectives
To identify the lineages of each viral strain.
To infer the main events of host-shift.
To determine the transmission paths within spatiotemporal structure.

1. Sequence Data: All the available sequences of complete genome of WNV, with collection times, and geographic locations will be retrieved from GenBank. In order to identify and delete recombinants, clones, and duplicates from the data base, I used Uclust v1.2.22q with 99 % of identity. Sequences of Ilheus virus (ILHV), Usutu virus (USUV), and Japanese encephalitis virus (JEV) will be used as the outgroup. Subsequently, all the WNV sequences will be aligned using the algorithm of multiple sequence alignment, implemented in MUSCLE v3.8.31 (Edgar, 2004).

2. Evolutionary rates: From this alignment, I will obtain a subset of 11 partitions, which correspond to the complete genome, and the genes that constitute it (C, prM/M, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5). An exploration of evolution rates of every gene will be done using Distance Rates (DistR) method (Bevan, Lang, & Bryant, 2005)⁠ to get a first approach to the molecular evolution of the genes, as one of the key determinants of the occurrence of cross-species transmission (Longdon, Brockhurst, Russell, Welch, & Jiggins, 2014; Vrancken et al., 2015)⁠.

3. Lineages identification: The substitution model will be selected using Akaike information criterion with JmodelTest2 (Darriba, Taboada, Doallo, & Posada, 2012)⁠. With this model, a Maximum likelihood (ML) inference will be performed using ExaML v3.0.X, with 20 searches and 100 bootstrap replicates, which are considered as sufficient for large data sets (Kozlov, Aberer, & Stamatakis, 2015)⁠. Every lineage will be assumed as a monophyletic group as sugested by (MacKenzie & Williams, 2009)⁠, and all the obtained clades will be revised taking previous studies into account.

4. Phylodynamics: In order to evaluate every lineage independently, the data set will be down sampled. Thus, the tree topologies, model parameters, evolutionary rates, MRCA, viral population size variation over time will be co-estimated independently for the resultant lineages, using with an uncorrelated log-normal relaxed clock model (rationale given in (May, Davis, Tesh, & Barrett, 2011)⁠, and the MCMC method implemented in the BEAST package v1.8.2 (Drummond, Suchard, Xie, & Rambaut, 2012)⁠. Bayesian skyline plot will be used as a coalescent prior during the estimation over time of the change in effective population size per generation, per year (Ne.g). The MCMC analysis will be run twice for 50 million generations, with sampling every 10000. MCMC convergence will be measured by estimating the effective sampling size (ESS), using Tracer software version 1.5 (http://tree.bio.ed.ac.uk/software/tracer/). Uncertainties will be estimated as 95% high probability densities (95% HPD). The results for the two runs will be combined for final analysis and Bayesian Factor (BF) support for host shift. Transition rates supported by a BF > 3 will be considered as significant support for a host shift between species. The obtained topologies will be summarized in a maximum clade credibility (MCC) tree, and annotated by the use of TreeAnnotator (http://beast.bio.ed.ac.uk/treeannotator).

5. Host-Shift Events: To determine whether there is a stronger influence of cross species transmission (CST) in the genetic divergence over within species transmission, I will compute Genetic distances in PAUP* v.4.0b10 (http://paup.csit.fsu.edu/) using models of nucleotide substitution specific to the lineages, and compare them with a cutoff value. Subsequently, transmission of WNV will be quantified by Metropolis Coupled Markov Chain Monte Carlo (MC3) coalescent simulation of migration rates, implemented in the program Migrate-N v3.6 (Beerli & Palczewski, 2010)⁠. The model of transmission (whether asymmetrical, bi-directional, symmetrical, inter alia) will be assessed, and the transmission web will be visualized using this software. In order to estimate the potential of the strains to jump into new hosts (sensu (Frost & Volz, 2010)⁠, or to predict viral emergence, I will estimate the per capita cross-species transmission rate Rij, and the effective reproductive number of a pathogen Re.

6. Host Phylogeny and the spatiotemporal Structure: Genetic population predictors: Ne.g, Rij, and Rwill be plotted in function of time.


Beerli, P., & Palczewski, M. (2010). Unified framework to evaluate panmixia and migration direction among multiple sampling locations. Genetics, 185(1), 313–326. doi:10.1534/genetics.109.112532
Bevan, R. B., Lang, B. F., & Bryant, D. (2005). Calculating the evolutionary rates of different genes: a fast, accurate estimator with applications to maximum likelihood phylogenetic analysis. Systematic Biology, 54(6), 900–915. doi:10.1080/10635150500354829
Darriba, D., Taboada, G. L., Doallo, R., & Posada, D. (2012). jModelTest 2: more models, new heuristics and parallel computing. Nature Methods, 9(8), 772. doi:10.1038/nmeth.2109
Drummond, A. J., Suchard, M. a, Xie, D., & Rambaut, A. (2012). Bayesian P hylogenetics with BEAUti and the BEAST 1 . 7. Molecular Biology and Evolution, 29(8), 1969–1973. doi:10.1093/molbev/mss075
Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. doi:10.1093/nar/gkh340
Frost, S. D. W., & Volz, E. M. (2010). Viral phylodynamics and the search for an “effective number of infections”. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 365(1548), 1879–1890. doi:10.1098/rstb.2010.0060
Hayes, E. B., Komar, N., Nasci, R. S., Montgomery, S. P., O’Leary, D. R., & Campbell, G. L. (2005). Epidemiology and transmission dynamics of West Nile virus disease. Emerging Infectious Diseases, 11(8), 1167–1173. doi:10.3201/eid1108.050289a
Kozlov, a. M., Aberer, a. J., & Stamatakis, a. (2015). ExaML Version 3: A Tool for Phylogenomic Analyses on Supercomputers. Bioinformatics, (March), 1–3. doi:10.1093/bioinformatics/btv184
Lanciotti, R. S., Roehrig, J. T., Deubel, V., Smith, J., Parker, M., Steele, K., … Gubler, D. J. (1999). Origin of the West Nile virus responsible for an outbreak of encephalitis in the northeastern United States. Science (New York, N.Y.), 286(5448), 2333–2337. doi:10.1126/science.286.5448.2333
Lim, S. M., Koraka, P., Osterhaus, A. D. M. E., & Martina, B. E. E. (2011). West Nile virus: Immunity and pathogenesis. Viruses, 3(6), 811–828. doi:10.3390/v3060811
Longdon, B., Brockhurst, M. a, Russell, C. a, Welch, J. J., & Jiggins, F. M. (2014). The Evolution and Genetics of Virus Host Shifts. PLoS Pathogens, 10(11). doi:10.1371/journal.ppat.1004395
MacKenzie, J. S., & Williams, D. T. (2009). The zoonotic flaviviruses of southern, south-eastern and eastern Asia, and australasia: The potential for emergent viruses. Zoonoses and Public Health, 56(6-7), 338–356. doi:10.1111/j.1863-2378.2008.01208.x
May, F. J., Davis, C. T., Tesh, R. B., & Barrett, A. D. T. (2011). Phylogeography of West Nile virus: from the cradle of evolution in Africa to Eurasia, Australia, and the Americas. Journal of Virology, 85(6), 2964–2974. doi:10.1128/JVI.01963-10
Tsai, T. F., Popovici, F., Cernescu, C., Campbell, G. L., & Nedelcu, N. I. (1998). West Nile encephalitis epidemic in southeastern Romania. Lancet, 352(9130), 767–771. doi:10.1016/S0140-6736(98)03538-7
Vrancken, B., Lemey, P., Rambaut, A., Bedford, T., Longdon, B., Günthard, H. F., & Suchard, M. a. (2015). Simultaneously estimating evolutionary history and repeated traits phylogenetic signal: applications to viral and host phenotypic evolution. Methods in Ecology and Evolution, 6(1), 67–82. doi:10.1111/2041-210X.12293

Divergence time estimation using autocorrelated variation rate and independent rate, depending on the topology size using MULTIDIVTIME and BEAST.


Estimating divergence times can be performed by methods based on variation of substitution rates. The AR method establish that the rates of change between ancestor and descendant are autocorrelated and the rates can follow a log-normal distribution with a mean equal to the parent rate (Thorne, Kishino and Painter, 1998; Kishino, Thorne and Bruno, 2001; Thorne and Kishino, 2002) or can be determined by a non-central χ 2 distribution (Lepage et al., 2006). Another method used in molecular clock analysis is the RR method which assigns random independent rates for each lineage and these rates are drawn from a single underlying parametric distribution such as an exponential or log-normal (Drummond et al., 2006; Rannala and Yang, 2007; Lepage et al., 2007). RR method has been implemented in BEAST while the AR method has been implemented in MULTIDIVTIME. In terms of the programs, other difference is the priors on node ages which MULTIDIVTIME uses a dirichlet distribution without an explicit assumption about the biological processes (Kishino, Thorne and Bruno, 2001; Thorne and Kishino, 2002) and in the other hand BEAST uses a Yule prior and the Birth-Death prior (Yule, 1924; Rannala and Yang, 1996; Yang and Rannala, 1997). These two methods have been widely used to dating phylogenies and it is recommended to use several calibration points; however, it is not common to find a number of fossil equivalent to the number of nodes in the topology. At the moment, the minimum number of constraints to achieve correctness divergence times, has not been established.


General objective

  • To determine what the minimum number of calibration points is, related to the number of tips of a topology (size) using the Autocorrelated rate and Random rate methods.

Specifics objectives
  • To correlate the divergence time estimated with the divergence time simulated.
  • To determine the delta of variation when the number of points increase.
  • To assess the extent of the program to reconstruct the correct phylogeny.
  • To assess if the posterior probabilities is high in nodes where the divergence time estimated is approximately the same to the divergence time simulated.

1. Simulations

The trees will be simulated considering four different number of tips: 10, 25, 50 and 100, plus an age of 31Mya at the root, with the package phytools v.0.4-56 (Revell, 2012) in the R language and it will be replicated 10 times. Based on the these trees, the sequences will be simulated using the program Seq-gen v1.3.3 (Rambaut and Grass, 1997), under the model HKY, with base frequencies 0.30[T], 0.25[C], 0.30[A], 0.15[G]; a transition-transversion rate K=5, gamma parameter alpha = 0.5 (sensu Brown and Yang, 2010) and a length of 1000 bp.

2. Molecular clock analysis and calibration

The analyzes will be performed using the AR and RR methods respectively in MULTIDIVTIME v9.25.03 (Thorne and Kishino, 2002) (MDT) y BEAST v.1.8.2 (Drummond and Rambaut, 2007; Drummond, Suchard, Xie, and Rambaut, 2012). In MDT the divergence time will be calculated after 10⁶ generations employing a correlated relaxed lognormal clock. In BEAST the analyzes will be done under an uncorrelated relaxed lognormal clock, Yule speciation process, a normal distribution for the tmrca and different number of generations depending on the authors's recommendation about the relation between it and the number of tips. For both methods the calibration points will be chosen randomly, leaving the base node of the ingroup fixed to 30 Mya. The number of points will be replicated for every tree simulation.

3. Correlation and additional comparisons

The simulated times and the outcomes of these analyzes will be compared and related by a Pearson correlation for each topology of different size and replica. The values of common nodes will be correlated and the index of error in BEAST will be calculated for each reconstruction comparing the structure of the topology with the simulated one employing the Robinson & Foulds distance implemented in the package phangorn (Schliep, 2011) in the R language. Then, I will compare and graph the values of posterior probabilities of correct nodes with the probabilities of incorrect nodes.


Brown, R. P., & Yang, Z. (2010). Bayesian dating of shallow phylogenies with a relaxed clock. Systematic Biology, 59(2), 119–31. http://doi.org/10.1093/sysbio/syp082

Drummond, A. J., Ho, S. Y. W., Phillips, M. J., & Rambaut, A. (2006). Relaxed phylogenetics and dating with confidence. PLoS Biology, 4(5), e88. http://doi.org/10.1371/journal.pbio.0040088

Drummond, A. J., & Rambaut, A. (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology, 7(1), 214. http://doi.org/10.1186/1471-2148-7-214

Drummond, A. J., Suchard, M. A., Xie, D., & Rambaut, A. (2012). Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular Biology and Evolution, 29(8), 1969–73. http://doi.org/10.1093/molbev/mss075

Kishino, H., Thorne, J. L., & Bruno, W. J. (2001). Performance of a Divergence Time Estimation Method under a Probabilistic Model of Rate Evolution. Molecular Biology and Evolution, 18(3), 352–361. http://doi.org/10.1093/oxfordjournals.molbev.a003811

Lepage, T., Bryant, D., Philippe, H., & Lartillot, N. (2007). A general comparison of relaxed molecular clock models. Molecular Biology and Evolution, 24(12), 2669–80. http://doi.org/10.1093/molbev/msm193

Lepage, T., Lawi, S., Tupper, P., & Bryant, D. (2006). Continuous and tractable models for the variation of evolutionary rates. Mathematical Biosciences, 199(2), 216–33. http://doi.org/10.1016/j.mbs.2005.11.002

R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria. Retrieved from http://www.r-project.org/

Rambaut, A., & Grass, N. C. (1997). Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics, 13(3), 235–238. http://doi.org/10.1093/bioinformatics/13.3.235

Rannala, B., & Yang, Z. (1996). Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. Journal of Molecular Evolution, 43(3), 304–311. http://doi.org/10.1007/BF02338839

Rannala, B., & Yang, Z. (2007). Inferring speciation times under an episodic molecular clock. Systematic Biology, 56(3), 453–66. http://doi.org/10.1080/10635150701420643

Revell, L. J. (2012). phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution, 3(2), 217–223. http://doi.org/10.1111/j.2041-210X.2011.00169.x

Schliep, K. P. (2011). phangorn: phylogenetic analysis in R. Bioinformatics (Oxford, England), 27(4), 592–3. http://doi.org/10.1093/bioinformatics/btq706

Thorne, J. L., & Kishino, H. (2002). Divergence time and evolutionary rate estimation with multilocus data. Systematic Biology, 51(5), 689–702. http://doi.org/10.1080/10635150290102456

Thorne, J. L., Kishino, H., & Painter, I. S. (1998). Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution, 15(12), 1647–1657. http://doi.org/10.1093/oxfordjournals.molbev.a025892

Yang, Z., & Rannala, B. (1997). Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. Molecular Biology and Evolution, 14(7), 717–724. http://doi.org/10.1093/oxfordjournals.molbev.a025811

Yule, G. U. (1925). A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character, 213, pp. 21–87. Retrieved from http://www.jstor.org/stable/92117

JackknifEvolution for endemism areas

¿It can be used a Jackknife for endemism areas, and how it will be the result ?. To answer this question a programing code will be written for the Jackknife calculation on R plataform 3.2.0 (2015-04-16). An analysis of areas of endemism will be made for Amazonas using Primates distributions (based on Da Silva & Oren, 1995) in the program NDM/VNDM 3 (Goloboff, 2005). From the matrix data used for NDM/VNDM, two types of Jackknife will be implemented: Species Jackknife and Occurences Jackknife. For each Jackknife type a 25% and 50% resample will be used with 50 replications each one. Finally support values of each resample and Jacknnife type will be compared between them.
Silva, J. M. C. D., & Oren, D. C. (1996). Application of parsimony analysis of endemicity in Amazonian biogeography: an example with primates. Biological Journal of the Linnean Society, 59(4), 427-437.

Quinn, G. P., & Keough, M. J. (2002). Experimental design and data analysis for biologists. Cambridge University Press.
DA SILVA, J. O. S. É., CARDOSO, M., RYLANDS, A. B., FONSECA, D., & GUSTAVO, A. (2005). The fate of the Amazonian areas of endemism. Conservation Biology, 19(3), 689-694.
Szumik, C. A., & Goloboff, P. A. (2004). Areas of endemism: an improved optimality criterion. Systematic biology, 53(6), 968-977.

Diversification of the Amazon biota: Rivers as vicariant barriers.

-Download georeferenced distributions through the Global Biodiversity Information Facility (GBIF: http://www.gbif.org) for species of the Amazon basin (Aves, Magnoliophyta, Insecta, Squamata, Amphibia, Mammalia).

-Gathering phylogenetic clades for distributed species in the Amazon basin.

-Data cleaning under the protocol proposed by R-Alarcon., & Miranda-Esquivel. (in prep). 

-Delimitation of areas of endemism under the optimality criterion implemented in the program NDM/VNDM. 

-Localization the isolating barriers for different taxa with distribution in the Amazon basin using the Vicariance Inference Program (VIP) (Arias et al. 2011).

-Analysis of events under models (Ronquist, MC, Reconciled) implemented in TreeFitter 1.3B1 (Ronquist, 2002).

Goloboff, P. 2005. NDM/VNDM ver. 2.5. Programs for identification of areas of 
endemism. Programs and documentation available at www.zmuc.dk/public/phylogeny/endemism

Arias JS, Szumik CA, Goloboff PA (2011) Spatial analysis of vicariance: A method for using direct geographical information in historical biogeography. Cladistics 27: 617–628. doi: 10.1111/j.1096-0031.2011.00353.

Ronquist, F. 2002. TREEFITTER, version 1.3b. Computer program and manual available by anonymous FTP from Uppsala University.

viernes, 6 de marzo de 2015

Partitions Weighting in the phylogenetic reconstruction of the family Pieridae (order Lepidoptera)


Character weighting in cladistics refers to the assignment of differential costs to a character or a type of characters assuming they give different information, in order to obtain results that are consistent with the evolutive history of organisms (Farris, 1969; Hennig, 1968; Neff, 1986)⁠. This process can be a priori or a posteriori, according to the information you are using and the step of the phylogenetic reconstruction in which you use it (Neff, 1986)⁠. The a priori scheme is based on three notions: adaptation, independence, and chemistry (Neff, 1986; Patterson, 1982; Sneath and Sokal, 1973)⁠⁠. The a posteriori weighthing is based on frequencies of kind of change. It is also founded on the notion that homoplasy is a “bad” thing and that homoplasy varies inversely with "reliability” (P. A. Goloboff, 1993; P Goloboff, 1998)⁠.

There are different strategies that has been known as successive weighting, and it is a posteriori differentially weighting (Farris, 1969; Goloboff, 1991) respect to a phylogeny. This concept of successive weighting (a posteriori) has more favorable rationale (Carpenter, 1994), even though the details of the method may be debated (Goloboff, 1993). Several of the methods of and a posteriori weighting and related schemes include linear and convex parsimony (Goloboff, 1998; Rodrigo, 1989), Implied Weighting (Goloboff, 1993) and Extended Weighting (Goloboff, 2013; Goloboff, 2014).

The amount of weight to be applied to the characters has been considered as a subjective exercise that does not always have the owing justification on the weights picking and their correpondent topology (Nixon & Carpenter, 2011, 2012). Therefore, a wide range of weights have been proposed, and for this reason it is not strange to find transition to transversion differential weighting ratios of zero to one, or even one to 10, often within a single phylogenetic study (Allard and Carpenter, 1996)⁠. Thence, examining these aspects of phylogenetic reconstruction using an empiric model is an interesting exercise to decide whether there is a specific weighting scheme to reconstruct it or not.

For this exercise I decided to use the butterfly family of Pieridae (order Lepidoptera) to test these schemes. Due to the above, the aim of this study was to assess the effect of the weighting scheme over the phylogenetic reconstruction of the family Pieridae (Lepidoptera: Pieridae). The butterfly family Pieridae is a relatively small family of Lepidoptera comprising about 1000 species placed in four subfamilies and 85 genera. These species have long been subjects of ecological and evolutionary studies. The monophyly of the family is now well established (Wahlberg et al. 2005; a 2014, Heikkila et al. 2012).

Due to the above, the aim of this study was to assess the effect of the weighting scheme over the phylogenetic reconstruction of the family Pieridae (Lepidoptera: Pieridae).

Materials and Methods

Data and Phylogenetic reconstruction

I downloaded the sequences of 4 genes from 8 taxa of the family Pieridae from the GenBank database: 1 of the mitochondrial region: Cytochrome oxidase subunit I (COI), and 3 from the nuclear regions: elongation factor I-alpha (EF-1a), isocitrate dehydrogenase (IDH) and malate dehydrogenase (MDH) (see Supplementary Material). The alignment was performed with MUSCLE v.3.5. (Edgar, 2004), to identify the sequences that present inconsistencies, such as duplicates, undefined nucleotides, or incompleteness. The phylogenetic analyses was performed using TNT (Goloboff, Farris, & Nixon, 2008), for all schemes I saved the strict consensus, with a bootstrap resampling of 100. To design the macros for each analysis I used the macro op.run (Arias & Miranda-Esquivel, 2010) as a reference. For linear parsimony I used a k value of 1, for convex parsimony a k value of 10, for implied weighting a value of 6. Although linear parsimony has been long used as the reference method (Goloboff, 2013; Goloboff, 1998; Swofford & Maddison, 1987), I used the topology of phyML as the reference topology (Guindon et al., 2009) under the model GTR, NNI and SPR swapping and 100 bootstrap replicates. 

Comparison metrics

The comparisons between reference topologies and the obtained with each weighting scheme analysis were performed in the software R 3.1.1. (R Core Team, 2014) using the packages ape 3.1.4 (Paradis & Strimmer, 2004), phangorn (1.99-11) (Schiliep, 2011), and plotly (https://plot.ly/r/) to calculate the percentage of recovered nodes; the Robinson & Foulds distance (Robinson and Foulds, 1981) to determine how alike or different the topologies were; and a bootstrap score based on the sum of the bootstrap support observed in the nodes of the topologies, and the ratio of this value with the expected score. The graphs were displayed using seaview (Gouy et al.,2010). In order to run the experiment, I wrote the scripts using the command language of BASH, for Ubuntu 14.04 (see supplementary material).

Results and Discussion

Recovered nodes

In comparison to the resulting topologies from the Maximum Likelihood analysis, only the topologies based on nuclear regions EF-1a, MDH, IDH, and successfully recovered the totality of the nodes. This might be an interesting result, since the sample grouped only 21 species from 8 taxa (Fig 1.)

Fig 1. Percentage of recovered nodes in the topologies based on each gene and weighting scheme. The topologies obtained with phyML (PHY-REF) are the reference trees. It is observable that MDH and EF recovered the 100% of the maximum nodes in all the weighting schemes. The COI gene recovered less than the 80% of the maximum nodes.

Using the schemes of linear parsimony with k=1, and implied weighting with k=6, the topologies based on the nuclear genes EF-1a and MDH recovered the 100% of the nodes. Unlike these two regions, COI and IDH respectively recovered the 75% and 90% of the maximal nodes. For the scheme of convex parsimony, EF-1a and IDH recovered the 100% of the node, while COI and MDH did it for the 75% and 90%. Overall, there was no difference among the weighting schemes that I used, and the behavior of nodes recovery was equivalent, highlighting the nuclear gene-based topologies as the ones that recovered the totality of nodes in almost all cases, given a reference topology.
Fig 2. Robinson & Foulds distance of the resulting topologies. It is evident that the topologies greatly differ from one gene to another, and that these differences are less observable in the phyML topologies.

Robinson & Foulds Distance

I considered that this measure encompassed an interval from 0 to 2(n-2), representing the number of edges in the topology, thus 38 was attributed to identical trees, and 0 for trees sharing no bipartition -since the rationale of this distance is to deem every tree branch as a bipartition to all leaves-. The resultant topologies were different between each other, taking values from 10 to 36. Relatively, the pairs of COI and MDH topologies had the lowest values of this measure. As shown in the previous section of nodes recovery, the R&F distance did not variate among the three weighting schemes.
Fig 3. Values of Bootstrap score for the topologies build from different weighting schemes , based on the sum of the bootstrap values of all of the nodes of the topologies. The implied weighting scheme (IP) yielded the highest scores for the genes IDH and MDH.

Bootstrap support

Considering that all nodes should contribute equally to the topology, I calculated a bootstrap score, as the ratio of the sum of the support of all nodes and the number of nodes of the topology (See supplementary material). I distinguished the attribute of topologies as follows: 100 as the maximal tree support, 90-99 as strong support, 65–89 as moderate support, 50–64 as weak support, and <50 as negligible tree support (Lu et al., 2013). The topologies based on nuclear genes essentially resulted in moderate values of support, while the COI-based topologies showed weak values of support. Regarding the weighting scheme, the results did not show differences.

Limitations and Flaws

The experiment per se had several failures. Since the purpose of the study was to assess the influence of the weighting scheme over the phylogenetic reconstruction, the first step would have been using a different reference topology that could somehow group the different partitions into one sequence to thence consider it as the “complete” evidence. On the other hand, the range of concavity constant was definitely an eye pad. This was reflected in the little differences between each scheme. As recommended by other authors, it is reasonable to explore different values of k, to thus compare whether the resultant taxonomic groups are indeed well established or not (Goloboff et al., 2008). Although I decided to count gaps as characters, I did not consider to assign specific costs to the codon positions, which have been proved to affect the final results (Rota, 2011). Finally, the bootstrap score calculation was perhaps a good approach to assess support as a full-topology attribute. Nevertheless, I did not applied a significance test on the data, thence it was not sane to make any asseveration from those results, but the assignation of attributes for instance, strong or moderately supported.


Despite all of the aforementioned, I found some interesting results. First, the COI gene showed that it is not suitable to reconstruct the phylogeny of Pieridae under the weighting schemes I used. This, probably because it is a highly conservative gene (Lunt et al., 1996), which explains why it does not recover a correct phylogeny. This result is consistent with other studies that evaluate the phylogeny of a group, using different genome partitions ( Heikkilä et al., 2011; Chen et al., 2013). Second, unlike this gene, the EF-1a showed a different behavior in the different reconstructions, which is also consistent with Heikkilä et al., 2011. For each method, it managed to recover all the tribes from the study of Wahlberg et al., 2014 (See supplementary Material. Fig 1.). Accordingly, using this partition in further studies might be useful to yield reasonable results with a small part of the genome. As regards the weighting scheme, molecular data should be carefully managed to consider every partition as a single different unit for the analysis, thence a prior exploration of k values is convenient. Notwithstanding the poor resolution of the results presented here, it has been proved for other groups, that the weights of partitions do affect the support and stability of the phylogenetic reconstructions (Chen et al., 2013).

Supplementary Material

Fig 1S. Resultant topologies for each partition. 


Allard, M. W., & Carpenter, J. M. (1996). On weighting and congruence. Cladistics, 12, 183–198.

Arias, J.S., Miranda-Esquivel, D.R. (2005) "Yuu-PRC", macros para TNT distributed by authors. Universidad Industrial de Santander, Bucaramanga (Colombia).

Braby, M. F., Vila, R., & Pierce, N. E. (2006). Molecular phylogeny and systematics of the Pieridae (Lepidoptera: Papilionoidea): Higher classification and biogeography. Zoological Journal of the Linnean Society, 147(1986), 239–275. doi:10.1111/j.1096-3642.2006.00218.x

Carpenter, J. M. (1994). Successive eighting, reliability and evidence. Cladistics, 10, 215–220.

Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. doi:10.1093/nar/gkh340

Farris, J. S. (1969). A Successive Approximations Approach to Character Weighting. Syst Biol, 18(4), 374–385.

Goloboff, P. (1998). Tree Searches Under Sankoff Parsimony. Cladistics, 14(3), 229–237. doi:10.1006/clad.1998.0068

Goloboff, P. a. (2014). Hide and vanish: Data sets where the most parsimonious tree is known but hard to find, and their implications for tree search methods. Molecular Phylogenetics and Evolution, 79, 118–31. doi:10.1016/j.ympev.2014.06.008

Goloboff, P. A. (1993). Estimating character weights during tree search. Cladistics, 9, 83–91.

Goloboff, P. A. (1997). Self-Weighted Optimization: Tree Searches and Character State Reconstructions under Implied Transformation Costs. Cladistics, 13(3), 225–245. doi:10.1111/j.1096-0031.1997.tb00317.x

Goloboff, P. A. (2013). Cladistics Extended implied weighting, 1–13.

Goloboff, P. a., Carpenter, J. M., Arias, J. S., & Esquivel, D. R. M. (2008). Weighting against homoplasy improves phylogenetic analysis of morphological data sets. Cladistics, 24(5), 758–773. doi:10.1111/j.1096-0031.2008.00209.x

Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774–786. doi:10.1111/j.1096-0031.2008.00217.x

Heikkila, M., Kaila, L., Mutanen, M., Pena, C., & Wahlberg, N. (2012). Cretaceous origin and repeated tertiary diversification of the redefined butterflies. Proceedings of the Royal Society B: Biological Sciences, 279(September 2011), 1093–1099. doi:10.1098/rspb.2011.1430

Hennig, W. (1968). Elementos de una sistemática filogenética. Buenos Aires: Eudeba.

Neff, N. a. (1986). A Rational Basis for a Priori Character Weighting. Systematic Biology, 35(1), 110–123. doi:10.1093/sysbio/35.1.110

Nixon, K. C., & Carpenter, J. M. (2011). Cladistics On homology, 27, 1–10.

Nixon, K. C., & Carpenter, J. M. (2012). Cladistics On homology, 28, 160–169.

Patterson, C. (1988). Homology in Classical and Molecular Biology. Molecular Biology and Evolution, 5(6), 603–625.

Rodrigo, A. G. (1989). An information-rich character weighting procedure for parsimony analysis. New Zealand Natural Sciences: 16-97, 103.

Sneath R.R, S. P. H. A. (1973). Numerical Taxonomy. Freeman, San Francisco.

Swofford, D. L., & Maddison, W. P. (1987). Reconstructing ancestral character states under Wagner parsimony. Mathematical Biosciences, 87(2), 199–229. doi:http://dx.doi.org/10.1016/0025-5564(87)90074-5

Wahlberg, N., Rota, J., Braby, M. F., Pierce, N. E., & Wheat, C. W. (2014). Revised systematics and higher classification of pierid butterflies (Lepidoptera: Pieridae) based on molecular data. Zoologica Scripta, 1–10. doi:10.1111/zsc.12075