martes, 22 de febrero de 2011

MAXIMUM LIKELIHOOD AND MAXIMUM PARSIMONY UNDER A SIMPLE MODEL.

Jiménez- Silva C. L.
Universidad Industrial de Santander.
Laboratorio de Sistemática y Biogeografía

INTRODUCTION

Stochastic models for nucleotide substitution are becoming increasingly important as a foundation for inferring phylogenetic trees from genetic sequence data. Such models allow for tree reconstruction through either maximum likelihood-based approaches or the _tting of transformed functions of the data to trees (see Swo_ord et al. (Swo_ord et al., 1996) for a recent survey). The models are also useful for analysing the performance of other, more conventional tree reconstruction methods, which are not explicitly based on such models, such as the popular maximum parsimony method (Fitch, 1971). Maximum parsimony (MP) is a popular technique for phylogeny reconstruction. However, MP is often criticized as being a statistically unsound method and one that fails to make explicit an underlying ‘‘model’’ of evolution (Steel and Penny, 2000). Parsimony does not make explicit assumptions about the evolutionary process. Some authors argue that parsimony makes no assumptions at all and that, furthermore, phylogenies should ideally be inferred without invoking any assumptions about the evolutionary process (Wiley 1981). Others point out that it is impossible to make any inference without a model; that a lack of explicit assumptions does not mean that the method is ‘assumption-free’ as the assumptions may be merely implicit; that the requirement for explicit specification of the assumed model is a strength rather than weakness of the model-based approach since then the fit of the model to data can be evaluated and improved (e.g. Felsenstein 1973).

METHODS

In the analysis were simulated sequences of 1000 bp under JC69 model and all branches on the tree are assumed to have the same length, were 10 replicates of each simulation under Seq-Gen v1.3.2 program (Rambaut & Grassly, 1997). For different topologies, 3 taxa, 4 taxa, 6 taxa and 12 taxa. The sequences generated were analyzed in parsimony using TNT (Goloboff et. al, 2008) program and Winclada and NONA (Goloboff, 1999; Nixon, K. C. 1999) program to 3 taxa. Also, the sequences were analyzed in Maximum likelihood using PhyML (Guindon & Gascuel, 2005), to check an equivalence between parsimony and likelihood under a particular model, this is JC69. The nucleotide models evaluated were JC69 in PhyML, For each simulation was performed the same procedure and finally the topologies generated were compared with Tree C program (Arias & Miranda-Esquivel) assuming on equal an exactly equal nodes. It obtains eventually a total of 200 comparisons were made and the process was automated by constructing scripts in bash.

RESULTS

When, I compared the phylogenetic reconstructions, data showed equivalence between parsimony and likelihood under a JC69 model. Equivalence here means that the most parsimonious tree and the ML tree under the said model are identical in every possible data set. But, this result was only present with the data set of a few terminals, ie 3 and 4 taxa.

JC model was assumed for comparison because I refer to it, as the fully symmetric model since it makes no distinction between any of the character states and being with each sequence being 1000 nt long. As a first approximation, there is no selection at any of the sites, and therefore it is more ‘‘parsimonious’’ to assume one common mechanism for all sites rather than 1000 different mechanisms, one for each site.

That parsimony and likelihood trees used for working with the JC model, sometimes Called the Neyman model with four states. It assumes rates of evolution on the branch of the tree each freely Vary from site to site. In this case, we have some underlying constraints on the type of substitution model (ie, Jukes-Cantor type), but no constraints on the edge parameters from site to site. This is even more general than the type of approach considered by Olsen (see Swofford et al. 1996, p. 443) in which the rate at which a site evolves can vary freely from site to site, but the ratios of the edge lengths are equal across the sites. (Steel and Penny, 2000). On the methodology used in this work, a free parameters model was assumed. For this purpose, was assigned the custom model option in Phyml. When the custom model option was selected, also it is possible to Give to the program a user-defined nucleotide frequency distribution at equilibrium, where calculated parameters are given by the data. Based on this, it is proposed that this type of Underlying model Almost Certainly is too flexible, because it allows many new parameters for each edge. It might be regarded as the model one might start with if one knew virtually nothing about any common underlying mechanism

linking the evolution of different characters on a tree. (Steel and Penny, 2000).

For the data set of 6 and 12 taxa the results were different; between both methods they were obtained under three equal nodes. This difference of equal nodes based on the number of terminals, in concordance with Hendy and Penny (1989) showed that with four species and binary characters evolving under the clock, parsimony is always consistent in recovering the tree, although ML and parsimony do not appear to be equivalent under the model. With five or more species evolving under the clock, it is known that parsimony can be inconsistent in estimating the trees (Hendy and Penny 1989; Zharkikh and Li 1993). Thus it is not equivalent to likelihood.

While, you can consider parsimony and likelihood to be equivalent under the JC69 model, those studies often used small trees with three to six taxa. The cases for much larger trees are not known. However, it appears easier to identify cases of inconsistency of parsimony on large trees than on small trees (Kim 1996; Huelsenbeck and Lander 2003), suggesting that likelihood and parsimony are in general not equivalent on large trees.

REFERENCES

Arias J. S., Miranda-Esquivel D. M. 2007. Tree C.

W. M. Fitch. Toward de_ning the course of evolution: minimum change for a speci_c tree topology.Systematic Zoology, 20:406{416, 1971.

D. L. Swo_ord, G. J. Olsen, P. J. Waddell, and D. M. Hillis. Phylogenetic inference. In D. M. Hillis, C. Moritz, and B. K. Marble, editors, Molecular Systematics, chapter 11, pages 407{514. Sinauer Associates, 2nd edition, 1996.

Golobo, P., 1999. NONA (No Name) ver. 2. Published by the author, Tucuman, Argentina

Felsenstein, J. 1973b. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst. Zool. 22:240–249.

Hendy, M. D. and Penny, D. 1989. A framework for the quantitative study of evolutionary trees. Syst. Zool. 38:297–309.

Huelsenbeck, J. P. and Lander, K. M. 2003. Frequent inconsistency of parsimony under a simple model of cladogenesis. Syst Biol 52:641–648.

Kim, J. 1996. General inconsistency conditions for maximum parsimony: effects of branch lengths and increasing numbers of taxa. Syst. Biol. 45:363–374.

Nixon, K. C. 1999. Winclada (BETA) ver. 0.9.9 PUBLISHED BY THE AUTHOR, ITHACA, NY. I have become weary of Clados generated trees being published without citation. Please cite the program

M. Steel and D. Penny, Parsimony, likelihood and the role of models in molecular phyloge-netics. Molecular Biology and Evolution 17 839{850 (2000).

Wiley, E. O. 1981. Phylogenetics. The Theory and Practice of Phylogenetic Systematics. John Wiley & Sons, New York.

Zharkikh, A. and Li,W. -H. 1993. Inconsistency of the maximum parsimony method: the case of five taxa with a molecular clock. Syst. Biol. 42:113–125.

sábado, 19 de febrero de 2011

Evaluation of topological congruence based on GTR model and different approaches to integrate the rate heterogeneity implemented under RAxML and PhyML


Introduction

The variation of substitution rates across nucleotide sites is a well-known process that can vary substantially for different positions in a sequence (Salemi & Vandamme). Accordingly, the rate of each sequence is modeled as a random variable drawn from a specified prior distribution, the most common empirical approach is to use a gamma density function to model the rate of distribution across sites (Mayrose et. al, 2005). However, is possible find opposing views for and against about implementing empirically this distribution especially with GTR model, some objections relate to the computational cost, memory consumption, rates not unimodals, etc. Innovative approaches have been raised to incorporate rate heterogeneity in programs such RAxML (Stamakis, 2006), nevertheless, the likelihood values produced can not be directly compared to likelihood values of other ML programs and little is known about the topological congruence when comparing its results. Given that, the objective of this study is to evaluate the topological congruence based on GTR model and different approaches to incorporates the rate heterogeneity under RAxML and PhyML programs.

Methods

In the analysis, sequences of 2000 bp were simulated under GRT and GTRGAMMA models under Seq-Gen v1.3.2 program (Rambaut & Grassly, 1997) in order to use GTR as control and avoid, a priori, a biased in favor of Γ. The sequences generated were analyzed in parsimony using TNT (Goloboff et. al, 2008) program and a topology was reconstructed. The parsimony´s topology and randomly selected topologies were used as starting trees in RAxML (Stamakis, 2006) and PhyML (Guindon & Gascuel, 2005) programs to determine if the initial topology influences the reconstruction of the phylogenetic relationships in both programs under the different models selected. The nucleotide models evaluated were GTRCAT, GTRMIX, GTRGAMMA and GTRCAT_GAMMA in RAxML versus GTR, GTRGAMMA in PhyML, also for each model was evaluated the number of categories to optimize the per-site individual rates in order to preclude the effect of over or subestimation, the categories defined were: 4, 10, 25, 35, 45, 50. For each simulation was performed the same procedure and finally the topologies generated were compared with Tree C program (Arias & Miranda-Esquivel) based on equal an exactly equal nodes. A total of 264 comparisons were made and the process was automated by constructing scripts in bash.

Results

The findings in this study indicate that RAxML with sequences simulated under GTR or GTRGAMMA recovered the shorter topology when it used the models GTRMIX, GTRGAMMA and GTRCAT_GAMMA from a random initial topology. However, if you provided a fixed starting tree (Maximum Parsimony topology) the topology recovered had an additional node (21 nodes). GTRCAT approximation was the only that recovered in both cases the same topology with 21 nodes coinciding with the GTR and GTRGAMMA models under PhyML. This implies, although GTRCAT approximation allows to computing trees under realistic approximation of nucleotide substitution (Stamakis, 2006), it do not perform the optimal topological inference based on the congruence between the nodes reconstructed. The analysis with distinct rate categories settings did not influence the tree recovered in PhyML, with one equal node and no node exactly equal in all cases, while, in RAxML the results are unstable between categories and did not follow a trend with increasing number of rate categories specified. Despite this, depending of the model selected, the number of equals nodes increases with specific categories, although the findings were variable under sequences simulated under GTR and GTRGAMMA (see table: http://dl.dropbox.com/u/16494208/Tabla.ods).

On the other hand, when the topologies estimated under different models implemented in PhyML and RAxML were compared (see appendix), the number or equal nodes recovered, from random starting tree, was better with the GTRMIX approximation in half the categories (10, 25, 50) and the worth scenario was obtained under GTRCAT with two equal nodes in 50 category. Additionally, topologies recovered from fixed starting tree (parsimony topology), showed a surprisingly high number of equal nodes (until 18 nodes) and exactly equal nodes (until 17 nodes) independently of the category analyzed.

Trees recovered with GTRGAMMA and GTRMIX models implemented in RAxML versus trees recovered with GTR/GTRGAMMA in PhyML, reconstructed the highest number of equal nodes followed by GTRGAMMA_CAT and GTRCAT, although the latter did not showed exactly equal nodes. The results found were exactly equals in the comparisons of the resulting trees under the models implemented in PhyML and the models used in RAxML in each simulation. But differences were detected when the comparision was performed under GTRAGAMMA simulation, because the behavior changed and the number of equal nodes increased in various categories under GRTCAT and GTRCAT_GAMMA followed by GTRGAMMA and GTRMIX versus GTR/GTRGAMMA in PhyML. Possibly, in this specific case, the rate heterogeneity described with gamma quantified from GRTCAT and GTRCAT_GAMMA approaches in RAxML is the closest to the response obtained with PhyML.

Conclusions

This study showed that the topological congruence in analysis using differents approaches to evaluate the rate heterogeneity in GTR is optimally treated with PhyML and RAxML, however PhyML is more stable and conservative also it is not strongly influenced by the number of categories. Contrary, RAxML is more sensible to the number of categories but it was not possible identify some type of trend or behavior to explain the changes. Additional findings, indicated that the topological congruence can be influenced by the starting tree in the analysis and by the model of simulation.

References

Arias J. S., Miranda-Esquivel D. M. 2007. Tree C.

Guindon S., Gascuel O. 2003. PhyML:"A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood."Systematic Biology. 2003 52(5):696-704.

Goloboff, P. A., Farris, J. S. and Nixon, K. C. 2008. TNT, a free program for phylogenetic analysis. Cladistics, 24: 774–786.

Mayrose I, Friedman N, Pupko T. 2005. A Gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics. 21(Suppl. 2):ii151–ii158.

Rambaut, A. and Grassly, N. C. 1997. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13: 235-238.

Salemi M., Vandamme A. M. (eds.). 2003. The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing. 2nd Edition, 381-404, Cambridge University Press, Cambridge.

Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22.

Annexes

Script for simulation of molecular sequences

#!/bin/bash

cd /home/andromeda/Seq-Gen.v1.3.2/source

for m in *.tree; do

for a in *.txt; do

echo "./seq-gen -mGTR -l2000 -n1 -z150 -op < $m > gtr$m$l.dat" >>gtr.txt

echo "./seq-gen -mGTR -g4 -l2000 -n1 -z150 -op < $m > gtrg$m$l.dat" >>gtrg.txt

chmod +x $a

./$a

done

done

mkdir gtr

mv gtr*.dat gtr

Script to run on PhyML and RAxML

RAxML:

#!/bin/bash

cd /home/andromeda/RAxML-7.0.4/

for s in *.datdo

for mod in GTRGAMMA GTRCAT GTRMIX GTRCAT_GAMMA;

do

for a in 4 10 25 35 45 50;

do

./raxml -f d -t parsi.tree -s $s -c $a -m $mod -n par$s$a$mod

done

done

done

mkdir parsi

mv RAxML_result.* parsi

PhyML

#!/bin/bash

cd /home/andromeda/PhyML_3.0/

for s in *.dat;

do

# categories

for a in 4 10 25 35 45 50;

do

echo "./phyml -i $s -d nt -q -n 1 -p -m GTR -c $a -a 1 -s BEST -o tlr --r_seed 12345 -- print_site_lnl --run gg$s$a" >> PHYML$s$a.txt

echo "./phyml -i $s -d nt -q -n 1 -p -m GTR -c $a -s BEST -o tlr --r_seed 12345 --print_site_lnl --run g$s$a" >> PHYML$s$a.txt

echo "./phyml -i $s -d nt -q -n 1 -p -m GTR -c $a -a 1 -s BEST -u parsi.tree -o tlr --r_seed 12345 --print_site_lnl --run ggp$s$a" >> PHYML$s$a.txt

echo "./phyml -i $s -d nt -q -n 1 -p -m GTR -c $a -s BEST -u parsi.tree -o tlr --r_seed 12345 --print_site_lnl --run gp$s$a" >> PHYML$s$a.txt

chmod +x PHYML$s$a.txt

./PHYML$s$a.txt

done

done

mkdir parsiphy

#mv RAxML_result.* parsi

Script to add header and change format
#!/bin/bash
for i in pargtrgarbol.tree.*; do
sed -i "1itread 'tree(s) from TNT, for data in e.nex'" $i
echo "proc-;" >>$i
done

Tree C´s script : comparison trees
#!/bin/bash
for i in gpgtrgarbol.tree.dat25.txt; do
for e in pargtrgarbol.tree.*; do
echo "./treec_linux -f $i -l $e -o sal$e$i.dat -t" >> treec.txt
chmod +x treec.txt
./treec.txt
done
done
mkdir comparison
mv *.dat comparison