sábado, 19 de febrero de 2011

Evaluation of topological congruence based on GTR model and different approaches to integrate the rate heterogeneity implemented under RAxML and PhyML


Introduction

The variation of substitution rates across nucleotide sites is a well-known process that can vary substantially for different positions in a sequence (Salemi & Vandamme). Accordingly, the rate of each sequence is modeled as a random variable drawn from a specified prior distribution, the most common empirical approach is to use a gamma density function to model the rate of distribution across sites (Mayrose et. al, 2005). However, is possible find opposing views for and against about implementing empirically this distribution especially with GTR model, some objections relate to the computational cost, memory consumption, rates not unimodals, etc. Innovative approaches have been raised to incorporate rate heterogeneity in programs such RAxML (Stamakis, 2006), nevertheless, the likelihood values produced can not be directly compared to likelihood values of other ML programs and little is known about the topological congruence when comparing its results. Given that, the objective of this study is to evaluate the topological congruence based on GTR model and different approaches to incorporates the rate heterogeneity under RAxML and PhyML programs.

Methods

In the analysis, sequences of 2000 bp were simulated under GRT and GTRGAMMA models under Seq-Gen v1.3.2 program (Rambaut & Grassly, 1997) in order to use GTR as control and avoid, a priori, a biased in favor of Γ. The sequences generated were analyzed in parsimony using TNT (Goloboff et. al, 2008) program and a topology was reconstructed. The parsimony´s topology and randomly selected topologies were used as starting trees in RAxML (Stamakis, 2006) and PhyML (Guindon & Gascuel, 2005) programs to determine if the initial topology influences the reconstruction of the phylogenetic relationships in both programs under the different models selected. The nucleotide models evaluated were GTRCAT, GTRMIX, GTRGAMMA and GTRCAT_GAMMA in RAxML versus GTR, GTRGAMMA in PhyML, also for each model was evaluated the number of categories to optimize the per-site individual rates in order to preclude the effect of over or subestimation, the categories defined were: 4, 10, 25, 35, 45, 50. For each simulation was performed the same procedure and finally the topologies generated were compared with Tree C program (Arias & Miranda-Esquivel) based on equal an exactly equal nodes. A total of 264 comparisons were made and the process was automated by constructing scripts in bash.

Results

The findings in this study indicate that RAxML with sequences simulated under GTR or GTRGAMMA recovered the shorter topology when it used the models GTRMIX, GTRGAMMA and GTRCAT_GAMMA from a random initial topology. However, if you provided a fixed starting tree (Maximum Parsimony topology) the topology recovered had an additional node (21 nodes). GTRCAT approximation was the only that recovered in both cases the same topology with 21 nodes coinciding with the GTR and GTRGAMMA models under PhyML. This implies, although GTRCAT approximation allows to computing trees under realistic approximation of nucleotide substitution (Stamakis, 2006), it do not perform the optimal topological inference based on the congruence between the nodes reconstructed. The analysis with distinct rate categories settings did not influence the tree recovered in PhyML, with one equal node and no node exactly equal in all cases, while, in RAxML the results are unstable between categories and did not follow a trend with increasing number of rate categories specified. Despite this, depending of the model selected, the number of equals nodes increases with specific categories, although the findings were variable under sequences simulated under GTR and GTRGAMMA (see table: http://dl.dropbox.com/u/16494208/Tabla.ods).

On the other hand, when the topologies estimated under different models implemented in PhyML and RAxML were compared (see appendix), the number or equal nodes recovered, from random starting tree, was better with the GTRMIX approximation in half the categories (10, 25, 50) and the worth scenario was obtained under GTRCAT with two equal nodes in 50 category. Additionally, topologies recovered from fixed starting tree (parsimony topology), showed a surprisingly high number of equal nodes (until 18 nodes) and exactly equal nodes (until 17 nodes) independently of the category analyzed.

Trees recovered with GTRGAMMA and GTRMIX models implemented in RAxML versus trees recovered with GTR/GTRGAMMA in PhyML, reconstructed the highest number of equal nodes followed by GTRGAMMA_CAT and GTRCAT, although the latter did not showed exactly equal nodes. The results found were exactly equals in the comparisons of the resulting trees under the models implemented in PhyML and the models used in RAxML in each simulation. But differences were detected when the comparision was performed under GTRAGAMMA simulation, because the behavior changed and the number of equal nodes increased in various categories under GRTCAT and GTRCAT_GAMMA followed by GTRGAMMA and GTRMIX versus GTR/GTRGAMMA in PhyML. Possibly, in this specific case, the rate heterogeneity described with gamma quantified from GRTCAT and GTRCAT_GAMMA approaches in RAxML is the closest to the response obtained with PhyML.

Conclusions

This study showed that the topological congruence in analysis using differents approaches to evaluate the rate heterogeneity in GTR is optimally treated with PhyML and RAxML, however PhyML is more stable and conservative also it is not strongly influenced by the number of categories. Contrary, RAxML is more sensible to the number of categories but it was not possible identify some type of trend or behavior to explain the changes. Additional findings, indicated that the topological congruence can be influenced by the starting tree in the analysis and by the model of simulation.

References

Arias J. S., Miranda-Esquivel D. M. 2007. Tree C.

Guindon S., Gascuel O. 2003. PhyML:"A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood."Systematic Biology. 2003 52(5):696-704.

Goloboff, P. A., Farris, J. S. and Nixon, K. C. 2008. TNT, a free program for phylogenetic analysis. Cladistics, 24: 774–786.

Mayrose I, Friedman N, Pupko T. 2005. A Gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics. 21(Suppl. 2):ii151–ii158.

Rambaut, A. and Grassly, N. C. 1997. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13: 235-238.

Salemi M., Vandamme A. M. (eds.). 2003. The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing. 2nd Edition, 381-404, Cambridge University Press, Cambridge.

Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22.

Annexes

Script for simulation of molecular sequences

#!/bin/bash

cd /home/andromeda/Seq-Gen.v1.3.2/source

for m in *.tree; do

for a in *.txt; do

echo "./seq-gen -mGTR -l2000 -n1 -z150 -op < $m > gtr$m$l.dat" >>gtr.txt

echo "./seq-gen -mGTR -g4 -l2000 -n1 -z150 -op < $m > gtrg$m$l.dat" >>gtrg.txt

chmod +x $a

./$a

done

done

mkdir gtr

mv gtr*.dat gtr

Script to run on PhyML and RAxML

RAxML:

#!/bin/bash

cd /home/andromeda/RAxML-7.0.4/

for s in *.datdo

for mod in GTRGAMMA GTRCAT GTRMIX GTRCAT_GAMMA;

do

for a in 4 10 25 35 45 50;

do

./raxml -f d -t parsi.tree -s $s -c $a -m $mod -n par$s$a$mod

done

done

done

mkdir parsi

mv RAxML_result.* parsi

PhyML

#!/bin/bash

cd /home/andromeda/PhyML_3.0/

for s in *.dat;

do

# categories

for a in 4 10 25 35 45 50;

do

echo "./phyml -i $s -d nt -q -n 1 -p -m GTR -c $a -a 1 -s BEST -o tlr --r_seed 12345 -- print_site_lnl --run gg$s$a" >> PHYML$s$a.txt

echo "./phyml -i $s -d nt -q -n 1 -p -m GTR -c $a -s BEST -o tlr --r_seed 12345 --print_site_lnl --run g$s$a" >> PHYML$s$a.txt

echo "./phyml -i $s -d nt -q -n 1 -p -m GTR -c $a -a 1 -s BEST -u parsi.tree -o tlr --r_seed 12345 --print_site_lnl --run ggp$s$a" >> PHYML$s$a.txt

echo "./phyml -i $s -d nt -q -n 1 -p -m GTR -c $a -s BEST -u parsi.tree -o tlr --r_seed 12345 --print_site_lnl --run gp$s$a" >> PHYML$s$a.txt

chmod +x PHYML$s$a.txt

./PHYML$s$a.txt

done

done

mkdir parsiphy

#mv RAxML_result.* parsi

Script to add header and change format
#!/bin/bash
for i in pargtrgarbol.tree.*; do
sed -i "1itread 'tree(s) from TNT, for data in e.nex'" $i
echo "proc-;" >>$i
done

Tree C´s script : comparison trees
#!/bin/bash
for i in gpgtrgarbol.tree.dat25.txt; do
for e in pargtrgarbol.tree.*; do
echo "./treec_linux -f $i -l $e -o sal$e$i.dat -t" >> treec.txt
chmod +x treec.txt
./treec.txt
done
done
mkdir comparison
mv *.dat comparison