lunes, 30 de octubre de 2017

Philosophy in phylogenetic analysis

Philosophy in phylogenetic analysis

To speak of philosophy in phylogenetic analysis it is important to know several positions on the subject, some authors favor methods with good philosophical support (coherence with the theory of epistemology), others leave these in the background and are more interested by the performance of the methods (simulations of real well-supported phylogenies to then evaluate the methods that best "reconstruct", based on probability).
Siddall & Kluge (1997) argued that parsimony fits the theory of epistemology developed by Karl Popper, while probability does not. On the other hand, they presented parsimony better than probability based on Popper's concept of corroboration. For Popper, testing a theory means trying to refute it by means of a counterexample. If it is not possible to refute it, this theory is corroborated, and it can be accepted provisionally, but not verified; that is, no theory is absolutely true, but not refuted.
The principle of parsimony, known as the Ockham knife proposed by the Franciscan William Ockham. This principle establishes that entities should not be multiplied beyond necessity, which is often interpreted as implying that when alternative hypotheses explain data equally well, the simplest is preferred4. This principle principle is not the same as the parsimony method used for phylogenetic reconstruction, which classifies phylogenetic trees in a variable way according to the minimum number of character transformations among the taxa. The method of parsimony is a set of methods, this method is adjusted to the principle of parsimony in which the transformations of a hypothetical nature is not multiplied beyond the need (dependent on the costs assigned to different kinds of transformations).4
On the other hand probabilistic models have the advantage of developing stochasticity (Process whose behavior is non-deterministic) in their methods. The maximum likelihood analyzes how much the data fits to a given hypothesis; where a tree is considered a hypothesis when choosing the highest probability for that event; In the Bayesian analysis it is quantified that both a previous test (Prior) is adjusted to a later one.
De Queiroz & Poe 2001 say that contrary to the views of the authors who have criticized the likelihood approach to phylogenetic inference as incompatible with Karl Popper's degree of corroboration, an examination of Popper's own writings reveals that the concept general likelihood is the basis of its degree of corroboration. Consequently, it is not surprising that phylogenetic inference probability methods are fully compatible with Popper's corroboration and that methods of cladistic parsimony are compatible with corroboration only if they are interpreted as incorporating probabilistic assumptions. But if parsimony methods are interpreted as incorporating probabilistic assumptions, these assumptions are models that can be used in the context of probability, and the non-probabilistic implementations of those methods are simply approximations for their probabilistic counterparts. Interpreted in this way, there is no conflict between parsimony and maximum likelihood, because the general statistical perspective of maximum likelihood and popperian corroboration summarizes all the individual methods and models that can be applied within the context of that perspective, including those that of cladistic parsimony. 2, 5, 6
Other authors such as Brooks et. al., 2007 speak of a new strategy in which neither Popper's philosophy nor statistical coherence can give priority to a method of quantitative phylogenetic analysis; It is common for an author to present parsimony, maximum likelihood and Bayesian inference and then have a preference for some of them. 1
Personally I agree with what De Queiroz & Poe 2001 proposes, since they are based on methods with good philosophical support, and corroborations, not on finding the absolute truth. Phylogenetic analyzes should be based on corroborations and not absolute truths; Independent of the method, each one presents a hypothesis that can be corroborated.

1 Brooks, D; Bilewitch, J; Condy, C; Evans, D; Folinsbee, K; Fröbisch, J; Halas, D; Hill, S; McLennan, D; Mattern, M; Tsuji, L; Ward, L; Wahlberg, N; Zamparo, D & David Zanatta, D. 2007. Quantitative Phylogenetic Analysis in the 21 st Century.Análisis Filogenéticos Cuantitativos en el siglo XXI.Revista Mexicana de Biodiversidad 78: 225- 252, 2007

2 De Queiroz,K & Poe, S. 2001 .Philosophy and Phylogenetic Inference: A Comparison of Likelihood and Parsimony Methods in the Context of Karl Popper’s Writings on Corroboration .Systematic Biology. 50(3):305–321, 2001

3 Siddall , M & Kluge, A . 1997. Probabilism and phylogenetic inference. Cladistics 13:313–336.

4 Sober , E. 1994. From a biological point of view. Cambridge Univ. Press, Cambridge, England.

5 Popper K. R. 1968. The logic of scientifi c discovery. Harper and Row, New York. 544 p.

6 Popper K. R. 1997. The demarcation between science and metaphysics. In The philosophy of Rudolph Carnap, P. A. Schilpp (ed.). Open Court, La Salle. p. 183-22

Philosophy in homology

According to Popper, the decision to give as true a hypothesis is only possible in the scenario where any other hypothesis refutes this truth; for Popper the way of approaching science is through the deductive method, where the hypothesis can be corroborated or distorted through observations (Helfenbein & Desalle, 2005)

In this context, the scientific community has criticized but also has applied the premise that a character used in comparative biology must comply with the Paterson test consisting of three tests; the first of similarity, the one of conjunction and finally the one of congruence (Patterson, 1988); to be denominated homology and in this way to be analyzed to find the relations of parentage between all the existing species.

According to Pinna (1991), "a homology in her basic form is understood as the equivalence of the parts", that is, the relation derived from the parts, which are homologous (Williams & Ebach, 2008)

Following the idea of corroboration proposed by Popper (1983), one can find the logical probability that a hypothesis (cladogram or diagram of ramifications that summarizes the general knowledge about the types and relationships of organisms (Platnick & Nelson, 1978)) is supported by evidence (homology) given background knowledge or background (Figure 1).

Figure 1. Equation of corroboration (Popper, 1983).

The above expression corresponds to Popper's corroboration definition, where p = probability, h = hypothesis, e = evidence, b = background, hb refers to the conjunction between h y b so p (e, hb) is the probability of the evidence given the hypothesis and the background; that is, of the conditional probability form P (A | B). Thus, the value of C is positive when the evidence supports the hypothesis, it is negative when the evidence does not contribute to the hypothesis and C = 0 when the evidence does not guide the hypothesis (Queiroz & Poe, 2001).

Is very important the clarity about the definition of homology because as a scientific community agreed to analyze the evolutionary tracks with the same lens; and we can share the Williams & Ebach (2008) statement that homology must be considered the unit of classification.

Helfenbein, K. G., & Desalle, R. (2005). Falsifications and corroborations : Karl Popper’s influence on systematics, 35, 271–280.
Patterson, C. (1988). Homology in Classical and Molecular Biology, 5(6), 603–625.
Pinna, M. G. G. De. (1991). Concepts and tests of homology in the cladistic paradigm, 367–394.
Platnick, N. I., & Nelson, G. (1978). A method of analysis for historical biogeography.
Queiroz, K., & Poe, S. (2001). Philosophy and Phylogenetic Inference : A Comparison of Likelihood and Parsimony Methods in the Context of Karl Popper ’ s, 50(3), 305–321.
Williams, D., & Ebach, M. (2008). Foundations of Systematics and Biogeography.

viernes, 1 de abril de 2016

Concave parsimony

Concave parsimony


Some authors believed that parsimony and weithing were somehow mutually exclusive and parsimony analysis should use all characters with the same weight (Watrous and Wheeler,1981). Farris (1983) saw the relationship between parsimony and weighting, he showed the most parsimonious tree is the hypothesis with greatest explanatory power, given the weights of the characters. Several authors are in favor of character weighting and consider that characters with more homoplasy are less reliable. In other words weighting characters inversely to their homoplasy would improve the results (Goloboff, 1993). The method consist of searching trees with maximum total fit, this character fits are defined as a concave function of homoplasy. The fittest trees imply that have fewer steps for the characters which fit tree better. The weight of a character is a function of its fit to a tree. Therefore, the fit for each character would be measured as a function of its homoplasy and the total fit of the tree would be the sum of the fits of the characters (Goloboff, 1993).


Simulate DNA sequences in Seq-gen (Rambaut and Grassly, 1997), with differents sequence lengths (10, 100 and 1000) under the evolutionary models GTR, F84 and HKY for each sequence length. Then, a parsimony analysis using implied weighting with differents values of concavity constant (K=1, K=10, K=40, K=100, and K=1000) implemented in TNT v.1.5 (Goloboff et al., 2001). To see the difference between implied weight and unweight carried out another parsimony analysis without implied weighting. To explore the sensitivity of topology under differents weights and unweights carried out the Robinson-Foulds metric or symmetric difference (Robinson & Foulds, 1981) in software R v 3.2.3 (R Core Team, 2014) and also to compare the reference tree (simulated under Farris and Felsenstein zone) with implied weights and unweights.

Results and discussion

In the Felsenstein and Farris zone can be seen that under the same model and the same sequence length there is an increase in the value of Total Fit as it increases the value of concavity constant (K). In general as the sequence length increases under the same model there is an increases in the Total Fit. In the Felsenstein zone is noted that under the model GTR-10 the Total Fit is zero with differents K values, for models F84 -10 y HKY-10 have a Total Fit of 1 under differents K values. The highest Total Fit values are under the model F84-10000 (Fig 1). In the Farris zone the models with the smallest sequence lengths have the lowest Total Fit values and these in turn decrease as the K value decreases and the models with longest sequence length have the highest Total Fit values as the K value increases (Fig 2).

The K values didn't have influence over the topology under the same model and the same sequence length. In other words it recovered the same topology under the same model with the same sequence length. For that reason I applied the Robinson Foulds metric among a tree for each model and sequence length with implied weights and one for unweight. In the Felsenstein zone with implied weights vs unweight it recovered the same topology (RF=0) for all models except for GTR-10 model (RF=2) (Fig 3). In the Farris zone F84-1000, GTR-1000 and HKY-1000 models recovered the same topology.

RF metric for tree reference vs implied weights showed that in the Farris zone recovered the most of topologies except for F84-10 and GTR-10 and non-weights recovered the same topologies except fot F84-10. In contrast for Felsenstein zone with implied weights recovered the same topology under GTR1000 and HKY-1000 and non-weights recovered F84-1000, GTR1000 and HKY-1000 (Fig 4, 5).

In conclusion, the homoplasy can be related to how long is the sequence length and the nucleotide substitution model to use, but not only this, I think it also depends on how conservative is the nucleotidic sequence.

Figure 1. Felsenstein zone. The Total Fit values for each concavity constant (K), under different models and sequence length, the total fit was transformed with natural logarithm for a better comparison. Models GTR-10000, F84-10000 y HKY-10000 have highest Total Fit values.

Figure 2. Farris zone. The Total Fit values for each concavity constant (K), under different models and sequence length, the total fit was transformed with natural logarithm for a better comparison. Under the HKY model and sequence length of 10 the K value is 1 (log(1)=0).

Figure 3. Robinson-Foulds metric (RF) for the Felsenstein and Farris zone (implied weights vs unweights).

 Figure 4-5. RF metric among Farris and Felsenstein zones (weight-unweight) with tree reference.


Goloboff, P. A., Farris, J. S., & Nixon, K. C. (2008). TNT, a free program for phylogenetic analysis. Cladistics, 24(5), 774-786.

Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology, 27(4), 401–410.

Goloboff, P., 1993. Estimating character weights during tree search. Cladistics 9, 83–91.

Goloboff, P., 1997. Self-weighted optimization: tree searches and character  state  reconstructions  under  implied transformation costs. Cladistics 13, 225–245.

Goloboff, P., 2013. Extended implied weighting. Cladstics 1-13.

Robinson, D., & Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1), 131–147.

R Core Team. (2014). R: A Language and Environment for Statistical Computing. Vienna, Austria.

Watrous, L. and Q. Wheeler. 1981. The out-group compariso method of character analysis. Syst. Zool. 30:1-11.