Combinatorics of Phylogenetic Trees

The following is based on a weekend project that I also presented as a short talk in an undergraduate combinatorics seminar. The project is self-contained and mostly based on independent work. Ideas and inspiration came from discussions with my teacher and from the introduction of Diaconis and Holmes (1998). Theorem 2 is from Semple and Steel (2003). Tree pictures were produced with Sagemath and Latex.

French pdf.

1. Introduction

A phylogenetic tree is a rooted binary tree with labeled leaves.

tree1

These trees are used in biology to represent the evolutive history of species. The leaves are the identified species, the root is a common anscestor, and branching represents speciation.

An interesting problem is that of reconstructing the phylogenetic tree that best explains the observed biological characterics of a set of species. A naive mathematical formulation of this problem is proposed in section 4, and used to implement a tree reconstruction algorithm.

fig1

2. Phylogenetic trees

Let X be a finite set, totally ordered in any way, such as a set of species. A phylogenetic tree on X is a rooted binary tree with leaves set X. Two phylogenetic trees are said to be equal if they are isomorphic under an application which, when restricted to X, is the identity on X. This is simply a formalisation of the idea that the leaves are labelled, whereas the other vertices are anonymous. In the following, we do not make special cases of this quotient.

fig2

2.1 Notations

We write \texttt{Phy}(X) for the set of all phylogenetic trees on X, and if n \in \mathbb{N}, then \texttt{Phy}(n) = \texttt{Phy}(\{1,2,\dots, n\}). For \mathcal{A} \in \texttt{Phy}(X) with root r and s any vertex, there exists a unique path from s to r. The vertices on this path are ordered in the obvious way as going away from the root. The subtree of a vertex s is the subtree of \mathcal{A} containing s and all the vertices under s. The principal branch of \mathcal{A} is the component of \mathcal{A} - r with the smallest leaf; its secondary branch is the other component of \mathcal{A} - r. Similarly, the principal branch of an interior vertex s is the principal branch of the subtree of s, and the secondary branch of s is the secondary branch of the subtree of s. If s is any vertex that is not the root, its parent p is the vertex above it and its cousin is the vertex under p that is not s.

2.2 Properties

We begin with an useful elementary lemma. The number of elements in X is n.

Lemma 1.
Let \mathcal{A} \in \texttt{Phy}(X). It has 2n-1 vertices and 2n-2 edges.

Proof.
If \mathcal{A} has only two leaves, then it has 3 vertex and 2 edges. Otherwise, let n_1 and n_2 be the number of leaves in the principal and secondary branches of \mathcal{A}. By induction, \mathcal{A} has (2n_1 - 1)+(2n_2-1)+1 = 2n-1 vertices and (2n_1-2)+(2n_2-2)+2 = 2n-2 edges. \Box

We can now count the number of phylogenetic trees with n leaves. At the same time, we will see a first recursive characterization of \texttt{Phy}(X).

Theorem 2 (See Semple and Steel, 2003).
For all n \in \mathbb{N}, there exists exactly

1 \cdots 3 \cdots 5 \cdots (2n-3) = \frac{(2n-3)!}{(n-2)! 2^{n-2}} =: (2n-3)!!

different phylogenetic trees with n leaves.

Proof.
Let T : \texttt{Phy}(n) \rightarrow \texttt{Phy}(n-1) be defined as follows. For \mathcal{A} a tree with root r, T(\mathcal{A}) is obtained in two steps:

  1. remove the root r and add an edge between the roots of the two components so obtained; and
  2. remove the leaf n.

The root of T(\mathcal{A}) is the vertex adjacent to the leaf n at the begining of the second step.

fig3

Each tree \mathcal{A}' \in \texttt{Phy}(n-1) has at most 2n-3 inverse images, since either n was adjacent to the root of the preimage, or one of the 2n-4 edges of \mathcal{A}' was added by T. Since a tree is identifiable to is principal and secondary branches, it is straightforward to verify that \mathcal{A}' has precisely 2n-3 inverse images. It follows that |\texttt{Phy}(n)| = (2n-3)|\texttt{Phy}(n-1)|. \Box

Remark 3.
This proof provides a recursive enumeration of \texttt{Phy}(n), by enumerating the preimage of each \mathcal{A}' \in \texttt{Phy}(n-1).

Remark 4.
It also provides a way of sampling random phylogenetic trees, uniformly distributed on \texttt{Phy}(n): given \mathcal{A}' \sim \text{Uniform}(\texttt{Phy}(n-1), we let \mathcal{A} \sim \text{Uniform}(T^{-1}(\mathcal{A}')). This can be used to integrate over \texttt{Phy}(n).

3. Exploring the tree space

3.1 Representation

Let us describe another recursive nature of \texttt{Phy}(X). Here, \sum denotes the disjoint union of sets. We then have a bijection

\texttt{Phy}(X) \rightarrow \sum\limits_{\min X \in A \subset X} \texttt{Phy}(A) \times \texttt{Phy}(X \backslash A)

that maps a tree to the pair of its principal and secondary branches. By letting \texttt{Phy}(\{x\}) \simeq x, for all x \in X, the elements of \texttt{Phy}(X) can be identified as shown below.

fig4

Furthermore, it follows from this bijection that the exponential generating function F(x) = \sum_{n=0}^{\infty} a_n x^n/n! of a_n := |\texttt{Phy}(n)|, a_0 = 0, a_1 = a_2 = 1, satisfies F(x) = x + F(x)^2/2. In other words,

F(x) = 1-\sqrt{1-2x}.

3.2 Graph structure on the tree space

Let \mathcal{A} \in \texttt{Phy}(X) and let s be an interior vertex of \mathcal{A} (s is not the root and is not a leaf). We consider two transformations of \mathcal{A}, \tau_1(s) and \tau_2(s), defined in the diagram below. Here, A is the principal branch of s, B is the secondary branch of s, and r is the parent of s. Also, C is the branch of r not containing s.

fig5-1fig5-2

It is possible to give a formal recursive definition of these transformations, but this suffices for our purposes.

Now, two trees \mathcal{A} and \mathcal{A}' are said to be adjacent in \texttt{Phy}(X) if one can be transformed into the other by such a transformation, for some interior vertex. The following theorem describes some important aspects of this graph structure on \texttt{Phy}(X).

Theorem 5.
Let n = |X|.We have

  1. each vertex of \texttt{Phy}(X) is of dregree 2n-4; and
  2. the graph \texttt{Phy}(X) is connected and its diameter is bounded by (n-1)(n-2).

Proof of (1).
Let \mathcal{A} \in \texttt{Phy}(X) and note that the transformations of the type \tau_i(s) are invertible. The degree of \mathcal{A} is therefore the cardinality of the set \{\tau_i(s)(\mathcal{A}) \,|\, i \in \{1,2\}, s \text{ is an interior vertex}\}. We show that \tau_i(s) (\mathcal{A}) \not = \tau_j(s') (\mathcal{A}) whenever (i, s) \not = (j,s'), from which the claim follows. Three cases are possible.

  1. If s = s' and i \not = j. By definition, \tau_i(s)(\mathcal{A}) \not = \tau_j(s)(\mathcal{A}).
  2. If none of s and s' is a descendant of the other. Then, there exists a vertex r such that s and s' are in distinct branches of r. It follows that \tau_i(s)(\mathcal{A}) \not = \tau_j(s')(\mathcal{A}).
  3. If s' is a descendant of s, s \not = s'. In this case, consider the parent r of s. The transform \tau_i(s) modifies the two branches of r, whereas \tau_j(s') only modifies one of them. It follows that \tau_i(s)(\mathcal{A}) \not = \tau_j(s')(\mathcal{A}). \Box

Proof of (2).
Suppose, without loss of generality, that X = \{1,2,\dots, n\}. Let \mathcal{E}_n \in \texttt{Phy}(n) be the elementary tree on X. fig8.pngLet \mathcal{A} \in \texttt{Phy}(X). We show that \mathcal{A} can be transformed into \mathcal{E}_n through at most {n-1 \choose 2} transformations of the type \tau_i(s). It takes at most n-2 transformations to make the leaf n adjacent to the root. By induction, it then takes at most n-2 + {n-2 \choose 2} = {n-1 \choose 2} steps to transform \mathcal{A} in \mathcal{E}(n). Since each \tau_i(s) is invertible, we obtain that \texttt{Phy}(X) is connected and that its diameter is bounded by 2 {n-1 \choose 2}. \Box

Remark 6.
Theorem 5 can be used to specify random walks on \texttt{Phy}(X) that converge to an arbitrary positive distribution, using a Metropolis-Hastings algorithm.

4. Phylogenetic tree reconstruction

Let X be a set of species. We assume that the evolutive history of the species in X is described by some unknown tree \mathsf{A} \in \texttt{Phy}(X). A priori, \mathsf{A} is (say) uniformly distributed on \texttt{Phy}(X). We observe for each specie x \in X a set of characters \mathcal{C}_x \subset \mathcal{C}, \mathcal{C} being the set of all biological characters under consideration. For instance, we may have \mathcal{C} = \{\text{fur}, \text{scales}\}. Bears have fur whereas snakes have scales.

Now, consider the following naive character propagation model. In a tree \mathcal{A} \in \texttt{Phy}(X), the root r has no distinctive characters. The children a and b of r spawn new characters sets \mathcal{C}_a and \mathcal{C}_b. We suppose that \mathcal{C}_a \cap \mathcal{C}_b = \emptyset (if a and b were to share a character c, then we would assume this character is also present in their parent r). Now, if s is any vertex of \mathcal{A} that is not a leaf, with character set \mathcal{C}_r, the children x and y spawn character sets \mathcal{C}_x\subset \mathcal{C} and \mathcal{C}_y\subset \mathcal{C} such that \mathcal{C}_x \cap \mathcal{C}_y = \mathcal{C}_s, through some probability distribution. This distribution should be specified by a biologist; here it will be implicitely defined in a very simple likelihood function.

Given observed characters \mathcal{D} = \{\mathcal{C}_x\}_{x \in X} and a candidate tree \mathcal{A}, we can reconstruct the character sets \mathcal{C}_s, for each s \in \mathcal{A}, as follows. If \mathcal{C}_x and \mathcal{C}_y are the character sets of the children of s, then \mathcal{C}_s = \mathcal{C}_x \cap \mathcal{C}_y. We finally define the likelihood of \mathcal{A} given \mathcal{D} through

\log P(\mathcal{D}\,|\, \mathcal{A}) = \text{cst.} + \sum_{s = (x, y)} \log(|\mathcal{C}_x| - |\mathcal{C}_s| + 1) + \log(|\mathcal{C}_y| - |\mathcal{C}_s| + 1),

where cst. is a universal constant and the summation is taken over all vertices s \in \mathcal{A} having children x and y. This likelihood was obtained through a partially specified probability model. It is maximized when the number of new characters spawned by each vertex (that is not the root) is constant, whenever that is possible.

To reconstruct \mathsf{A} \in \texttt{Phy}(X), we look for the tree \mathcal{A} \in \texttt{Phy}(X) of maximal a posteriori probability

P(\mathcal{A} \,|\, \mathcal{D}) \propto P(\mathcal{D} \,|\, \mathcal{A}) P(\mathcal{A}).

Through gradient ascent on the graph \texttt{Phy}(X), we obtain local maximums.

4.1 Simulation

(see the pdf.)

References.

[1] Semple, C. et Steel, M., Phylogenetics, Oxford University Press, 2003.
[2] Diaconis, P. et Holmes, S., Matchings and Phylogenetic Trees, Proceedings of the National Academy of Sciences of the United States of America, Vol. 95, No. 25, 1998.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s