The following is based on a weekend project that I also presented as a short talk in an undergraduate combinatorics seminar. The project is self-contained and mostly based on independent work. Ideas and inspiration came from discussions with my teacher and from the introduction of Diaconis and Holmes (1998). Theorem 2 is from Semple and Steel (2003). Tree pictures were produced with Sagemath and Latex.
A phylogenetic tree is a rooted binary tree with labeled leaves.
These trees are used in biology to represent the evolutive history of species. The leaves are the identified species, the root is a common anscestor, and branching represents speciation.
An interesting problem is that of reconstructing the phylogenetic tree that best explains the observed biological characterics of a set of species. A naive mathematical formulation of this problem is proposed in section 4, and used to implement a tree reconstruction algorithm.
2. Phylogenetic trees
Let be a finite set, totally ordered in any way, such as a set of species. A phylogenetic tree on is a rooted binary tree with leaves set . Two phylogenetic trees are said to be equal if they are isomorphic under an application which, when restricted to , is the identity on . This is simply a formalisation of the idea that the leaves are labelled, whereas the other vertices are anonymous. In the following, we do not make special cases of this quotient.
We write for the set of all phylogenetic trees on , and if , then . For with root and any vertex, there exists a unique path from to . The vertices on this path are ordered in the obvious way as going away from the root. The subtree of a vertex is the subtree of containing and all the vertices under . The principal branch of is the component of with the smallest leaf; its secondary branch is the other component of . Similarly, the principal branch of an interior vertex is the principal branch of the subtree of , and the secondary branch of is the secondary branch of the subtree of . If is any vertex that is not the root, its parent is the vertex above it and its cousin is the vertex under that is not .
We begin with an useful elementary lemma. The number of elements in is .
Let . It has vertices and edges.
If has only two leaves, then it has 3 vertex and 2 edges. Otherwise, let and be the number of leaves in the principal and secondary branches of . By induction, has vertices and edges.
We can now count the number of phylogenetic trees with leaves. At the same time, we will see a first recursive characterization of .
Theorem 2 (See Semple and Steel, 2003).
For all , there exists exactly
different phylogenetic trees with leaves.
Let be defined as follows. For a tree with root , is obtained in two steps:
- remove the root and add an edge between the roots of the two components so obtained; and
- remove the leaf .
The root of is the vertex adjacent to the leaf at the begining of the second step.
Each tree has at most inverse images, since either was adjacent to the root of the preimage, or one of the edges of was added by . Since a tree is identifiable to is principal and secondary branches, it is straightforward to verify that has precisely inverse images. It follows that .
This proof provides a recursive enumeration of , by enumerating the preimage of each .
It also provides a way of sampling random phylogenetic trees, uniformly distributed on : given , we let . This can be used to integrate over .
3. Exploring the tree space
Let us describe another recursive nature of . Here, denotes the disjoint union of sets. We then have a bijection
that maps a tree to the pair of its principal and secondary branches. By letting , for all , the elements of can be identified as shown below.
Furthermore, it follows from this bijection that the exponential generating function of , , , satisfies . In other words,
3.2 Graph structure on the tree space
Let and let be an interior vertex of ( is not the root and is not a leaf). We consider two transformations of , and , defined in the diagram below. Here, is the principal branch of , is the secondary branch of , and is the parent of . Also, is the branch of not containing .
It is possible to give a formal recursive definition of these transformations, but this suffices for our purposes.
Now, two trees and are said to be adjacent in if one can be transformed into the other by such a transformation, for some interior vertex. The following theorem describes some important aspects of this graph structure on .
Let .We have
- each vertex of is of dregree ; and
- the graph is connected and its diameter is bounded by .
Proof of (1).
Let and note that the transformations of the type are invertible. The degree of is therefore the cardinality of the set . We show that whenever , from which the claim follows. Three cases are possible.
- If and . By definition, .
- If none of and is a descendant of the other. Then, there exists a vertex such that and are in distinct branches of . It follows that .
- If is a descendant of , . In this case, consider the parent of . The transform modifies the two branches of , whereas only modifies one of them. It follows that .
Proof of (2).
Suppose, without loss of generality, that . Let be the elementary tree on . Let . We show that can be transformed into through at most transformations of the type . It takes at most transformations to make the leaf adjacent to the root. By induction, it then takes at most steps to transform in . Since each is invertible, we obtain that is connected and that its diameter is bounded by .
Theorem 5 can be used to specify random walks on that converge to an arbitrary positive distribution, using a Metropolis-Hastings algorithm.
4. Phylogenetic tree reconstruction
Let be a set of species. We assume that the evolutive history of the species in is described by some unknown tree . A priori, is (say) uniformly distributed on . We observe for each specie a set of characters , being the set of all biological characters under consideration. For instance, we may have . Bears have fur whereas snakes have scales.
Now, consider the following naive character propagation model. In a tree , the root has no distinctive characters. The children and of spawn new characters sets and . We suppose that (if and were to share a character , then we would assume this character is also present in their parent ). Now, if is any vertex of that is not a leaf, with character set , the children and spawn character sets and such that , through some probability distribution. This distribution should be specified by a biologist; here it will be implicitely defined in a very simple likelihood function.
Given observed characters and a candidate tree , we can reconstruct the character sets , for each , as follows. If and are the character sets of the children of , then . We finally define the likelihood of given through
where cst. is a universal constant and the summation is taken over all vertices having children and . This likelihood was obtained through a partially specified probability model. It is maximized when the number of new characters spawned by each vertex (that is not the root) is constant, whenever that is possible.
To reconstruct , we look for the tree of maximal a posteriori probability
Through gradient ascent on the graph , we obtain local maximums.
(see the pdf.)
 Semple, C. et Steel, M., Phylogenetics, Oxford University Press, 2003.
 Diaconis, P. et Holmes, S., Matchings and Phylogenetic Trees, Proceedings of the National Academy of Sciences of the United States of America, Vol. 95, No. 25, 1998.