Elementary topology and machine learning

The machine learning bubble may be overinflated, but it is not about to burst. Interdisciplinary research in this field is grounded in sound theory and has numerous empirical breakthroughs to show for. As it finds more and more applications and concentrates public research funding, many of us are still wondering: how can mathematics contribute?
Case study of an interaction between elementary topology and machine learning’s binary classification problem. Following classical theorems, we obtain a topologically accurate solution.

Correction: It should be \sup_{i,j}| E_{i,j}^n - (k+1)^2 \int_{R_{i,j}^n} f | = o(1/n) and k=k(n) grow correspondingly slower with n.

PDF note here.


Bayesian learning

Friday july 28 at 17:00
Rutherford Physics Building, Room 118, McGill

Next week, I’ll be talking about Bayesian learning at the Mathematical congress of the americas and at the Canadian undergraduate mathematics conference. These are somewhat challenging talks: I need to sell the idea of Bayesian statistics to a general mathematical audience (which knows nothing about it), interest them in some though problems of Bayesian nonparametrics, and then present some of our research results. This must be done in under 20 minutes.

To make the presentation more intuitive and accessible, I borrowed some language from machine learning. I’m talking about learning rather than inference, uncertain knowledge rather than subjective belief, and asymptotic correctness rather than consistency. These are essentially synonymous, although some authors might use them in different ways. This should not cause problems for this introductory talk.

1. What is learning?

This is the question that drives the talk. We do not want to deal with it in full generality, but rather want an operational definition of learning that allows us to program algorithms that can learn from experience. The suggestion is the following:

Learning is about rationally adjusting uncertain knowledge in the light of new information.

Bayesian learning implements this idea through the calculus of probability. Let’s see this in a simple context.

1.1 The classical problem

Some stochastic mechanism generates data x_1, x_2, x_3, \dots, each x_i being an independent sample from a unknown distribution P_0. We want to use the data to learn about P_0.


A cryptographer observes word at random in an unknown countable language. Learning could be about:

  • What’s the distribution of the observed words?
  • What’s the probability that the next word to be observed has never been seen before?

1.2 Bayesian Learning

Uncertain knowledge about what P_0 may is encoded through a prior probability measure \Pi on the space of all possibilities for P_0. Thus, if A is some set of probability distributions, then \Pi(A) is the probability, representing the best of our knowledge, that “P_0 \in A“. Then, we observe the data \mathcal{D} = (x_1, x_2, \dots, x_n). This yields the posterior measure \Pi(\cdot | \mathcal{D}), obtained by adjusting \Pi in light of \mathcal{D} through probabilistic conditioning.

Example (coin tossing)

We suspect a coin of being biased. Tossing it, there is an unknown probability p_0 that the coin falls on heads. A prior probability density function, quantifying our uncertainty about what p_0 may be, could look like this:


It is most likely that the coin is approximately unbiaised (p_0 \approx 1/2), but not impossible that it is strongly biaised.

Now, we flip the coin 20 times and observe only 5 heads. The density of the posterior distribution, updating the prior in light of this information, is the following.


It is most likely that the coin is biased and that p_0 \approx 1/4.

Certainly the results obtained depend on the choice of prior distribution, but similar priors yield similar posteriors:


2. Bayesian nonparametrics

It is tempting to apply the same procedure to more complex problems.

Reconsider, for instance, the cryptographer example. Each word \omega has some probability P_0(\omega) of appearing. Since there is an infinite number of words, there is an infinite number of parameters for which we need to quantify uncertainty. A reasonable question is then:

Is it feasible to faithfully quantify uncertain knowledge on an infinite dimensional space?

The answer is no. At least, not always and not necessarily in a way that makes bayesian learning meaningfull. Diaconis and Freedman showed, in 1986, showed the following in the context of bayesian nonparametrics:

“Some statisticians, as more and more data is gathered, will become more and more convinced of the wrong answer.”

Bayesian learning does not always work in infinite dimension. To approach this problem, we need to figure out charateristics of prior distributions that

  1. describe how uncertainty is spread out; and
  2. ensure that bayesian learning works correctly.

2.1 A positive result

An important positive result is based on the two following conditions / characteristics of prior distributions. We denote by \mathbb{F} the set of all probability densities on some common space, assuming that P_0 \in \mathbb{F} (the data-generating distribution has some density in \mathbb{F}).

  1. \Pi puts positive mass to the relative entropy neighborhoods of P_0:

\Pi\left(\left\{P \in \mathbb{F}| \int \log \frac{dP_0}{dP} dP_0 < \varepsilon \right\}\right) > 0.

This means that a priori, we’re not excluding the truth from the set of possibilities. Since P_0 is unknown, we require that this condition be satisfied whatever P_0 may be.

2. \Pi is of finite entropy: for all \delta > 0, there exists 1 > \alpha > 0 and a covering \{A_i\} of \mathbb{F} of L^1 diameter less than \delta such that

\sum_{i} \Pi(A_i)^\alpha < \infty.

This means that \Pi is not too complex and that we can make sense of it through discretization.

Under these two conditions, bayesian learning is asymptotically correct: the posterior distribution concentrates around the truth.

Theorem (Walker, 2004)
If the conditions (1) and (2) are satisfied, than for any L^1 neighborhood N of P_0, we have

\Pi(N \,|\, x_1, x_2, \dots, x_n) \rightarrow 1

almost surely as x_i \sim P_0.

This is helpful, but has not yet solved our problem.

  • How do we, generally, construct priors satisfying the two conditions?
  • How can we use these priors to solve practical problems?

This is where our research (as well as a lot of other research in Bayesian nonparametrics) comes in.

3. Some of our research

Two contributions I want to present.

  • We developped a relationship between some statistical models and approximation theory that can be used to easily construct priors satisfying the two conditions.
  • We use it to solve problems raised in the litterature.

Let’s see how this works in the context of directional statistics. Say we have some data distributed on the circle or, more generally, on a compact metric space.


We want to learn about the data-generating mechanism, e.g. do

  • density estimation,
  • hypothesis tests,
  • classification,

or any other statistical procedure. First, we specify a prior on the set \mathbb{F} of all bounded densities on the circle.

3.1 Prior specification

We begin with a density basis of the trigonometric polynomials introduced in 2009:

C_{j,n}(u) \propto \left(1 + \cos\left( u - \frac{2\pi j}{2n+1}\right)\right)^n, \quad j = 0,1,\dots, 2n.


We studied the statistical properties of this basis, and use it to construct approximation operators

T_n : \mathbb{F} \ni f \mapsto \sum_j  C_{j,n} \int_{R_{j,n}}f(x) dx,

where R_{j,n} = \left[\frac{\pi(2j-1)}{2n+1}, \frac{\pi(2j+1)}{2n+1}\right). It can be shown that these operators are variation-diminishing and possess shape-preserving properties. More importantly, they give the decomposition

\overline{\cup_{n\in \mathbb{N}} T_n(\mathbb{F})} = \mathbb{F}.

By specifying priors \Pi_n on T_n(\mathbb{F}) and mixing them together, we obtain the prior

\Pi = \sum_{n}\rho(n) \Pi_n :\, A \mapsto \sum_n \rho(n) \Pi_n\left(A \cap T_n(\mathbb{F})\right).

on \mathbb{F}.

This reduces the problem of specifying a prior on an infinite dimensional space, to specifying an infinite number of priors on finite dimensional spaces. This turns out to be easier, and from properties of T_n we obtain asymptotic correctness of bayesian learning.

Let T_n : L^1 \rightarrow L^1 be any linear function mapping \mathbb{F} to itself, with \dim T_n (\mathbb{F}) increasing. If \|T_n f - f\|_\infty \rightarrow 0 for all continuous f, then taking \rho such that 0 < \rho(n) <  e^{-\dim T_n(\mathbb{F})} and \Pi_n > 0 ensures that bayesian learning based on \Pi is asymptotically correct.

What is interesting here:

  • The approach scales easily to (almost) any set of bounded densities on a compact metric space.
  • The approximation theory litterature provides a wealth of well-studied approximation operators that can be used to construct such priors.
  • Properties of the operators relate to properties of the priors. If, for instance, it is known that the true density of P_0 is unimodal, then using a unimodality-preserving operator yields a prior the space of unimodal densities.

3.2 Simple application

We use a prior of the type considered above to estimate the density of P_0, also quantifying uncertainty about our estimate. Below, the data-generating distribution P_0 is drawn in orange. A hundred samples are represented in the grey histogram. The blue line is an estimate of the unknown orange function, and the blue shaded region quantifies uncertainty (50% and 90% credible regions).


4. Take-home message

There are two things I would like you to remember.

1. The calculus of probability provides an operational definition of learning. That is what Bayesian statistics is about.

2. As you must already know, different fields of mathematics enrich each other in their interactions. Here, it is approximation theory that provides tools that ensure bayesian learning works correctly.

Thank you for your attention!


Continue reading