Bayesian learning

Friday july 28 at 17:00
Rutherford Physics Building, Room 118, McGill

Next week, I’ll be talking about Bayesian learning at the Mathematical congress of the americas and at the Canadian undergraduate mathematics conference. These are somewhat challenging talks: I need to sell the idea of Bayesian statistics to a general mathematical audience (which knows nothing about it), interest them in some though problems of Bayesian nonparametrics, and then present some of our research results. This must be done in under 20 minutes.

To make the presentation more intuitive and accessible, I borrowed some language from machine learning. I’m talking about learning rather than inference, uncertain knowledge rather than subjective belief, and asymptotic correctness rather than consistency. These are essentially synonymous, although some authors might use them in different ways. This should not cause problems for this introductory talk.

1. What is learning?

This is the question that drives the talk. We do not want to deal with it in full generality, but rather want an operational definition of learning that allows us to program algorithms that can learn from experience. The suggestion is the following:

Learning is about rationally adjusting uncertain knowledge in the light of new information.

Bayesian learning implements this idea through the calculus of probability. Let’s see this in a simple context.

1.1 The classical problem

Some stochastic mechanism generates data x_1, x_2, x_3, \dots, each x_i being an independent sample from a unknown distribution P_0. We want to use the data to learn about P_0.


A cryptographer observes word at random in an unknown countable language. Learning could be about:

  • What’s the distribution of the observed words?
  • What’s the probability that the next word to be observed has never been seen before?

1.2 Bayesian Learning

Uncertain knowledge about what P_0 may is encoded through a prior probability measure \Pi on the space of all possibilities for P_0. Thus, if A is some set of probability distributions, then \Pi(A) is the probability, representing the best of our knowledge, that “P_0 \in A“. Then, we observe the data \mathcal{D} = (x_1, x_2, \dots, x_n). This yields the posterior measure \Pi(\cdot | \mathcal{D}), obtained by adjusting \Pi in light of \mathcal{D} through probabilistic conditioning.

Example (coin tossing)

We suspect a coin of being biased. Tossing it, there is an unknown probability p_0 that the coin falls on heads. A prior probability density function, quantifying our uncertainty about what p_0 may be, could look like this:


It is most likely that the coin is approximately unbiaised (p_0 \approx 1/2), but not impossible that it is strongly biaised.

Now, we flip the coin 20 times and observe only 5 heads. The density of the posterior distribution, updating the prior in light of this information, is the following.


It is most likely that the coin is biased and that p_0 \approx 1/4.

Certainly the results obtained depend on the choice of prior distribution, but similar priors yield similar posteriors:


2. Bayesian nonparametrics

It is tempting to apply the same procedure to more complex problems.

Reconsider, for instance, the cryptographer example. Each word \omega has some probability P_0(\omega) of appearing. Since there is an infinite number of words, there is an infinite number of parameters for which we need to quantify uncertainty. A reasonable question is then:

Is it feasible to faithfully quantify uncertain knowledge on an infinite dimensional space?

The answer is no. At least, not always and not necessarily in a way that makes bayesian learning meaningfull. Diaconis and Freedman showed, in 1986, showed the following in the context of bayesian nonparametrics:

“Some statisticians, as more and more data is gathered, will become more and more convinced of the wrong answer.”

Bayesian learning does not always work in infinite dimension. To approach this problem, we need to figure out charateristics of prior distributions that

  1. describe how uncertainty is spread out; and
  2. ensure that bayesian learning works correctly.

2.1 A positive result

An important positive result is based on the two following conditions / characteristics of prior distributions. We denote by \mathbb{F} the set of all probability densities on some common space, assuming that P_0 \in \mathbb{F} (the data-generating distribution has some density in \mathbb{F}).

  1. \Pi puts positive mass to the relative entropy neighborhoods of P_0:

\Pi\left(\left\{P \in \mathbb{F}| \int \log \frac{dP_0}{dP} dP_0 < \varepsilon \right\}\right) > 0.

This means that a priori, we’re not excluding the truth from the set of possibilities. Since P_0 is unknown, we require that this condition be satisfied whatever P_0 may be.

2. \Pi is of finite entropy: for all \delta > 0, there exists 1 > \alpha > 0 and a covering \{A_i\} of \mathbb{F} of L^1 diameter less than \delta such that

\sum_{i} \Pi(A_i)^\alpha < \infty.

This means that \Pi is not too complex and that we can make sense of it through discretization.

Under these two conditions, bayesian learning is asymptotically correct: the posterior distribution concentrates around the truth.

Theorem (Walker, 2004)
If the conditions (1) and (2) are satisfied, than for any L^1 neighborhood N of P_0, we have

\Pi(N \,|\, x_1, x_2, \dots, x_n) \rightarrow 1

almost surely as x_i \sim P_0.

This is helpful, but has not yet solved our problem.

  • How do we, generally, construct priors satisfying the two conditions?
  • How can we use these priors to solve practical problems?

This is where our research (as well as a lot of other research in Bayesian nonparametrics) comes in.

3. Some of our research

Two contributions I want to present.

  • We developped a relationship between some statistical models and approximation theory that can be used to easily construct priors satisfying the two conditions.
  • We use it to solve problems raised in the litterature.

Let’s see how this works in the context of directional statistics. Say we have some data distributed on the circle or, more generally, on a compact metric space.


We want to learn about the data-generating mechanism, e.g. do

  • density estimation,
  • hypothesis tests,
  • classification,

or any other statistical procedure. First, we specify a prior on the set \mathbb{F} of all bounded densities on the circle.

3.1 Prior specification

We begin with a density basis of the trigonometric polynomials introduced in 2009:

C_{j,n}(u) \propto \left(1 + \cos\left( u - \frac{2\pi j}{2n+1}\right)\right)^n, \quad j = 0,1,\dots, 2n.


We studied the statistical properties of this basis, and use it to construct approximation operators

T_n : \mathbb{F} \ni f \mapsto \sum_j  C_{j,n} \int_{R_{j,n}}f(x) dx,

where R_{j,n} = \left[\frac{\pi(2j-1)}{2n+1}, \frac{\pi(2j+1)}{2n+1}\right). It can be shown that these operators are variation-diminishing and possess shape-preserving properties. More importantly, they give the decomposition

\overline{\cup_{n\in \mathbb{N}} T_n(\mathbb{F})} = \mathbb{F}.

By specifying priors \Pi_n on T_n(\mathbb{F}) and mixing them together, we obtain the prior

\Pi = \sum_{n}\rho(n) \Pi_n :\, A \mapsto \sum_n \rho(n) \Pi_n\left(A \cap T_n(\mathbb{F})\right).

on \mathbb{F}.

This reduces the problem of specifying a prior on an infinite dimensional space, to specifying an infinite number of priors on finite dimensional spaces. This turns out to be easier, and from properties of T_n we obtain asymptotic correctness of bayesian learning.

Let T_n : L^1 \rightarrow L^1 be any linear function mapping \mathbb{F} to itself, with \dim T_n (\mathbb{F}) increasing. If \|T_n f - f\|_\infty \rightarrow 0 for all continuous f, then taking \rho such that 0 < \rho(n) <  e^{-\dim T_n(\mathbb{F})} and \Pi_n > 0 ensures that bayesian learning based on \Pi is asymptotically correct.

What is interesting here:

  • The approach scales easily to (almost) any set of bounded densities on a compact metric space.
  • The approximation theory litterature provides a wealth of well-studied approximation operators that can be used to construct such priors.
  • Properties of the operators relate to properties of the priors. If, for instance, it is known that the true density of P_0 is unimodal, then using a unimodality-preserving operator yields a prior the space of unimodal densities.

3.2 Simple application

We use a prior of the type considered above to estimate the density of P_0, also quantifying uncertainty about our estimate. Below, the data-generating distribution P_0 is drawn in orange. A hundred samples are represented in the grey histogram. The blue line is an estimate of the unknown orange function, and the blue shaded region quantifies uncertainty (50% and 90% credible regions).


4. Take-home message

There are two things I would like you to remember.

1. The calculus of probability provides an operational definition of learning. That is what Bayesian statistics is about.

2. As you must already know, different fields of mathematics enrich each other in their interactions. Here, it is approximation theory that provides tools that ensure bayesian learning works correctly.

Thank you for your attention!


Continue reading


Présentation (20 minutes) au séminaire du 5e.

Je présente le théorème d’approximation de Weierstrass pour les fonctions périodiques, en utilisant une base des polynômes trigonométriques récemment suggérée par Róth et al. (2009). Celle-ci se prête naturellement bien à notre application.

Théorème d’approximation de Weierstrass.
Soit f : \mathbb{R} \rightarrow \mathbb{R} une fonction 2\pi-périodique. Si f est continue, alors on peut construire des polynômes trigonométriques f_1, f_2, f_3, \dots tels que

f(x) = \sum_{i=1}^{\infty} f_i(x)

et tels que la convergence de la série ci-dessus est uniforme.

Ce théorème intervient dans plusieurs domaines: en topologie pour démontrer le théorème du point fixe de Brouwer, en géométrie pour l’inégalité isopérimétrique et en géométrie algébrique pour le théorème de Nash-Tognoli. Il implique que \{1, \cos(x), \sin(x), \cos(2x), \sin(2x), \dots\}, en tant que système orthonormal, est complèt dans L^2(\mathbb{S}^1). Plus généralement, on s’en sert pour ramener un problème sur les fonctions continues à un problème sur les polynômes, où le calcul différentiel et l’algèbre linéaire s’appliquent. Les démonstrations constructives du théorème fournissent de plus des outils permettant d’effectuer la régression ou la reconstruction de courbes et de surfaces.

Notions de base

Un polynôme trigonométrique (de degré m) est une fonction de x prenant la forme

a_0 + \sum_{n=1}^m \left\{a_n \cos(nx) + b_n \sin(nx)\right\}.

Notons que les sommes et les produits de polynômes trigonométriques sont encore de tels polynômes. De façon un peu moins évidente, ils forment un système de Chebyshev: pour tout ensemble \{x_i\}_{i=1}^{2m+1} de points distincts et pour tout \{y_i\}_{i=1}^{2m+1}\subset \mathbb{R}, il existe un unique polynôme trigonométrique P_m de degré m tel que

P_m(x_i)=y_i, \quad \forall i \in \{1,2,\dots, 2m+1\}.

Les fonctions périodiques et continues sur \mathbb{R} s’identifie aux fonctions continues sur le cercle

\mathbb{S}^1 = \mathbb{R}/2\pi\mathbb{Z}

munit de la distance de la longueur d’arc

d(u,v) = \min_{k \in \mathbb{Z}} |u-v+2\pi k|.

Cela suit d’un principe général: les fonctions continues sur \mathbb{R}^k et invariantes sous l’action d’un groupe G, dont les orbites ne possèdent pas de points d’accumulation, s’identifient aux fonctions continues sur le quotient \mathbb{R}^k / G munit de la distance d([u],[v]) = \min_{g, h\in G} \|g(u) - h(v)\|. (Ici, [u] est l’orbite de u sous l’action de G.) On peut obtenir, avec ces quotients, des surfaces telles le tore, le ruban de Mobius et la bouteille de Klein.

Démonstration du théorème

Considérons la fonction positive

C_{0,n}(u) = c_n\left(1+\cos(u)\right)^n,\quad c_n = \frac{2\pi}{2n+1}\left(\int_{0}^{2\pi} (1+\cos u)^n du\right)^{-1}

et ses translatées

C_{j,n}(u) = C_{0,n}\left(u - \tfrac{2\pi j}{2n+1} \right),\quad j=0,1, \dots, 2n

qui sont disposés d’une façon régulière autour du cercle.

Les C_{j,n} forment une partition de l’unité.

Remarquons que pour tout i \in \{0,1, \dots, 2n\}, on a

\sum_{j=0}^{2n} C_{j,n}\left(\tfrac{2\pi j}{2n+1}\right) = \sum_{j=0}^{2n} C_{j,n}(0)

par symétrie cyclique des C_{j,n}. Ainsi, la fonction \sum_{j=0}^{2n}C_{j,n} est constante en 2n+1 points. Comme c’est un polynôme trigonométrique de degré n et que ceux-ci forment un système de Chebyshev, il faut alors que ce soit une constante. Or,

\int_{0}^{2\pi} \sum_{j=0}^{2n} C_{j,n}(u) du = \sum_{j=0}^{2n}  \frac{2\pi\int_0^{2\pi} \left(1+\cos\left(u - \tfrac{2\pi j}{2n+1}\right)\right)^n du}{(2n+1)\int_0^{2\pi}\left(1+\cos u\right)^n du} = 2\pi,


\sum_{j=-n}^n C_{j,n} \equiv 1.

Construction des approximants


T_n(u) = \sum_{j=0}^{2n} f\left(\tfrac{2\pi j}{2n+1}\right) C_{j,n}(u), \quad T_0 = 0,

et montrons que T_n converge uniformément vers f lorsque n \rightarrow \infty. En prenant f_n = T_n-T_{n-1}, on aura démontré le théorème.

On calcule, en utilisant le fait que \sum_{j=0}^{2n} C_{j,n} \equiv 1,

|T_n (u) - f(u)| \le \sum_{j=0}^{2n} \left| f\left(\tfrac{2\pi j}{2n+1}\right) - f(u) \right| C_{j,n}(u) = (*).

Fixons maintenant \varepsilon > 0 et posons \delta > 0 tel que d(u,v) < \delta \Rightarrow |f(u) - f(v)| < \varepsilon. Avec A = \{j \,|\, d(u, \tfrac{2\pi j}{2n+1}) < \delta\}, la somme ci-dessus s’écrit

(*) \le \varepsilon \sum_{j \in A} C_{j,n}(u) + 2 \sup_x |f(x)| \sum_{j \in A^c} C_{j,n}(u) \le \varepsilon + 2 \|f\|_\infty \sum_{j \in A^c} C_{j,n}(u).


\sum_{j \in A^c} C_{j,n}(u) \le (2n+1) C_{0,n}(\delta) \rightarrow 0,

d’où pour n suffisamment grand on obtient

|T_n (u) - f(u)| < 2\varepsilon.

Comme \delta et n ne dépendent pas de u, et puisque \varepsilon > 0 était arbitraire, T_n converge uniformément vers f. CQFD.


[1] Róth. Á. et al. (2009). A cyclic basis for closed curve and surface modelling. Computer Aided Geometric Design, 26, 528-546.

[2] Bernstein, S. (1912). Démonstration du théorème de Weierstrass fondée sur le calcul des probabilités.