# Bayesian learning

Friday july 28 at 17:00
Rutherford Physics Building, Room 118, McGill

Next week, I’ll be talking about Bayesian learning at the Mathematical congress of the americas and at the Canadian undergraduate mathematics conference. These are somewhat challenging talks: I need to sell the idea of Bayesian statistics to a general mathematical audience (which knows nothing about it), interest them in some though problems of Bayesian nonparametrics, and then present some of our research results. This must be done in under 20 minutes.

To make the presentation more intuitive and accessible, I borrowed some language from machine learning. I’m talking about learning rather than inference, uncertain knowledge rather than subjective belief, and asymptotic correctness rather than consistency. These are essentially synonymous, although some authors might use them in different ways. This should not cause problems for this introductory talk.

# 1. What is learning?

This is the question that drives the talk. We do not want to deal with it in full generality, but rather want an operational definition of learning that allows us to program algorithms that can learn from experience. The suggestion is the following:

Learning is about rationally adjusting uncertain knowledge in the light of new information.

Bayesian learning implements this idea through the calculus of probability. Let’s see this in a simple context.

## 1.1 The classical problem

Some stochastic mechanism generates data $x_1, x_2, x_3, \dots$, each $x_i$ being an independent sample from a unknown distribution $P_0$. We want to use the data to learn about $P_0$.

### Example

A cryptographer observes word at random in an unknown countable language. Learning could be about:

• What’s the distribution of the observed words?
• What’s the probability that the next word to be observed has never been seen before?

## 1.2 Bayesian Learning

Uncertain knowledge about what $P_0$ may is encoded through a prior probability measure $\Pi$ on the space of all possibilities for $P_0$. Thus, if $A$ is some set of probability distributions, then $\Pi(A)$ is the probability, representing the best of our knowledge, that “$P_0 \in A$“. Then, we observe the data $\mathcal{D} = (x_1, x_2, \dots, x_n)$. This yields the posterior measure $\Pi(\cdot | \mathcal{D})$, obtained by adjusting $\Pi$ in light of $\mathcal{D}$ through probabilistic conditioning.

### Example (coin tossing)

We suspect a coin of being biased. Tossing it, there is an unknown probability $p_0$ that the coin falls on heads. A prior probability density function, quantifying our uncertainty about what $p_0$ may be, could look like this:

It is most likely that the coin is approximately unbiaised ($p_0 \approx 1/2$), but not impossible that it is strongly biaised.

Now, we flip the coin 20 times and observe only 5 heads. The density of the posterior distribution, updating the prior in light of this information, is the following.

It is most likely that the coin is biased and that $p_0 \approx 1/4$.

Certainly the results obtained depend on the choice of prior distribution, but similar priors yield similar posteriors:

# 2. Bayesian nonparametrics

It is tempting to apply the same procedure to more complex problems.

Reconsider, for instance, the cryptographer example. Each word $\omega$ has some probability $P_0(\omega)$ of appearing. Since there is an infinite number of words, there is an infinite number of parameters for which we need to quantify uncertainty. A reasonable question is then:

Is it feasible to faithfully quantify uncertain knowledge on an infinite dimensional space?

The answer is no. At least, not always and not necessarily in a way that makes bayesian learning meaningfull. Diaconis and Freedman showed, in 1986, showed the following in the context of bayesian nonparametrics:

“Some statisticians, as more and more data is gathered, will become more and more convinced of the wrong answer.”

Bayesian learning does not always work in infinite dimension. To approach this problem, we need to figure out charateristics of prior distributions that

1. describe how uncertainty is spread out; and
2. ensure that bayesian learning works correctly.

## 2.1 A positive result

An important positive result is based on the two following conditions / characteristics of prior distributions. We denote by $\mathbb{F}$ the set of all probability densities on some common space, assuming that $P_0 \in \mathbb{F}$ (the data-generating distribution has some density in $\mathbb{F}$).

1. $\Pi$ puts positive mass to the relative entropy neighborhoods of $P_0$:

$\Pi\left(\left\{P \in \mathbb{F}| \int \log \frac{dP_0}{dP} dP_0 < \varepsilon \right\}\right) > 0.$

This means that a priori, we’re not excluding the truth from the set of possibilities. Since $P_0$ is unknown, we require that this condition be satisfied whatever $P_0$ may be.

2. $\Pi$ is of finite entropy: for all $\delta > 0$, there exists $1 > \alpha > 0$ and a covering $\{A_i\}$ of $\mathbb{F}$ of $L^1$ diameter less than $\delta$ such that

$\sum_{i} \Pi(A_i)^\alpha < \infty.$

This means that $\Pi$ is not too complex and that we can make sense of it through discretization.

Under these two conditions, bayesian learning is asymptotically correct: the posterior distribution concentrates around the truth.

Theorem (Walker, 2004)
If the conditions (1) and (2) are satisfied, than for any $L^1$ neighborhood $N$ of $P_0$, we have

$\Pi(N \,|\, x_1, x_2, \dots, x_n) \rightarrow 1$

almost surely as $x_i \sim P_0$.

This is helpful, but has not yet solved our problem.

• How do we, generally, construct priors satisfying the two conditions?
• How can we use these priors to solve practical problems?

This is where our research (as well as a lot of other research in Bayesian nonparametrics) comes in.

# 3. Some of our research

Two contributions I want to present.

• We developped a relationship between some statistical models and approximation theory that can be used to easily construct priors satisfying the two conditions.
• We use it to solve problems raised in the litterature.

Let’s see how this works in the context of directional statistics. Say we have some data distributed on the circle or, more generally, on a compact metric space.

We want to learn about the data-generating mechanism, e.g. do

• density estimation,
• hypothesis tests,
• classification,

or any other statistical procedure. First, we specify a prior on the set $\mathbb{F}$ of all bounded densities on the circle.

## 3.1 Prior specification

We begin with a density basis of the trigonometric polynomials introduced in 2009:

$C_{j,n}(u) \propto \left(1 + \cos\left( u - \frac{2\pi j}{2n+1}\right)\right)^n, \quad j = 0,1,\dots, 2n.$

We studied the statistical properties of this basis, and use it to construct approximation operators

$T_n : \mathbb{F} \ni f \mapsto \sum_j C_{j,n} \int_{R_{j,n}}f(x) dx,$

where $R_{j,n} = \left[\frac{\pi(2j-1)}{2n+1}, \frac{\pi(2j+1)}{2n+1}\right)$. It can be shown that these operators are variation-diminishing and possess shape-preserving properties. More importantly, they give the decomposition

$\overline{\cup_{n\in \mathbb{N}} T_n(\mathbb{F})} = \mathbb{F}.$

By specifying priors $\Pi_n$ on $T_n(\mathbb{F})$ and mixing them together, we obtain the prior

$\Pi = \sum_{n}\rho(n) \Pi_n :\, A \mapsto \sum_n \rho(n) \Pi_n\left(A \cap T_n(\mathbb{F})\right).$

on $\mathbb{F}$.

This reduces the problem of specifying a prior on an infinite dimensional space, to specifying an infinite number of priors on finite dimensional spaces. This turns out to be easier, and from properties of $T_n$ we obtain asymptotic correctness of bayesian learning.

Theorem.
Let $T_n : L^1 \rightarrow L^1$ be any linear function mapping $\mathbb{F}$ to itself, with $\dim T_n (\mathbb{F})$ increasing. If $\|T_n f - f\|_\infty \rightarrow 0$ for all continuous $f$, then taking $\rho$ such that $0 < \rho(n) < e^{-\dim T_n(\mathbb{F})}$ and $\Pi_n > 0$ ensures that bayesian learning based on $\Pi$ is asymptotically correct.

What is interesting here:

• The approach scales easily to (almost) any set of bounded densities on a compact metric space.
• The approximation theory litterature provides a wealth of well-studied approximation operators that can be used to construct such priors.
• Properties of the operators relate to properties of the priors. If, for instance, it is known that the true density of $P_0$ is unimodal, then using a unimodality-preserving operator yields a prior the space of unimodal densities.

## 3.2 Simple application

We use a prior of the type considered above to estimate the density of $P_0$, also quantifying uncertainty about our estimate. Below, the data-generating distribution $P_0$ is drawn in orange. A hundred samples are represented in the grey histogram. The blue line is an estimate of the unknown orange function, and the blue shaded region quantifies uncertainty (50% and 90% credible regions).

# 4. Take-home message

There are two things I would like you to remember.

1. The calculus of probability provides an operational definition of learning. That is what Bayesian statistics is about.

2. As you must already know, different fields of mathematics enrich each other in their interactions. Here, it is approximation theory that provides tools that ensure bayesian learning works correctly.

References.

# Approximation

Présentation (20 minutes) au séminaire du 5e.

Je présente le théorème d’approximation de Weierstrass pour les fonctions périodiques, en utilisant une base des polynômes trigonométriques récemment suggérée par Róth et al. (2009). Celle-ci se prête naturellement bien à notre application.

Théorème d’approximation de Weierstrass.
Soit $f : \mathbb{R} \rightarrow \mathbb{R}$ une fonction $2\pi$-périodique. Si $f$ est continue, alors on peut construire des polynômes trigonométriques $f_1, f_2, f_3, \dots$ tels que

$f(x) = \sum_{i=1}^{\infty} f_i(x)$

et tels que la convergence de la série ci-dessus est uniforme.

Ce théorème intervient dans plusieurs domaines: en topologie pour démontrer le théorème du point fixe de Brouwer, en géométrie pour l’inégalité isopérimétrique et en géométrie algébrique pour le théorème de Nash-Tognoli. Il implique que $\{1, \cos(x), \sin(x), \cos(2x), \sin(2x), \dots\}$, en tant que système orthonormal, est complèt dans $L^2(\mathbb{S}^1)$. Plus généralement, on s’en sert pour ramener un problème sur les fonctions continues à un problème sur les polynômes, où le calcul différentiel et l’algèbre linéaire s’appliquent. Les démonstrations constructives du théorème fournissent de plus des outils permettant d’effectuer la régression ou la reconstruction de courbes et de surfaces.

## Notions de base

Un polynôme trigonométrique (de degré $m$) est une fonction de $x$ prenant la forme

$a_0 + \sum_{n=1}^m \left\{a_n \cos(nx) + b_n \sin(nx)\right\}.$

Notons que les sommes et les produits de polynômes trigonométriques sont encore de tels polynômes. De façon un peu moins évidente, ils forment un système de Chebyshev: pour tout ensemble $\{x_i\}_{i=1}^{2m+1}$ de points distincts et pour tout $\{y_i\}_{i=1}^{2m+1}\subset \mathbb{R}$, il existe un unique polynôme trigonométrique $P_m$ de degré $m$ tel que

$P_m(x_i)=y_i, \quad \forall i \in \{1,2,\dots, 2m+1\}.$

Les fonctions périodiques et continues sur $\mathbb{R}$ s’identifie aux fonctions continues sur le cercle

$\mathbb{S}^1 = \mathbb{R}/2\pi\mathbb{Z}$

munit de la distance de la longueur d’arc

$d(u,v) = \min_{k \in \mathbb{Z}} |u-v+2\pi k|.$

Cela suit d’un principe général: les fonctions continues sur $\mathbb{R}^k$ et invariantes sous l’action d’un groupe $G$, dont les orbites ne possèdent pas de points d’accumulation, s’identifient aux fonctions continues sur le quotient $\mathbb{R}^k / G$ munit de la distance $d([u],[v]) = \min_{g, h\in G} \|g(u) - h(v)\|$. (Ici, $[u]$ est l’orbite de $u$ sous l’action de $G$.) On peut obtenir, avec ces quotients, des surfaces telles le tore, le ruban de Mobius et la bouteille de Klein.

## Démonstration du théorème

Considérons la fonction positive

$C_{0,n}(u) = c_n\left(1+\cos(u)\right)^n,\quad c_n = \frac{2\pi}{2n+1}\left(\int_{0}^{2\pi} (1+\cos u)^n du\right)^{-1}$

et ses translatées

$C_{j,n}(u) = C_{0,n}\left(u - \tfrac{2\pi j}{2n+1} \right),\quad j=0,1, \dots, 2n$

qui sont disposés d’une façon régulière autour du cercle.

#### Les $C_{j,n}$ forment une partition de l’unité.

Remarquons que pour tout $i \in \{0,1, \dots, 2n\}$, on a

$\sum_{j=0}^{2n} C_{j,n}\left(\tfrac{2\pi j}{2n+1}\right) = \sum_{j=0}^{2n} C_{j,n}(0)$

par symétrie cyclique des $C_{j,n}$. Ainsi, la fonction $\sum_{j=0}^{2n}C_{j,n}$ est constante en $2n+1$ points. Comme c’est un polynôme trigonométrique de degré $n$ et que ceux-ci forment un système de Chebyshev, il faut alors que ce soit une constante. Or,

$\int_{0}^{2\pi} \sum_{j=0}^{2n} C_{j,n}(u) du = \sum_{j=0}^{2n} \frac{2\pi\int_0^{2\pi} \left(1+\cos\left(u - \tfrac{2\pi j}{2n+1}\right)\right)^n du}{(2n+1)\int_0^{2\pi}\left(1+\cos u\right)^n du} = 2\pi,$

d’où

$\sum_{j=-n}^n C_{j,n} \equiv 1.$

#### Construction des approximants

Posons

$T_n(u) = \sum_{j=0}^{2n} f\left(\tfrac{2\pi j}{2n+1}\right) C_{j,n}(u), \quad T_0 = 0,$

et montrons que $T_n$ converge uniformément vers $f$ lorsque $n \rightarrow \infty$. En prenant $f_n = T_n-T_{n-1}$, on aura démontré le théorème.

On calcule, en utilisant le fait que $\sum_{j=0}^{2n} C_{j,n} \equiv 1$,

$|T_n (u) - f(u)| \le \sum_{j=0}^{2n} \left| f\left(\tfrac{2\pi j}{2n+1}\right) - f(u) \right| C_{j,n}(u) = (*).$

Fixons maintenant $\varepsilon > 0$ et posons $\delta > 0$ tel que $d(u,v) < \delta \Rightarrow |f(u) - f(v)| < \varepsilon$. Avec $A = \{j \,|\, d(u, \tfrac{2\pi j}{2n+1}) < \delta\}$, la somme ci-dessus s’écrit

$(*) \le \varepsilon \sum_{j \in A} C_{j,n}(u) + 2 \sup_x |f(x)| \sum_{j \in A^c} C_{j,n}(u) \le \varepsilon + 2 \|f\|_\infty \sum_{j \in A^c} C_{j,n}(u).$

Or,

$\sum_{j \in A^c} C_{j,n}(u) \le (2n+1) C_{0,n}(\delta) \rightarrow 0,$

d’où pour $n$ suffisamment grand on obtient

$|T_n (u) - f(u)| < 2\varepsilon.$

Comme $\delta$ et $n$ ne dépendent pas de $u$, et puisque $\varepsilon > 0$ était arbitraire, $T_n$ converge uniformément vers $f$. CQFD.

## References

[1] Róth. Á. et al. (2009). A cyclic basis for closed curve and surface modelling. Computer Aided Geometric Design, 26, 528-546.

[2] Bernstein, S. (1912). Démonstration du théorème de Weierstrass fondée sur le calcul des probabilités.