# Bayesian learning

Friday july 28 at 17:00
Rutherford Physics Building, Room 118, McGill

Next week, I’ll be talking about Bayesian learning at the Mathematical congress of the americas and at the Canadian undergraduate mathematics conference. These are somewhat challenging talks: I need to sell the idea of Bayesian statistics to a general mathematical audience (which knows nothing about it), interest them in some though problems of Bayesian nonparametrics, and then present some of our research results. This must be done in under 20 minutes.

To make the presentation more intuitive and accessible, I borrowed some language from machine learning. I’m talking about learning rather than inference, uncertain knowledge rather than subjective belief, and “asymptotic correctness” rather than consistency. These are essentially synonymous, although some authors might use them in different ways. This should not cause problems for this introductory talk.

# 1. What is learning?

This is the question that drives the talk. We do not want to deal with it in full generality, but rather want an operational definition of learning that allows us to program algorithms that can learn from experience. The suggestion is the following:

Learning is about rationally adjusting uncertain knowledge in the light of new information.

Bayesian learning implements this idea through the calculus of probability. Let’s see this in a simple context.

## 1.1 The classical problem

Some stochastic mechanism generates data $x_1, x_2, x_3, \dots$, each $x_i$ being an independent sample from a unknown distribution $P_0$. We want to use the data to learn about $P_0$.

### Example

A cryptographer observes word at random in an unknown countable language. Learning could be about:

• What’s the distribution of the observed words?
• What’s the probability that the next word to be observed has never been seen before?

## 1.2 Bayesian Learning

Uncertain knowledge about what $P_0$ may is encoded through a prior probability measure $\Pi$ on the space of all possibilities for $P_0$. Thus, if $A$ is some set of probability distributions, then $\Pi(A)$ is the probability, representing the best of our knowledge, that “$P_0 \in A$“. Then, we observe the data $\mathcal{D} = (x_1, x_2, \dots, x_n)$. This yields the posterior measure $\Pi(\cdot | \mathcal{D})$, obtained by adjusting $\Pi$ in light of $\mathcal{D}$ through probabilistic conditioning.

### Example (coin tossing)

We suspect a coin of being biased. Tossing it, there is an unknown probability $p_0$ that the coin falls on heads. A prior probability density function, quantifying our uncertainty about what $p_0$ may be, could look like this:

Now, we flip the coin 20 times and observe only 5 heads. The density of the posterior distribution, updating the prior in light of this information, is the following.

Certainly the results obtained depend on the choice of prior distribution, but similar priors yield similar posteriors:

# 2. Bayesian nonparametrics

It is tempting to apply the same procedure to more complex problems.

Reconsider, for instance, the cryptographer example. Each word $\omega$ has some probability $P_0(\omega)$ of appearing. Since there is an infinite number of words, there is an infinite number of parameters for which we need to quantify uncertainty. A reasonable question is then:

Is it feasible to faithfully quantify uncertain knowledge on an infinite dimensional space?

The answer is no. At least, not always and not necessarily in a way that makes bayesian learning meaningfull. Diaconis and Freedman showed, in 1986, showed the following in the context of bayesian nonparametrics:

“Some statisticians, as more and more data is gathered, will become more and more convinced of the wrong answer.”

Bayesian learning does not always work in infinite dimension. To approach this problem, we need to figure out charateristics of prior distributions that

1. describe how uncertainty is spread out; and
2. ensure that bayesian learning works correctly.

## 2.1 A positive result

An important positive result is based on the two following conditions / characteristics of prior distributions. We denote by $\mathbb{F}$ the set of all probability densities on some common space, assuming that $P_0 \in \mathbb{F}$ (the data-generating distribution has some density in $\mathbb{F}$).

1. $\Pi$ puts positive mass to the relative entropy neighborhoods of $P_0$:

$\Pi\left(\left\{P \in \mathbb{F}| \int \log \frac{dP_0}{dP} dP_0 < \varepsilon \right\}\right) > 0.$

This means that a priori, we’re not excluding the truth from the set of possibilities. Since $P_0$ is unknown, we require that this condition be satisfied whatever $P_0$ may be.

2. $\Pi$ is of finite entropy: for all $\delta > 0$, there exists $1 > \alpha > 0$ and a covering $\{A_i\}$ of $\mathbb{F}$ of $L^1$ diameter less than $\delta$ such that

$\sum_{i} \Pi(A_i)^\alpha < \infty.$

This means that $\Pi$ is not too complex and that we can make sense of it through discretization.

Under these two conditions, bayesian learning is asymptotically correct: the posterior distribution concentrates around the truth.

Theorem (Walker, 2004)
If the conditions (1) and (2) are satisfied, than for any $L^1$ neighborhood $N$ of $P_0$, we have

$\Pi(N \,|\, x_1, x_2, \dots, x_n) \rightarrow 1$

almost surely as $x_i \sim P_0$.

This is helpful, but has not yet solved our problem.

• How do we, generally, construct priors satisfying the two conditions?
• How can we use these priors to solve practical problems?

This is where our research (as well as a lot of other research in Bayesian nonparametrics) comes in.

# 3. Some of our research

Two contributions I want to present.

• We developped a relationship between some statistical models and approximation theory that can be used to easily construct priors satisfying the two conditions.
• We use it to solve problems raised in the litterature.

Let’s see how this works in the context of directional statistics. Say we have some data distributed on the circle or, more generally, on a compact metric space.

We want to learn about the data-generating mechanism, e.g. do

• density estimation,
• hypothesis tests,
• classification,

or any other statistical procedure. First, we specify a prior on the set $\mathbb{F}$ of all bounded densities on the circle.

## 3.1 Prior specification

We begin with a density basis of the trigonometric polynomials introduced in 2009:

$C_{j,n}(u) \propto \left(1 + \cos\left( u - \frac{2\pi j}{2n+1}\right)\right)^n, \quad j = 0,1,\dots, 2n.$

We studied the statistical properties of this basis, and use it to construct approximation operators

$T_n : \mathbb{F} \ni f \mapsto \sum_j C_{j,n} \int_{R_{j,n}}f(x) dx,$

where $R_{j,n} = \left[\frac{\pi(2j-1)}{2n+1}, \frac{\pi(2j+1)}{2n+1}\right)$. It can be shown that these operators are variation-diminishing and possess shape-preserving properties. More importantly, they give the decomposition

$\overline{\cup_{n\in \mathbb{N}} T_n(\mathbb{F})} = \mathbb{F}.$

By specifying priors $\Pi_n$ on $T_n(\mathbb{F})$ and mixing them together, we obtain the prior

$\Pi = \sum_{n}\rho(n) \Pi_n :\, A \mapsto \sum_n \rho(n) \Pi_n\left(A \cap T_n(\mathbb{F})\right).$

on $\mathbb{F}$.

This reduces the problem of specifying a prior on an infinite dimensional space, to specifying an infinite number of priors on finite dimensional spaces. This turns out to be easier, and from properties of $T_n$ we obtain asymptotic correctness of bayesian learning.

Theorem.
Let $T_n : L^1 \rightarrow L^1$ be any linear function mapping $\mathbb{F}$ to itself, with $\dim T_n (\mathbb{F})$ increasing. If $\|T_n f - f\|_\infty \rightarrow 0$ for all continuous $f$, then taking $\rho$ such that $0 < \rho(n) < e^{-\dim T_n(\mathbb{F})}$ and $\Pi_n > 0$ ensures that bayesian learning based on $\Pi$ is asymptotically correct.

What is interesting here:

• The approach scales easily to (almost) any set of bounded densities on a compact metric space.
• The approximation theory litterature provides a wealth of well-studied approximation operators that can be used to construct such priors.
• Properties of the operators relate to properties of the priors. If, for instance, it is known that the true density of $P_0$ is unimodal, then using a unimodality-preserving operator yields a prior the space of unimodal densities.

## 3.2 Simple application

We use a prior of the type considered above to estimate the density of $P_0$, also quantifying uncertainty about our estimate. Below, the data-generating distribution $P_0$ is drawn in orange. A hundred samples are represented in the grey histogram. The blue line is an estimate of the unknown orange function, and the blue shaded region quantifies uncertainty (50% and 90% credible regions).

# 4. Take-home message

There are two things I would like you to remember.

1. The calculus of probability provides an operational definition of learning. That is what Bayesian statistics is about.

2. As you must already know, different fields of mathematics enrich each other in their interactions. Here, it is approximation theory that provides tools that ensure bayesian learning works correctly.

References.

Diaconis, P. & Freedman, D. (1986). On the Consistency of Bayes Estimates. Ann. Stat. 14, 1–26 .

Roth, A. et al. (2009). A cyclic basis for closed curve and surface modeling, Computer Aided Geometric Design, 26(5), 528-546.

Walker, S. (2004). New approaches to Bayesian consistency. Ann. Statist. 32(5), 2028-2043.

McVinish, R. \& Mengersen, K. (2008). Semiparametric Bayesian circular statistics. Computational Statistics and Data Analysis 52, 4722–4730.

Petrone, S. \& Veronese, P. (2010). Feller operators and mixture priors in Bayesian nonparametrics. Statistica Sinica, 20(1), 379-404.

Wu, Y., Ghosal, S. (2008) Kullback Leibler property of kernel mixture priors in Bayesian density estimation. Electronic Journal of Statistics, 2, 298-331.

Binette, O. & Guillotte, S. (2017) Bayesian Nonparametrics for Directional Statistics. Working paper.

## One thought on “Bayesian learning”

1. […] Bayesian learning […]