# Bayesian learning

Friday july 28 at 17:00
Rutherford Physics Building, Room 118, McGill

Next week, I’ll be talking about Bayesian learning at the Mathematical congress of the americas and at the Canadian undergraduate mathematics conference. These are somewhat challenging talks: I need to sell the idea of Bayesian statistics to a general mathematical audience (which knows nothing about it), interest them in some though problems of Bayesian nonparametrics, and then present some of our research results. This must be done in under 20 minutes.

To make the presentation more intuitive and accessible, I borrowed some language from machine learning. I’m talking about learning rather than inference, uncertain knowledge rather than subjective belief, and asymptotic correctness rather than consistency. These are essentially synonymous, although some authors might use them in different ways. This should not cause problems for this introductory talk.

# 1. What is learning?

This is the question that drives the talk. We do not want to deal with it in full generality, but rather want an operational definition of learning that allows us to program algorithms that can learn from experience. The suggestion is the following:

Learning is about rationally adjusting uncertain knowledge in the light of new information.

Bayesian learning implements this idea through the calculus of probability. Let’s see this in a simple context.

## 1.1 The classical problem

Some stochastic mechanism generates data $x_1, x_2, x_3, \dots$, each $x_i$ being an independent sample from a unknown distribution $P_0$. We want to use the data to learn about $P_0$.

### Example

A cryptographer observes word at random in an unknown countable language. Learning could be about:

• What’s the distribution of the observed words?
• What’s the probability that the next word to be observed has never been seen before?

## 1.2 Bayesian Learning

Uncertain knowledge about what $P_0$ may is encoded through a prior probability measure $\Pi$ on the space of all possibilities for $P_0$. Thus, if $A$ is some set of probability distributions, then $\Pi(A)$ is the probability, representing the best of our knowledge, that “$P_0 \in A$“. Then, we observe the data $\mathcal{D} = (x_1, x_2, \dots, x_n)$. This yields the posterior measure $\Pi(\cdot | \mathcal{D})$, obtained by adjusting $\Pi$ in light of $\mathcal{D}$ through probabilistic conditioning.

### Example (coin tossing)

We suspect a coin of being biased. Tossing it, there is an unknown probability $p_0$ that the coin falls on heads. A prior probability density function, quantifying our uncertainty about what $p_0$ may be, could look like this:

It is most likely that the coin is approximately unbiaised ($p_0 \approx 1/2$), but not impossible that it is strongly biaised.

Now, we flip the coin 20 times and observe only 5 heads. The density of the posterior distribution, updating the prior in light of this information, is the following.

It is most likely that the coin is biased and that $p_0 \approx 1/4$.

Certainly the results obtained depend on the choice of prior distribution, but similar priors yield similar posteriors:

# 2. Bayesian nonparametrics

It is tempting to apply the same procedure to more complex problems.

Reconsider, for instance, the cryptographer example. Each word $\omega$ has some probability $P_0(\omega)$ of appearing. Since there is an infinite number of words, there is an infinite number of parameters for which we need to quantify uncertainty. A reasonable question is then:

Is it feasible to faithfully quantify uncertain knowledge on an infinite dimensional space?

The answer is no. At least, not always and not necessarily in a way that makes bayesian learning meaningfull. Diaconis and Freedman showed, in 1986, showed the following in the context of bayesian nonparametrics:

“Some statisticians, as more and more data is gathered, will become more and more convinced of the wrong answer.”

Bayesian learning does not always work in infinite dimension. To approach this problem, we need to figure out charateristics of prior distributions that

1. describe how uncertainty is spread out; and
2. ensure that bayesian learning works correctly.

## 2.1 A positive result

An important positive result is based on the two following conditions / characteristics of prior distributions. We denote by $\mathbb{F}$ the set of all probability densities on some common space, assuming that $P_0 \in \mathbb{F}$ (the data-generating distribution has some density in $\mathbb{F}$).

1. $\Pi$ puts positive mass to the relative entropy neighborhoods of $P_0$:

$\Pi\left(\left\{P \in \mathbb{F}| \int \log \frac{dP_0}{dP} dP_0 < \varepsilon \right\}\right) > 0.$

This means that a priori, we’re not excluding the truth from the set of possibilities. Since $P_0$ is unknown, we require that this condition be satisfied whatever $P_0$ may be.

2. $\Pi$ is of finite entropy: for all $\delta > 0$, there exists $1 > \alpha > 0$ and a covering $\{A_i\}$ of $\mathbb{F}$ of $L^1$ diameter less than $\delta$ such that

$\sum_{i} \Pi(A_i)^\alpha < \infty.$

This means that $\Pi$ is not too complex and that we can make sense of it through discretization.

Under these two conditions, bayesian learning is asymptotically correct: the posterior distribution concentrates around the truth.

Theorem (Walker, 2004)
If the conditions (1) and (2) are satisfied, than for any $L^1$ neighborhood $N$ of $P_0$, we have

$\Pi(N \,|\, x_1, x_2, \dots, x_n) \rightarrow 1$

almost surely as $x_i \sim P_0$.

This is helpful, but has not yet solved our problem.

• How do we, generally, construct priors satisfying the two conditions?
• How can we use these priors to solve practical problems?

This is where our research (as well as a lot of other research in Bayesian nonparametrics) comes in.

# 3. Some of our research

Two contributions I want to present.

• We developped a relationship between some statistical models and approximation theory that can be used to easily construct priors satisfying the two conditions.
• We use it to solve problems raised in the litterature.

Let’s see how this works in the context of directional statistics. Say we have some data distributed on the circle or, more generally, on a compact metric space.

We want to learn about the data-generating mechanism, e.g. do

• density estimation,
• hypothesis tests,
• classification,

or any other statistical procedure. First, we specify a prior on the set $\mathbb{F}$ of all bounded densities on the circle.

## 3.1 Prior specification

We begin with a density basis of the trigonometric polynomials introduced in 2009:

$C_{j,n}(u) \propto \left(1 + \cos\left( u - \frac{2\pi j}{2n+1}\right)\right)^n, \quad j = 0,1,\dots, 2n.$

We studied the statistical properties of this basis, and use it to construct approximation operators

$T_n : \mathbb{F} \ni f \mapsto \sum_j C_{j,n} \int_{R_{j,n}}f(x) dx,$

where $R_{j,n} = \left[\frac{\pi(2j-1)}{2n+1}, \frac{\pi(2j+1)}{2n+1}\right)$. It can be shown that these operators are variation-diminishing and possess shape-preserving properties. More importantly, they give the decomposition

$\overline{\cup_{n\in \mathbb{N}} T_n(\mathbb{F})} = \mathbb{F}.$

By specifying priors $\Pi_n$ on $T_n(\mathbb{F})$ and mixing them together, we obtain the prior

$\Pi = \sum_{n}\rho(n) \Pi_n :\, A \mapsto \sum_n \rho(n) \Pi_n\left(A \cap T_n(\mathbb{F})\right).$

on $\mathbb{F}$.

This reduces the problem of specifying a prior on an infinite dimensional space, to specifying an infinite number of priors on finite dimensional spaces. This turns out to be easier, and from properties of $T_n$ we obtain asymptotic correctness of bayesian learning.

Theorem.
Let $T_n : L^1 \rightarrow L^1$ be any linear function mapping $\mathbb{F}$ to itself, with $\dim T_n (\mathbb{F})$ increasing. If $\|T_n f - f\|_\infty \rightarrow 0$ for all continuous $f$, then taking $\rho$ such that $0 < \rho(n) < e^{-\dim T_n(\mathbb{F})}$ and $\Pi_n > 0$ ensures that bayesian learning based on $\Pi$ is asymptotically correct.

What is interesting here:

• The approach scales easily to (almost) any set of bounded densities on a compact metric space.
• The approximation theory litterature provides a wealth of well-studied approximation operators that can be used to construct such priors.
• Properties of the operators relate to properties of the priors. If, for instance, it is known that the true density of $P_0$ is unimodal, then using a unimodality-preserving operator yields a prior the space of unimodal densities.

## 3.2 Simple application

We use a prior of the type considered above to estimate the density of $P_0$, also quantifying uncertainty about our estimate. Below, the data-generating distribution $P_0$ is drawn in orange. A hundred samples are represented in the grey histogram. The blue line is an estimate of the unknown orange function, and the blue shaded region quantifies uncertainty (50% and 90% credible regions).

# 4. Take-home message

There are two things I would like you to remember.

1. The calculus of probability provides an operational definition of learning. That is what Bayesian statistics is about.

2. As you must already know, different fields of mathematics enrich each other in their interactions. Here, it is approximation theory that provides tools that ensure bayesian learning works correctly.

References.

# The discretization trick

I explain the discretization trick that I mentioned in my previous post (Posterior consistency under possible misspecification). I think it was introduced by Walker (New approaches to Bayesian consistency (2004)).

Let $\mathbb{F}$ be a set of densities and let $\Pi$ be a prior on $\mathbb{F}$. If $x_1, x_2, x_3, \dots$ follows some distribution $P_0$ having density $f_0$, then the posterior distribution of $\Pi$ can be written as

$\Pi(A | x_1, \dots, x_n) \propto \int_A \prod_{i=1}^n f(x_i) \Pi(df).$

The discretization trick is to find densities $f_{1}, f_2, f_3, \dots$ in the convex hull of $A$ (taken in the space of all densities) such that

$\int_A \prod_{i=1}^n f(x_i) = \prod_{i=1}^n f_i(x_i) \Pi(A).$

For example, suppose $\varepsilon > 2\delta > 0$, $A = A_\varepsilon = \{f \,|\, D_{\frac{1}{2}}(f_0, f) > \varepsilon\}$ and that $A_i$ is a partition of $A$ of diameter at most $\delta$. If there exists $1 > \alpha > 0$ such that

$\sum_i \Pi(A_i)^\alpha < \infty,$

then for some $\beta > 0$ we have that

$e^{n \beta} \left(\int_{A_{\varepsilon}} \prod_{i=1}^n \frac{f(x_i)}{f_0(x_i)} \Pi (df) \right)^\alpha \le e^{n \beta} \sum_i \prod_{j=1}^n \left(\frac{f_{i,j}(x_j)}{f_0(x_j)}\right)^\alpha \Pi(A_i)^\alpha \rightarrow 0$

almost surely. This is because, with $A_{\alpha}$ the $\alpha$-affinity defined here, we have that

goes exponentially fast towards 0 when $\beta$ is sufficiently small.  Hence the Borel-Cantelli lemma applies, yielding the claim.

## Construction

The $f_{i,j}$‘s are defined as the posterior mean predictive density, when the posterior is conditioned on $A_i$. That is,

$f_{i,j} : x \mapsto \frac{\int_{A_i} f(x) \prod_{k=1}^{j-1}f(x_k) \Pi(df)}{\int_{A_i} \prod_{k=1}^{j-1}f(x_k) \Pi(df)}$

and

$f_{i, 1} : x \mapsto \frac{\int_{A_i} f(x) \Pi(df)}{\Pi(A_i)}.$

Clearly

$\int_{A_i} \prod_{i=1}^n f(x_i) \Pi(df) = \prod_{j=1}^n f_{i,j}(x_j) \Pi(A_i).$

Furthermore, if $A_i$ is contained in a Hellinger ball of center $g_i$ and of radius $\delta$, then also

$H(f_{i,j}, g_i) < \delta.$

This follows form the convexity of the Hellinger balls (an important remark for the generalization of this trick).

# Posterior consistency under (possible) misspecification

We assume, without too much loss of generality, that our priors are discrete. When dealing with Hellinger separable density spaces, it is possible to discretize posterior distributions to study consistency (see this post about it).

Let $\Pi$ be a prior on a countable space $\mathcal{N} = \{f_1, f_2, f_3, \dots\}$ of probability density functions, with $\Pi(f) > 0$ for all $f \in \mathcal{N}$. Data $X_1, X_2, X_3, \dots$ follows (independently) some unknown distribution $P_0$ with density $f_0$.

We denote by $D_{KL}(f_0, f) = \int f_0 \log\frac{f_0}{f}$ the Kullback-Leibler divergence and we let $D_{\frac{1}{2}}(f_0, f) = 1 - \int \sqrt{f_0 f}$ be half of the squared Hellinger distance.

The following theorem states that the posterior distribution of $\Pi$ accumulates in Hellinger neighborhoods of $f_0$, assuming the prior is root-summable (i.e. $\sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty$ for some $\alpha > 0$) . In the well-specified case (i.e. $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0$), the posterior accumulates in any neighborhood of $f_0$. In the misspecified case, small neighborhoods of $f_0$ could be empty, but the posterior distribution still accumulates in sufficiently large neighborhoods (how large exactly is a function of $\alpha$ and $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f)$).

The result was essentially stated by Barron (Discussion: On the Consistency of Bayes Estimates, 1986). In the case where $\Pi$ is not necessarily discrete, a similar result was obtained, through a discretization argument, by Walker (Bayesian asymptotics with misspecified models, 2013). See also Xing (Sufficient conditions for Bayesian consistency, 2009) for a thorough treatment of Bayesian consistency using the same method of proof.

Theorem (Barron).
Suppose $\beta_0 :=\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) < \infty$ and that

$\alpha := \inf \left\{ p \in [\tfrac{1}{2},1] \,|\, \sum_{f \in \mathcal{N}} \Pi(f)^p < \infty \right\} < 1.$

If $\varepsilon > 0$ is such that

$\varepsilon > 1- \exp\left( \frac{-\beta_0 \alpha}{2(1-\alpha)} \right)$

and if $A_\varepsilon := \{f \in \mathcal{N} \,|\, D_{\frac{1}{2}} (f_0, f) < \varepsilon\} \not = \emptyset$, then

$\Pi\left(A_\varepsilon \,|\, \{x_i\}_{i=1}^n\right) \rightarrow 1$

almost surely as $n \rightarrow \infty$.

Remarks.

1 – If $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0$, then any $\varepsilon > 0$ can be used.

2 – $\alpha$ is related to the rate of convergence of $\rho(n) := \Pi(f_n)$ towards $0$. The quantity $H_\alpha (\Pi) = \log \sum_{f \in \mathcal{N}} \Pi(f)^\alpha$ can be thought as measure of entropy.

3 – Walker (2013) considered the case $\sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty$ for some $\alpha < \frac{1}{2}$. This implies that $\sum_{f \in \mathcal{N}} \sqrt{\Pi(f)} < \infty$ and the above theorem can also be applied in this case.

## Demonstration

The proof is brief. I do not dwell on explanations.

Background.
First let me recall some concepts. The $\alpha$-affinity between two densities $f$ and $g$ is defined as

$A_\alpha(f, g) = \int g^\alpha f^{1-\alpha}. \qquad (1)$

Note that $0 \le A_{\alpha}(f, g) \le 1$ and that $A_\alpha(f, g) = 1$ if and only if $f = g$. Furthermore, when $\alpha \geq \frac{1}{2}$, Holder’s inequality and Jensen’s inequality yield

$A_{\frac{1}{2}}(f, g) \le A_{\alpha}(f, g) \le \left(A_{\frac{1}{2}}(f,g)\right)^{2(1-\alpha)}. \qquad (2)$

Proof.
We can now begin the proof. Let $\tilde{\alpha}$ be such that $1 > \tilde{\alpha} > \alpha$. Then, we have

If $\beta > 0$ and $g \in \mathcal{N}$ is such that $D_{KL}(f_0, g) < \beta_0 + \beta$, then

almost surely. Furthermore, using (2), we find

Here $\text{cst.}$ is a constant. Since $\beta > 0$ is arbitrary and since $\tilde \alpha$ can be taken so that $2(1-\tilde \alpha) \log (1-\varepsilon) + \tilde \alpha \beta_0 < 0$, we obtain that $(**)$ converges exponentially fast towards $0$. Hence, by the Borel-Cantelli lemma, we have

almost surely. This, together with $(3)$, implies that $(*) \rightarrow 0$ almost surely. $\Box$

# The choice of prior in bayesian nonparametrics – part 2

See part 1. Most proofs are omitted; I’ll post them with the complete pdf later this week.

# The structure of $\mathcal{M}$

Recall that $\mathbb{M}$ is is a Polish space (ie. a complete and separable metric space). It is endowed with its borel $\sigma$-algebra $\mathfrak{B}$ which is the smallest family of subsets of $\mathbb{M}$ that contains its topology and that is closed under countable unions and intersections. All subsets of $\mathbb{M}$ we consider in the following are supposed to be part of $\mathfrak{B}$. A probability measure on $\mathbb{M}$ is a function $\mu : \mathfrak{B} \rightarrow [0,1]$ such that for any countable partition $A_1, A_2, A_3, \dots$ of $\mathbb{M}$ we have that $\sum_{i=1}^\infty \mu(A_i) = 1.$ The set $\mathcal{M}$ consists of all such probability measures.

Note that since $\mathbb{M}$ is complete and separable, every probability measure $\mu \in \mathcal{M}$ is regular (and tight). It means that the measure of any $A\subset \mathbb{M}$ can be well approximated from the measure of compact subsets of $A$ as well as from the measure of open super-sets of $A$:

$\mu(A) = \sup \left\{\mu(K) \,|\, K \subset A \text{ is compact}\right\}\\ = \inf \left\{\mu(U) \,|\, U \supset A \text{ is open}\right\}.$

## Metrics on $\mathcal{M}$

Let me review some facts. A natural metric used to compare the mass allocation of two measures $\mu, \nu \in \mathbb{M}$ is the total variation distance defined by

$\|\mu - \nu\|_{TV} = \sup_{A \subset \mathbb{M}}|\mu(A) - \nu(A)|.$

It is relatively straightforward to verify that $\mathcal{M}$ is complete under this distance, but it is not in general separable. To see this, suppose that $\mathbb{M} = [0,1]$. If a ball centered at $\mu$ contains a dirac measure $\delta_x$, $x \in [0,1]$, then $\mu$ must have a point mass at $x$. Yet any measure contains at most a countable number of point masses, and there is an uncountable number of dirac measures on $[0,1]$. Thus no countable subset of $\mathcal{M}$ can cover $\mathcal{M}$ up to an $\varepsilon$ of error.

This distance can be relaxed to the Prokhorov metric, comparing mass allocation up to $\varepsilon$-neighborhoods. It is defined as

$d_P(\mu, \nu) = \inf \left\{ \varepsilon > 0 \,|\, \mu(A) \le \nu(A^{\varepsilon}) + \varepsilon \text{ and } \nu(A) \le \mu(A^{\varepsilon}) + \varepsilon,\; \forall A \subset \mathbb{M} \right\},$

where $A^{\varepsilon} = \{x \in \mathbb{M} \,|\, d(x, A) < \varepsilon\}$ is the $\varepsilon$-neighborhood of $A$. It is a metrization of the topology of weak convergence of probability measures, and $\mathcal{M}$ is separable under this distance.

The compact sets of $\mathcal{M}$ under the Prokhorov metric admit a simple characterization given by the Prokhorov theorem: $P \subset \mathcal{M}$ is precompact if and only if $P$ is uniformly tight (for each $\varepsilon > 0$, there exists a compact $K \subset X$ such that $\sup_{\mu \in P} \mu(K) \geq 1-\varepsilon$). This means that a sequence $\{\mu_n\} \subset \mathcal{M}$ admits a weakly convergent subsequence if and only if $\{\mu_n\}$ is uniformly tight.

Characterizations of weak convergence are given by the Portemanteau theorem, which says in particular that $\mu_n$ converges weakly to $\mu$ if and only if

$\int f d\mu_n \rightarrow \int f d\mu$

for all continuous and bounded and continuous $f$. It is also equivalent to

$\mu_n(A) \rightarrow \mu(A)$

for all sets $A$ such that $\mu(\partial A) = 0$.

## Measures of divergence

In addition to metrics, that admit a geometric interpretation through the triangle inequality, statistical measures of divergence can also be considered. Here, we consider functions $D : \mathcal{M}\times \mathcal{M} \rightarrow [0, \infty]$ that can be used to determine the rate of convergence of the likelihood ratio

$\prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0,$

where $x_i \sim \nu$ and $\mu, \nu \in \mathcal{M}$.

### Kullback-Leibler divergence

The weight of evidence in favor of the hypothesis “$\lambda = \mu$” versus “$\lambda = \nu$” given a sample $x$ is defined as

$W(x) = \log\frac{d\mu}{d\nu}.$

It measures how much information about the hypotheses is brought by the observation of $x$. (For a justification of this interpretation, see Good (Weight of evidence: a brief survey, 1985).) The Kullback-Leibler divergence $D_{KL}$ between $\mu$ and $\nu$ is defined as the expected weight of evidence given that $x \sim \mu$:

$D_{KL}(\mu, \nu) =\mathbb{E}_{x \sim \mu} W(x) = \int \log \frac{d\mu}{d\nu} d\mu.$

The following properties of the Kullback-Leibler divergence support its interpretation as an expected weight of evidence.

Theorem 1 (Kullback and Leibler, 1951).
We have

1. $D_{KL}(\mu, \nu) \geq 0$ with equality if and only if $\mu = \nu$;
2. $D_{KL}(\mu T^{-1}, \nu T^{-1}) \geq D_{KL}(\mu, \nu)$ with equality if and only if $T: \mathbb{M} \rightarrow \mathbb{M}'$ is a sufficient statistic for $\{\mu, \nu\}$.

Furthermore, the KL divergence can be used to precisely identify exponential rates of convergence of the likelihood ratio. The first part of the next proposition says that $D_{KL}(\lambda, \nu)$ is finite if and only if the likelihood ratio $\prod_{i} \frac{d\nu}{d\lambda}(x_i)$, $x_i \sim \lambda$ cannot convergence super-exponentially fast towards $0$. The second part identifies the rate of convergence then the KL divergence is finite.

Proposition 2.
Let $x_1, x_2, x_3, \dots \sim \lambda$ (independently). The KL divergence $D_{KL}(\lambda, \nu)$ is finite if and only if there exists an $\alpha > 0$ such that

$e^{n\alpha} \prod_{i=1}^n \frac{d\nu}{d\lambda}(x_i) \rightarrow \infty$

with positive probability.

Finally, suppose we are dealing with a submodel $\mathcal{F} \subset \mathcal{M}$ such that the rates of convergences of the likelihood ratios in $\mathcal{F}$ are of an exponential order. By the previous proposition, this is equivalent to the fact that $\forall \mu, \nu \in \mathcal{F}$, $D_{KL}(\mu, \nu) < \infty$. We can show that the KL divergence is, up to topological equivalence, the best measure of divergence that determines the convergence of the likelihood ratio. That is, suppose $D: \mathcal{F} \times \mathcal{F}\rightarrow [0, \infty]$ is such that

$D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i) \rightarrow 0$

at an exponential rate, almost surely when $x_i \sim \lambda$, and that $D(\lambda, \mu) = 0$ if and only if $\lambda = \mu$. Then, the topology induced by $D_{KL}$ is coarser than the topology induced by $D$.

Proposition 3.
Let $D$ be as above and let $\mathcal{F} \subset \mathcal{M}$ be such that $\forall \mu, \nu \in \mathcal{F}$, $D_{KL}(\mu, \nu) < \infty$. Then, the topology on $\mathcal{F}$ induced by $D_{KL}$ is weaker than the topology induced by $D$. More precisely, we have that

$D(\lambda, \mu) < D(\lambda, \nu) \Rightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu).$

### alpha-affinity and alpha-divergence

We define the $\alpha$-affinity between two probability measures as the expectancy of another transform of the likelihood ratio. Let $\mu, \nu$ be two probability measures dominated by $\lambda$, with $d\mu = f d\lambda$ and $d\nu = g d\lambda$. Given $0 < \alpha < 1$, the $\alpha$-affinity between $\mu$ and $\nu$ is

$A_\alpha(\mu, \nu) = \int \left(\frac{g}{f}\right)^\alpha d\mu = \int g^\alpha f^{1-\alpha} d\lambda.$

Proposition 4.
For all $0 < \alpha < 1$, we have that

1. $A_\alpha(\mu, \nu) \le 1$ with equality if and only if $\mu = \nu$;

2. $A_\alpha$ is monotonous in $\alpha$ and jointly concave in its arguments;

3. $A_\alpha$ is jointly multiplicative under products:

$A_\alpha (\mu^{(n)}, \nu^{(n)}) = \left(A_{\alpha}(\mu, \nu)\right)^n.$

4. if $\frac{1}{2} \leq \alpha$, then

$A_{\frac{1}{2}} \le A_\alpha \le \left(A_{\frac{1}{2}}\right)^{2(1-\alpha)};$

Proof.
1-2 follow from Jensen’s inequality and the joint concavity of $(x,y) \mapsto x^\alpha y^{1-\alpha}$. 3 follows from Fubini’s theorem. For
(iv), the first inequality is a particular case of 2 and Hölder’s inequality finally yields

$A_{\alpha}(\mu, \nu) = \int (fg)^{1-\alpha} g^{2\alpha - 1} d\lambda \le \left( \int \sqrt{fg} \,d\lambda \right)^{2-2\alpha} = A_{\frac{1}{2}}(\mu, \nu).$

$\Box$

The $\alpha$-divergence $D_\alpha$ is obtained as

$D_\alpha = 1 - A_\alpha.$

Other similar divergences considered in the litterature are

$\frac{1-A_\alpha}{\alpha(1-\alpha)}\; \text{ and }\; \frac{\log A_\alpha}{\alpha(1-\alpha)},$

but we prefer $D_\alpha$ for its simplicity. When $\alpha = \frac{1}{2}$, it is closely related to the hellinger distance

$H(\mu, \nu) = \left(\int \left(\sqrt{f} - \sqrt{g}\right)^2d\lambda\right)^{\frac{1}{2}}$

through

$D_{\frac{1}{2}}(\mu, \nu) = \frac{H(\mu, \nu)^2}{2}.$

Other important and well-known inequalities are given below.

Proposition 5.
We have

$D_{\frac{1}{2}}(\mu, \nu) \le \|\mu-\nu\|_{TV} \le \sqrt{2 D_{\frac{1}{2}}(\mu, \nu)}$

and

$2D_{\frac{1}{2}}(\mu, \nu) \le D_{KL}(\mu, \nu) \le 2\left\|\frac{f}{g}\right\|_\infty \|\mu-\nu\|_{TV}.$

This, together with proposition 4 (4) , yields similar bounds for the other divergences.

## Finite models

Let $\Pi$ be a prior on $\mathcal{M}$ that is finitely supported. That is, $\Pi = \sum_{i=1}^n p_i \delta_{\mu_i}$ for some $\mu_i \in \mathcal{M}$ and $p_i > 0$ with $\sum_i p_i = 1$. Suppose that $x_1, x_2, x_3, \dots$ independently follow some $\mu_* \in \mathcal{M}$.

The following proposition ensures that as data is gathered, the posterior distribution of $\Pi$ concentrates on the measures $\mu_i$ that are closest to $\mu_*$.

Proposition 6.
Let $A_{\varepsilon} = \{\mu_i \,|\, D_{KL}(\mu_*, \mu_i) < \varepsilon \}$. If $A_\varepsilon \not = \emptyset$, then

$\Pi(A_\varepsilon \,|\, \{x_i\}_{i=1}^m) \rightarrow 1$

almost surely as $m \rightarrow \infty$.

# The choice of prior in bayesian nonparametrics – Introduction

In preparation for the 11th Bayesian nonparametrics conference, I’m writing (and rewriting) notes on the background of our research (i.e. some of the general theory of bayesian nonparametrics). There are some good books on the subject (such as Bayesian Nonparametrics (Ghosh and Ramamoorthi, 2003)), but I wanted a more introductory focus and to present Choi and Ramamoorthi’s very clear point of view on posterior consistency (Remarks on the consistency of posterior distributions, 2008).

# 1. Introduction

Let $\mathbb{X}$ be a complete and separable metric space and let $\mathcal{M}$ be the space of all probability measures on $\mathbb{X}$. Some unknown distribution $P_0\in \mathcal{M}$ is generating observable data $\mathcal{D}_n = (X_1, X_2, \dots, X_n) \in \mathbb{X}^n$, where each $X_i$ is independently drawn from $P_0$. The problem is to learn about $P_0$ using only $\mathcal{D}_n$ and prior knowledge.

Example (Discovery probabilities).
A cryptographer observes words, following some distribution $P_0$, in an unknown countable language $\mathcal{L}$. What are the $P_0$-probabilities of the words observed thus far? What is the probability that the next word to be observed has never been observed before?

## 1.1 Learning and uncertainty

We need an employable definition of learning. As a first approximation, we can consider learning to be the reduction of uncertainty about what is $P_0$. This requires a quantification of how uncertain we are to begin with. Then, hopefully, as data is gathered out uncertainty decreases and we are able to pinpoint $P_0$.

This is the core of Bayesian learning, alghough our definition is not yet entirely satisfactory. There are some difficulties with this idea of quantifying uncertainty, at least when using information-theoric concepts. The solution we adopt here is the use of probabilities to quantify uncertain knowledge (bayesians would also talk of subjective probabilities quantifying rational belief). For example, you may know that a coin flip is likely to be fair, although it is not impossible the two sides of the coin are both the same. This is uncertain knowledge about the distribution of heads and tails in the coin flips, and you could assign probabilities to the different possibilities.

More formally, prior uncertain knowledge about what is $P_0$ is quantified by a probability measure $\Pi$ on $\mathcal{M}$. For any $A \subset \mathcal{M}$, $\Pi(A)$ is the the prior probability that “$P_0 \in A$“. Then, given data $\mathcal{D}_n$, prior probabilities are adjusted to posterior probabilities: $\Pi$ becomes $\Pi_n$, the conditional distribution of $\Pi$ given $\mathcal{D}_n$. The celebrated Bayes’ theorem provides a formula to calculate $\Pi_n$ from $\Pi$ and $\mathcal{D}_n$. Thus we have an operational definition of learning in our statistical framework.

Learning is rationally adjusting uncertain knowledge in the light of new information.

For explanations as to why probabilities are well suited to the representation of uncertain knowledge, I refer the reader to Pearl (Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988). We will also see that the operation of updating the prior to posterior posterior probabilities does work as intended.

## 1.2 The choice of prior

Specifying prior probabilities, that is quantifying prior uncertain knowledge, is not a simple task. It is especially difficult when uncertainty is over the non-negligeable part $\mathcal{M}$ of an infinite dimensional vector space. Fortunately, “probability is not about numbers, it is about the structure of reasoning”, as Glenn Shafer puts it (cited in Pearl, 1988, p.15). The exact numbers given to the events “$P_0 \in A$” are not of foremost importance; what matters is how probabilities are more qualitatively put together, and how this relates to the learning process.

Properties of prior distributions, opening them to scrutiny, criticism and discussion, must be identified and related to what happens as more and more data is gathered.

Part 2.

# Explanations

Angular data arises in many scientific fields, such as in experimental biology for the study of animal orientation, and in bioinformatics in relation to the protein structure prediction problem.

The statistical analysis of this data requires adapted tools such as $2\pi$-periodic density models. Fernandez-Duran (Biometrics, 60(2), 2004) proposed non-negative trigonometric sums (i.e. non-negative trigonometric polynomials) as a flexible family of circular distributions. However, the coefficients of trigonometric polynomials expressed in the standard basis $1, \cos(x), \sin(x), \dots$ are difficult to interpret and we do not see how an informative prior could be specified through this parametrization. Moreover, the use of this basis was criticized by Ferreira et al. (Bayesian Analysis, 3(2), 2008) as resulting in a “wigly approximation, unlikely to be useful in most real applications”.

### Trigonometric density basis

Here, we suggest the use of a density basis of the trigonometric polynomials and argue it is well suited to statistical applications. In particular, coefficients of trigonometric densities expressed in this basis possess an intuitive geometric interpretation. Furthermore, we show how “wiggliness” can be precisely controlled using this basis and how another geometric constraint, periodic unimodality, can be enforced [first proposition on the poster]. To ensure that nothing is lost by using this basis, we also show that the whole model consists of precisely all positive trigonometric densities, together with the basis functions [first theorem on the poster].

### Prior specification

Priors can be specified on the coefficients of mixtures in our basis and on the degree of the trigonometric polynomials to be used. Through the interpretability of the coefficients and the shape-preserving properties of the basis, different types of prior knowledge may be incorporated. Together with an approximate understanding of mass allocation, these include:

• periodic unimodality;
• bounds on total variation; and
• knowledge of the marginal distributions (in the multivariate case).

The priors obtained this way are part of a well-studied family called sieve priors, including the well-known Bernstein-Dirichlet prior, and are finite mixtures with an unknown number of components. Most results and interpretations about the Bernstein-Dirichlet prior (see Petrone & Wasserman (J. R. Stat. Soc. B., 64(1),  2002), Kruijer and Van der Vaart (J. Stat. Plan. Inference, 138(7), 2008), McVinish et al. (Scand. J. Statist., 36(2), 2009) can carry over to the priors we consider, but we dot not discuss them further.

### Approximation-theoric framework

Our density models arise as the image of “shape-perserving” linear approximation operators. This approximation-theoric relationship is used to obtain a notably large prior Kullback-Leibler support and ensures strong posterior consistency at all bounded (not necessarily continuous) density. The result partly relies on known properties of sieve priors, as well as general consistency results (Walker (Ann. Statist., 32(5), 2004)), but extends known result by removing an usual continuity hypothesis on the densities at which consistency is achieved (see Wu & Ghosal (‎Electron. J. Stat., 2, 2008), Petrone & Veronese (Statistica Sinica, 20, 2010)). For contraction rates, higher order smoothness conditions are usually required (see Shen & Ghosal (Scand. J. Statist., 42(4), 2015)).

For example, consider the prior induced by the random density

$T_n \mathcal{D} := \sum_i \mathcal{D}(R_{i,n}) C_{i,n},\qquad (1)$

where $\mathcal{D}$ is a Dirichlet process, $n$ is distributed on $\mathbb{N}$ and $R_{i,n}$ is a partition of the circle. It has the strong posterior consistency at all bounded density provided that the associated operator

$T_n : f \mapsto \sum_i C_{i,n} \int_{R_{i,n}} f$

is such that $\|T_n f - f\|_\infty \rightarrow 0$ for all continuous $f$.

More generally, let $\mathbb{F}$ be a set of bounded densities on some compact metric space $\mathbb{M}$, let $T_n : L^1(\mathbb{M}) \rightarrow L^1(\mathbb{M})$, $n \in \mathbb{N}$, be a sequence of operators that are:

• shape preserving: $T_n$ maps densities to densities and $T_n(\mathbb{F}) \subset \mathbb{F}$; and
• approximating: $\|T_n f - f\|_\infty \rightarrow 0$ for all continuous $f$;

and finally let $\Pi_n$ be priors on $T_n(\mathbb{F})$ with full support. A sieve prior on $\mathbb{F}$ is defined by

$\Pi : A \mapsto \sum_n \rho(n) \Pi_n(A \cap T_n(\mathbb{F}))$.

Theorem.
If $0 < \rho(n) < Ce^{-c d_n}$ for some increasing sequence $d_n$ bounding the dimensions of $T_n (\mathbb{F})$, then the posterior distribution of $\Pi$ is strongly consistent at each density of $\mathbb{F}$.

The approximation theory literature is rich in such operators. The theorem shows that they provide strongly consistent priors on arbitrary density spaces simply given priors $\Pi_n$ on $T_n(\mathbb{F})$.

Basic density estimation:

A thousand samples (grey histogram) were drawn from the density in orange. The prior is defined by (1) with the Dirichlet process centered on the uniform density and with a precision parameter of 2. The degree $n$ is distributed as a $\text{Poiss}(15)$. The blue line is the posterior mean, the dark blue shaded region is a 50% pointwise credible region around the median, and the light blue shaded region is a 90% credible region.