# The choice of prior in bayesian nonparametrics – part 2

See part 1. Most proofs are omitted; I’ll post them with the complete pdf later this week.

# The structure of $\mathcal{M}$

Recall that $\mathbb{M}$ is is a Polish space (ie. a complete and separable metric space). It is endowed with its borel $\sigma$-algebra $\mathfrak{B}$ which is the smallest family of subsets of $\mathbb{M}$ that contains its topology and that is closed under countable unions and intersections. All subsets of $\mathbb{M}$ we consider in the following are supposed to be part of $\mathfrak{B}$. A probability measure on $\mathbb{M}$ is a function $\mu : \mathfrak{B} \rightarrow [0,1]$ such that for any countable partition $A_1, A_2, A_3, \dots$ of $\mathbb{M}$ we have that $\sum_{i=1}^\infty \mu(A_i) = 1.$ The set $\mathcal{M}$ consists of all such probability measures.

Note that since $\mathbb{M}$ is complete and separable, every probability measure $\mu \in \mathcal{M}$ is regular (and tight). It means that the measure of any $A\subset \mathbb{M}$ can be well approximated from the measure of compact subsets of $A$ as well as from the measure of open super-sets of $A$:

$\mu(A) = \sup \left\{\mu(K) \,|\, K \subset A \text{ is compact}\right\}\\ = \inf \left\{\mu(U) \,|\, U \supset A \text{ is open}\right\}.$

## Metrics on $\mathcal{M}$

Let me review some facts. A natural metric used to compare the mass allocation of two measures $\mu, \nu \in \mathbb{M}$ is the total variation distance defined by

$\|\mu - \nu\|_{TV} = \sup_{A \subset \mathbb{M}}|\mu(A) - \nu(A)|.$

It is relatively straightforward to verify that $\mathcal{M}$ is complete under this distance, but it is not in general separable. To see this, suppose that $\mathbb{M} = [0,1]$. If a ball centered at $\mu$ contains a dirac measure $\delta_x$, $x \in [0,1]$, then $\mu$ must have a point mass at $x$. Yet any measure contains at most a countable number of point masses, and there is an uncountable number of dirac measures on $[0,1]$. Thus no countable subset of $\mathcal{M}$ can cover $\mathcal{M}$ up to an $\varepsilon$ of error.

This distance can be relaxed to the Prokhorov metric, comparing mass allocation up to $\varepsilon$-neighborhoods. It is defined as

$d_P(\mu, \nu) = \inf \left\{ \varepsilon > 0 \,|\, \mu(A) \le \nu(A^{\varepsilon}) + \varepsilon \text{ and } \nu(A) \le \mu(A^{\varepsilon}) + \varepsilon,\; \forall A \subset \mathbb{M} \right\},$

where $A^{\varepsilon} = \{x \in \mathbb{M} \,|\, d(x, A) < \varepsilon\}$ is the $\varepsilon$-neighborhood of $A$. It is a metrization of the topology of weak convergence of probability measures, and $\mathcal{M}$ is separable under this distance.

The compact sets of $\mathcal{M}$ under the Prokhorov metric admit a simple characterization given by the Prokhorov theorem: $P \subset \mathcal{M}$ is precompact if and only if $P$ is uniformly tight (for each $\varepsilon > 0$, there exists a compact $K \subset X$ such that $\sup_{\mu \in P} \mu(K) \geq 1-\varepsilon$). This means that a sequence $\{\mu_n\} \subset \mathcal{M}$ admits a weakly convergent subsequence if and only if $\{\mu_n\}$ is uniformly tight.

Characterizations of weak convergence are given by the Portemanteau theorem, which says in particular that $\mu_n$ converges weakly to $\mu$ if and only if

$\int f d\mu_n \rightarrow \int f d\mu$

for all continuous and bounded and continuous $f$. It is also equivalent to

$\mu_n(A) \rightarrow \mu(A)$

for all sets $A$ such that $\mu(\partial A) = 0$.

## Measures of divergence

In addition to metrics, that admit a geometric interpretation through the triangle inequality, statistical measures of divergence can also be considered. Here, we consider functions $D : \mathcal{M}\times \mathcal{M} \rightarrow [0, \infty]$ that can be used to determine the rate of convergence of the likelihood ratio

$\prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0,$

where $x_i \sim \nu$ and $\mu, \nu \in \mathcal{M}$.

### Kullback-Leibler divergence

The weight of evidence in favor of the hypothesis “$\lambda = \mu$” versus “$\lambda = \nu$” given a sample $x$ is defined as

$W(x) = \log\frac{d\mu}{d\nu}.$

It measures how much information about the hypotheses is brought by the observation of $x$. (For a justification of this interpretation, see Good (Weight of evidence: a brief survey, 1985).) The Kullback-Leibler divergence $D_{KL}$ between $\mu$ and $\nu$ is defined as the expected weight of evidence given that $x \sim \mu$:

$D_{KL}(\mu, \nu) =\mathbb{E}_{x \sim \mu} W(x) = \int \log \frac{d\mu}{d\nu} d\mu.$

The following properties of the Kullback-Leibler divergence support its interpretation as an expected weight of evidence.

Theorem 1 (Kullback and Leibler, 1951).
We have

1. $D_{KL}(\mu, \nu) \geq 0$ with equality if and only if $\mu = \nu$;
2. $D_{KL}(\mu T^{-1}, \nu T^{-1}) \geq D_{KL}(\mu, \nu)$ with equality if and only if $T: \mathbb{M} \rightarrow \mathbb{M}'$ is a sufficient statistic for $\{\mu, \nu\}$.

Furthermore, the KL divergence can be used to precisely identify exponential rates of convergence of the likelihood ratio. The first part of the next proposition says that $D_{KL}(\lambda, \nu)$ is finite if and only if the likelihood ratio $\prod_{i} \frac{d\nu}{d\lambda}(x_i)$, $x_i \sim \lambda$ cannot convergence super-exponentially fast towards $0$. The second part identifies the rate of convergence then the KL divergence is finite.

Proposition 2.
Let $x_1, x_2, x_3, \dots \sim \lambda$ (independently). The KL divergence $D_{KL}(\lambda, \nu)$ is finite if and only if there exists an $\alpha > 0$ such that

$e^{n\alpha} \prod_{i=1}^n \frac{d\nu}{d\lambda}(x_i) \rightarrow \infty$

with positive probability.

Finally, suppose we are dealing with a submodel $\mathcal{F} \subset \mathcal{M}$ such that the rates of convergences of the likelihood ratios in $\mathcal{F}$ are of an exponential order. By the previous proposition, this is equivalent to the fact that $\forall \mu, \nu \in \mathcal{F}$, $D_{KL}(\mu, \nu) < \infty$. We can show that the KL divergence is, up to topological equivalence, the best measure of divergence that determines the convergence of the likelihood ratio. That is, suppose $D: \mathcal{F} \times \mathcal{F}\rightarrow [0, \infty]$ is such that

$D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i) \rightarrow 0$

at an exponential rate, almost surely when $x_i \sim \lambda$, and that $D(\lambda, \mu) = 0$ if and only if $\lambda = \mu$. Then, the topology induced by $D_{KL}$ is coarser than the topology induced by $D$.

Proposition 3.
Let $D$ be as above and let $\mathcal{F} \subset \mathcal{M}$ be such that $\forall \mu, \nu \in \mathcal{F}$, $D_{KL}(\mu, \nu) < \infty$. Then, the topology on $\mathcal{F}$ induced by $D_{KL}$ is weaker than the topology induced by $D$. More precisely, we have that

$D(\lambda, \mu) < D(\lambda, \nu) \Rightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu).$

### alpha-affinity and alpha-divergence

We define the $\alpha$-affinity between two probability measures as the expectancy of another transform of the likelihood ratio. Let $\mu, \nu$ be two probability measures dominated by $\lambda$, with $d\mu = f d\lambda$ and $d\nu = g d\lambda$. Given $0 < \alpha < 1$, the $\alpha$-affinity between $\mu$ and $\nu$ is

$A_\alpha(\mu, \nu) = \int \left(\frac{g}{f}\right)^\alpha d\mu = \int g^\alpha f^{1-\alpha} d\lambda.$

Proposition 4.
For all $0 < \alpha < 1$, we have that

1. $A_\alpha(\mu, \nu) \le 1$ with equality if and only if $\mu = \nu$;

2. $A_\alpha$ is monotonous in $\alpha$ and jointly concave in its arguments;

3. $A_\alpha$ is jointly multiplicative under products:

$A_\alpha (\mu^{(n)}, \nu^{(n)}) = \left(A_{\alpha}(\mu, \nu)\right)^n.$

4. if $\frac{1}{2} \leq \alpha$, then

$A_{\frac{1}{2}} \le A_\alpha \le \left(A_{\frac{1}{2}}\right)^{2(1-\alpha)};$

Proof.
1-2 follow from Jensen’s inequality and the joint concavity of $(x,y) \mapsto x^\alpha y^{1-\alpha}$. 3 follows from Fubini’s theorem. For
(iv), the first inequality is a particular case of 2 and Hölder’s inequality finally yields

$A_{\alpha}(\mu, \nu) = \int (fg)^{1-\alpha} g^{2\alpha - 1} d\lambda \le \left( \int \sqrt{fg} \,d\lambda \right)^{2-2\alpha} = A_{\frac{1}{2}}(\mu, \nu).$

$\Box$

The $\alpha$-divergence $D_\alpha$ is obtained as

$D_\alpha = 1 - A_\alpha.$

Other similar divergences considered in the litterature are

$\frac{1-A_\alpha}{\alpha(1-\alpha)}\; \text{ and }\; \frac{\log A_\alpha}{\alpha(1-\alpha)},$

but we prefer $D_\alpha$ for its simplicity. When $\alpha = \frac{1}{2}$, it is closely related to the hellinger distance

$H(\mu, \nu) = \left(\int \left(\sqrt{f} - \sqrt{g}\right)^2d\lambda\right)^{\frac{1}{2}}$

through

$D_{\frac{1}{2}}(\mu, \nu) = \frac{H(\mu, \nu)^2}{2}.$

Other important and well-known inequalities are given below.

Proposition 5.
We have

$D_{\frac{1}{2}}(\mu, \nu) \le \|\mu-\nu\|_{TV} \le \sqrt{2 D_{\frac{1}{2}}(\mu, \nu)}$

and

$2D_{\frac{1}{2}}(\mu, \nu) \le D_{KL}(\mu, \nu) \le 2\left\|\frac{f}{g}\right\|_\infty \|\mu-\nu\|_{TV}.$

This, together with proposition 4 (4) , yields similar bounds for the other divergences.

## Finite models

Let $\Pi$ be a prior on $\mathcal{M}$ that is finitely supported. That is, $\Pi = \sum_{i=1}^n p_i \delta_{\mu_i}$ for some $\mu_i \in \mathcal{M}$ and $p_i > 0$ with $\sum_i p_i = 1$. Suppose that $x_1, x_2, x_3, \dots$ independently follow some $\mu_* \in \mathcal{M}$.

The following proposition ensures that as data is gathered, the posterior distribution of $\Pi$ concentrates on the measures $\mu_i$ that are closest to $\mu_*$.

Proposition 6.
Let $A_{\varepsilon} = \{\mu_i \,|\, D_{KL}(\mu_*, \mu_i) < \varepsilon \}$. If $A_\varepsilon \not = \emptyset$, then

$\Pi(A_\varepsilon \,|\, \{x_i\}_{i=1}^m) \rightarrow 1$

almost surely as $m \rightarrow \infty$.