# Posterior consistency under (possible) misspecification

We assume, without too much loss of generality, that our priors are discrete. When dealing with Hellinger separable density spaces, it is possible to discretize posterior distributions to study consistency (see this post about it).

Let $\Pi$ be a prior on a countable space $\mathcal{N} = \{f_1, f_2, f_3, \dots\}$ of probability density functions, with $\Pi(f) > 0$ for all $f \in \mathcal{N}$. Data $X_1, X_2, X_3, \dots$ follows (independently) some unknown distribution $P_0$ with density $f_0$.

We denote by $D_{KL}(f_0, f) = \int f_0 \log\frac{f_0}{f}$ the Kullback-Leibler divergence and we let $D_{\frac{1}{2}}(f_0, f) = 1 - \int \sqrt{f_0 f}$ be half of the squared Hellinger distance.

The following theorem states that the posterior distribution of $\Pi$ accumulates in Hellinger neighborhoods of $f_0$, assuming the prior is root-summable (i.e. $\sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty$ for some $\alpha > 0$) . In the well-specified case (i.e. $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0$), the posterior accumulates in any neighborhood of $f_0$. In the misspecified case, small neighborhoods of $f_0$ could be empty, but the posterior distribution still accumulates in sufficiently large neighborhoods (how large exactly is a function of $\alpha$ and $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f)$).

The result was essentially stated by Barron (Discussion: On the Consistency of Bayes Estimates, 1986). In the case where $\Pi$ is not necessarily discrete, a similar result was obtained, through a discretization argument, by Walker (Bayesian asymptotics with misspecified models, 2013). See also Xing (Sufficient conditions for Bayesian consistency, 2009) for a thorough treatment of Bayesian consistency using the same method of proof.

Theorem (Barron).
Suppose $\beta_0 :=\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) < \infty$ and that

$\alpha := \inf \left\{ p \in [\tfrac{1}{2},1] \,|\, \sum_{f \in \mathcal{N}} \Pi(f)^p < \infty \right\} < 1.$

If $\varepsilon > 0$ is such that

$\varepsilon > 1- \exp\left( \frac{-\beta_0 \alpha}{2(1-\alpha)} \right)$

and if $A_\varepsilon := \{f \in \mathcal{N} \,|\, D_{\frac{1}{2}} (f_0, f) < \varepsilon\} \not = \emptyset$, then

$\Pi\left(A_\varepsilon \,|\, \{x_i\}_{i=1}^n\right) \rightarrow 1$

almost surely as $n \rightarrow \infty$.

Remarks.

1 – If $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0$, then any $\varepsilon > 0$ can be used.

2 – $\alpha$ is related to the rate of convergence of $\rho(n) := \Pi(f_n)$ towards $0$. The quantity $H_\alpha (\Pi) = \log \sum_{f \in \mathcal{N}} \Pi(f)^\alpha$ can be thought as measure of entropy.

3 – Walker (2013) considered the case $\sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty$ for some $\alpha < \frac{1}{2}$. This implies that $\sum_{f \in \mathcal{N}} \sqrt{\Pi(f)} < \infty$ and the above theorem can also be applied in this case.

## Demonstration

The proof is brief. I do not dwell on explanations.

Background.
First let me recall some concepts. The $\alpha$-affinity between two densities $f$ and $g$ is defined as

$A_\alpha(f, g) = \int g^\alpha f^{1-\alpha}. \qquad (1)$

Note that $0 \le A_{\alpha}(f, g) \le 1$ and that $A_\alpha(f, g) = 1$ if and only if $f = g$. Furthermore, when $\alpha \geq \frac{1}{2}$, Holder’s inequality and Jensen’s inequality yield

$A_{\frac{1}{2}}(f, g) \le A_{\alpha}(f, g) \le \left(A_{\frac{1}{2}}(f,g)\right)^{2(1-\alpha)}. \qquad (2)$

Proof.
We can now begin the proof. Let $\tilde{\alpha}$ be such that $1 > \tilde{\alpha} > \alpha$. Then, we have

If $\beta > 0$ and $g \in \mathcal{N}$ is such that $D_{KL}(f_0, g) < \beta_0 + \beta$, then

almost surely. Furthermore, using (2), we find

Here $\text{cst.}$ is a constant. Since $\beta > 0$ is arbitrary and since $\tilde \alpha$ can be taken so that $2(1-\tilde \alpha) \log (1-\varepsilon) + \tilde \alpha \beta_0 < 0$, we obtain that $(**)$ converges exponentially fast towards $0$. Hence, by the Borel-Cantelli lemma, we have

almost surely. This, together with $(3)$, implies that $(*) \rightarrow 0$ almost surely. $\Box$

# The choice of prior in bayesian nonparametrics – part 2

See part 1. Most proofs are omitted; I’ll post them with the complete pdf later this week.

# The structure of $\mathcal{M}$

Recall that $\mathbb{M}$ is is a Polish space (ie. a complete and separable metric space). It is endowed with its borel $\sigma$-algebra $\mathfrak{B}$ which is the smallest family of subsets of $\mathbb{M}$ that contains its topology and that is closed under countable unions and intersections. All subsets of $\mathbb{M}$ we consider in the following are supposed to be part of $\mathfrak{B}$. A probability measure on $\mathbb{M}$ is a function $\mu : \mathfrak{B} \rightarrow [0,1]$ such that for any countable partition $A_1, A_2, A_3, \dots$ of $\mathbb{M}$ we have that $\sum_{i=1}^\infty \mu(A_i) = 1.$ The set $\mathcal{M}$ consists of all such probability measures.

Note that since $\mathbb{M}$ is complete and separable, every probability measure $\mu \in \mathcal{M}$ is regular (and tight). It means that the measure of any $A\subset \mathbb{M}$ can be well approximated from the measure of compact subsets of $A$ as well as from the measure of open super-sets of $A$:

$\mu(A) = \sup \left\{\mu(K) \,|\, K \subset A \text{ is compact}\right\}\\ = \inf \left\{\mu(U) \,|\, U \supset A \text{ is open}\right\}.$

## Metrics on $\mathcal{M}$

Let me review some facts. A natural metric used to compare the mass allocation of two measures $\mu, \nu \in \mathbb{M}$ is the total variation distance defined by

$\|\mu - \nu\|_{TV} = \sup_{A \subset \mathbb{M}}|\mu(A) - \nu(A)|.$

It is relatively straightforward to verify that $\mathcal{M}$ is complete under this distance, but it is not in general separable. To see this, suppose that $\mathbb{M} = [0,1]$. If a ball centered at $\mu$ contains a dirac measure $\delta_x$, $x \in [0,1]$, then $\mu$ must have a point mass at $x$. Yet any measure contains at most a countable number of point masses, and there is an uncountable number of dirac measures on $[0,1]$. Thus no countable subset of $\mathcal{M}$ can cover $\mathcal{M}$ up to an $\varepsilon$ of error.

This distance can be relaxed to the Prokhorov metric, comparing mass allocation up to $\varepsilon$-neighborhoods. It is defined as

$d_P(\mu, \nu) = \inf \left\{ \varepsilon > 0 \,|\, \mu(A) \le \nu(A^{\varepsilon}) + \varepsilon \text{ and } \nu(A) \le \mu(A^{\varepsilon}) + \varepsilon,\; \forall A \subset \mathbb{M} \right\},$

where $A^{\varepsilon} = \{x \in \mathbb{M} \,|\, d(x, A) < \varepsilon\}$ is the $\varepsilon$-neighborhood of $A$. It is a metrization of the topology of weak convergence of probability measures, and $\mathcal{M}$ is separable under this distance.

The compact sets of $\mathcal{M}$ under the Prokhorov metric admit a simple characterization given by the Prokhorov theorem: $P \subset \mathcal{M}$ is precompact if and only if $P$ is uniformly tight (for each $\varepsilon > 0$, there exists a compact $K \subset X$ such that $\sup_{\mu \in P} \mu(K) \geq 1-\varepsilon$). This means that a sequence $\{\mu_n\} \subset \mathcal{M}$ admits a weakly convergent subsequence if and only if $\{\mu_n\}$ is uniformly tight.

Characterizations of weak convergence are given by the Portemanteau theorem, which says in particular that $\mu_n$ converges weakly to $\mu$ if and only if

$\int f d\mu_n \rightarrow \int f d\mu$

for all continuous and bounded and continuous $f$. It is also equivalent to

$\mu_n(A) \rightarrow \mu(A)$

for all sets $A$ such that $\mu(\partial A) = 0$.

## Measures of divergence

In addition to metrics, that admit a geometric interpretation through the triangle inequality, statistical measures of divergence can also be considered. Here, we consider functions $D : \mathcal{M}\times \mathcal{M} \rightarrow [0, \infty]$ that can be used to determine the rate of convergence of the likelihood ratio

$\prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0,$

where $x_i \sim \nu$ and $\mu, \nu \in \mathcal{M}$.

### Kullback-Leibler divergence

The weight of evidence in favor of the hypothesis “$\lambda = \mu$” versus “$\lambda = \nu$” given a sample $x$ is defined as

$W(x) = \log\frac{d\mu}{d\nu}.$

It measures how much information about the hypotheses is brought by the observation of $x$. (For a justification of this interpretation, see Good (Weight of evidence: a brief survey, 1985).) The Kullback-Leibler divergence $D_{KL}$ between $\mu$ and $\nu$ is defined as the expected weight of evidence given that $x \sim \mu$:

$D_{KL}(\mu, \nu) =\mathbb{E}_{x \sim \mu} W(x) = \int \log \frac{d\mu}{d\nu} d\mu.$

The following properties of the Kullback-Leibler divergence support its interpretation as an expected weight of evidence.

Theorem 1 (Kullback and Leibler, 1951).
We have

1. $D_{KL}(\mu, \nu) \geq 0$ with equality if and only if $\mu = \nu$;
2. $D_{KL}(\mu T^{-1}, \nu T^{-1}) \geq D_{KL}(\mu, \nu)$ with equality if and only if $T: \mathbb{M} \rightarrow \mathbb{M}'$ is a sufficient statistic for $\{\mu, \nu\}$.

Furthermore, the KL divergence can be used to precisely identify exponential rates of convergence of the likelihood ratio. The first part of the next proposition says that $D_{KL}(\lambda, \nu)$ is finite if and only if the likelihood ratio $\prod_{i} \frac{d\nu}{d\lambda}(x_i)$, $x_i \sim \lambda$ cannot convergence super-exponentially fast towards $0$. The second part identifies the rate of convergence then the KL divergence is finite.

Proposition 2.
Let $x_1, x_2, x_3, \dots \sim \lambda$ (independently). The KL divergence $D_{KL}(\lambda, \nu)$ is finite if and only if there exists an $\alpha > 0$ such that

$e^{n\alpha} \prod_{i=1}^n \frac{d\nu}{d\lambda}(x_i) \rightarrow \infty$

with positive probability.

Finally, suppose we are dealing with a submodel $\mathcal{F} \subset \mathcal{M}$ such that the rates of convergences of the likelihood ratios in $\mathcal{F}$ are of an exponential order. By the previous proposition, this is equivalent to the fact that $\forall \mu, \nu \in \mathcal{F}$, $D_{KL}(\mu, \nu) < \infty$. We can show that the KL divergence is, up to topological equivalence, the best measure of divergence that determines the convergence of the likelihood ratio. That is, suppose $D: \mathcal{F} \times \mathcal{F}\rightarrow [0, \infty]$ is such that

$D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i) \rightarrow 0$

at an exponential rate, almost surely when $x_i \sim \lambda$, and that $D(\lambda, \mu) = 0$ if and only if $\lambda = \mu$. Then, the topology induced by $D_{KL}$ is coarser than the topology induced by $D$.

Proposition 3.
Let $D$ be as above and let $\mathcal{F} \subset \mathcal{M}$ be such that $\forall \mu, \nu \in \mathcal{F}$, $D_{KL}(\mu, \nu) < \infty$. Then, the topology on $\mathcal{F}$ induced by $D_{KL}$ is weaker than the topology induced by $D$. More precisely, we have that

$D(\lambda, \mu) < D(\lambda, \nu) \Rightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu).$

### alpha-affinity and alpha-divergence

We define the $\alpha$-affinity between two probability measures as the expectancy of another transform of the likelihood ratio. Let $\mu, \nu$ be two probability measures dominated by $\lambda$, with $d\mu = f d\lambda$ and $d\nu = g d\lambda$. Given $0 < \alpha < 1$, the $\alpha$-affinity between $\mu$ and $\nu$ is

$A_\alpha(\mu, \nu) = \int \left(\frac{g}{f}\right)^\alpha d\mu = \int g^\alpha f^{1-\alpha} d\lambda.$

Proposition 4.
For all $0 < \alpha < 1$, we have that

1. $A_\alpha(\mu, \nu) \le 1$ with equality if and only if $\mu = \nu$;

2. $A_\alpha$ is monotonous in $\alpha$ and jointly concave in its arguments;

3. $A_\alpha$ is jointly multiplicative under products:

$A_\alpha (\mu^{(n)}, \nu^{(n)}) = \left(A_{\alpha}(\mu, \nu)\right)^n.$

4. if $\frac{1}{2} \leq \alpha$, then

$A_{\frac{1}{2}} \le A_\alpha \le \left(A_{\frac{1}{2}}\right)^{2(1-\alpha)};$

Proof.
1-2 follow from Jensen’s inequality and the joint concavity of $(x,y) \mapsto x^\alpha y^{1-\alpha}$. 3 follows from Fubini’s theorem. For
(iv), the first inequality is a particular case of 2 and Hölder’s inequality finally yields

$A_{\alpha}(\mu, \nu) = \int (fg)^{1-\alpha} g^{2\alpha - 1} d\lambda \le \left( \int \sqrt{fg} \,d\lambda \right)^{2-2\alpha} = A_{\frac{1}{2}}(\mu, \nu).$

$\Box$

The $\alpha$-divergence $D_\alpha$ is obtained as

$D_\alpha = 1 - A_\alpha.$

Other similar divergences considered in the litterature are

$\frac{1-A_\alpha}{\alpha(1-\alpha)}\; \text{ and }\; \frac{\log A_\alpha}{\alpha(1-\alpha)},$

but we prefer $D_\alpha$ for its simplicity. When $\alpha = \frac{1}{2}$, it is closely related to the hellinger distance

$H(\mu, \nu) = \left(\int \left(\sqrt{f} - \sqrt{g}\right)^2d\lambda\right)^{\frac{1}{2}}$

through

$D_{\frac{1}{2}}(\mu, \nu) = \frac{H(\mu, \nu)^2}{2}.$

Other important and well-known inequalities are given below.

Proposition 5.
We have

$D_{\frac{1}{2}}(\mu, \nu) \le \|\mu-\nu\|_{TV} \le \sqrt{2 D_{\frac{1}{2}}(\mu, \nu)}$

and

$2D_{\frac{1}{2}}(\mu, \nu) \le D_{KL}(\mu, \nu) \le 2\left\|\frac{f}{g}\right\|_\infty \|\mu-\nu\|_{TV}.$

This, together with proposition 4 (4) , yields similar bounds for the other divergences.

## Finite models

Let $\Pi$ be a prior on $\mathcal{M}$ that is finitely supported. That is, $\Pi = \sum_{i=1}^n p_i \delta_{\mu_i}$ for some $\mu_i \in \mathcal{M}$ and $p_i > 0$ with $\sum_i p_i = 1$. Suppose that $x_1, x_2, x_3, \dots$ independently follow some $\mu_* \in \mathcal{M}$.

The following proposition ensures that as data is gathered, the posterior distribution of $\Pi$ concentrates on the measures $\mu_i$ that are closest to $\mu_*$.

Proposition 6.
Let $A_{\varepsilon} = \{\mu_i \,|\, D_{KL}(\mu_*, \mu_i) < \varepsilon \}$. If $A_\varepsilon \not = \emptyset$, then

$\Pi(A_\varepsilon \,|\, \{x_i\}_{i=1}^m) \rightarrow 1$

almost surely as $m \rightarrow \infty$.

# Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

## The problem.

Let $f, g, h$ be three densities and suppose that, $x_i \sim h$, $i \in \mathbb{N}$, independently. What happens to the likelihood ratio

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}$

as $n\rightarrow \infty$?

Clearly, it depends. If $h = g \not = f$, then

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$

almost surely at an exponential rate. More generally, if $h$ is closer to $g$ than to $f$, in some sense, we’d expect that $\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

$D_{KL}(f, g) = \int f \log\frac{f}{g}.$

It can be verified that $D_{KL}(f,g) \geq 0$ with equality if and only if $f=g$, and that

$D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)$

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

## Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in $(1)$ (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.

This is of interest in Bayesian nonparametrics. The hypothesis that a density $f_0$ is in the Kullback-Leibler support of a prior $\Pi$ on a density space $\mathbb{F}$ is used to ensure that

$\int_{\mathbb{F}} \prod_{i=1}^n \frac{f(x_i)}{f_0(x_i)} \Pi(df)$

does not converge to $0$ exponentially fast as $n\rightarrow \infty$ and $x_i \sim f_0$. Under the conditions we specify, our remark implies that the hypothesis that “$f_0$ is in the Kullback-Leibler support of $\Pi$” may not be replaced by a weaker one.

# Statement of the remark

Some notations. Let $\mathcal{M}$ the space of all probability measures on some measurable space $\mathbb{M}$. If $\mu, \nu \in \mathcal{M}$, then both are absolutely continuous with respect to $\tau = \mu + \nu$ and possess densities $f, g$ such that $d\mu = fd\tau$ and $d\nu gd\tau$. We denote by $d\mu /d\nu$ the ratio of densities $f / g$, which in fact does not depend on the choice of dominating measure $\tau$. The likelihood ratio $\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}$ is abbreviated to $\frac{d\mu}{d\nu}(X)$, depending implicitely on $n$. The Kullback-Leibler divergence between $\mu$ and $\nu$ is

$D_{KL}(\mu, \nu) = \int \log\frac{d\mu}{d\nu} d\mu.$

We let $D$ be any other function $\mathcal{M}\times \mathcal{M} \rightarrow [0,\infty]$ such that $D(\mu, \nu) = 0$ iff $\mu = \nu$ and such that if $D(\lambda, \nu) < D(\lambda, \mu)$, then there exists a $\varepsilon > 0$ with

$e^{\varepsilon n} \frac{d\mu}{d\nu}(X) = e^{\varepsilon n} \prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0$

almost surely as $n \rightarrow \infty$. The topology on a subset $\mathcal{F} \subset \mathcal{M}$ induced by such a function $D$ is the topology induced by the sets

$\{\nu \,|\, D(\mu, \nu) < \varepsilon\},\quad \mu \in \mathcal{F}, \, \varepsilon > 0.$

The remark below shows that any exponential rate of convergence of the likelihood ratio is picked up by the KL divergence. It is rather obvious (albeit a bit technical), but I thought it was worth writing it up properly.

Remark.
Let $x_i \sim \mu$, $i \in \mathbb{N}$, independently, and let $\frac{d\nu}{d\mu}(X) = \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i)$.

1. We have that $D_{KL}(\mu, \nu) < \infty$ if and only if $\frac{d\nu}{d\mu}(X)$ does not converge more than exponentially fast to $0$ (i.e. there exists $\varepsilon > 0$ such that $e^{\varepsilon n}\frac{d\nu}{d\mu} (X) \rightarrow \infty$).
2. If $\mathcal{F} \subset \mathcal{M}$ is such that $D_{KL}(\mu, \nu) < \infty$ for all $\mu, \nu \in \mathcal{F}$, then

$D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu)$

and the topology on $\mathcal{F}$ induced by $D_{KL}$ is weaker than the topology of any other function $D$ defined as above.

Proof of 1.
Suppose that $D_{KL}(\mu, \nu) < \infty$. Then, since by the strong law of large numbers $D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i) \rightarrow \varepsilon > 0$, we find that

$\log\left(e^{n (D_{KL}(\mu, \nu) + \varepsilon)} \frac{d\nu}{d\mu}(X)\right) = n (D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i)) \rightarrow \infty$

for all $\varepsilon > 0$.

If $D_{KL}(\mu, \nu) = \infty$, then for all $\varepsilon > 0$ we have

$\log \left(e^{n \varepsilon}\frac{d\nu}{d\mu}(X)\right) = n\left(\varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)}\right) \rightarrow -\infty$

since $\frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)} \rightarrow \infty$. $\Box$

Proof of 2.
Suppose that there exists $\lambda, \mu, \nu \in \mathcal{F}$ such that $D(\lambda, \mu) < D(\lambda, \nu)$ but $D_{KL}(\lambda, \mu) = D_{KL}(\lambda, \nu)$. Then, there is a $\varepsilon > 0$ such that

$e^{2\varepsilon n} \frac{d\nu}{d\mu}(X) = e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) /\left( e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \right) \rightarrow 0.$

But $e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) \rightarrow \infty$ and $e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \rightarrow 0$, which yields the contradiction.

Since $D(\lambda, \mu) =0$ iff $\lambda = \mu$, this implies that the topology on $\mathcal{F}$ induced by $D_{KL}$ is weaker than the one induced by $D$. $\Box$

# The choice of prior in bayesian nonparametrics – Introduction

In preparation for the 11th Bayesian nonparametrics conference, I’m writing (and rewriting) notes on the background of our research (i.e. some of the general theory of bayesian nonparametrics). There are some good books on the subject (such as Bayesian Nonparametrics (Ghosh and Ramamoorthi, 2003)), but I wanted a more introductory focus and to present Choi and Ramamoorthi’s very clear point of view on posterior consistency (Remarks on the consistency of posterior distributions, 2008).

# 1. Introduction

Let $\mathbb{X}$ be a complete and separable metric space and let $\mathcal{M}$ be the space of all probability measures on $\mathbb{X}$. Some unknown distribution $P_0\in \mathcal{M}$ is generating observable data $\mathcal{D}_n = (X_1, X_2, \dots, X_n) \in \mathbb{X}^n$, where each $X_i$ is independently drawn from $P_0$. The problem is to learn about $P_0$ using only $\mathcal{D}_n$ and prior knowledge.

Example (Discovery probabilities).
A cryptographer observes words, following some distribution $P_0$, in an unknown countable language $\mathcal{L}$. What are the $P_0$-probabilities of the words observed thus far? What is the probability that the next word to be observed has never been observed before?

## 1.1 Learning and uncertainty

We need an employable definition of learning. As a first approximation, we can consider learning to be the reduction of uncertainty about what is $P_0$. This requires a quantification of how uncertain we are to begin with. Then, hopefully, as data is gathered out uncertainty decreases and we are able to pinpoint $P_0$.

This is the core of Bayesian learning, alghough our definition is not yet entirely satisfactory. There are some difficulties with this idea of quantifying uncertainty, at least when using information-theoric concepts. The solution we adopt here is the use of probabilities to quantify uncertain knowledge (bayesians would also talk of subjective probabilities quantifying rational belief). For example, you may know that a coin flip is likely to be fair, although it is not impossible the two sides of the coin are both the same. This is uncertain knowledge about the distribution of heads and tails in the coin flips, and you could assign probabilities to the different possibilities.

More formally, prior uncertain knowledge about what is $P_0$ is quantified by a probability measure $\Pi$ on $\mathcal{M}$. For any $A \subset \mathcal{M}$, $\Pi(A)$ is the the prior probability that “$P_0 \in A$“. Then, given data $\mathcal{D}_n$, prior probabilities are adjusted to posterior probabilities: $\Pi$ becomes $\Pi_n$, the conditional distribution of $\Pi$ given $\mathcal{D}_n$. The celebrated Bayes’ theorem provides a formula to calculate $\Pi_n$ from $\Pi$ and $\mathcal{D}_n$. Thus we have an operational definition of learning in our statistical framework.

Learning is rationally adjusting uncertain knowledge in the light of new information.

For explanations as to why probabilities are well suited to the representation of uncertain knowledge, I refer the reader to Pearl (Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988). We will also see that the operation of updating the prior to posterior posterior probabilities does work as intended.

## 1.2 The choice of prior

Specifying prior probabilities, that is quantifying prior uncertain knowledge, is not a simple task. It is especially difficult when uncertainty is over the non-negligeable part $\mathcal{M}$ of an infinite dimensional vector space. Fortunately, “probability is not about numbers, it is about the structure of reasoning”, as Glenn Shafer puts it (cited in Pearl, 1988, p.15). The exact numbers given to the events “$P_0 \in A$” are not of foremost importance; what matters is how probabilities are more qualitatively put together, and how this relates to the learning process.

Properties of prior distributions, opening them to scrutiny, criticism and discussion, must be identified and related to what happens as more and more data is gathered.

Part 2.

# Sudden jumps and almost surely slow convergence

Estimating a mean can be an arduous task. Observe, for instance, the “sudden jumps” in trajectories of the Monte Carlo estimate $\overline X_n$ of $\mathbb{E}X$, where $\mathbb{E}X^2 = \infty$.

Proposition.
Let $X_1, X_2, X_3, \dots$ be a sequence of i.i.d. and integrable random variables, and let $\overline X_k = \frac{1}{k}\sum_{i=1}^k X_i$. For all $M > 0$ we have that

$P\left( \left| \overline X_{n} - \overline X_{n-1}\right| > Mn^{-1/2} \;\; \text{ i.o.} \right)$

is $1$ or $0$ according to $\text{Var}(X_n) =\infty$ or $\text{Var}(X_n) < \infty$.

Corollary.
If $\text{Var}(X_k) = \infty$, then $\overline X_n$ converges towards $\mathbb{E}X_k$ at a rate almost surely slower than $\frac{1}{\sqrt{n}}$ :

$\left|\overline X_n - \mathbb{E}X_n \right| \not = \mathcal{O}\left(n^{-1/2}\right) \quad \text{a.s.}$

Proof of the proposition.
We can suppose that $M=1/2$, by the scale invariance of the problem. Note also that

$|\overline X_{n} - \overline X_{n-1}| = |X_{n} - \overline X_{n-1}|/n \geq |X_{n}|/n - \left|\overline X_{n-1}\right|/n,$

from which we find

$P\left(\left| \overline X_n - \overline X_{n-1} \right| > \frac{1}{2\sqrt{n}}\; \text{ i.o.} \right) \geq P\left( | X_n | > \frac{\sqrt{n}}{2} + \left|\overline X_{n-1}\right|\; \text{i.o}\right) \geq P\left(|X_n| > \sqrt{n}\; \text{ i.o}\right)$

since $|\overline X_{n-1}| = o(1)$ almost surely. From the Borel-Cantelli lemma and

$\infty = \mathbb{E}X_k^2 = \int_0^\infty P\left(X_k^2 > u\right) du \le 1+\sum_{n=1}^{\infty} P\left(X_{n}^2 > n\right),$

we obtain that $P\left(\left| \overline X_n - \overline X_{n-1} \right| > \frac{1}{2\sqrt{n}}\; \text{ i.o.} \right) = 1$ in this case.

When $\text{Var}(X_n) < \infty$, we let $M=2$ and note that $|\overline X_{n} - \overline X_{n-1}| \le |X_{n}|/n + \left|\overline X_{n-1}\right|/n$. Therefore

Since the sum over $n$ of the right hand side probabilities converges (again using the fact that $\mathbb{E}X_n^2 = \int_{0}^\infty P(X_n^2 > u) du$), the Borel-Cantelli lemma yields the result. $\Box$

Remark.
The result can be refined knowing that $\mathbb{E} |X_k|^p = \infty$ for some small $1 < p$: $\mathbb{E} X_k^{1/(1-\varepsilon)} = \infty$ implies $P\left(|\overline X_{n} - \overline X_{n-1}| > n^{-\varepsilon}\; \text{i.o.}\right) =1$. The almost sure lower bound on the rate of convergence can be correspondingly adjusted.

Many thanks to Xi’an who wondered about the occurence of these jumps.

# Comment on The Sample Size Required in Importance Sampling

I summarize and comment part of The Sample Size Required in Importance Sampling (Chatterjee and Diaconis, 2015). One innovative idea is to bound the mean estimation error in terms of the tail behavior of $d\mu/d\lambda$, where $\mu$ and $\lambda$ are the importance sampling target and proposal distributions, respectively.

The problem is to evaluate

$I = I(f) = \int f d\mu,$

where $\mu$ is a probability measure on a space $\mathbb{M}$ and where $f: \mathbb{M} \rightarrow \mathbb{R}$ is measurable. The Monte-Carlo estimate of $I$ is

$\frac{1}{n}\sum_{i=1}^n f(x_i), \qquad x_i \sim \mu.$

When it is too difficult to sample $\mu$, for instance, other estimates can be obtained. Suppose that $\mu$ is absolutely continuous with respect to another probability measure $\lambda$, and that the density of $\mu$ with respect to $\lambda$ is given by $\rho$. Another unbiaised estimate of $I$ is then

$I_n(f) = \frac{1}{n}\sum_{i=1}^n f(y_i)\rho(y_i),\qquad y_i \sim \lambda.$

This is the general framework of importance sampling, with the Monte-Carlo estimate recovered by taking $\lambda = \mu$. An important question is the following.

How large should $n$ be for $I_n(f)$ to be close to $I(f)$?

An answer is given, under certain conditions, by Chatterjee and Diaconis (2015). Their main result can be interpreted as follows. If $X \sim \mu$ and if $\log \rho(X)$ is concentrated around its expected value $L=\text{E}[\log \rho(X)]$, then a sample size of approximately $n = e^{L}$ is both necessary and sufficient for $I_n$ to be close to $I$. The exact sample size needed depends on $\|f\|_{L^2(\mu)}$ and on the tail behavior of $\log\rho(X)$. I state below their theorem with a small modification.

Theorem 1. (Chatterjee and Diaconis, 2015)
As above, let $X \sim \mu$. For any $a > 0$ and $n \in \mathbb{N}$,

$\mathbb{E} |I_n(f) - I(f)| \le \|f\|_{L^2(\mu)}\left( \sqrt{a/n} + 2\sqrt{\mathbb{P} (\rho(X) > a)} \right).$

Conversely, for any $\delta \in (0,1)$ and $b > 0$,

$\mathbb{P}(1 - I_n(1) \le \delta) \le \frac{n}{b} + \frac{\mathbb{P}(\rho(X) \le b)}{1-\delta}.$

Remark 1.
Suppose $\|f\|_{L^2(\mu)} \le 1$ and that $\log\rho(X)$ is concentrated around $L = \mathbb{E} \log\rho(X)$, meaning that for some $t > 0$ we have that $\mathbb{P}(\log \rho(X) > L+t/2)$ and $\mathbb{P}(\log\rho(X) < L-t/2)$ are both less than an arbitrary $\varepsilon > 0$. Then, taking $n \geq e^{L+t}$ we find

$\mathbb{E} |I_n(f) - I| \le e^{-t/4} + 2\varepsilon.$

However, if $n \leq e^{L-t}$, we obtain

$\mathbb{P}\left(1 - I_n(1) \le \tfrac{1}{2}\right) \le e^{-t/2} + 2 \varepsilon.$

meaning that there can be a  high probability that $I(1)$ and $I_n(1)$ are not close.

Remark 2.
Let $\lambda = \mu$, so that $\rho = 1$. In that case, $\log\rho(X)$ only takes its expected value $0$. The theorem yields

$\mathbb{E} |I_n(f) - I(f)| \le \frac{\|f\|_{L^2(\mu)}}{\sqrt{n}}$

and no useful bound on $\mathbb{P}(1-I_n(1) \le \delta)$.

Comment.
For the theorem to yield a sharp cutoff, it is necessary that $L = \mathbb{E} \log\rho(X)$ be relatively large and that $\log\rho(X)$ be highly concentrated around $L$. The first condition is not aimed at in the practice of importance sampling. This difficulty contrasts with the broad claim that “a sample of size approximately $e^{L}$ is necessary and sufficient for accurate estimation by importance sampling”. The result in conceptually interesting, but I’m not convinced that a sharp cutoff is common.

# Example

I consider their example 1.4. Here $\lambda$ is the exponential distribution of mean $1$, $\mu$ is the exponential distribution of mean 2,$\rho(x) = e^{x/2}/2$ and $f(x) = x$. Thus $I(f) = 2$. We have $L = \mathbb{E}\log\rho(X) = 1-\log(2) \approxeq 0.3$, meaning that the theorem yields no useful cutoff. Furthermore, ${}\mathbb{P}(\rho(X) > a) = \tfrac{1}{2a}$ and $\|f\|_{L^2(\mu)} = 2$. Optimizing the bound given by the theorem yields

$\mathbb{E}|I_n(f)-2| \le \frac{4\sqrt{2}}{(2n)^{1/4}}.$

The figure below shows $100$ trajectories of $I_k(f)$. The shaded area bounds the expected error.

This next figure shows $100$ trajectories for the Monte-Carlo estimate of $2 = \int x d\mu$, taking $\lambda = \mu$ and $\rho = 1$. Here the theorem yields

$\mathbb{E}|I_n(f)-2| \le \frac{2}{\sqrt{n}}.$

References.

Chatterjee, S. and Diaconis, P. The Sample Size Required in Importance Sampling. https://arxiv.org/abs/1511.01437v2

# The asymptotic behaviour of posterior distributions – 1

I describe key ideas of the theory of posterior distribution asymptotics and work out small examples.

# 1. Introduction

Consider the problem of learning about the probability distribution that a stochastic mechanism is following, based on independent observations. You may quantify your uncertainty about what distribution the mechanism is following through a prior $P$ defined on the space of all possible distributions, and then obtain the conditional distribution of $P$ given the observations to correspondingly adjust your uncertainty. In the simplest case, the prior is concentrated on a space $\mathcal{M}$ of distributions dominated by a common measure $\lambda$. This means that each probability measure in $\mu \in \mathcal{M}$ is such that there exists $f$ with $\mu(E) = \int_E f \,d\lambda$ for all measurable $E$. We may thus identify $\mathcal{M}$ with a space $\mathbb{F}$ of probability densities. Given independent observations $X = (X_1, \dots, X_n)$, the conditional distribution of $P$ given $X$ is then

$P(A \mid X) = \frac{\int_A f(X) P(df)}{\int_{\mathbb{F}} f(X) P(df)}, \quad A \subset \mathbb{F},$

where $f(X) = \Pi_{i=1}^nf(X_i)$ and $A$ is measurable. This conditional distribution is called the posterior distribution of $P$, and is understood as a Lebesgue integral relative to the measure $P$ defined on $\mathbb{F} \simeq \mathcal{M}$.

The procedure of conditionning $P$ on $X$ is bayesian learning. It is expected that as more and more data is gathered, that the posterior distribution will converge, in a suitable way, to a point mass located at the true distribution that the data is following. This is the asymptotic behavior we study.

## 1.1 Notations

The elements of measure theory and the language of mathematical statistics I use are standard and very nicely reviewed in Halmos (1949). To keep notations light, I avoid making explicit references to measure spaces. All sets and maps considered are measurable.

# 2. The relative entropy of probability measures

Let $\mu_1$ and $\mu_2$ be two probability measures on a metric space $\mathbb{M}$, and let $\lambda$ be a common $\sigma$-finite dominating measure such as $\mu_1 + \mu_2$. That is, $\mu_i(E) = 0$ whenever $E \subset \mathbb{M}$ is of $\lambda$-measure $0$, and by the Radon-Nikodym theorem this is equivalent to the fact that there exists unique densities $f_1$, $f_2$ with $d\mu_i = f_i d\lambda$. The relative entropy of $\mu_1$ and $\mu_2$, also known as the Kullback-Leibler divergence, is defined as

$K(\mu_1, \mu_2) = \int \log \frac{f_1}{f_2} d\mu_1\,.$

The following inequality will be of much use.

Lemma 1 (Kullback and Leibler (1951)). We have $K(\mu_1, \mu_2) \geq 0$, with equality if and only if $\mu_1 = \mu_2$.

Proof: We may assume that $\mu_2$ is dominated by $\mu_1$, as otherwise $K(\mu_1, \mu_2) = \infty$ and the lemma holds. Now, let $\phi (t) = t\log t$, $g = f_1/f_2$ and write ${}K(\mu_1, \mu_2) = \int \phi \circ g d\mu_2$. Since $\phi (t) = (t-1) +(t-1)^2 / h(t)$ for some $h(t) > 0$, we find

$K(\mu_1, \mu_2) = \int (g-1) d\mu_2 + \int \frac{(g-1)^2}{h} d\mu_2 = \int \frac{(g-1)^2}{h\circ g} d\mu_2 \geq 0,$

with equality if and only if $f_1/f_2 = 1$, $\mu_1$-almost everywhere. QED.

We may already apply the lemma to problems of statistical inference.

## 2.1 Discriminating between point hypotheses

Alice observes $X = (X_1, \dots, X_n)$, $X_i \sim^{ind}\mu_*$, $\mu_* \in \{\mu_1, \mu_2\}$, and considers the hypotheses $H_i = \{\mu_i\}$, $i=1,2$, representing “$\mu_* = \mu_i$“. A prior distribution on $H_1 \cup H_2$ takes the form

$P = \alpha \delta_1 + (1-\alpha)\delta_2, \quad 0 < \alpha < 1,$

where $\delta_i(A)$ is $1$ when $\mu_i \in A$ and is $0$ otherwise. Given a sample $X_1 \sim \mu_1$, the weight of evidence for $H_1$ versus $H_2$ (Good, 1985), also known as “the information in $X_1$ for discrimination between $H_1$ and $H_2$” (Kullback, 1951), is defined as

$W(X_1) = \log\frac{f_1(X_1)}{f_2(X_1)}.$

This quantity is additive for independent sample: if $X = (X_1, \dots, X_n)$, $X_i \sim^{ind} \mu_*$, then

$W(X) = \sum_{i=1}^n W(X_i);$

and the posterior log-odds are given by
$\log\frac{P(H_1 \mid X)}{P(H_2 \mid X)} = W(X) + \log\frac{P(H_1)}{P(H_2)}.$
Thus $K(\mu_1, \mu_2)$ is the expected weight of evidence for $H_1$ against $H_2$ brought by an unique sample $X_1 \sim \mu_1$ and is strictly positive whenever $\mu_1 \not = \mu_2$. By the additivity of $W$, Alice should expect that as more and more data is gathered, the weight of evidence grows to infinity. In fact, this happens $\mu_1$-almost surely.

Proposition 2. Almost surely as the number of observations grows, $W(X) \rightarrow \infty$ and $P(H_1 | X)\rightarrow 1$.

Proof: By the law of large numbers (the case $K(\mu_1, \mu_2) = \infty$ is easily treated separatly), we have

$\frac{1}{n}\sum_{i=1}^n \log\frac{f_1(X_i)}{f_2(X_i)}\rightarrow K(\mu_1, \mu_2) > 0, \quad \mu_1\text{-a.s.}.$

Therefore

$W(X) = n \left( \frac{1}{n}\sum_{i=1}^n \log\frac{f_1(X_i)}{f_2(X_i)} \right) \rightarrow \infty \quad \mu_1\text{-a.s.}$

and $P(H_1\mid X) \rightarrow 1$ $\mu_1$-almost surely. QED.

## 2.2 Finite mixture prior under misspecification

Alice wants to learn about a fixed unknown distribution $\mu_*$ through data $X =(X_1, \dots, X_n)$, where $X_i \sim^{ind.} \mu_*$. She models what $\mu_*$ may be as one of the distribution in $\mathcal{M} = \{\mu_1, \dots, \mu_k\}$, and quantifies her uncertainty through a prior $P$ on $\mathcal{M}$. We may assume that $K(\mu_*, \mu_i) < \infty$ for some $i$, as otherwise she will eventually observe data that is impossible under $\mathcal{M}$ and adjust her model. (Indeed, $K(\mu_*, \mu_i)=\infty$ implies that there exists a $\mu_*$ non-negligible set $E_i$ such that $\mu_i(E_i) = 0 < \mu_*(E_i)$. Alice will $\mu_*$-almost surely observe $X_m \in E_i$ and conclude that $\mu_* \not = \mu_k$.) If $\mu_* \not\in \mathcal{M}$, she may not realise that her model is wrong, but the following proposition ensures that the posterior distribution will concentrate on the $\mu_i$‘s that are closest to $\mu_*$.

Proposition 3. Let $A = \{\mu_i \in \mathcal{M} \mid K(\mu_*, \mu_i) = \min_j K(\mu_*, \mu_j)\}$. Almost surely as the number of observations grows, ${} P(A | X) \rightarrow 1$.

Proof: The prior takes the form ${}P = \sum_{i=1}^{k} \alpha_i \delta_i$, ${}\alpha_i > 0$, where ${}\delta_i$ is a point mass at ${}\mu_i$. The model ${}\mathcal{M}$ is dominated by a ${}\sigma$-finite measure ${}\lambda$, such as ${}\sum \mu_i$, so that the posterior distribution is

$P(A\mid X) = \frac{\sum_{\mu_i \in A} f_i(X) \alpha_i }{\sum_{\mu_i \in \mathcal{M}} f_i(X) \alpha_i}, \quad d\mu_i = f_i d\lambda.$

Because ${}K(\mu_\star, \mu_i) < \infty$ for some ${}i$, ${}\mu_\star$ is also absolutely continuous with respect to ${}\lambda$ and we let ${}d\mu_\star = f_\star d\lambda$. Now, let ${}\varepsilon = \min_j K(\mu_\star, \mu_j)$ and ${}\delta = \min_j \{K(\mu_\star, \mu_j \mid \mu_j \in A^c)\} > \varepsilon$. Write

$\log \frac{P(A | X)}{P(A^c | X)} \geq \log \frac{f_A(X)}{f_{A^c}(X)} + \log \frac{\sum_{\mu_i \in A} \alpha_i}{ \sum_{\mu_i \in A^c} \alpha_i},$

where ${}f_A = \arg \min_{f_i : \mu_i \in A} f_i(X)$ and ${}f_{A^c} = \arg \max_{f_i : \mu_i \in A^c} f_i(X)$ both depend on ${}n$. Using the law of large numbers, we find

$\liminf_{n \rightarrow \infty} \frac{1}{n} \sum_{i=1}^n \log\frac{f_\star(X_i)}{f_{A^c}(X_i)} \geq \delta > \varepsilon = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=1}^n \log\frac{f_\star(X_i)}{f_A (X_i)}\quad \mu_\star\text{-a.s.},$

and therefore

${}\log \frac{f_A(X)}{f_{A^c}(X)} = n\left( \frac{1}{n} \sum_{i=1}^n \log\frac{f_\star(X_i)}{f_{A^c}(X_i)} - \frac{1}{n}\sum_{i=1}^n \log\frac{f_\star(X_i)}{f_A (X_i)} \right) \rightarrow \infty\quad \mu_*\text{-a.s.}.$

This implies ${}P(A \mid X) \rightarrow 1$, ${}\mu_\star$-almost surely. QED.

## 2.3 Properties of the relative entropy

The following properties justify the interpretation of the relative entropy $K(\mu_1, \mu_2)$ as an expected weight of evidence.

Lemma 4. The relative entropy ${}K(\mu_1, \mu_2) = \int f_1 \log\frac{f_1}{f_2} d\lambda$, ${}d\mu_i = f_id\lambda$, does not depend on the choice of dominating measure ${}\lambda$.

Let ${}T: \mathbb{M} \rightarrow \mathbb{M}'$ be a mapping onto a space ${}\mathbb{M}'$. If ${}X \sim \mu$, then ${}T(X) \sim \mu T^{-1}$, where ${}\mu T^{-1}$ is the probability measure on ${}\mathbb{M}'$ defined by ${}\mu T^{-1}(F) = \mu (T^{-1}(F))$. The following proposition, a particular case of (Kullback, 1951, theorem 4.1), states that transforming the data through ${}T$ cannot increase the expected weight of evidence.

Proposition 5 (see Kullback and Leibler (1951)). For any two probability measures ${}\mu_1, \mu_2$ on ${}\mathbb{M}$, we have

${}K(\mu_1 T^{-1}, \mu_2 T^{-1}) \le K(\mu_1, \mu_2).$

Proof:  Let ${}\lambda$ be a dominating measure for ${}\{\mu_1, \mu_2\}$, ${}d\mu_i = f_i d\lambda$. Therefore, ${}\mu_i T^{-1}$ is dominated by ${}\lambda T^{-1}$, and we may write ${}d\mu_i T^{-1} = g_i d\lambda T^{-1}$, ${}i=1,2$. The measures on the two spaces are related by the formula

${} \int_{T^{-1}(F)} h \circ T \, d\alpha = \int_{F} h \,d\alpha T^{-1},$

where ${}\alpha$ may be one of ${}\mu_1$, ${}\mu_2$ and ${}\lambda$ and ${}h$ is any measurable function (Halmos, 1949, lemma 3). Therefore,

${}K(\mu_1, \mu_2) - K(\mu_1 T^{-1}, \mu_2 T^{-1}) = \int_{\mathbb{M}} \log\frac{f_1}{f_2} d\mu_1 - \int_{\mathbb{M}'} \log \frac{g_1}{g_2} d\mu_1T^{-1} \\ = \int \log\frac{f_1}{f_2} d\mu_1 - \int \log\frac{g_1 \circ T}{g_2 \circ T} d\mu_1 = (\star).$

By letting ${}h = \frac{f_1 g_2 \circ T}{f_2 g_1 \circ T}$ we find

${} (\star) = \int h \log h \, \frac{g_1\circ T}{g_2 \circ T} d\mu_2 = \int h \log h \,d\gamma,$

where ${}\gamma$ is such that ${}d\gamma = \frac{g_1\circ T}{g_2 \circ T} d\mu_2$. It is a probability measure since ${}\int \frac{g_1\circ T}{g_2 \circ T} d\mu_2 = 1$.

By convexity of ${}\phi : t \mapsto t \log t$, we find

${} \int h \log h \,d\gamma \geq \phi\left( \int h \,d\gamma \right) = \phi(1) = 0$

and this finishes the proof.

# References.

Good, I. (1985). Weight of evidence: A brief survey. In A. S. D. L. J.M. Bernardo, M.H. DeGroot (Ed.), Bayesian Statistics 2, pp. 249–270. North-Holland B.V.: Elsevier Science Publishers.

Halmos, P. R. and L. J. Savage (1949, 06). Application of the radon-nikodym theorem to the theory of sufficient statistics. Ann. Math. Statist. 20 (2), 225–241.

Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist. 22(1), 79-86