The asymptotic behaviour of posterior distributions – 1

I describe key ideas of the theory of posterior distribution asymptotics and work out small examples.

1. Introduction

Consider the problem of learning about the probability distribution that a stochastic mechanism is following, based on independent observations. You may quantify your uncertainty about what distribution the mechanism is following through a prior P defined on the space of all possible distributions, and then obtain the conditional distribution of P given the observations to correspondingly adjust your uncertainty. In the simplest case, the prior is concentrated on a space \mathcal{M} of distributions dominated by a common measure \lambda. This means that each probability measure in \mu \in \mathcal{M} is such that there exists f with \mu(E) = \int_E f \,d\lambda for all measurable E. We may thus identify \mathcal{M} with a space \mathbb{F} of probability densities. Given independent observations X = (X_1, \dots, X_n), the conditional distribution of P given X is then

P(A \mid X) = \frac{\int_A f(X) P(df)}{\int_{\mathbb{F}} f(X) P(df)}, \quad A \subset \mathbb{F},

where f(X) = \Pi_{i=1}^nf(X_i) and A is measurable. This conditional distribution is called the posterior distribution of P, and is understood as a Lebesgue integral relative to the measure P defined on \mathbb{F} \simeq \mathcal{M}.

The procedure of conditionning P on X is bayesian learning. It is expected that as more and more data is gathered, that the posterior distribution will converge, in a suitable way, to a point mass located at the true distribution that the data is following. This is the asymptotic behavior we study.

1.1 Notations

The elements of measure theory and the language of mathematical statistics I use are standard and very nicely reviewed in Halmos (1949). To keep notations light, I avoid making explicit references to measure spaces. All sets and maps considered are measurable.

2. The relative entropy of probability measures

Let \mu_1 and \mu_2 be two probability measures on a metric space \mathbb{M}, and let \lambda be a common \sigma-finite dominating measure such as \mu_1 + \mu_2. That is, \mu_i(E) = 0 whenever E \subset \mathbb{M} is of \lambda-measure 0, and by the Radon-Nikodym theorem this is equivalent to the fact that there exists unique densities f_1, f_2 with d\mu_i = f_i d\lambda. The relative entropy of \mu_1 and \mu_2, also known as the Kullback-Leibler divergence, is defined as

K(\mu_1, \mu_2) = \int \log \frac{f_1}{f_2} d\mu_1\,.

The following inequality will be of much use.

Lemma 1 (Kullback and Leibler (1951)). We have K(\mu_1, \mu_2) \geq 0, with equality if and only if \mu_1 = \mu_2.

Proof: We may assume that \mu_2 is dominated by \mu_1, as otherwise K(\mu_1, \mu_2) = \infty and the lemma holds. Now, let \phi (t) = t\log t, g = f_1/f_2 and write {}K(\mu_1, \mu_2) = \int \phi \circ g d\mu_2. Since \phi (t) = (t-1) +(t-1)^2 / h(t) for some h(t) > 0, we find

K(\mu_1, \mu_2) = \int (g-1) d\mu_2 + \int \frac{(g-1)^2}{h} d\mu_2 = \int \frac{(g-1)^2}{h\circ g} d\mu_2 \geq 0,

with equality if and only if f_1/f_2 = 1, \mu_1-almost everywhere. QED.

We may already apply the lemma to problems of statistical inference.

2.1 Discriminating between point hypotheses

Alice observes X = (X_1, \dots, X_n), X_i \sim^{ind}\mu_*, \mu_* \in \{\mu_1, \mu_2\}, and considers the hypotheses H_i = \{\mu_i\}, i=1,2, representing “\mu_* = \mu_i“. A prior distribution on H_1 \cup H_2 takes the form

P = \alpha \delta_1 + (1-\alpha)\delta_2, \quad 0 < \alpha < 1,

where \delta_i(A) is 1 when \mu_i \in A and is 0 otherwise. Given a sample X_1 \sim \mu_1, the weight of evidence for H_1 versus H_2 (Good, 1985), also known as “the information in X_1 for discrimination between H_1 and H_2” (Kullback, 1951), is defined as

W(X_1) = \log\frac{f_1(X_1)}{f_2(X_1)}.

This quantity is additive for independent sample: if X = (X_1, \dots, X_n), X_i \sim^{ind} \mu_*, then

W(X) = \sum_{i=1}^n W(X_i);

and the posterior log-odds are given by
\log\frac{P(H_1 \mid X)}{P(H_2 \mid X)} = W(X) + \log\frac{P(H_1)}{P(H_2)}.
Thus K(\mu_1, \mu_2) is the expected weight of evidence for H_1 against H_2 brought by an unique sample X_1 \sim \mu_1 and is strictly positive whenever \mu_1 \not = \mu_2. By the additivity of W, Alice should expect that as more and more data is gathered, the weight of evidence grows to infinity. In fact, this happens \mu_1-almost surely.

Proposition 2. Almost surely as the number of observations grows, W(X) \rightarrow \infty and P(H_1 | X)\rightarrow 1.

Proof: By the law of large numbers (the case K(\mu_1, \mu_2) = \infty is easily treated separatly), we have

\frac{1}{n}\sum_{i=1}^n \log\frac{f_1(X_i)}{f_2(X_i)}\rightarrow K(\mu_1, \mu_2) > 0, \quad \mu_1\text{-a.s.}.

Therefore

W(X) = n \left( \frac{1}{n}\sum_{i=1}^n \log\frac{f_1(X_i)}{f_2(X_i)} \right) \rightarrow \infty \quad \mu_1\text{-a.s.}

and P(H_1\mid X) \rightarrow 1 \mu_1-almost surely. QED.

2.2 Finite mixture prior under misspecification

Alice wants to learn about a fixed unknown distribution \mu_* through data X =(X_1, \dots, X_n), where X_i \sim^{ind.} \mu_*. She models what \mu_* may be as one of the distribution in \mathcal{M} = \{\mu_1, \dots, \mu_k\}, and quantifies her uncertainty through a prior P on \mathcal{M}. We may assume that K(\mu_*, \mu_i) < \infty for some i, as otherwise she will eventually observe data that is impossible under \mathcal{M} and adjust her model. (Indeed, K(\mu_*, \mu_i)=\infty implies that there exists a \mu_* non-negligible set E_i such that \mu_i(E_i) = 0 < \mu_*(E_i). Alice will \mu_*-almost surely observe X_m \in E_i and conclude that \mu_* \not = \mu_k.) If \mu_* \not\in \mathcal{M}, she may not realise that her model is wrong, but the following proposition ensures that the posterior distribution will concentrate on the \mu_i‘s that are closest to \mu_*.

Proposition 3. Let A = \{\mu_i \in \mathcal{M} \mid K(\mu_*, \mu_i) = \min_j K(\mu_*, \mu_j)\}. Almost surely as the number of observations grows, {} P(A | X) \rightarrow 1.

Proof: The prior takes the form {}P = \sum_{i=1}^{k} \alpha_i \delta_i, {}\alpha_i > 0, where {}\delta_i is a point mass at {}\mu_i. The model {}\mathcal{M} is dominated by a {}\sigma-finite measure {}\lambda, such as {}\sum \mu_i, so that the posterior distribution is

P(A\mid X) = \frac{\sum_{\mu_i \in A} f_i(X) \alpha_i }{\sum_{\mu_i \in \mathcal{M}} f_i(X) \alpha_i}, \quad d\mu_i = f_i d\lambda.

Because {}K(\mu_\star, \mu_i) < \infty for some {}i, {}\mu_\star is also absolutely continuous with respect to {}\lambda and we let {}d\mu_\star = f_\star d\lambda. Now, let {}\varepsilon = \min_j K(\mu_\star, \mu_j) and {}\delta = \min_j \{K(\mu_\star, \mu_j \mid \mu_j \in A^c)\} > \varepsilon. Write

\log \frac{P(A | X)}{P(A^c | X)} \geq \log \frac{f_A(X)}{f_{A^c}(X)} + \log \frac{\sum_{\mu_i \in A} \alpha_i}{ \sum_{\mu_i \in A^c} \alpha_i},

where {}f_A = \arg \min_{f_i : \mu_i \in A} f_i(X) and {}f_{A^c} = \arg \max_{f_i : \mu_i \in A^c} f_i(X) both depend on {}n. Using the law of large numbers, we find

\liminf_{n \rightarrow \infty} \frac{1}{n} \sum_{i=1}^n \log\frac{f_\star(X_i)}{f_{A^c}(X_i)} \geq \delta > \varepsilon = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=1}^n \log\frac{f_\star(X_i)}{f_A (X_i)}\quad \mu_\star\text{-a.s.},

and therefore

{}\log \frac{f_A(X)}{f_{A^c}(X)} = n\left( \frac{1}{n} \sum_{i=1}^n \log\frac{f_\star(X_i)}{f_{A^c}(X_i)} - \frac{1}{n}\sum_{i=1}^n \log\frac{f_\star(X_i)}{f_A (X_i)} \right) \rightarrow \infty\quad \mu_*\text{-a.s.}.

This implies {}P(A \mid X) \rightarrow 1, {}\mu_\star-almost surely. QED.

2.3 Properties of the relative entropy

The following properties justify the interpretation of the relative entropy K(\mu_1, \mu_2) as an expected weight of evidence.

Lemma 4. The relative entropy {}K(\mu_1, \mu_2) = \int f_1 \log\frac{f_1}{f_2} d\lambda, {}d\mu_i = f_id\lambda, does not depend on the choice of dominating measure {}\lambda.

Let {}T: \mathbb{M} \rightarrow \mathbb{M}' be a mapping onto a space {}\mathbb{M}'. If {}X \sim \mu, then {}T(X) \sim \mu T^{-1}, where {}\mu T^{-1} is the probability measure on {}\mathbb{M}' defined by {}\mu T^{-1}(F) = \mu (T^{-1}(F)). The following proposition, a particular case of (Kullback, 1951, theorem 4.1), states that transforming the data through {}T cannot increase the expected weight of evidence.

Proposition 5 (see Kullback and Leibler (1951)). For any two probability measures {}\mu_1, \mu_2 on {}\mathbb{M}, we have

{}K(\mu_1 T^{-1}, \mu_2 T^{-1}) \le K(\mu_1, \mu_2).

Proof:  Let {}\lambda be a dominating measure for {}\{\mu_1, \mu_2\}, {}d\mu_i = f_i d\lambda. Therefore, {}\mu_i T^{-1} is dominated by {}\lambda T^{-1}, and we may write {}d\mu_i T^{-1} = g_i d\lambda T^{-1}, {}i=1,2. The measures on the two spaces are related by the formula

{} \int_{T^{-1}(F)} h \circ T \, d\alpha = \int_{F} h \,d\alpha T^{-1},

where {}\alpha may be one of {}\mu_1, {}\mu_2 and {}\lambda and {}h is any measurable function (Halmos, 1949, lemma 3). Therefore,

{}K(\mu_1, \mu_2) - K(\mu_1 T^{-1}, \mu_2 T^{-1}) = \int_{\mathbb{M}} \log\frac{f_1}{f_2} d\mu_1 - \int_{\mathbb{M}'} \log \frac{g_1}{g_2} d\mu_1T^{-1} \\ = \int \log\frac{f_1}{f_2} d\mu_1 - \int \log\frac{g_1 \circ T}{g_2 \circ T} d\mu_1 = (\star).

By letting {}h = \frac{f_1 g_2 \circ T}{f_2 g_1 \circ T} we find

{} (\star) = \int h \log h \, \frac{g_1\circ T}{g_2 \circ T} d\mu_2 = \int h \log h \,d\gamma,

where {}\gamma is such that {}d\gamma = \frac{g_1\circ T}{g_2 \circ T} d\mu_2. It is a probability measure since {}\int \frac{g_1\circ T}{g_2 \circ T} d\mu_2 = 1.

By convexity of {}\phi : t \mapsto t \log t, we find

{} \int h \log h \,d\gamma \geq \phi\left( \int h \,d\gamma \right) = \phi(1) = 0

and this finishes the proof.

References.

Good, I. (1985). Weight of evidence: A brief survey. In A. S. D. L. J.M. Bernardo, M.H. DeGroot (Ed.), Bayesian Statistics 2, pp. 249–270. North-Holland B.V.: Elsevier Science Publishers.

Halmos, P. R. and L. J. Savage (1949, 06). Application of the radon-nikodym theorem to the theory of sufficient statistics. Ann. Math. Statist. 20 (2), 225–241.

Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist. 22(1), 79-86

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s