Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

The problem.

Let f, g, h be three densities and suppose that, x_i \sim h, i \in \mathbb{N}, independently. What happens to the likelihood ratio

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}

as n\rightarrow \infty?

Clearly, it depends. If h = g \not = f, then

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0

almost surely at an exponential rate. More generally, if h is closer to g than to f, in some sense, we’d expect that \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

D_{KL}(f, g) = \int f \log\frac{f}{g}.

It can be verified that D_{KL}(f,g) \geq 0 with equality if and only if f=g, and that

D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in (1) (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.Read More »

The asymptotic behaviour of posterior distributions – 1

I describe key ideas of the theory of posterior distribution asymptotics and work out small examples.

1. Introduction

Consider the problem of learning about the probability distribution that a stochastic mechanism is following, based on independent observations. You may quantify your uncertainty about what distribution the mechanism is following through a prior P defined on the space of all possible distributions, and then obtain the conditional distribution of P given the observations to correspondingly adjust your uncertainty. In the simplest case, the prior is concentrated on a space \mathcal{M} of distributions dominated by a common measure \lambda. This means that each probability measure in \mu \in \mathcal{M} is such that there exists f with \mu(E) = \int_E f \,d\lambda for all measurable E. We may thus identify \mathcal{M} with a space \mathbb{F} of probability densities. Given independent observations X = (X_1, \dots, X_n), the conditional distribution of P given X is then

P(A \mid X) = \frac{\int_A f(X) P(df)}{\int_{\mathbb{F}} f(X) P(df)}, \quad A \subset \mathbb{F},

where f(X) = \Pi_{i=1}^nf(X_i) and A is measurable. This conditional distribution is called the posterior distribution of P, and is understood as a Lebesgue integral relative to the measure P defined on \mathbb{F} \simeq \mathcal{M}.

The procedure of conditionning P on X is bayesian learning. It is expected that as more and more data is gathered, that the posterior distribution will converge, in a suitable way, to a point mass located at the true distribution that the data is following. This is the asymptotic behavior we study.

1.1 Notations

The elements of measure theory and the language of mathematical statistics I use are standard and very nicely reviewed in Halmos (1949). To keep notations light, I avoid making explicit references to measure spaces. All sets and maps considered are measurable.

2. The relative entropy of probability measures

Let \mu_1 and \mu_2 be two probability measures on a metric space \mathbb{M}, and let \lambda be a common \sigma-finite dominating measure such as \mu_1 + \mu_2. That is, \mu_i(E) = 0 whenever E \subset \mathbb{M} is of \lambda-measure 0, and by the Radon-Nikodym theorem this is equivalent to the fact that there exists unique densities f_1, f_2 with d\mu_i = f_i d\lambda. The relative entropy of \mu_1 and \mu_2, also known as the Kullback-Leibler divergence, is defined as

K(\mu_1, \mu_2) = \int \log \frac{f_1}{f_2} d\mu_1\,.

The following inequality will be of much use.

Lemma 1 (Kullback and Leibler (1951)). We have K(\mu_1, \mu_2) \geq 0, with equality if and only if \mu_1 = \mu_2.

Proof: We may assume that \mu_2 is dominated by \mu_1, as otherwise K(\mu_1, \mu_2) = \infty and the lemma holds. Now, let \phi (t) = t\log t, g = f_1/f_2 and write {}K(\mu_1, \mu_2) = \int \phi \circ g d\mu_2. Since \phi (t) = (t-1) +(t-1)^2 / h(t) for some h(t) > 0, we find

K(\mu_1, \mu_2) = \int (g-1) d\mu_2 + \int \frac{(g-1)^2}{h} d\mu_2 = \int \frac{(g-1)^2}{h\circ g} d\mu_2 \geq 0,

with equality if and only if f_1/f_2 = 1, \mu_1-almost everywhere. QED.

We may already apply the lemma to problems of statistical inference.

2.1 Discriminating between point hypotheses

Alice observes X = (X_1, \dots, X_n), X_i \sim^{ind}\mu_*, \mu_* \in \{\mu_1, \mu_2\}, and considers the hypotheses H_i = \{\mu_i\}, i=1,2, representing “\mu_* = \mu_i“. A prior distribution on H_1 \cup H_2 takes the form

P = \alpha \delta_1 + (1-\alpha)\delta_2, \quad 0 < \alpha < 1,

where \delta_i(A) is 1 when \mu_i \in A and is 0 otherwise. Given a sample X_1 \sim \mu_1, the weight of evidence for H_1 versus H_2 (Good, 1985), also known as “the information in X_1 for discrimination between H_1 and H_2” (Kullback, 1951), is defined as

W(X_1) = \log\frac{f_1(X_1)}{f_2(X_1)}.

This quantity is additive for independent sample: if X = (X_1, \dots, X_n), X_i \sim^{ind} \mu_*, then

W(X) = \sum_{i=1}^n W(X_i);

and the posterior log-odds are given by
\log\frac{P(H_1 \mid X)}{P(H_2 \mid X)} = W(X) + \log\frac{P(H_1)}{P(H_2)}.
Thus K(\mu_1, \mu_2) is the expected weight of evidence for H_1 against H_2 brought by an unique sample X_1 \sim \mu_1 and is strictly positive whenever \mu_1 \not = \mu_2. By the additivity of W, Alice should expect that as more and more data is gathered, the weight of evidence grows to infinity. In fact, this happens \mu_1-almost surely.

Proposition 2. Almost surely as the number of observations grows, W(X) \rightarrow \infty and P(H_1 | X)\rightarrow 1.

Proof: By the law of large numbers (the case K(\mu_1, \mu_2) = \infty is easily treated separatly), we have

\frac{1}{n}\sum_{i=1}^n \log\frac{f_1(X_i)}{f_2(X_i)}\rightarrow K(\mu_1, \mu_2) > 0, \quad \mu_1\text{-a.s.}.


W(X) = n \left( \frac{1}{n}\sum_{i=1}^n \log\frac{f_1(X_i)}{f_2(X_i)} \right) \rightarrow \infty \quad \mu_1\text{-a.s.}

and P(H_1\mid X) \rightarrow 1 \mu_1-almost surely. QED.

2.2 Finite mixture prior under misspecification

Alice wants to learn about a fixed unknown distribution \mu_* through data X =(X_1, \dots, X_n), where X_i \sim^{ind.} \mu_*. She models what \mu_* may be as one of the distribution in \mathcal{M} = \{\mu_1, \dots, \mu_k\}, and quantifies her uncertainty through a prior P on \mathcal{M}. We may assume that K(\mu_*, \mu_i) < \infty for some i, as otherwise she will eventually observe data that is impossible under \mathcal{M} and adjust her model. (Indeed, K(\mu_*, \mu_i)=\infty implies that there exists a \mu_* non-negligible set E_i such that \mu_i(E_i) = 0 < \mu_*(E_i). Alice will \mu_*-almost surely observe X_m \in E_i and conclude that \mu_* \not = \mu_k.) If \mu_* \not\in \mathcal{M}, she may not realise that her model is wrong, but the following proposition ensures that the posterior distribution will concentrate on the \mu_i‘s that are closest to \mu_*.

Proposition 3. Let A = \{\mu_i \in \mathcal{M} \mid K(\mu_*, \mu_i) = \min_j K(\mu_*, \mu_j)\}. Almost surely as the number of observations grows, {} P(A | X) \rightarrow 1.

Proof: The prior takes the form {}P = \sum_{i=1}^{k} \alpha_i \delta_i, {}\alpha_i > 0, where {}\delta_i is a point mass at {}\mu_i. The model {}\mathcal{M} is dominated by a {}\sigma-finite measure {}\lambda, such as {}\sum \mu_i, so that the posterior distribution is

P(A\mid X) = \frac{\sum_{\mu_i \in A} f_i(X) \alpha_i }{\sum_{\mu_i \in \mathcal{M}} f_i(X) \alpha_i}, \quad d\mu_i = f_i d\lambda.

Because {}K(\mu_\star, \mu_i) < \infty for some {}i, {}\mu_\star is also absolutely continuous with respect to {}\lambda and we let {}d\mu_\star = f_\star d\lambda. Now, let {}\varepsilon = \min_j K(\mu_\star, \mu_j) and {}\delta = \min_j \{K(\mu_\star, \mu_j \mid \mu_j \in A^c)\} > \varepsilon. Write

\log \frac{P(A | X)}{P(A^c | X)} \geq \log \frac{f_A(X)}{f_{A^c}(X)} + \log \frac{\sum_{\mu_i \in A} \alpha_i}{ \sum_{\mu_i \in A^c} \alpha_i},

where {}f_A = \arg \min_{f_i : \mu_i \in A} f_i(X) and {}f_{A^c} = \arg \max_{f_i : \mu_i \in A^c} f_i(X) both depend on {}n. Using the law of large numbers, we find

\liminf_{n \rightarrow \infty} \frac{1}{n} \sum_{i=1}^n \log\frac{f_\star(X_i)}{f_{A^c}(X_i)} \geq \delta > \varepsilon = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=1}^n \log\frac{f_\star(X_i)}{f_A (X_i)}\quad \mu_\star\text{-a.s.},

and therefore

{}\log \frac{f_A(X)}{f_{A^c}(X)} = n\left( \frac{1}{n} \sum_{i=1}^n \log\frac{f_\star(X_i)}{f_{A^c}(X_i)} - \frac{1}{n}\sum_{i=1}^n \log\frac{f_\star(X_i)}{f_A (X_i)} \right) \rightarrow \infty\quad \mu_*\text{-a.s.}.

This implies {}P(A \mid X) \rightarrow 1, {}\mu_\star-almost surely. QED.

2.3 Properties of the relative entropy

The following properties justify the interpretation of the relative entropy K(\mu_1, \mu_2) as an expected weight of evidence.

Lemma 4. The relative entropy {}K(\mu_1, \mu_2) = \int f_1 \log\frac{f_1}{f_2} d\lambda, {}d\mu_i = f_id\lambda, does not depend on the choice of dominating measure {}\lambda.

Let {}T: \mathbb{M} \rightarrow \mathbb{M}' be a mapping onto a space {}\mathbb{M}'. If {}X \sim \mu, then {}T(X) \sim \mu T^{-1}, where {}\mu T^{-1} is the probability measure on {}\mathbb{M}' defined by {}\mu T^{-1}(F) = \mu (T^{-1}(F)). The following proposition, a particular case of (Kullback, 1951, theorem 4.1), states that transforming the data through {}T cannot increase the expected weight of evidence.

Proposition 5 (see Kullback and Leibler (1951)). For any two probability measures {}\mu_1, \mu_2 on {}\mathbb{M}, we have

{}K(\mu_1 T^{-1}, \mu_2 T^{-1}) \le K(\mu_1, \mu_2).

Proof:  Let {}\lambda be a dominating measure for {}\{\mu_1, \mu_2\}, {}d\mu_i = f_i d\lambda. Therefore, {}\mu_i T^{-1} is dominated by {}\lambda T^{-1}, and we may write {}d\mu_i T^{-1} = g_i d\lambda T^{-1}, {}i=1,2. The measures on the two spaces are related by the formula

{} \int_{T^{-1}(F)} h \circ T \, d\alpha = \int_{F} h \,d\alpha T^{-1},

where {}\alpha may be one of {}\mu_1, {}\mu_2 and {}\lambda and {}h is any measurable function (Halmos, 1949, lemma 3). Therefore,

{}K(\mu_1, \mu_2) - K(\mu_1 T^{-1}, \mu_2 T^{-1}) = \int_{\mathbb{M}} \log\frac{f_1}{f_2} d\mu_1 - \int_{\mathbb{M}'} \log \frac{g_1}{g_2} d\mu_1T^{-1} \\ = \int \log\frac{f_1}{f_2} d\mu_1 - \int \log\frac{g_1 \circ T}{g_2 \circ T} d\mu_1 = (\star).

By letting {}h = \frac{f_1 g_2 \circ T}{f_2 g_1 \circ T} we find

{} (\star) = \int h \log h \, \frac{g_1\circ T}{g_2 \circ T} d\mu_2 = \int h \log h \,d\gamma,

where {}\gamma is such that {}d\gamma = \frac{g_1\circ T}{g_2 \circ T} d\mu_2. It is a probability measure since {}\int \frac{g_1\circ T}{g_2 \circ T} d\mu_2 = 1.

By convexity of {}\phi : t \mapsto t \log t, we find

{} \int h \log h \,d\gamma \geq \phi\left( \int h \,d\gamma \right) = \phi(1) = 0

and this finishes the proof.


Good, I. (1985). Weight of evidence: A brief survey. In A. S. D. L. J.M. Bernardo, M.H. DeGroot (Ed.), Bayesian Statistics 2, pp. 249–270. North-Holland B.V.: Elsevier Science Publishers.

Halmos, P. R. and L. J. Savage (1949, 06). Application of the radon-nikodym theorem to the theory of sufficient statistics. Ann. Math. Statist. 20 (2), 225–241.

Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist. 22(1), 79-86