# Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

## The problem.

Let $f, g, h$ be three densities and suppose that, $x_i \sim h$, $i \in \mathbb{N}$, independently. What happens to the likelihood ratio

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}$

as $n\rightarrow \infty$?

Clearly, it depends. If $h = g \not = f$, then

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$

almost surely at an exponential rate. More generally, if $h$ is closer to $g$ than to $f$, in some sense, we’d expect that $\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

$D_{KL}(f, g) = \int f \log\frac{f}{g}.$

It can be verified that $D_{KL}(f,g) \geq 0$ with equality if and only if $f=g$, and that

$D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)$

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

## Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in $(1)$ (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.Read More »

# The asymptotic behaviour of posterior distributions – 1

I describe key ideas of the theory of posterior distribution asymptotics and work out small examples.

# 1. Introduction

Consider the problem of learning about the probability distribution that a stochastic mechanism is following, based on independent observations. You may quantify your uncertainty about what distribution the mechanism is following through a prior $P$ defined on the space of all possible distributions, and then obtain the conditional distribution of $P$ given the observations to correspondingly adjust your uncertainty. In the simplest case, the prior is concentrated on a space $\mathcal{M}$ of distributions dominated by a common measure $\lambda$. This means that each probability measure in $\mu \in \mathcal{M}$ is such that there exists $f$ with $\mu(E) = \int_E f \,d\lambda$ for all measurable $E$. We may thus identify $\mathcal{M}$ with a space $\mathbb{F}$ of probability densities. Given independent observations $X = (X_1, \dots, X_n)$, the conditional distribution of $P$ given $X$ is then

$P(A \mid X) = \frac{\int_A f(X) P(df)}{\int_{\mathbb{F}} f(X) P(df)}, \quad A \subset \mathbb{F},$

where $f(X) = \Pi_{i=1}^nf(X_i)$ and $A$ is measurable. This conditional distribution is called the posterior distribution of $P$, and is understood as a Lebesgue integral relative to the measure $P$ defined on $\mathbb{F} \simeq \mathcal{M}$.

The procedure of conditionning $P$ on $X$ is bayesian learning. It is expected that as more and more data is gathered, that the posterior distribution will converge, in a suitable way, to a point mass located at the true distribution that the data is following. This is the asymptotic behavior we study.

## 1.1 Notations

The elements of measure theory and the language of mathematical statistics I use are standard and very nicely reviewed in Halmos (1949). To keep notations light, I avoid making explicit references to measure spaces. All sets and maps considered are measurable.

# 2. The relative entropy of probability measures

Let $\mu_1$ and $\mu_2$ be two probability measures on a metric space $\mathbb{M}$, and let $\lambda$ be a common $\sigma$-finite dominating measure such as $\mu_1 + \mu_2$. That is, $\mu_i(E) = 0$ whenever $E \subset \mathbb{M}$ is of $\lambda$-measure $0$, and by the Radon-Nikodym theorem this is equivalent to the fact that there exists unique densities $f_1$, $f_2$ with $d\mu_i = f_i d\lambda$. The relative entropy of $\mu_1$ and $\mu_2$, also known as the Kullback-Leibler divergence, is defined as

$K(\mu_1, \mu_2) = \int \log \frac{f_1}{f_2} d\mu_1\,.$

The following inequality will be of much use.

Lemma 1 (Kullback and Leibler (1951)). We have $K(\mu_1, \mu_2) \geq 0$, with equality if and only if $\mu_1 = \mu_2$.

Proof: We may assume that $\mu_2$ is dominated by $\mu_1$, as otherwise $K(\mu_1, \mu_2) = \infty$ and the lemma holds. Now, let $\phi (t) = t\log t$, $g = f_1/f_2$ and write ${}K(\mu_1, \mu_2) = \int \phi \circ g d\mu_2$. Since $\phi (t) = (t-1) +(t-1)^2 / h(t)$ for some $h(t) > 0$, we find

$K(\mu_1, \mu_2) = \int (g-1) d\mu_2 + \int \frac{(g-1)^2}{h} d\mu_2 = \int \frac{(g-1)^2}{h\circ g} d\mu_2 \geq 0,$

with equality if and only if $f_1/f_2 = 1$, $\mu_1$-almost everywhere. QED.

We may already apply the lemma to problems of statistical inference.

## 2.1 Discriminating between point hypotheses

Alice observes $X = (X_1, \dots, X_n)$, $X_i \sim^{ind}\mu_*$, $\mu_* \in \{\mu_1, \mu_2\}$, and considers the hypotheses $H_i = \{\mu_i\}$, $i=1,2$, representing “$\mu_* = \mu_i$“. A prior distribution on $H_1 \cup H_2$ takes the form

$P = \alpha \delta_1 + (1-\alpha)\delta_2, \quad 0 < \alpha < 1,$

where $\delta_i(A)$ is $1$ when $\mu_i \in A$ and is $0$ otherwise. Given a sample $X_1 \sim \mu_1$, the weight of evidence for $H_1$ versus $H_2$ (Good, 1985), also known as “the information in $X_1$ for discrimination between $H_1$ and $H_2$” (Kullback, 1951), is defined as

$W(X_1) = \log\frac{f_1(X_1)}{f_2(X_1)}.$

This quantity is additive for independent sample: if $X = (X_1, \dots, X_n)$, $X_i \sim^{ind} \mu_*$, then

$W(X) = \sum_{i=1}^n W(X_i);$

and the posterior log-odds are given by
$\log\frac{P(H_1 \mid X)}{P(H_2 \mid X)} = W(X) + \log\frac{P(H_1)}{P(H_2)}.$
Thus $K(\mu_1, \mu_2)$ is the expected weight of evidence for $H_1$ against $H_2$ brought by an unique sample $X_1 \sim \mu_1$ and is strictly positive whenever $\mu_1 \not = \mu_2$. By the additivity of $W$, Alice should expect that as more and more data is gathered, the weight of evidence grows to infinity. In fact, this happens $\mu_1$-almost surely.

Proposition 2. Almost surely as the number of observations grows, $W(X) \rightarrow \infty$ and $P(H_1 | X)\rightarrow 1$.

Proof: By the law of large numbers (the case $K(\mu_1, \mu_2) = \infty$ is easily treated separatly), we have

$\frac{1}{n}\sum_{i=1}^n \log\frac{f_1(X_i)}{f_2(X_i)}\rightarrow K(\mu_1, \mu_2) > 0, \quad \mu_1\text{-a.s.}.$

Therefore

$W(X) = n \left( \frac{1}{n}\sum_{i=1}^n \log\frac{f_1(X_i)}{f_2(X_i)} \right) \rightarrow \infty \quad \mu_1\text{-a.s.}$

and $P(H_1\mid X) \rightarrow 1$ $\mu_1$-almost surely. QED.

## 2.2 Finite mixture prior under misspecification

Alice wants to learn about a fixed unknown distribution $\mu_*$ through data $X =(X_1, \dots, X_n)$, where $X_i \sim^{ind.} \mu_*$. She models what $\mu_*$ may be as one of the distribution in $\mathcal{M} = \{\mu_1, \dots, \mu_k\}$, and quantifies her uncertainty through a prior $P$ on $\mathcal{M}$. We may assume that $K(\mu_*, \mu_i) < \infty$ for some $i$, as otherwise she will eventually observe data that is impossible under $\mathcal{M}$ and adjust her model. (Indeed, $K(\mu_*, \mu_i)=\infty$ implies that there exists a $\mu_*$ non-negligible set $E_i$ such that $\mu_i(E_i) = 0 < \mu_*(E_i)$. Alice will $\mu_*$-almost surely observe $X_m \in E_i$ and conclude that $\mu_* \not = \mu_k$.) If $\mu_* \not\in \mathcal{M}$, she may not realise that her model is wrong, but the following proposition ensures that the posterior distribution will concentrate on the $\mu_i$‘s that are closest to $\mu_*$.

Proposition 3. Let $A = \{\mu_i \in \mathcal{M} \mid K(\mu_*, \mu_i) = \min_j K(\mu_*, \mu_j)\}$. Almost surely as the number of observations grows, ${} P(A | X) \rightarrow 1$.

Proof: The prior takes the form ${}P = \sum_{i=1}^{k} \alpha_i \delta_i$, ${}\alpha_i > 0$, where ${}\delta_i$ is a point mass at ${}\mu_i$. The model ${}\mathcal{M}$ is dominated by a ${}\sigma$-finite measure ${}\lambda$, such as ${}\sum \mu_i$, so that the posterior distribution is

$P(A\mid X) = \frac{\sum_{\mu_i \in A} f_i(X) \alpha_i }{\sum_{\mu_i \in \mathcal{M}} f_i(X) \alpha_i}, \quad d\mu_i = f_i d\lambda.$

Because ${}K(\mu_\star, \mu_i) < \infty$ for some ${}i$, ${}\mu_\star$ is also absolutely continuous with respect to ${}\lambda$ and we let ${}d\mu_\star = f_\star d\lambda$. Now, let ${}\varepsilon = \min_j K(\mu_\star, \mu_j)$ and ${}\delta = \min_j \{K(\mu_\star, \mu_j \mid \mu_j \in A^c)\} > \varepsilon$. Write

$\log \frac{P(A | X)}{P(A^c | X)} \geq \log \frac{f_A(X)}{f_{A^c}(X)} + \log \frac{\sum_{\mu_i \in A} \alpha_i}{ \sum_{\mu_i \in A^c} \alpha_i},$

where ${}f_A = \arg \min_{f_i : \mu_i \in A} f_i(X)$ and ${}f_{A^c} = \arg \max_{f_i : \mu_i \in A^c} f_i(X)$ both depend on ${}n$. Using the law of large numbers, we find

$\liminf_{n \rightarrow \infty} \frac{1}{n} \sum_{i=1}^n \log\frac{f_\star(X_i)}{f_{A^c}(X_i)} \geq \delta > \varepsilon = \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=1}^n \log\frac{f_\star(X_i)}{f_A (X_i)}\quad \mu_\star\text{-a.s.},$

and therefore

${}\log \frac{f_A(X)}{f_{A^c}(X)} = n\left( \frac{1}{n} \sum_{i=1}^n \log\frac{f_\star(X_i)}{f_{A^c}(X_i)} - \frac{1}{n}\sum_{i=1}^n \log\frac{f_\star(X_i)}{f_A (X_i)} \right) \rightarrow \infty\quad \mu_*\text{-a.s.}.$

This implies ${}P(A \mid X) \rightarrow 1$, ${}\mu_\star$-almost surely. QED.

## 2.3 Properties of the relative entropy

The following properties justify the interpretation of the relative entropy $K(\mu_1, \mu_2)$ as an expected weight of evidence.

Lemma 4. The relative entropy ${}K(\mu_1, \mu_2) = \int f_1 \log\frac{f_1}{f_2} d\lambda$, ${}d\mu_i = f_id\lambda$, does not depend on the choice of dominating measure ${}\lambda$.

Let ${}T: \mathbb{M} \rightarrow \mathbb{M}'$ be a mapping onto a space ${}\mathbb{M}'$. If ${}X \sim \mu$, then ${}T(X) \sim \mu T^{-1}$, where ${}\mu T^{-1}$ is the probability measure on ${}\mathbb{M}'$ defined by ${}\mu T^{-1}(F) = \mu (T^{-1}(F))$. The following proposition, a particular case of (Kullback, 1951, theorem 4.1), states that transforming the data through ${}T$ cannot increase the expected weight of evidence.

Proposition 5 (see Kullback and Leibler (1951)). For any two probability measures ${}\mu_1, \mu_2$ on ${}\mathbb{M}$, we have

${}K(\mu_1 T^{-1}, \mu_2 T^{-1}) \le K(\mu_1, \mu_2).$

Proof:  Let ${}\lambda$ be a dominating measure for ${}\{\mu_1, \mu_2\}$, ${}d\mu_i = f_i d\lambda$. Therefore, ${}\mu_i T^{-1}$ is dominated by ${}\lambda T^{-1}$, and we may write ${}d\mu_i T^{-1} = g_i d\lambda T^{-1}$, ${}i=1,2$. The measures on the two spaces are related by the formula

${} \int_{T^{-1}(F)} h \circ T \, d\alpha = \int_{F} h \,d\alpha T^{-1},$

where ${}\alpha$ may be one of ${}\mu_1$, ${}\mu_2$ and ${}\lambda$ and ${}h$ is any measurable function (Halmos, 1949, lemma 3). Therefore,

${}K(\mu_1, \mu_2) - K(\mu_1 T^{-1}, \mu_2 T^{-1}) = \int_{\mathbb{M}} \log\frac{f_1}{f_2} d\mu_1 - \int_{\mathbb{M}'} \log \frac{g_1}{g_2} d\mu_1T^{-1} \\ = \int \log\frac{f_1}{f_2} d\mu_1 - \int \log\frac{g_1 \circ T}{g_2 \circ T} d\mu_1 = (\star).$

By letting ${}h = \frac{f_1 g_2 \circ T}{f_2 g_1 \circ T}$ we find

${} (\star) = \int h \log h \, \frac{g_1\circ T}{g_2 \circ T} d\mu_2 = \int h \log h \,d\gamma,$

where ${}\gamma$ is such that ${}d\gamma = \frac{g_1\circ T}{g_2 \circ T} d\mu_2$. It is a probability measure since ${}\int \frac{g_1\circ T}{g_2 \circ T} d\mu_2 = 1$.

By convexity of ${}\phi : t \mapsto t \log t$, we find

${} \int h \log h \,d\gamma \geq \phi\left( \int h \,d\gamma \right) = \phi(1) = 0$

and this finishes the proof.

# References.

Good, I. (1985). Weight of evidence: A brief survey. In A. S. D. L. J.M. Bernardo, M.H. DeGroot (Ed.), Bayesian Statistics 2, pp. 249–270. North-Holland B.V.: Elsevier Science Publishers.

Halmos, P. R. and L. J. Savage (1949, 06). Application of the radon-nikodym theorem to the theory of sufficient statistics. Ann. Math. Statist. 20 (2), 225–241.

Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist. 22(1), 79-86