Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

The problem.

Let f, g, h be three densities and suppose that, x_i \sim h, i \in \mathbb{N}, independently. What happens to the likelihood ratio

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}

as n\rightarrow \infty?

Clearly, it depends. If h = g \not = f, then

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0

almost surely at an exponential rate. More generally, if h is closer to g than to f, in some sense, we’d expect that \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

D_{KL}(f, g) = \int f \log\frac{f}{g}.

It can be verified that D_{KL}(f,g) \geq 0 with equality if and only if f=g, and that

D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in (1) (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.

This is of interest in Bayesian nonparametrics. The hypothesis that a density f_0 is in the Kullback-Leibler support of a prior \Pi on a density space \mathbb{F} is used to ensure that

\int_{\mathbb{F}} \prod_{i=1}^n \frac{f(x_i)}{f_0(x_i)} \Pi(df)

does not converge to 0 exponentially fast as n\rightarrow \infty and x_i \sim f_0. Under the conditions we specify, our remark implies that the hypothesis that “f_0 is in the Kullback-Leibler support of \Pi” may not be replaced by a weaker one.

Statement of the remark

Some notations. Let \mathcal{M} the space of all probability measures on some measurable space \mathbb{M}. If \mu, \nu \in \mathcal{M}, then both are absolutely continuous with respect to \tau = \mu + \nu and possess densities f, g such that d\mu = fd\tau and d\nu gd\tau. We denote by d\mu /d\nu the ratio of densities f / g, which in fact does not depend on the choice of dominating measure \tau. The likelihood ratio \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} is abbreviated to \frac{d\mu}{d\nu}(X), depending implicitely on n. The Kullback-Leibler divergence between \mu and \nu is

D_{KL}(\mu, \nu) = \int \log\frac{d\mu}{d\nu} d\mu.

We let D be any other function \mathcal{M}\times \mathcal{M} \rightarrow [0,\infty] such that D(\mu, \nu) = 0 iff \mu = \nu and such that if D(\lambda, \nu) < D(\lambda, \mu), then there exists a \varepsilon > 0 with

e^{\varepsilon n} \frac{d\mu}{d\nu}(X) = e^{\varepsilon n} \prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0

almost surely as n \rightarrow \infty. The topology on a subset \mathcal{F} \subset \mathcal{M} induced by such a function D is the topology induced by the sets

\{\nu \,|\, D(\mu, \nu) < \varepsilon\},\quad \mu \in \mathcal{F}, \, \varepsilon > 0.

The remark below shows that any exponential rate of convergence of the likelihood ratio is picked up by the KL divergence. It is rather obvious (albeit a bit technical), but I thought it was worth writing it up properly.

Remark.
Let x_i \sim \mu, i \in \mathbb{N}, independently, and let \frac{d\nu}{d\mu}(X) = \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i).

  1. We have that D_{KL}(\mu, \nu) < \infty if and only if \frac{d\nu}{d\mu}(X) does not converge more than exponentially fast to 0 (i.e. there exists \varepsilon > 0 such that e^{\varepsilon n}\frac{d\nu}{d\mu} (X) \rightarrow \infty).
  2. If \mathcal{F} \subset \mathcal{M} is such that D_{KL}(\mu, \nu) < \infty for all \mu, \nu \in \mathcal{F}, then

D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu)

and the topology on \mathcal{F} induced by D_{KL} is weaker than the topology of any other function D defined as above.

Proof of 1.
Suppose that D_{KL}(\mu, \nu) < \infty. Then, since by the strong law of large numbers D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i) \rightarrow \varepsilon > 0, we find that

\log\left(e^{n (D_{KL}(\mu, \nu) + \varepsilon)} \frac{d\nu}{d\mu}(X)\right) = n (D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i)) \rightarrow \infty

for all \varepsilon > 0.

If D_{KL}(\mu, \nu) = \infty, then for all \varepsilon > 0 we have

\log \left(e^{n \varepsilon}\frac{d\nu}{d\mu}(X)\right) = n\left(\varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)}\right) \rightarrow -\infty

since \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)} \rightarrow \infty. \Box

Proof of 2.
Suppose that there exists \lambda, \mu, \nu \in \mathcal{F} such that D(\lambda, \mu) < D(\lambda, \nu) but D_{KL}(\lambda, \mu) = D_{KL}(\lambda, \nu). Then, there is a \varepsilon > 0 such that

e^{2\varepsilon n} \frac{d\nu}{d\mu}(X)  = e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) /\left( e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \right) \rightarrow 0.

But e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) \rightarrow \infty and e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \rightarrow 0, which yields the contradiction.

Since D(\lambda, \mu) =0 iff \lambda = \mu, this implies that the topology on \mathcal{F} induced by D_{KL} is weaker than the one induced by D. \Box

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s