My short paper A Note on Reverse Pinsker Inequalities was published in early access in the IEEE Transactions on Information Theory a few days ago (arXiv version here). I thought I would try to explain to non-specialists what this paper is about and why I got interested.
Let be three densities and suppose that, , , independently. What happens to the likelihood ratio
Clearly, it depends. If , then
almost surely at an exponential rate. More generally, if is closer to than to , in some sense, we’d expect that . Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence
It can be verified that with equality if and only if , and that
almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.
Better measures of divergence?
There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.Read More »
I describe key ideas of the theory of posterior distribution asymptotics and work out small examples.
Consider the problem of learning about the probability distribution that a stochastic mechanism is following, based on independent observations. You may quantify your uncertainty about what distribution the mechanism is following through a prior defined on the space of all possible distributions, and then obtain the conditional distribution of given the observations to correspondingly adjust your uncertainty. In the simplest case, the prior is concentrated on a space of distributions dominated by a common measure . This means that each probability measure in is such that there exists with for all measurable . We may thus identify with a space of probability densities. Given independent observations , the conditional distribution of given is then
where and is measurable. This conditional distribution is called the posterior distribution of , and is understood as a Lebesgue integral relative to the measure defined on .
The procedure of conditionning on is bayesian learning. It is expected that as more and more data is gathered, that the posterior distribution will converge, in a suitable way, to a point mass located at the true distribution that the data is following. This is the asymptotic behavior we study.
The elements of measure theory and the language of mathematical statistics I use are standard and very nicely reviewed in Halmos (1949). To keep notations light, I avoid making explicit references to measure spaces. All sets and maps considered are measurable.
2. The relative entropy of probability measures
Let and be two probability measures on a metric space , and let be a common -finite dominating measure such as . That is, whenever is of -measure , and by the Radon-Nikodym theorem this is equivalent to the fact that there exists unique densities , with . The relative entropy of and , also known as the Kullback-Leibler divergence, is defined as
The following inequality will be of much use.
Lemma 1 (Kullback and Leibler (1951)). We have , with equality if and only if .
Proof: We may assume that is dominated by , as otherwise and the lemma holds. Now, let , and write . Since for some , we find
with equality if and only if , -almost everywhere. QED.
We may already apply the lemma to problems of statistical inference.
2.1 Discriminating between point hypotheses
Alice observes , , , and considers the hypotheses , , representing ““. A prior distribution on takes the form
where is when and is otherwise. Given a sample , the weight of evidence for versus (Good, 1985), also known as “the information in for discrimination between and ” (Kullback, 1951), is defined as
This quantity is additive for independent sample: if , , then
and the posterior log-odds are given by
Thus is the expected weight of evidence for against brought by an unique sample and is strictly positive whenever . By the additivity of , Alice should expect that as more and more data is gathered, the weight of evidence grows to infinity. In fact, this happens -almost surely.
Proposition 2. Almost surely as the number of observations grows, and .
Proof: By the law of large numbers (the case is easily treated separatly), we have
and -almost surely. QED.
2.2 Finite mixture prior under misspecification
Alice wants to learn about a fixed unknown distribution through data , where . She models what may be as one of the distribution in , and quantifies her uncertainty through a prior on . We may assume that for some , as otherwise she will eventually observe data that is impossible under and adjust her model. (Indeed, implies that there exists a non-negligible set such that . Alice will -almost surely observe and conclude that .) If , she may not realise that her model is wrong, but the following proposition ensures that the posterior distribution will concentrate on the ‘s that are closest to .
Proposition 3. Let . Almost surely as the number of observations grows, .
Proof: The prior takes the form , , where is a point mass at . The model is dominated by a -finite measure , such as , so that the posterior distribution is
Because for some , is also absolutely continuous with respect to and we let . Now, let and . Write
where and both depend on . Using the law of large numbers, we find
This implies , -almost surely. QED.
2.3 Properties of the relative entropy
The following properties justify the interpretation of the relative entropy as an expected weight of evidence.
Lemma 4. The relative entropy , , does not depend on the choice of dominating measure .
Let be a mapping onto a space . If , then , where is the probability measure on defined by . The following proposition, a particular case of (Kullback, 1951, theorem 4.1), states that transforming the data through cannot increase the expected weight of evidence.
Proposition 5 (see Kullback and Leibler (1951)). For any two probability measures on , we have
Proof: Let be a dominating measure for , . Therefore, is dominated by , and we may write , . The measures on the two spaces are related by the formula
where may be one of , and and is any measurable function (Halmos, 1949, lemma 3). Therefore,
By letting we find
where is such that . It is a probability measure since .
By convexity of , we find
and this finishes the proof.
Good, I. (1985). Weight of evidence: A brief survey. In A. S. D. L. J.M. Bernardo, M.H. DeGroot (Ed.), Bayesian Statistics 2, pp. 249–270. North-Holland B.V.: Elsevier Science Publishers.
Halmos, P. R. and L. J. Savage (1949, 06). Application of the radon-nikodym theorem to the theory of sufficient statistics. Ann. Math. Statist. 20 (2), 225–241.
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist. 22(1), 79-86