Elementary topology and machine learning

The machine learning bubble may be overinflated, but it is not about to burst. Interdisciplinary research in this field is grounded in sound theory and has numerous empirical breakthroughs to show for. As it finds more and more applications and concentrates public research funding, many of us are still wondering: how can mathematics contribute?
Case study of an interaction between elementary topology and machine learning’s binary classification problem. Following classical theorems, we obtain a topologically accurate solution.

Correction: It should be \sup_{i,j}| E_{i,j}^n - (k+1)^2 \int_{R_{i,j}^n} f | = o(1/n) and k=k(n) grow correspondingly slower with n.

PDF note here.

 

Bayesian binary classification using partitions of unity

Sunday afternoon project. I think it is possible to get topological garanties for the reconstruction of the classification boundary using the Nash-Tognoli theorem.

1. The problem

Points x_i, i=1\dots, N are distributed on some space \mathbb{M} and associated a label \ell_i\in \{0,1\}. It is assumed that

\ell_i \sim \text{Ber}(p(x_i)), \qquad (1)

where p: \mathbb{M} \rightarrow [0,1] is some unknown integrable function. Estimating p from the data \{(x_i,\ell_i)\,|\, i=1,2,\dots, N\} allows us to predict \ell_{n+1} given x_{n+1}.

It is usually assumed that p is a binary classification rule, i.e. that p(\mathbb{M}) \subset \{0,1\}. Letting p be a probabilistic classification rule, taking values ranging between 0 and 1, accounts for class overlapping and provides a form of uncertainty quantification.

1.1 Alternative formulation

The problem may be alternatively formulated as one of estimating the densities f_0 and f_1 which are such that

\ell_i \sim \frac{\sum_{j=0}^1 \alpha_j f_j(x_i)\delta_j}{\sum_{j=0}^1\alpha_j f_j(x_i)},

for some \alpha_j \geq 0, \sum \alpha_j = 1. This is the approach pursued in Battacharya and Dunson (2012, Nonparametric Bayes Classification and Hypothesis Testing on Manifolds). It makes available general theory and standard tools, while also relating the classification problem to testing of difference between groups.

For the purpose of this sunday afternoon project, however, we prefer the simpler model (1). We also let \mathbb{M} = [0,1]^2 to facilitate visualizations. The approach considered below would be relatively straightforward to extend to other spaces.

2. Bayesian solution

We model p in (1) by linear combinations of basis functions. That is, we assume

p = \sum_{j=0}^n a_j \phi_j^n

for some set \{\phi_j^n : \mathbb{M} \rightarrow \mathbb{R}\,|\, j=0,1,\dots, n\} of functions. In order to make the statistical analysis independent of class relabellings, the functions \phi_j must form a partition of unity.

Definition 1.
A partition of unity on a set \mathbb{M} is a finite set of functions \{\phi_i^n: \mathbb{M} \rightarrow [0,1]\,|\, i=0,1,\dots, n\} such that \sum_{i=0}^n \phi_{i}^n(x) = 1 for all x \in \mathbb{M}.

Here, we consider \mathbb{M} = [0,1]^2 and use the partition of unity

(x^1, x^2) \mapsto B_{j}^n(x^1) B_k^n(x^2),\quad B_j^n(u) = {n \choose j} u^j (1-u)^{n-j},\quad i,j \in \{0,\dots, n\}

obtained by taking tensor products of the Bernstein polynomial densities. The model for p is then

p(x) = \sum_{j,k=0}^n a_{j,k} B_j^n(x^1) B_k^n(x^2),\quad a_{j,k} \in [0,1].

2.1 Prior on p

A prior on p is obtained by letting

a_{j,k} \sim^{ind.} \text{Beta}(\alpha, \alpha).

Here n, the degree of the Bernstein polynomials, is assumed to be known. It is not too complicated to have n adapting to the data, letting it follow a prior distribution. This reduces the effiency of the posterior sampling algorithm, however, and can lead to computational difficulties.

2.2 Posterior computation

I used a simple independent Metropolis-Hastings algorithm to produce the figures above. The log-likelihood of p = \sum_{j,k} a_{j,k} B_j^nB_k^n given (x_i, \ell_i), i=1,\dots, N, is a translate of

\sum_{i=1}^N\left\{ (1-\ell_i) \log \left(\sum_{j,k=0}^n a_{j,k}B_j^n(x_i^1)B_k^n(x_i^2) \right) + \ell_i\log \left(\sum_{j,k=0}^n (1-a_{j,k})B_j^n(x_i^1)B_k^n(x_i^2) \right)\right\}.

New coefficients are proposed by

a_{j,k}' \sim \text{Beta}\left(\alpha + \kappa N_0(j,k), \alpha + \kappa N_1(j,k)\right),

where \kappa is a tuning parameter and where N_0, N_1 are data-dependent functions used to give a very rough approximation of the posterior.

Calculations for the images shown here take less than 1-2 minutes in Python.

3. Examples

The thick black line is the level set of the posterior mean of p corresponding to the value 1/2. The transparent grey lines are the preimage of 1/2 under posterior samples. I hand-picked different values of n without giving it too much thought. The last picture is n=15 while the others are n=8 or n = 10.

test2test8test9

test10

Source code available here.

Bayesian learning

Friday july 28 at 17:00
Rutherford Physics Building, Room 118, McGill

Next week, I’ll be talking about Bayesian learning at the Mathematical congress of the americas and at the Canadian undergraduate mathematics conference. These are somewhat challenging talks: I need to sell the idea of Bayesian statistics to a general mathematical audience (which knows nothing about it), interest them in some though problems of Bayesian nonparametrics, and then present some of our research results. This must be done in under 20 minutes.

To make the presentation more intuitive and accessible, I borrowed some language from machine learning. I’m talking about learning rather than inference, uncertain knowledge rather than subjective belief, and asymptotic correctness rather than consistency. These are essentially synonymous, although some authors might use them in different ways. This should not cause problems for this introductory talk.

1. What is learning?

This is the question that drives the talk. We do not want to deal with it in full generality, but rather want an operational definition of learning that allows us to program algorithms that can learn from experience. The suggestion is the following:

Learning is about rationally adjusting uncertain knowledge in the light of new information.

Bayesian learning implements this idea through the calculus of probability. Let’s see this in a simple context.

1.1 The classical problem

Some stochastic mechanism generates data x_1, x_2, x_3, \dots, each x_i being an independent sample from a unknown distribution P_0. We want to use the data to learn about P_0.

Example

A cryptographer observes word at random in an unknown countable language. Learning could be about:

  • What’s the distribution of the observed words?
  • What’s the probability that the next word to be observed has never been seen before?

1.2 Bayesian Learning

Uncertain knowledge about what P_0 may is encoded through a prior probability measure \Pi on the space of all possibilities for P_0. Thus, if A is some set of probability distributions, then \Pi(A) is the probability, representing the best of our knowledge, that “P_0 \in A“. Then, we observe the data \mathcal{D} = (x_1, x_2, \dots, x_n). This yields the posterior measure \Pi(\cdot | \mathcal{D}), obtained by adjusting \Pi in light of \mathcal{D} through probabilistic conditioning.

Example (coin tossing)

We suspect a coin of being biased. Tossing it, there is an unknown probability p_0 that the coin falls on heads. A prior probability density function, quantifying our uncertainty about what p_0 may be, could look like this:

fig1

It is most likely that the coin is approximately unbiaised (p_0 \approx 1/2), but not impossible that it is strongly biaised.

Now, we flip the coin 20 times and observe only 5 heads. The density of the posterior distribution, updating the prior in light of this information, is the following.

fig2

It is most likely that the coin is biased and that p_0 \approx 1/4.

Certainly the results obtained depend on the choice of prior distribution, but similar priors yield similar posteriors:

 

2. Bayesian nonparametrics

It is tempting to apply the same procedure to more complex problems.

Reconsider, for instance, the cryptographer example. Each word \omega has some probability P_0(\omega) of appearing. Since there is an infinite number of words, there is an infinite number of parameters for which we need to quantify uncertainty. A reasonable question is then:

Is it feasible to faithfully quantify uncertain knowledge on an infinite dimensional space?

The answer is no. At least, not always and not necessarily in a way that makes bayesian learning meaningfull. Diaconis and Freedman showed, in 1986, showed the following in the context of bayesian nonparametrics:

“Some statisticians, as more and more data is gathered, will become more and more convinced of the wrong answer.”

Bayesian learning does not always work in infinite dimension. To approach this problem, we need to figure out charateristics of prior distributions that

  1. describe how uncertainty is spread out; and
  2. ensure that bayesian learning works correctly.

2.1 A positive result

An important positive result is based on the two following conditions / characteristics of prior distributions. We denote by \mathbb{F} the set of all probability densities on some common space, assuming that P_0 \in \mathbb{F} (the data-generating distribution has some density in \mathbb{F}).

  1. \Pi puts positive mass to the relative entropy neighborhoods of P_0:

\Pi\left(\left\{P \in \mathbb{F}| \int \log \frac{dP_0}{dP} dP_0 < \varepsilon \right\}\right) > 0.

This means that a priori, we’re not excluding the truth from the set of possibilities. Since P_0 is unknown, we require that this condition be satisfied whatever P_0 may be.

2. \Pi is of finite entropy: for all \delta > 0, there exists 1 > \alpha > 0 and a covering \{A_i\} of \mathbb{F} of L^1 diameter less than \delta such that

\sum_{i} \Pi(A_i)^\alpha < \infty.

This means that \Pi is not too complex and that we can make sense of it through discretization.

Under these two conditions, bayesian learning is asymptotically correct: the posterior distribution concentrates around the truth.

Theorem (Walker, 2004)
If the conditions (1) and (2) are satisfied, than for any L^1 neighborhood N of P_0, we have

\Pi(N \,|\, x_1, x_2, \dots, x_n) \rightarrow 1

almost surely as x_i \sim P_0.

This is helpful, but has not yet solved our problem.

  • How do we, generally, construct priors satisfying the two conditions?
  • How can we use these priors to solve practical problems?

This is where our research (as well as a lot of other research in Bayesian nonparametrics) comes in.

3. Some of our research

Two contributions I want to present.

  • We developped a relationship between some statistical models and approximation theory that can be used to easily construct priors satisfying the two conditions.
  • We use it to solve problems raised in the litterature.

Let’s see how this works in the context of directional statistics. Say we have some data distributed on the circle or, more generally, on a compact metric space.

 

We want to learn about the data-generating mechanism, e.g. do

  • density estimation,
  • hypothesis tests,
  • classification,

or any other statistical procedure. First, we specify a prior on the set \mathbb{F} of all bounded densities on the circle.

3.1 Prior specification

We begin with a density basis of the trigonometric polynomials introduced in 2009:

C_{j,n}(u) \propto \left(1 + \cos\left( u - \frac{2\pi j}{2n+1}\right)\right)^n, \quad j = 0,1,\dots, 2n.

basis_2

We studied the statistical properties of this basis, and use it to construct approximation operators

T_n : \mathbb{F} \ni f \mapsto \sum_j  C_{j,n} \int_{R_{j,n}}f(x) dx,

where R_{j,n} = \left[\frac{\pi(2j-1)}{2n+1}, \frac{\pi(2j+1)}{2n+1}\right). It can be shown that these operators are variation-diminishing and possess shape-preserving properties. More importantly, they give the decomposition

\overline{\cup_{n\in \mathbb{N}} T_n(\mathbb{F})} = \mathbb{F}.

By specifying priors \Pi_n on T_n(\mathbb{F}) and mixing them together, we obtain the prior

\Pi = \sum_{n}\rho(n) \Pi_n :\, A \mapsto \sum_n \rho(n) \Pi_n\left(A \cap T_n(\mathbb{F})\right).

on \mathbb{F}.

This reduces the problem of specifying a prior on an infinite dimensional space, to specifying an infinite number of priors on finite dimensional spaces. This turns out to be easier, and from properties of T_n we obtain asymptotic correctness of bayesian learning.

Theorem.
Let T_n : L^1 \rightarrow L^1 be any linear function mapping \mathbb{F} to itself, with \dim T_n (\mathbb{F}) increasing. If \|T_n f - f\|_\infty \rightarrow 0 for all continuous f, then taking \rho such that 0 < \rho(n) <  e^{-\dim T_n(\mathbb{F})} and \Pi_n > 0 ensures that bayesian learning based on \Pi is asymptotically correct.

What is interesting here:

  • The approach scales easily to (almost) any set of bounded densities on a compact metric space.
  • The approximation theory litterature provides a wealth of well-studied approximation operators that can be used to construct such priors.
  • Properties of the operators relate to properties of the priors. If, for instance, it is known that the true density of P_0 is unimodal, then using a unimodality-preserving operator yields a prior the space of unimodal densities.

3.2 Simple application

We use a prior of the type considered above to estimate the density of P_0, also quantifying uncertainty about our estimate. Below, the data-generating distribution P_0 is drawn in orange. A hundred samples are represented in the grey histogram. The blue line is an estimate of the unknown orange function, and the blue shaded region quantifies uncertainty (50% and 90% credible regions).

n_100.png

4. Take-home message

There are two things I would like you to remember.

1. The calculus of probability provides an operational definition of learning. That is what Bayesian statistics is about.

2. As you must already know, different fields of mathematics enrich each other in their interactions. Here, it is approximation theory that provides tools that ensure bayesian learning works correctly.

Thank you for your attention!

References.

Continue reading

A limit

Félix Locas presented me this problem.

Let r(n) = \lfloor \log_2 \frac{n}{\log_2 n} \rfloor. Show that

\lim_{n \rightarrow \infty} \left( \log 2+\sum_{k=1}^{r(n)} \frac{1}{k(k+1) 2^k} \right)^n = 1.

My solution

The series \sum_{k=1}^{\infty} \frac{1}{k(k+1) 2^k} is easy to calculate. It is, for instance, the difference between the integrals of geometric series:

\sum_{k=1}^\infty \frac{1}{k(k+1) 2^k} = \sum_{k=1}^\infty \frac{1}{k 2^k} - \sum_{k=1}^\infty \frac{1}{(k+1) 2^k} = 1-\log 2.

Furthermore, abbreviating r = r(n),

r^{3/2} 2^{r} \sum_{k=r+1}^\infty \frac{1}{k(k+1) 2^{k}} \le \sum_{k=0}^\infty \frac{1}{\sqrt{r} 2^k} \xrightarrow{r \rightarrow \infty} 0

implies that for n sufficiently large we have \sum_{k=r+1}^\infty \frac{1}{k(k+1) 2^{k}} < r^{-3/2} 2^{-r} and

\log 2+\sum_{k=1}^{r(n)} \frac{1}{k(k+1) 2^k} = 1-\sum_{k=r+1}^{\infty} \frac{1}{k(k+1) 2^k} \geq 1 - r^{-3/2} 2^{-r}. \qquad (*)

Finally, since r = \log_2 \frac{n}{\log_2 n} - \varepsilon_n for some 0 \le\varepsilon_n < 1, we have

n r^{-3/2} 2^{-r} = \frac{2^{\varepsilon_n}\log_2 n}{(\log_2 n - \log_2 \log_2 n - \varepsilon_n)^{3/2}} \rightarrow 0

which implies that

\left( 1 - r^{-3/2} 2^{-r} \right)^n \rightarrow 1.

Since also \log 2+\sum_{k=1}^{r(n)} \frac{1}{k(k+1) 2^k} \le 1 comparing this with (*) yields

\lim_{n \rightarrow \infty} \left( \log 2+\sum_{k=1}^{r(n)} \frac{1}{k(k+1) 2^k} \right)^n = 1.

The discretization trick

I explain the discretization trick that I mentioned in my previous post (Posterior consistency under possible misspecification). I think it was introduced by Walker (New approaches to Bayesian consistency (2004)).

Let \mathbb{F} be a set of densities and let \Pi be a prior on \mathbb{F}. If x_1, x_2, x_3, \dots follows some distribution P_0 having density f_0, then the posterior distribution of \Pi can be written as

\Pi(A | x_1, \dots, x_n) \propto \int_A \prod_{i=1}^n f(x_i) \Pi(df).

The discretization trick is to find densities f_{1}, f_2, f_3, \dots in the convex hull of A (taken in the space of all densities) such that

\int_A \prod_{i=1}^n f(x_i) = \prod_{i=1}^n f_i(x_i) \Pi(A).

For example, suppose \varepsilon > 2\delta > 0, A = A_\varepsilon = \{f \,|\, D_{\frac{1}{2}}(f_0, f) > \varepsilon\} and that A_i is a partition of A of diameter at most \delta. If there exists 1 > \alpha > 0 such that

\sum_i \Pi(A_i)^\alpha < \infty,

then for some \beta > 0 we have that

e^{n \beta} \left(\int_{A_{\varepsilon}} \prod_{i=1}^n \frac{f(x_i)}{f_0(x_i)} \Pi (df) \right)^\alpha \le e^{n \beta} \sum_i \prod_{j=1}^n \left(\frac{f_{i,j}(x_j)}{f_0(x_j)}\right)^\alpha \Pi(A_i)^\alpha  \rightarrow 0

almost surely. This is because, with A_{\alpha} the \alpha-affinity defined here, we have that

eq_5

goes exponentially fast towards 0 when \beta is sufficiently small.  Hence the Borel-Cantelli lemma applies, yielding the claim.

Construction

The f_{i,j}‘s are defined as the posterior mean predictive density, when the posterior is conditioned on A_i. That is,

f_{i,j} : x \mapsto \frac{\int_{A_i} f(x) \prod_{k=1}^{j-1}f(x_k) \Pi(df)}{\int_{A_i} \prod_{k=1}^{j-1}f(x_k) \Pi(df)}

and

f_{i, 1} : x \mapsto \frac{\int_{A_i} f(x) \Pi(df)}{\Pi(A_i)}.

Clearly

\int_{A_i} \prod_{i=1}^n f(x_i) \Pi(df) = \prod_{j=1}^n f_{i,j}(x_j) \Pi(A_i).

Furthermore, if A_i is contained in a Hellinger ball of center g_i and of radius \delta, then also

H(f_{i,j}, g_i) < \delta.

This follows form the convexity of the Hellinger balls (an important remark for the generalization of this trick).

Posterior consistency under (possible) misspecification

We assume, without too much loss of generality, that our priors are discrete. When dealing with Hellinger separable density spaces, it is possible to discretize posterior distributions to study consistency (see this post about it).

Let \Pi be a prior on a countable space \mathcal{N} = \{f_1, f_2, f_3, \dots\} of probability density functions, with \Pi(f) > 0 for all f \in \mathcal{N}. Data X_1, X_2, X_3, \dots follows (independently) some unknown distribution P_0 with density f_0.

We denote by D_{KL}(f_0, f) = \int f_0 \log\frac{f_0}{f} the Kullback-Leibler divergence and we let D_{\frac{1}{2}}(f_0, f) = 1 - \int \sqrt{f_0 f} be half of the squared Hellinger distance.

The following theorem states that the posterior distribution of \Pi accumulates in Hellinger neighborhoods of f_0, assuming the prior is root-summable (i.e. \sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty for some \alpha > 0) . In the well-specified case (i.e. \inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0), the posterior accumulates in any neighborhood of f_0. In the misspecified case, small neighborhoods of f_0 could be empty, but the posterior distribution still accumulates in sufficiently large neighborhoods (how large exactly is a function of \alpha and \inf_{f \in \mathcal{N}} D_{KL}(f_0, f)).

The result was essentially stated by Barron (Discussion: On the Consistency of Bayes Estimates, 1986). In the case where \Pi is not necessarily discrete, a similar result was obtained, through a discretization argument, by Walker (Bayesian asymptotics with misspecified models, 2013). See also Xing (Sufficient conditions for Bayesian consistency, 2009) for a thorough treatment of Bayesian consistency using the same method of proof.

Theorem (Barron).
Suppose \beta_0 :=\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) < \infty and that

\alpha := \inf \left\{ p \in [\tfrac{1}{2},1] \,|\, \sum_{f \in \mathcal{N}} \Pi(f)^p < \infty \right\} < 1.

If \varepsilon > 0 is such that

\varepsilon > 1- \exp\left( \frac{-\beta_0 \alpha}{2(1-\alpha)} \right)

and if A_\varepsilon := \{f \in \mathcal{N} \,|\, D_{\frac{1}{2}} (f_0, f) < \varepsilon\} \not = \emptyset, then

\Pi\left(A_\varepsilon \,|\, \{x_i\}_{i=1}^n\right) \rightarrow 1

almost surely as n \rightarrow \infty.

Remarks.

1 – If \inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0, then any \varepsilon > 0 can be used.

2 – \alpha is related to the rate of convergence of \rho(n) := \Pi(f_n) towards 0. The quantity H_\alpha (\Pi) = \log \sum_{f \in \mathcal{N}} \Pi(f)^\alpha can be thought as measure of entropy.

3 – Walker (2013) considered the case \sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty for some \alpha < \frac{1}{2}. This implies that \sum_{f \in \mathcal{N}} \sqrt{\Pi(f)} < \infty and the above theorem can also be applied in this case.

Demonstration

The proof is brief. I do not dwell on explanations.

Background.
First let me recall some concepts. The \alpha-affinity between two densities f and g is defined as

A_\alpha(f, g) = \int g^\alpha f^{1-\alpha}. \qquad (1)

Note that 0 \le A_{\alpha}(f, g) \le 1 and that A_\alpha(f, g) = 1 if and only if f = g. Furthermore, when \alpha \geq \frac{1}{2}, Holder’s inequality and Jensen’s inequality yield

A_{\frac{1}{2}}(f, g) \le A_{\alpha}(f, g) \le \left(A_{\frac{1}{2}}(f,g)\right)^{2(1-\alpha)}. \qquad (2)

Proof.
We can now begin the proof. Let \tilde{\alpha} be such that 1 > \tilde{\alpha} > \alpha. Then, we have

eq_1

If \beta > 0 and g \in \mathcal{N} is such that D_{KL}(f_0, g) < \beta_0 + \beta, then

eq_2

almost surely. Furthermore, using (2), we find

eq_3

Here \text{cst.} is a constant. Since \beta > 0 is arbitrary and since \tilde \alpha can be taken so that 2(1-\tilde \alpha) \log (1-\varepsilon) + \tilde \alpha \beta_0 < 0, we obtain that (**) converges exponentially fast towards 0. Hence, by the Borel-Cantelli lemma, we have

eq_4

almost surely. This, together with (3), implies that (*) \rightarrow 0 almost surely. \Box

 

Poster critique

I printed my poster today (see this post) for the 11th Bayesian nonparametrics conference. Here’s the final version (42in x 60in).

poster_final.png

While I was first satisfied with it, as it appeared on my computer screen, staring at it full size changed my perception.

Before jumping into the critique, let me list a few things I like about the poster.

  1. The content, which has been thoroughly reviewed.
  2. The text size (44pt). It is clear and easy to read between 3 to 9 feet away.
  3. It is self-contained and explains itself.
  4. I left plenty out. I listed the two main contributions of our work and that’s what I talk about.

Now for the things I dislike.

  1. The titles are too small.
  2. Using letters in both calligraphic and normal font. This makes it harder to refer to the poster.
  3. It looks messy. The structure of the poster should be graphically emphasized rather than follow columns of text.
  4. The white background is unattractive; the title banner is bland.
  5. “References” should be replaced by “Works cited”.
  6. Should I talk more about Bayesian procedures / posterior simulation? I could have shown posterior mean estimates, the graphs of densities smoothed using our operators, etc.

In short, I have mixed feelings about the poster. On one hand, I like that the text is large, that it is self-contained and that it goes pretty much straight to the point. There are no meaningless graphics under a “result” banner. On the other hand, it looks messy and I’m afraid people will get lost in it.

Next one will be better.

The choice of prior in bayesian nonparametrics – part 2

See part 1. Most proofs are omitted; I’ll post them with the complete pdf later this week.

The structure of \mathcal{M}

Recall that \mathbb{M} is is a Polish space (ie. a complete and separable metric space). It is endowed with its borel \sigma-algebra \mathfrak{B} which is the smallest family of subsets of \mathbb{M} that contains its topology and that is closed under countable unions and intersections. All subsets of \mathbb{M} we consider in the following are supposed to be part of \mathfrak{B}. A probability measure on \mathbb{M} is a function \mu : \mathfrak{B} \rightarrow [0,1] such that for any countable partition A_1, A_2, A_3, \dots of \mathbb{M} we have that \sum_{i=1}^\infty \mu(A_i) = 1. The set \mathcal{M} consists of all such probability measures.

Note that since \mathbb{M} is complete and separable, every probability measure \mu \in \mathcal{M} is regular (and tight). It means that the measure of any A\subset \mathbb{M} can be well approximated from the measure of compact subsets of A as well as from the measure of open super-sets of A:

\mu(A) = \sup \left\{\mu(K) \,|\, K \subset A \text{ is compact}\right\}\\ = \inf \left\{\mu(U) \,|\, U \supset A \text{ is open}\right\}.

Metrics on \mathcal{M}

Let me review some facts. A natural metric used to compare the mass allocation of two measures \mu, \nu \in \mathbb{M} is the total variation distance defined by

\|\mu - \nu\|_{TV} = \sup_{A \subset \mathbb{M}}|\mu(A) - \nu(A)|.

It is relatively straightforward to verify that \mathcal{M} is complete under this distance, but it is not in general separable. To see this, suppose that \mathbb{M} = [0,1]. If a ball centered at \mu contains a dirac measure \delta_x, x \in [0,1], then \mu must have a point mass at x. Yet any measure contains at most a countable number of point masses, and there is an uncountable number of dirac measures on [0,1]. Thus no countable subset of \mathcal{M} can cover \mathcal{M} up to an \varepsilon of error.

This distance can be relaxed to the Prokhorov metric, comparing mass allocation up to \varepsilon-neighborhoods. It is defined as

d_P(\mu, \nu) = \inf \left\{ \varepsilon > 0 \,|\, \mu(A) \le \nu(A^{\varepsilon}) + \varepsilon \text{ and } \nu(A) \le \mu(A^{\varepsilon}) + \varepsilon,\; \forall A \subset \mathbb{M} \right\},

where A^{\varepsilon} = \{x \in \mathbb{M} \,|\, d(x, A) < \varepsilon\} is the \varepsilon-neighborhood of A. It is a metrization of the topology of weak convergence of probability measures, and \mathcal{M} is separable under this distance.

The compact sets of \mathcal{M} under the Prokhorov metric admit a simple characterization given by the Prokhorov theorem: P \subset \mathcal{M} is precompact if and only if P is uniformly tight (for each \varepsilon > 0, there exists a compact K \subset X such that \sup_{\mu \in P} \mu(K) \geq 1-\varepsilon). This means that a sequence \{\mu_n\} \subset \mathcal{M} admits a weakly convergent subsequence if and only if \{\mu_n\} is uniformly tight.

Characterizations of weak convergence are given by the Portemanteau theorem, which says in particular that \mu_n converges weakly to \mu if and only if

\int f d\mu_n \rightarrow \int f d\mu

for all continuous and bounded and continuous f. It is also equivalent to

\mu_n(A) \rightarrow \mu(A)

for all sets A such that \mu(\partial A) = 0.

Measures of divergence

In addition to metrics, that admit a geometric interpretation through the triangle inequality, statistical measures of divergence can also be considered. Here, we consider functions D : \mathcal{M}\times \mathcal{M} \rightarrow [0, \infty] that can be used to determine the rate of convergence of the likelihood ratio

\prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0,

where x_i \sim \nu and \mu, \nu \in \mathcal{M}.

Kullback-Leibler divergence

The weight of evidence in favor of the hypothesis “\lambda = \mu” versus “\lambda = \nu” given a sample x is defined as

W(x) = \log\frac{d\mu}{d\nu}.

It measures how much information about the hypotheses is brought by the observation of x. (For a justification of this interpretation, see Good (Weight of evidence: a brief survey, 1985).) The Kullback-Leibler divergence D_{KL} between \mu and \nu is defined as the expected weight of evidence given that x \sim \mu:

D_{KL}(\mu, \nu) =\mathbb{E}_{x \sim \mu} W(x) = \int \log \frac{d\mu}{d\nu} d\mu.

The following properties of the Kullback-Leibler divergence support its interpretation as an expected weight of evidence.

Theorem 1 (Kullback and Leibler, 1951).
We have

  1. D_{KL}(\mu, \nu) \geq 0 with equality if and only if \mu = \nu;
  2. D_{KL}(\mu T^{-1}, \nu T^{-1}) \geq D_{KL}(\mu, \nu) with equality if and only if T: \mathbb{M} \rightarrow \mathbb{M}' is a sufficient statistic for \{\mu, \nu\}.

Furthermore, the KL divergence can be used to precisely identify exponential rates of convergence of the likelihood ratio. The first part of the next proposition says that D_{KL}(\lambda, \nu) is finite if and only if the likelihood ratio \prod_{i} \frac{d\nu}{d\lambda}(x_i), x_i \sim \lambda cannot convergence super-exponentially fast towards 0. The second part identifies the rate of convergence then the KL divergence is finite.

Proposition 2.
Let x_1, x_2, x_3, \dots \sim \lambda (independently). The KL divergence D_{KL}(\lambda, \nu) is finite if and only if there exists an \alpha > 0 such that

e^{n\alpha} \prod_{i=1}^n \frac{d\nu}{d\lambda}(x_i) \rightarrow \infty

with positive probability.

Finally, suppose we are dealing with a submodel \mathcal{F} \subset \mathcal{M} such that the rates of convergences of the likelihood ratios in \mathcal{F} are of an exponential order. By the previous proposition, this is equivalent to the fact that \forall \mu, \nu \in \mathcal{F}, D_{KL}(\mu, \nu) < \infty. We can show that the KL divergence is, up to topological equivalence, the best measure of divergence that determines the convergence of the likelihood ratio. That is, suppose D: \mathcal{F} \times \mathcal{F}\rightarrow [0, \infty] is such that

D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i) \rightarrow 0

at an exponential rate, almost surely when x_i \sim \lambda, and that D(\lambda, \mu) = 0 if and only if \lambda = \mu. Then, the topology induced by D_{KL} is coarser than the topology induced by D.

Proposition 3.
Let D be as above and let \mathcal{F} \subset \mathcal{M} be such that \forall \mu, \nu \in \mathcal{F}, D_{KL}(\mu, \nu) < \infty. Then, the topology on \mathcal{F} induced by D_{KL} is weaker than the topology induced by D. More precisely, we have that

D(\lambda, \mu) < D(\lambda, \nu) \Rightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu).

 

alpha-affinity and alpha-divergence

We define the \alpha-affinity between two probability measures as the expectancy of another transform of the likelihood ratio. Let \mu, \nu be two probability measures dominated by \lambda, with d\mu = f d\lambda and d\nu = g d\lambda. Given 0 < \alpha < 1, the \alpha-affinity between \mu and \nu is

A_\alpha(\mu, \nu) = \int \left(\frac{g}{f}\right)^\alpha d\mu = \int g^\alpha f^{1-\alpha} d\lambda.

Proposition 4.
For all 0 < \alpha < 1, we have that

1. A_\alpha(\mu, \nu) \le 1 with equality if and only if \mu = \nu;

2. A_\alpha is monotonous in \alpha and jointly concave in its arguments;

3. A_\alpha is jointly multiplicative under products:

A_\alpha (\mu^{(n)}, \nu^{(n)}) = \left(A_{\alpha}(\mu, \nu)\right)^n.

4. if \frac{1}{2} \leq \alpha, then

A_{\frac{1}{2}} \le A_\alpha \le \left(A_{\frac{1}{2}}\right)^{2(1-\alpha)};

Proof.
1-2 follow from Jensen’s inequality and the joint concavity of (x,y) \mapsto x^\alpha y^{1-\alpha}. 3 follows from Fubini’s theorem. For
(iv), the first inequality is a particular case of 2 and Hölder’s inequality finally yields

A_{\alpha}(\mu, \nu) = \int (fg)^{1-\alpha} g^{2\alpha - 1} d\lambda \le \left( \int \sqrt{fg} \,d\lambda \right)^{2-2\alpha} = A_{\frac{1}{2}}(\mu, \nu).

\Box

The \alpha-divergence D_\alpha is obtained as

D_\alpha = 1 - A_\alpha.

Other similar divergences considered in the litterature are

\frac{1-A_\alpha}{\alpha(1-\alpha)}\; \text{ and }\; \frac{\log A_\alpha}{\alpha(1-\alpha)},

but we prefer D_\alpha for its simplicity. When \alpha = \frac{1}{2}, it is closely related to the hellinger distance

H(\mu, \nu) = \left(\int \left(\sqrt{f} - \sqrt{g}\right)^2d\lambda\right)^{\frac{1}{2}}

through

D_{\frac{1}{2}}(\mu, \nu) = \frac{H(\mu, \nu)^2}{2}.

Other important and well-known inequalities are given below.

Proposition 5.
We have

D_{\frac{1}{2}}(\mu, \nu) \le \|\mu-\nu\|_{TV} \le \sqrt{2 D_{\frac{1}{2}}(\mu, \nu)}

and

2D_{\frac{1}{2}}(\mu, \nu) \le D_{KL}(\mu, \nu) \le 2\left\|\frac{f}{g}\right\|_\infty \|\mu-\nu\|_{TV}.

This, together with proposition 4 (4) , yields similar bounds for the other divergences.

Finite models

Let \Pi be a prior on \mathcal{M} that is finitely supported. That is, \Pi = \sum_{i=1}^n p_i \delta_{\mu_i} for some \mu_i \in \mathcal{M} and p_i > 0 with \sum_i p_i = 1. Suppose that x_1, x_2, x_3, \dots independently follow some \mu_* \in \mathcal{M}.

The following proposition ensures that as data is gathered, the posterior distribution of \Pi concentrates on the measures \mu_i that are closest to \mu_*.

Proposition 6.
Let A_{\varepsilon} = \{\mu_i \,|\, D_{KL}(\mu_*, \mu_i) < \varepsilon \}. If A_\varepsilon \not = \emptyset, then

\Pi(A_\varepsilon \,|\, \{x_i\}_{i=1}^m) \rightarrow 1

almost surely as m \rightarrow \infty.

Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

The problem.

Let f, g, h be three densities and suppose that, x_i \sim h, i \in \mathbb{N}, independently. What happens to the likelihood ratio

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}

as n\rightarrow \infty?

Clearly, it depends. If h = g \not = f, then

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0

almost surely at an exponential rate. More generally, if h is closer to g than to f, in some sense, we’d expect that \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

D_{KL}(f, g) = \int f \log\frac{f}{g}.

It can be verified that D_{KL}(f,g) \geq 0 with equality if and only if f=g, and that

D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in (1) (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.

This is of interest in Bayesian nonparametrics. The hypothesis that a density f_0 is in the Kullback-Leibler support of a prior \Pi on a density space \mathbb{F} is used to ensure that

\int_{\mathbb{F}} \prod_{i=1}^n \frac{f(x_i)}{f_0(x_i)} \Pi(df)

does not converge to 0 exponentially fast as n\rightarrow \infty and x_i \sim f_0. Under the conditions we specify, our remark implies that the hypothesis that “f_0 is in the Kullback-Leibler support of \Pi” may not be replaced by a weaker one.

Statement of the remark

Some notations. Let \mathcal{M} the space of all probability measures on some measurable space \mathbb{M}. If \mu, \nu \in \mathcal{M}, then both are absolutely continuous with respect to \tau = \mu + \nu and possess densities f, g such that d\mu = fd\tau and d\nu gd\tau. We denote by d\mu /d\nu the ratio of densities f / g, which in fact does not depend on the choice of dominating measure \tau. The likelihood ratio \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} is abbreviated to \frac{d\mu}{d\nu}(X), depending implicitely on n. The Kullback-Leibler divergence between \mu and \nu is

D_{KL}(\mu, \nu) = \int \log\frac{d\mu}{d\nu} d\mu.

We let D be any other function \mathcal{M}\times \mathcal{M} \rightarrow [0,\infty] such that D(\mu, \nu) = 0 iff \mu = \nu and such that if D(\lambda, \nu) < D(\lambda, \mu), then there exists a \varepsilon > 0 with

e^{\varepsilon n} \frac{d\mu}{d\nu}(X) = e^{\varepsilon n} \prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0

almost surely as n \rightarrow \infty. The topology on a subset \mathcal{F} \subset \mathcal{M} induced by such a function D is the topology induced by the sets

\{\nu \,|\, D(\mu, \nu) < \varepsilon\},\quad \mu \in \mathcal{F}, \, \varepsilon > 0.

The remark below shows that any exponential rate of convergence of the likelihood ratio is picked up by the KL divergence. It is rather obvious (albeit a bit technical), but I thought it was worth writing it up properly.

Remark.
Let x_i \sim \mu, i \in \mathbb{N}, independently, and let \frac{d\nu}{d\mu}(X) = \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i).

  1. We have that D_{KL}(\mu, \nu) < \infty if and only if \frac{d\nu}{d\mu}(X) does not converge more than exponentially fast to 0 (i.e. there exists \varepsilon > 0 such that e^{\varepsilon n}\frac{d\nu}{d\mu} (X) \rightarrow \infty).
  2. If \mathcal{F} \subset \mathcal{M} is such that D_{KL}(\mu, \nu) < \infty for all \mu, \nu \in \mathcal{F}, then

D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu)

and the topology on \mathcal{F} induced by D_{KL} is weaker than the topology of any other function D defined as above.

Proof of 1.
Suppose that D_{KL}(\mu, \nu) < \infty. Then, since by the strong law of large numbers D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i) \rightarrow \varepsilon > 0, we find that

\log\left(e^{n (D_{KL}(\mu, \nu) + \varepsilon)} \frac{d\nu}{d\mu}(X)\right) = n (D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i)) \rightarrow \infty

for all \varepsilon > 0.

If D_{KL}(\mu, \nu) = \infty, then for all \varepsilon > 0 we have

\log \left(e^{n \varepsilon}\frac{d\nu}{d\mu}(X)\right) = n\left(\varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)}\right) \rightarrow -\infty

since \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)} \rightarrow \infty. \Box

Proof of 2.
Suppose that there exists \lambda, \mu, \nu \in \mathcal{F} such that D(\lambda, \mu) < D(\lambda, \nu) but D_{KL}(\lambda, \mu) = D_{KL}(\lambda, \nu). Then, there is a \varepsilon > 0 such that

e^{2\varepsilon n} \frac{d\nu}{d\mu}(X)  = e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) /\left( e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \right) \rightarrow 0.

But e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) \rightarrow \infty and e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \rightarrow 0, which yields the contradiction.

Since D(\lambda, \mu) =0 iff \lambda = \mu, this implies that the topology on \mathcal{F} induced by D_{KL} is weaker than the one induced by D. \Box