Constrained semiparametric modelling (for directional statistics)

main

Explanations

Angular data arises in many scientific fields, such as in experimental biology for the study of animal orientation, and in bioinformatics in relation to the protein structure prediction problem.

 

 

The statistical analysis of this data requires adapted tools such as 2\pi-periodic density models. Fernandez-Duran (Biometrics, 60(2), 2004) proposed non-negative trigonometric sums (i.e. non-negative trigonometric polynomials) as a flexible family of circular distributions. However, the coefficients of trigonometric polynomials expressed in the standard basis 1, \cos(x), \sin(x), \dots are difficult to interpret and we do not see how an informative prior could be specified through this parametrization. Moreover, the use of this basis was criticized by Ferreira et al. (Bayesian Analysis, 3(2), 2008) as resulting in a “wigly approximation, unlikely to be useful in most real applications”.

Trigonometric density basis

Here, we suggest the use of a density basis of the trigonometric polynomials and argue it is well suited to statistical applications. In particular, coefficients of trigonometric densities expressed in this basis possess an intuitive geometric interpretation. Furthermore, we show how “wiggliness” can be precisely controlled using this basis and how another geometric constraint, periodic unimodality, can be enforced [first proposition on the poster]. To ensure that nothing is lost by using this basis, we also show that the whole model consists of precisely all positive trigonometric densities, together with the basis functions [first theorem on the poster].

Prior specification

Priors can be specified on the coefficients of mixtures in our basis and on the degree of the trigonometric polynomials to be used. Through the interpretability of the coefficients and the shape-preserving properties of the basis, different types of prior knowledge may be incorporated. Together with an approximate understanding of mass allocation, these include:

  • periodic unimodality;
  • bounds on total variation; and
  • knowledge of the marginal distributions (in the multivariate case).

The priors obtained this way are part of a well-studied family called sieve priors, including the well-known Bernstein-Dirichlet prior, and are finite mixtures with an unknown number of components. Most results and interpretations about the Bernstein-Dirichlet prior (see Petrone & Wasserman (J. R. Stat. Soc. B., 64(1),  2002), Kruijer and Van der Vaart (J. Stat. Plan. Inference, 138(7), 2008), McVinish et al. (Scand. J. Statist., 36(2), 2009) can carry over to the priors we consider, but we dot not discuss them further.

Approximation-theoric framework

Our density models arise as the image of “shape-perserving” linear approximation operators. This approximation-theoric relationship is used to obtain a notably large prior Kullback-Leibler support and ensures strong posterior consistency at all bounded (not necessarily continuous) density. The result partly relies on known properties of sieve priors, as well as general consistency results (Walker (Ann. Statist., 32(5), 2004)), but extends known result by removing an usual continuity hypothesis on the densities at which consistency is achieved (see Wu & Ghosal (‎Electron. J. Stat., 2, 2008), Petrone & Veronese (Statistica Sinica, 20, 2010)). For contraction rates, higher order smoothness conditions are usually required (see Shen & Ghosal (Scand. J. Statist., 42(4), 2015)).

For example, consider the prior induced by the random density

T_n \mathcal{D} := \sum_i \mathcal{D}(R_{i,n}) C_{i,n},\qquad (1)

where \mathcal{D} is a Dirichlet process, n is distributed on \mathbb{N} and R_{i,n} is a partition of the circle. It has the strong posterior consistency at all bounded density provided that the associated operator

T_n : f \mapsto \sum_i C_{i,n} \int_{R_{i,n}} f

is such that \|T_n f - f\|_\infty \rightarrow 0 for all continuous f.

More generally, let \mathbb{F} be a set of bounded densities on some compact metric space \mathbb{M}, let T_n : L^1(\mathbb{M}) \rightarrow L^1(\mathbb{M}), n \in \mathbb{N}, be a sequence of operators that are:

  • shape preserving: T_n maps densities to densities and T_n(\mathbb{F}) \subset \mathbb{F}; and
  • approximating: \|T_n f - f\|_\infty \rightarrow 0 for all continuous f;

and finally let \Pi_n be priors on T_n(\mathbb{F}) with full support. A sieve prior on \mathbb{F} is defined by

\Pi : A \mapsto \sum_n \rho(n) \Pi_n(A \cap T_n(\mathbb{F})).

Theorem.
If 0 < \rho(n) < Ce^{-c d_n} for some increasing sequence d_n bounding the dimensions of T_n (\mathbb{F}), then the posterior distribution of \Pi is strongly consistent at each density of \mathbb{F}.

The approximation theory literature is rich in such operators. The theorem shows that they provide strongly consistent priors on arbitrary density spaces simply given priors \Pi_n on T_n(\mathbb{F}).

Basic density estimation:

res_5_15_2

A thousand samples (grey histogram) were drawn from the density in orange. The prior is defined by (1) with the Dirichlet process centered on the uniform density and with a precision parameter of 2. The degree n is distributed as a \text{Poiss}(15). The blue line is the posterior mean, the dark blue shaded region is a 50% pointwise credible region around the median, and the light blue shaded region is a 90% credible region.

A limit

Félix Locas presented me this problem.

Let r(n) = \lfloor \log_2 \frac{n}{\log_2 n} \rfloor. Show that

\lim_{n \rightarrow \infty} \left( \log 2+\sum_{k=1}^{r(n)} \frac{1}{k(k+1) 2^k} \right)^n = 1.

My solution

The series \sum_{k=1}^{\infty} \frac{1}{k(k+1) 2^k} is easy to calculate. It is, for instance, the difference between the integrals of geometric series:

\sum_{k=1}^\infty \frac{1}{k(k+1) 2^k} = \sum_{k=1}^\infty \frac{1}{k 2^k} - \sum_{k=1}^\infty \frac{1}{(k+1) 2^k} = 1-\log 2.

Furthermore, abbreviating r = r(n),

r^{3/2} 2^{r} \sum_{k=r+1}^\infty \frac{1}{k(k+1) 2^{k}} \le \sum_{k=0}^\infty \frac{1}{\sqrt{r} 2^k} \xrightarrow{r \rightarrow \infty} 0

implies that for n sufficiently large we have \sum_{k=r+1}^\infty \frac{1}{k(k+1) 2^{k}} < r^{-3/2} 2^{-r} and

\log 2+\sum_{k=1}^{r(n)} \frac{1}{k(k+1) 2^k} = 1-\sum_{k=r+1}^{\infty} \frac{1}{k(k+1) 2^k} \geq 1 - r^{-3/2} 2^{-r}. \qquad (*)

Finally, since r = \log_2 \frac{n}{\log_2 n} - \varepsilon_n for some 0 \le\varepsilon_n < 1, we have

n r^{-3/2} 2^{-r} = \frac{2^{\varepsilon_n}\log_2 n}{(\log_2 n - \log_2 \log_2 n - \varepsilon_n)^{3/2}} \rightarrow 0

which implies that

\left( 1 - r^{-3/2} 2^{-r} \right)^n \rightarrow 1.

Since also \log 2+\sum_{k=1}^{r(n)} \frac{1}{k(k+1) 2^k} \le 1 comparing this with (*) yields

\lim_{n \rightarrow \infty} \left( \log 2+\sum_{k=1}^{r(n)} \frac{1}{k(k+1) 2^k} \right)^n = 1.

The discretization trick

I explain the discretization trick that I mentioned in my previous post (Posterior consistency under possible misspecification). I think it was introduced by Walker (New approaches to Bayesian consistency (2004)).

Let \mathbb{F} be a set of densities and let \Pi be a prior on \mathbb{F}. If x_1, x_2, x_3, \dots follows some distribution P_0 having density f_0, then the posterior distribution of \Pi can be written as

\Pi(A | x_1, \dots, x_n) \propto \int_A \prod_{i=1}^n f(x_i) \Pi(df).

The discretization trick is to find densities f_{1}, f_2, f_3, \dots in the convex hull of A (taken in the space of all densities) such that

\int_A \prod_{i=1}^n f(x_i) = \prod_{i=1}^n f_i(x_i) \Pi(A).

For example, suppose \varepsilon > 2\delta > 0, A = A_\varepsilon = \{f \,|\, D_{\frac{1}{2}}(f_0, f) > \varepsilon\} and that A_i is a partition of A of diameter at most \delta. If there exists 1 > \alpha > 0 such that

\sum_i \Pi(A_i)^\alpha < \infty,

then for some \beta > 0 we have that

e^{n \beta} \left(\int_{A_{\varepsilon}} \prod_{i=1}^n \frac{f(x_i)}{f_0(x_i)} \Pi (df) \right)^\alpha \le e^{n \beta} \sum_i \prod_{j=1}^n \left(\frac{f_{i,j}(x_j)}{f_0(x_j)}\right)^\alpha \Pi(A_i)^\alpha  \rightarrow 0

almost surely. This is because, with A_{\alpha} the \alpha-affinity defined here, we have that

eq_5

goes exponentially fast towards 0 when \beta is sufficiently small.  Hence the Borel-Cantelli lemma applies.

Construction

The f_{i,j}‘s are defined as the posterior mean predictive density, when the posterior is conditioned on A_i. That is,

f_{i,j} : x \mapsto \frac{\int_{A_i} f(x) \prod_{k=1}^{j-1}f(x_k) \Pi(df)}{\int_{A_i} \prod_{k=1}^{j-1}f(x_k) \Pi(df)}

and

f_{i, 1} : x \mapsto \frac{\int_{A_i} f(x) \Pi(df)}{\Pi(A_i)}.

Clearly

\int_{A_i} \prod_{i=1}^n f(x_i) \Pi(df) = \prod_{j=1}^n f_{i,j}(x_j) \Pi(A_i).

Furthermore, if A_i is contained in a Hellinger ball of center g_i and of radius \delta, then also

H(f_{i,j}, g_i) < \delta.

This follows form the convexity of the Hellinger balls (an important remark for the generalization of this trick).

Posterior consistency under (possible) misspecification

We assume, without too much loss of generality, that our priors are discrete. When dealing with Hellinger separable density spaces, it is possible to discretize posterior distributions to study consistency (see this post about it).

Let \Pi be a prior on a countable space \mathcal{N} = \{f_1, f_2, f_3, \dots\} of probability density functions, with \Pi(f) > 0 for all f \in \mathcal{N}. Data X_1, X_2, X_3, \dots follows (independently) some unknown distribution P_0 with density f_0.

We denote by D_{KL}(f_0, f) = \int f_0 \log\frac{f_0}{f} the Kullback-Leibler divergence and we let D_{\frac{1}{2}}(f_0, f) = 1 - \int \sqrt{f_0 f} be half of the squared Hellinger distance.

The following theorem states that the posterior distribution of \Pi accumulates in Hellinger neighborhoods of f_0, assuming the prior is root-summable (i.e. \sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty for some \alpha > 0) . In the well-specified case (i.e. \inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0), the posterior accumulates in any neighborhood of f_0. In the misspecified case, small neighborhoods of f_0 could be empty, but the posterior distribution still accumulates in sufficiently large neighborhoods (how large exactly is a function of \alpha and \inf_{f \in \mathcal{N}} D_{KL}(f_0, f)).

The result was essentially stated by Barron (Discussion: On the Consistency of Bayes Estimates, 1986). In the case where \Pi is not necessarily discrete, a similar result was obtained, through a discretization argument, by Walker (Bayesian asymptotics with misspecified models, 2013). See also Xing (Sufficient conditions for Bayesian consistency, 2009) for a thorough treatment of Bayesian consistency using the same method of proof.

Theorem (Barron).
Suppose \beta_0 :=\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) < \infty and that

\alpha := \inf \left\{ p \in [\tfrac{1}{2},1] \,|\, \sum_{f \in \mathcal{N}} \Pi(f)^p < \infty \right\} < 1.

If \varepsilon > 0 is such that

\varepsilon > 1- \exp\left( \frac{-\beta_0 \alpha}{2(1-\alpha)} \right)

and if A_\varepsilon := \{f \in \mathcal{N} \,|\, D_{\frac{1}{2}} (f_0, f) < \varepsilon\} \not = \emptyset, then

\Pi\left(A_\varepsilon \,|\, \{x_i\}_{i=1}^n\right) \rightarrow 1

almost surely as n \rightarrow \infty.

Remarks.

1 – If \inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0, then any \varepsilon > 0 can be used.

2 – \alpha is related to the rate of convergence of \rho(n) := \Pi(f_n) towards 0. The quantity H_\alpha (\Pi) = \log \sum_{f \in \mathcal{N}} \Pi(f)^\alpha can be thought as measure of entropy.

3 – Walker (2013) considered the case \sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty for some \alpha < \frac{1}{2}. This implies that \sum_{f \in \mathcal{N}} \sqrt{\Pi(f)} < \infty and the above theorem can also be applied in this case.

Demonstration

The proof is brief. I do not dwell on explanations.

Background.
First let me recall some concepts. The \alpha-affinity between two densities f and g is defined as

A_\alpha(f, g) = \int g^\alpha f^{1-\alpha}. \qquad (1)

Note that 0 \le A_{\alpha}(f, g) \le 1 and that A_\alpha(f, g) = 1 if and only if f = g. Furthermore, when \alpha \geq \frac{1}{2}, Holder’s inequality and Jensen’s inequality yield

A_{\frac{1}{2}}(f, g) \le A_{\alpha}(f, g) \le \left(A_{\frac{1}{2}}(f,g)\right)^{2(1-\alpha)}. \qquad (2)

Proof.
We can now begin the proof. Let \tilde{\alpha} be such that 1 > \tilde{\alpha} > \alpha. Then, we have

eq_1

If \beta > 0 and g \in \mathcal{N} is such that D_{KL}(f_0, g) < \beta_0 + \beta, then

eq_2

almost surely. Furthermore, using (2), we find

eq_3

Here \text{cst.} is a constant. Since \beta > 0 is arbitrary and since \tilde \alpha can be taken so that 2(1-\tilde \alpha) \log (1-\varepsilon) + \tilde \alpha \beta_0 < 0, we obtain that (**) converges exponentially fast towards 0. Hence, by the Borel-Cantelli lemma, we have

eq_4

almost surely. This, together with (3), implies that (*) \rightarrow 0 almost surely. \Box

 

Poster critique

I printed my poster today (see this post) for the 11th Bayesian nonparametrics conference. Here’s the final version (42in x 60in).

poster_final.png

While I was first satisfied with it, as it appeared on my computer screen, staring at it full size changed my perception.

Before jumping into the critique, let me list a few things I like about the poster.

  1. The content, which has been thoroughly reviewed.
  2. The text size (44pt). It is clear and easy to read between 3 to 9 feet away.
  3. It is self-contained and explains itself.
  4. I left plenty out. I listed the two main contributions of our work and that’s what I talk about.

Now for the things I dislike.

  1. The titles are too small.
  2. Using letters in both calligraphic and normal font. This makes it harder to refer to the poster.
  3. It looks messy. The structure of the poster should be graphically emphasized rather than follow columns of text.
  4. The white background is unattractive; the title banner is bland.
  5. “References” should be replaced by “Works cited”.
  6. Should I talk more about Bayesian procedures / posterior simulation? I could have shown posterior mean estimates, the graphs of densities smoothed using our operators, etc.

In short, I have mixed feelings about the poster. On one hand, I like that the text is large, that it is self-contained and that it goes pretty much straight to the point. There are no meaningless graphics under a “result” banner. On the other hand, it looks messy and I’m afraid people will get lost in it.

Next one will be better.

The choice of prior in bayesian nonparametrics – part 2

See part 1. Most proofs are omitted; I’ll post them with the complete pdf later this week.

The structure of \mathcal{M}

Recall that \mathbb{M} is is a Polish space (ie. a complete and separable metric space). It is endowed with its borel \sigma-algebra \mathfrak{B} which is the smallest family of subsets of \mathbb{M} that contains its topology and that is closed under countable unions and intersections. All subsets of \mathbb{M} we consider in the following are supposed to be part of \mathfrak{B}. A probability measure on \mathbb{M} is a function \mu : \mathfrak{B} \rightarrow [0,1] such that for any countable partition A_1, A_2, A_3, \dots of \mathbb{M} we have that \sum_{i=1}^\infty \mu(A_i) = 1. The set \mathcal{M} consists of all such probability measures.

Note that since \mathbb{M} is complete and separable, every probability measure \mu \in \mathcal{M} is regular (and tight). It means that the measure of any A\subset \mathbb{M} can be well approximated from the measure of compact subsets of A as well as from the measure of open super-sets of A:

\mu(A) = \sup \left\{\mu(K) \,|\, K \subset A \text{ is compact}\right\}\\ = \inf \left\{\mu(U) \,|\, U \supset A \text{ is open}\right\}.

Metrics on \mathcal{M}

Let me review some facts. A natural metric used to compare the mass allocation of two measures \mu, \nu \in \mathbb{M} is the total variation distance defined by

\|\mu - \nu\|_{TV} = \sup_{A \subset \mathbb{M}}|\mu(A) - \nu(A)|.

It is relatively straightforward to verify that \mathcal{M} is complete under this distance, but it is not in general separable. To see this, suppose that \mathbb{M} = [0,1]. If a ball centered at \mu contains a dirac measure \delta_x, x \in [0,1], then \mu must have a point mass at x. Yet any measure contains at most a countable number of point masses, and there is an uncountable number of dirac measures on [0,1]. Thus no countable subset of \mathcal{M} can cover \mathcal{M} up to an \varepsilon of error.

This distance can be relaxed to the Prokhorov metric, comparing mass allocation up to \varepsilon-neighborhoods. It is defined as

d_P(\mu, \nu) = \inf \left\{ \varepsilon > 0 \,|\, \mu(A) \le \nu(A^{\varepsilon}) + \varepsilon \text{ and } \nu(A) \le \mu(A^{\varepsilon}) + \varepsilon,\; \forall A \subset \mathbb{M} \right\},

where A^{\varepsilon} = \{x \in \mathbb{M} \,|\, d(x, A) < \varepsilon\} is the \varepsilon-neighborhood of A. It is a metrization of the topology of weak convergence of probability measures, and \mathcal{M} is separable under this distance.

The compact sets of \mathcal{M} under the Prokhorov metric admit a simple characterization given by the Prokhorov theorem: P \subset \mathcal{M} is precompact if and only if P is uniformly tight (for each \varepsilon > 0, there exists a compact K \subset X such that \sup_{\mu \in P} \mu(K) \geq 1-\varepsilon). This means that a sequence \{\mu_n\} \subset \mathcal{M} admits a weakly convergent subsequence if and only if \{\mu_n\} is uniformly tight.

Characterizations of weak convergence are given by the Portemanteau theorem, which says in particular that \mu_n converges weakly to \mu if and only if

\int f d\mu_n \rightarrow \int f d\mu

for all continuous and bounded and continuous f. It is also equivalent to

\mu_n(A) \rightarrow \mu(A)

for all sets A such that \mu(\partial A) = 0.

Measures of divergence

In addition to metrics, that admit a geometric interpretation through the triangle inequality, statistical measures of divergence can also be considered. Here, we consider functions D : \mathcal{M}\times \mathcal{M} \rightarrow [0, \infty] that can be used to determine the rate of convergence of the likelihood ratio

\prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0,

where x_i \sim \nu and \mu, \nu \in \mathcal{M}.

Kullback-Leibler divergence

The weight of evidence in favor of the hypothesis “\lambda = \mu” versus “\lambda = \nu” given a sample x is defined as

W(x) = \log\frac{d\mu}{d\nu}.

It measures how much information about the hypotheses is brought by the observation of x. (For a justification of this interpretation, see Good (Weight of evidence: a brief survey, 1985).) The Kullback-Leibler divergence D_{KL} between \mu and \nu is defined as the expected weight of evidence given that x \sim \mu:

D_{KL}(\mu, \nu) =\mathbb{E}_{x \sim \mu} W(x) = \int \log \frac{d\mu}{d\nu} d\mu.

The following properties of the Kullback-Leibler divergence support its interpretation as an expected weight of evidence.

Theorem 1 (Kullback and Leibler, 1951).
We have

  1. D_{KL}(\mu, \nu) \geq 0 with equality if and only if \mu = \nu;
  2. D_{KL}(\mu T^{-1}, \nu T^{-1}) \geq D_{KL}(\mu, \nu) with equality if and only if T: \mathbb{M} \rightarrow \mathbb{M}' is a sufficient statistic for \{\mu, \nu\}.

Furthermore, the KL divergence can be used to precisely identify exponential rates of convergence of the likelihood ratio. The first part of the next proposition says that D_{KL}(\lambda, \nu) is finite if and only if the likelihood ratio \prod_{i} \frac{d\nu}{d\lambda}(x_i), x_i \sim \lambda cannot convergence super-exponentially fast towards 0. The second part identifies the rate of convergence then the KL divergence is finite.

Proposition 2.
Let x_1, x_2, x_3, \dots \sim \lambda (independently). The KL divergence D_{KL}(\lambda, \nu) is finite if and only if there exists an \alpha > 0 such that

e^{n\alpha} \prod_{i=1}^n \frac{d\nu}{d\lambda}(x_i) \rightarrow \infty

with positive probability.

Finally, suppose we are dealing with a submodel \mathcal{F} \subset \mathcal{M} such that the rates of convergences of the likelihood ratios in \mathcal{F} are of an exponential order. By the previous proposition, this is equivalent to the fact that \forall \mu, \nu \in \mathcal{F}, D_{KL}(\mu, \nu) < \infty. We can show that the KL divergence is, up to topological equivalence, the best measure of divergence that determines the convergence of the likelihood ratio. That is, suppose D: \mathcal{F} \times \mathcal{F}\rightarrow [0, \infty] is such that

D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i) \rightarrow 0

at an exponential rate, almost surely when x_i \sim \lambda, and that D(\lambda, \mu) = 0 if and only if \lambda = \mu. Then, the topology induced by D_{KL} is coarser than the topology induced by D.

Proposition 3.
Let D be as above and let \mathcal{F} \subset \mathcal{M} be such that \forall \mu, \nu \in \mathcal{F}, D_{KL}(\mu, \nu) < \infty. Then, the topology on \mathcal{F} induced by D_{KL} is weaker than the topology induced by D. More precisely, we have that

D(\lambda, \mu) < D(\lambda, \nu) \Rightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu).

 

alpha-affinity and alpha-divergence

We define the \alpha-affinity between two probability measures as the expectancy of another transform of the likelihood ratio. Let \mu, \nu be two probability measures dominated by \lambda, with d\mu = f d\lambda and d\nu = g d\lambda. Given 0 < \alpha < 1, the \alpha-affinity between \mu and \nu is

A_\alpha(\mu, \nu) = \int \left(\frac{g}{f}\right)^\alpha d\mu = \int g^\alpha f^{1-\alpha} d\lambda.

Proposition 4.
For all 0 < \alpha < 1, we have that

1. A_\alpha(\mu, \nu) \le 1 with equality if and only if \mu = \nu;

2. A_\alpha is monotonous in \alpha and jointly concave in its arguments;

3. A_\alpha is jointly multiplicative under products:

A_\alpha (\mu^{(n)}, \nu^{(n)}) = \left(A_{\alpha}(\mu, \nu)\right)^n.

4. if \frac{1}{2} \leq \alpha, then

A_{\frac{1}{2}} \le A_\alpha \le \left(A_{\frac{1}{2}}\right)^{2(1-\alpha)};

Proof.
1-2 follow from Jensen’s inequality and the joint concavity of (x,y) \mapsto x^\alpha y^{1-\alpha}. 3 follows from Fubini’s theorem. For
(iv), the first inequality is a particular case of 2 and Hölder’s inequality finally yields

A_{\alpha}(\mu, \nu) = \int (fg)^{1-\alpha} g^{2\alpha - 1} d\lambda \le \left( \int \sqrt{fg} \,d\lambda \right)^{2-2\alpha} = A_{\frac{1}{2}}(\mu, \nu).

\Box

The \alpha-divergence D_\alpha is obtained as

D_\alpha = 1 - A_\alpha.

Other similar divergences considered in the litterature are

\frac{1-A_\alpha}{\alpha(1-\alpha)}\; \text{ and }\; \frac{\log A_\alpha}{\alpha(1-\alpha)},

but we prefer D_\alpha for its simplicity. When \alpha = \frac{1}{2}, it is closely related to the hellinger distance

H(\mu, \nu) = \left(\int \left(\sqrt{f} - \sqrt{g}\right)^2d\lambda\right)^{\frac{1}{2}}

through

D_{\frac{1}{2}}(\mu, \nu) = \frac{H(\mu, \nu)^2}{2}.

Other important and well-known inequalities are given below.

Proposition 5.
We have

D_{\frac{1}{2}}(\mu, \nu) \le \|\mu-\nu\|_{TV} \le \sqrt{2 D_{\frac{1}{2}}(\mu, \nu)}

and

2D_{\frac{1}{2}}(\mu, \nu) \le D_{KL}(\mu, \nu) \le 2\left\|\frac{f}{g}\right\|_\infty \|\mu-\nu\|_{TV}.

This, together with proposition 4 (4) , yields similar bounds for the other divergences.

Finite models

Let \Pi be a prior on \mathcal{M} that is finitely supported. That is, \Pi = \sum_{i=1}^n p_i \delta_{\mu_i} for some \mu_i \in \mathcal{M} and p_i > 0 with \sum_i p_i = 1. Suppose that x_1, x_2, x_3, \dots independently follow some \mu_* \in \mathcal{M}.

The following proposition ensures that as data is gathered, the posterior distribution of \Pi concentrates on the measures \mu_i that are closest to \mu_*.

Proposition 6.
Let A_{\varepsilon} = \{\mu_i \,|\, D_{KL}(\mu_*, \mu_i) < \varepsilon \}. If A_\varepsilon \not = \emptyset, then

\Pi(A_\varepsilon \,|\, \{x_i\}_{i=1}^m) \rightarrow 1

almost surely as m \rightarrow \infty.

Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

The problem.

Let f, g, h be three densities and suppose that, x_i \sim h, i \in \mathbb{N}, independently. What happens to the likelihood ratio

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}

as n\rightarrow \infty?

Clearly, it depends. If h = g \not = f, then

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0

almost surely at an exponential rate. More generally, if h is closer to g than to f, in some sense, we’d expect that \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

D_{KL}(f, g) = \int f \log\frac{f}{g}.

It can be verified that D_{KL}(f,g) \geq 0 with equality if and only if f=g, and that

D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in (1) (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.

This is of interest in Bayesian nonparametrics. The hypothesis that a density f_0 is in the Kullback-Leibler support of a prior \Pi on a density space \mathbb{F} is used to ensure that

\int_{\mathbb{F}} \prod_{i=1}^n \frac{f(x_i)}{f_0(x_i)} \Pi(df)

does not converge to 0 exponentially fast as n\rightarrow \infty and x_i \sim f_0. Under the conditions we specify, our remark implies that the hypothesis that “f_0 is in the Kullback-Leibler support of \Pi” may not be replaced by a weaker one.

Statement of the remark

Some notations. Let \mathcal{M} the space of all probability measures on some measurable space \mathbb{M}. If \mu, \nu \in \mathcal{M}, then both are absolutely continuous with respect to \tau = \mu + \nu and possess densities f, g such that d\mu = fd\tau and d\nu gd\tau. We denote by d\mu /d\nu the ratio of densities f / g, which in fact does not depend on the choice of dominating measure \tau. The likelihood ratio \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} is abbreviated to \frac{d\mu}{d\nu}(X), depending implicitely on n. The Kullback-Leibler divergence between \mu and \nu is

D_{KL}(\mu, \nu) = \int \log\frac{d\mu}{d\nu} d\mu.

We let D be any other function \mathcal{M}\times \mathcal{M} \rightarrow [0,\infty] such that D(\mu, \nu) = 0 iff \mu = \nu and such that if D(\lambda, \nu) < D(\lambda, \mu), then there exists a \varepsilon > 0 with

e^{\varepsilon n} \frac{d\mu}{d\nu}(X) = e^{\varepsilon n} \prod_{i=1}^n \frac{d\mu}{d\nu}(x_i) \rightarrow 0

almost surely as n \rightarrow \infty. The topology on a subset \mathcal{F} \subset \mathcal{M} induced by such a function D is the topology induced by the sets

\{\nu \,|\, D(\mu, \nu) < \varepsilon\},\quad \mu \in \mathcal{F}, \, \varepsilon > 0.

The remark below shows that any exponential rate of convergence of the likelihood ratio is picked up by the KL divergence. It is rather obvious (albeit a bit technical), but I thought it was worth writing it up properly.

Remark.
Let x_i \sim \mu, i \in \mathbb{N}, independently, and let \frac{d\nu}{d\mu}(X) = \prod_{i=1}^n \frac{d\nu}{d\mu}(x_i).

  1. We have that D_{KL}(\mu, \nu) < \infty if and only if \frac{d\nu}{d\mu}(X) does not converge more than exponentially fast to 0 (i.e. there exists \varepsilon > 0 such that e^{\varepsilon n}\frac{d\nu}{d\mu} (X) \rightarrow \infty).
  2. If \mathcal{F} \subset \mathcal{M} is such that D_{KL}(\mu, \nu) < \infty for all \mu, \nu \in \mathcal{F}, then

D(\lambda, \mu) < D(\lambda, \nu) \Longrightarrow D_{KL}(\lambda, \mu) < D_{KL}(\lambda, \nu)

and the topology on \mathcal{F} induced by D_{KL} is weaker than the topology of any other function D defined as above.

Proof of 1.
Suppose that D_{KL}(\mu, \nu) < \infty. Then, since by the strong law of large numbers D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i) \rightarrow \varepsilon > 0, we find that

\log\left(e^{n (D_{KL}(\mu, \nu) + \varepsilon)} \frac{d\nu}{d\mu}(X)\right) = n (D_{KL}(\mu, \nu) + \varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu}{d\nu}(x_i)) \rightarrow \infty

for all \varepsilon > 0.

If D_{KL}(\mu, \nu) = \infty, then for all \varepsilon > 0 we have

\log \left(e^{n \varepsilon}\frac{d\nu}{d\mu}(X)\right) = n\left(\varepsilon - \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)}\right) \rightarrow -\infty

since \frac{1}{n}\sum_{i=1}^n \log\frac{d\mu(x_i)}{d\nu(x_i)} \rightarrow \infty. \Box

Proof of 2.
Suppose that there exists \lambda, \mu, \nu \in \mathcal{F} such that D(\lambda, \mu) < D(\lambda, \nu) but D_{KL}(\lambda, \mu) = D_{KL}(\lambda, \nu). Then, there is a \varepsilon > 0 such that

e^{2\varepsilon n} \frac{d\nu}{d\mu}(X)  = e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) /\left( e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \right) \rightarrow 0.

But e^{n(D_{KL}(\lambda, \nu) + \varepsilon )} \frac{d\nu}{d\lambda}(X) \rightarrow \infty and e^{n(d_{KL}(\lambda, \mu) - \varepsilon)} \frac{d\mu}{d\lambda}(X) \rightarrow 0, which yields the contradiction.

Since D(\lambda, \mu) =0 iff \lambda = \mu, this implies that the topology on \mathcal{F} induced by D_{KL} is weaker than the one induced by D. \Box

The choice of prior in bayesian nonparametrics – Introduction

In preparation for the 11th Bayesian nonparametrics conference, I’m writing (and rewriting) notes on the background of our research (i.e. some of the general theory of bayesian nonparametrics). There are some good books on the subject (such as Bayesian Nonparametrics (Ghosh and Ramamoorthi, 2003)), but I wanted a more introductory focus and to present Choi and Ramamoorthi’s very clear point of view on posterior consistency (Remarks on the consistency of posterior distributions, 2008).

1. Introduction

Let \mathbb{X} be a complete and separable metric space and let \mathcal{M} be the space of all probability measures on \mathbb{X}. Some unknown distribution P_0\in \mathcal{M} is generating observable data \mathcal{D}_n = (X_1, X_2, \dots, X_n) \in \mathbb{X}^n, where each X_i is independently drawn from P_0. The problem is to learn about P_0 using only \mathcal{D}_n and prior knowledge.

Example (Discovery probabilities).
A cryptographer observes words, following some distribution P_0, in an unknown countable language \mathcal{L}. What are the P_0-probabilities of the words observed thus far? What is the probability that the next word to be observed has never been observed before?

1.1 Learning and uncertainty

We need an employable definition of learning. As a first approximation, we can consider learning to be the reduction of uncertainty about what is P_0. This requires a quantification of how uncertain we are to begin with. Then, hopefully, as data is gathered out uncertainty decreases and we are able to pinpoint P_0.

This is the core of Bayesian learning, alghough our definition is not yet entirely satisfactory. There are some difficulties with this idea of quantifying uncertainty, at least when using information-theoric concepts. The solution we adopt here is the use of probabilities to quantify uncertain knowledge (bayesians would also talk of subjective probabilities quantifying rational belief). For example, you may know that a coin flip is likely to be fair, although it is not impossible the two sides of the coin are both the same. This is uncertain knowledge about the distribution of heads and tails in the coin flips, and you could assign probabilities to the different possibilities.

More formally, prior uncertain knowledge about what is P_0 is quantified by a probability measure \Pi on \mathcal{M}. For any A \subset \mathcal{M}, \Pi(A) is the the prior probability that “P_0 \in A“. Then, given data \mathcal{D}_n, prior probabilities are adjusted to posterior probabilities: \Pi becomes \Pi_n, the conditional distribution of \Pi given \mathcal{D}_n. The celebrated Bayes’ theorem provides a formula to calculate \Pi_n from \Pi and \mathcal{D}_n. Thus we have an operational definition of learning in our statistical framework.

Learning is rationally adjusting uncertain knowledge in the light of new information.

For explanations as to why probabilities are well suited to the representation of uncertain knowledge, I refer the reader to Pearl (Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988). We will also see that the operation of updating the prior to posterior posterior probabilities does work as intended.

1.2 The choice of prior

Specifying prior probabilities, that is quantifying prior uncertain knowledge, is not a simple task. It is especially difficult when uncertainty is over the non-negligeable part \mathcal{M} of an infinite dimensional vector space. Fortunately, “probability is not about numbers, it is about the structure of reasoning”, as Glenn Shafer puts it (cited in Pearl, 1988, p.15). The exact numbers given to the events “P_0 \in A” are not of foremost importance; what matters is how probabilities are more qualitatively put together, and how this relates to the learning process.

Properties of prior distributions, opening them to scrutiny, criticism and discussion, must be identified and related to what happens as more and more data is gathered.

Part 2.

Approximation

Présentation (20 minutes) au séminaire du 5e.

Je présente le théorème d’approximation de Weierstrass pour les fonctions périodiques, en utilisant une base des polynômes trigonométriques récemment suggérée par Róth et al. (2009). Celle-ci se prête naturellement bien à notre application.

Théorème d’approximation de Weierstrass.
Soit f : \mathbb{R} \rightarrow \mathbb{R} une fonction 2\pi-périodique. Si f est continue, alors on peut construire des polynômes trigonométriques f_1, f_2, f_3, \dots tels que

f(x) = \sum_{i=1}^{\infty} f_i(x)

et tels que la convergence de la série ci-dessus est uniforme.

Ce théorème intervient dans plusieurs domaines: en topologie pour démontrer le théorème du point fixe de Brouwer, en géométrie pour l’inégalité isopérimétrique et en géométrie algébrique pour le théorème de Nash-Tognoli. Il implique que \{1, \cos(x), \sin(x), \cos(2x), \sin(2x), \dots\}, en tant que système orthonormal, est complèt dans L^2(\mathbb{S}^1). Plus généralement, on s’en sert pour ramener un problème sur les fonctions continues à un problème sur les polynômes, où le calcul différentiel et l’algèbre linéaire s’appliquent. Les démonstrations constructives du théorème fournissent de plus des outils permettant d’effectuer la régression ou la reconstruction de courbes et de surfaces.

Notions de base

Un polynôme trigonométrique (de degré m) est une fonction de x prenant la forme

a_0 + \sum_{n=1}^m \left\{a_n \cos(nx) + b_n \sin(nx)\right\}.

Notons que les sommes et les produits de polynômes trigonométriques sont encore de tels polynômes. De façon un peu moins évidente, ils forment un système de Chebyshev: pour tout ensemble \{x_i\}_{i=1}^{2m+1} de points distincts et pour tout \{y_i\}_{i=1}^{2m+1}\subset \mathbb{R}, il existe un unique polynôme trigonométrique P_m de degré m tel que

P_m(x_i)=y_i, \quad \forall i \in \{1,2,\dots, 2m+1\}.

Les fonctions périodiques et continues sur \mathbb{R} s’identifie aux fonctions continues sur le cercle

\mathbb{S}^1 = \mathbb{R}/2\pi\mathbb{Z}

munit de la distance de la longueur d’arc

d(u,v) = \min_{k \in \mathbb{Z}} |u-v+2\pi k|.

Cela suit d’un principe général: les fonctions continues sur \mathbb{R}^k et invariantes sous l’action d’un groupe G, dont les orbites ne possèdent pas de points d’accumulation, s’identifient aux fonctions continues sur le quotient \mathbb{R}^k / G munit de la distance d([u],[v]) = \min_{g, h\in G} \|g(u) - h(v)\|. (Ici, [u] est l’orbite de u sous l’action de G.) On peut obtenir, avec ces quotients, des surfaces telles le tore, le ruban de Mobius et la bouteille de Klein.

Démonstration du théorème

Considérons la fonction positive

C_{0,n}(u) = c_n\left(1+\cos(u)\right)^n,\quad c_n = \frac{2\pi}{2n+1}\left(\int_{0}^{2\pi} (1+\cos u)^n du\right)^{-1}

et ses translatées

C_{j,n}(u) = C_{0,n}\left(u - \tfrac{2\pi j}{2n+1} \right),\quad j=0,1, \dots, 2n

qui sont disposés d’une façon régulière autour du cercle.

Les C_{j,n} forment une partition de l’unité.

Remarquons que pour tout i \in \{0,1, \dots, 2n\}, on a

\sum_{j=0}^{2n} C_{j,n}\left(\tfrac{2\pi j}{2n+1}\right) = \sum_{j=0}^{2n} C_{j,n}(0)

par symétrie cyclique des C_{j,n}. Ainsi, la fonction \sum_{j=0}^{2n}C_{j,n} est constante en 2n+1 points. Comme c’est un polynôme trigonométrique de degré n et que ceux-ci forment un système de Chebyshev, il faut alors que ce soit une constante. Or,

\int_{0}^{2\pi} \sum_{j=0}^{2n} C_{j,n}(u) du = \sum_{j=0}^{2n}  \frac{2\pi\int_0^{2\pi} \left(1+\cos\left(u - \tfrac{2\pi j}{2n+1}\right)\right)^n du}{(2n+1)\int_0^{2\pi}\left(1+\cos u\right)^n du} = 2\pi,

d’où

\sum_{j=-n}^n C_{j,n} \equiv 1.

Construction des approximants

Posons

T_n(u) = \sum_{j=0}^{2n} f\left(\tfrac{2\pi j}{2n+1}\right) C_{j,n}(u), \quad T_0 = 0,

et montrons que T_n converge uniformément vers f lorsque n \rightarrow \infty. En prenant f_n = T_n-T_{n-1}, on aura démontré le théorème.

On calcule, en utilisant le fait que \sum_{j=0}^{2n} C_{j,n} \equiv 1,

|T_n (u) - f(u)| \le \sum_{j=0}^{2n} \left| f\left(\tfrac{2\pi j}{2n+1}\right) - f(u) \right| C_{j,n}(u) = (*).

Fixons maintenant \varepsilon > 0 et posons \delta > 0 tel que d(u,v) < \delta \Rightarrow |f(u) - f(v)| < \varepsilon. Avec A = \{j \,|\, d(u, \tfrac{2\pi j}{2n+1}) < \delta\}, la somme ci-dessus s’écrit

(*) \le \varepsilon \sum_{j \in A} C_{j,n}(u) + 2 \sup_x |f(x)| \sum_{j \in A^c} C_{j,n}(u) \le \varepsilon + 2 \|f\|_\infty \sum_{j \in A^c} C_{j,n}(u).

Or,

\sum_{j \in A^c} C_{j,n}(u) \le (2n+1) C_{0,n}(\delta) \rightarrow 0,

d’où pour n suffisamment grand on obtient

|T_n (u) - f(u)| < 2\varepsilon.

Comme \delta et n ne dépendent pas de u, et puisque \varepsilon > 0 était arbitraire, T_n converge uniformément vers f. CQFD.

References

[1] Róth. Á. et al. (2009). A cyclic basis for closed curve and surface modelling. Computer Aided Geometric Design, 26, 528-546.

[2] Bernstein, S. (1912). Démonstration du théorème de Weierstrass fondée sur le calcul des probabilités.