Elementary topology and machine learning

The machine learning bubble may be overinflated, but it is not about to burst. Interdisciplinary research in this field is grounded in sound theory and has numerous empirical breakthroughs to show for. As it finds more and more applications and concentrates public research funding, many of us are still wondering: how can mathematics contribute?
Case study of an interaction between elementary topology and machine learning’s binary classification problem. Following classical theorems, we obtain a topologically accurate solution.

Correction: It should be \sup_{i,j}| E_{i,j}^n - (k+1)^2 \int_{R_{i,j}^n} f | = o(1/n) and k=k(n) grow correspondingly slower with n.

PDF note here.

 

Bayesian learning

Friday july 28 at 17:00
Rutherford Physics Building, Room 118, McGill

Next week, I’ll be talking about Bayesian learning at the Mathematical congress of the americas and at the Canadian undergraduate mathematics conference. These are somewhat challenging talks: I need to sell the idea of Bayesian statistics to a general mathematical audience (which knows nothing about it), interest them in some though problems of Bayesian nonparametrics, and then present some of our research results. This must be done in under 20 minutes.

To make the presentation more intuitive and accessible, I borrowed some language from machine learning. I’m talking about learning rather than inference, uncertain knowledge rather than subjective belief, and asymptotic correctness rather than consistency. These are essentially synonymous, although some authors might use them in different ways. This should not cause problems for this introductory talk.

1. What is learning?

This is the question that drives the talk. We do not want to deal with it in full generality, but rather want an operational definition of learning that allows us to program algorithms that can learn from experience. The suggestion is the following:

Learning is about rationally adjusting uncertain knowledge in the light of new information.

Bayesian learning implements this idea through the calculus of probability. Let’s see this in a simple context.

1.1 The classical problem

Some stochastic mechanism generates data x_1, x_2, x_3, \dots, each x_i being an independent sample from a unknown distribution P_0. We want to use the data to learn about P_0.

Example

A cryptographer observes word at random in an unknown countable language. Learning could be about:

  • What’s the distribution of the observed words?
  • What’s the probability that the next word to be observed has never been seen before?

1.2 Bayesian Learning

Uncertain knowledge about what P_0 may is encoded through a prior probability measure \Pi on the space of all possibilities for P_0. Thus, if A is some set of probability distributions, then \Pi(A) is the probability, representing the best of our knowledge, that “P_0 \in A“. Then, we observe the data \mathcal{D} = (x_1, x_2, \dots, x_n). This yields the posterior measure \Pi(\cdot | \mathcal{D}), obtained by adjusting \Pi in light of \mathcal{D} through probabilistic conditioning.

Example (coin tossing)

We suspect a coin of being biased. Tossing it, there is an unknown probability p_0 that the coin falls on heads. A prior probability density function, quantifying our uncertainty about what p_0 may be, could look like this:

fig1

It is most likely that the coin is approximately unbiaised (p_0 \approx 1/2), but not impossible that it is strongly biaised.

Now, we flip the coin 20 times and observe only 5 heads. The density of the posterior distribution, updating the prior in light of this information, is the following.

fig2

It is most likely that the coin is biased and that p_0 \approx 1/4.

Certainly the results obtained depend on the choice of prior distribution, but similar priors yield similar posteriors:

 

2. Bayesian nonparametrics

It is tempting to apply the same procedure to more complex problems.

Reconsider, for instance, the cryptographer example. Each word \omega has some probability P_0(\omega) of appearing. Since there is an infinite number of words, there is an infinite number of parameters for which we need to quantify uncertainty. A reasonable question is then:

Is it feasible to faithfully quantify uncertain knowledge on an infinite dimensional space?

The answer is no. At least, not always and not necessarily in a way that makes bayesian learning meaningfull. Diaconis and Freedman showed, in 1986, showed the following in the context of bayesian nonparametrics:

“Some statisticians, as more and more data is gathered, will become more and more convinced of the wrong answer.”

Bayesian learning does not always work in infinite dimension. To approach this problem, we need to figure out charateristics of prior distributions that

  1. describe how uncertainty is spread out; and
  2. ensure that bayesian learning works correctly.

2.1 A positive result

An important positive result is based on the two following conditions / characteristics of prior distributions. We denote by \mathbb{F} the set of all probability densities on some common space, assuming that P_0 \in \mathbb{F} (the data-generating distribution has some density in \mathbb{F}).

  1. \Pi puts positive mass to the relative entropy neighborhoods of P_0:

\Pi\left(\left\{P \in \mathbb{F}| \int \log \frac{dP_0}{dP} dP_0 < \varepsilon \right\}\right) > 0.

This means that a priori, we’re not excluding the truth from the set of possibilities. Since P_0 is unknown, we require that this condition be satisfied whatever P_0 may be.

2. \Pi is of finite entropy: for all \delta > 0, there exists 1 > \alpha > 0 and a covering \{A_i\} of \mathbb{F} of L^1 diameter less than \delta such that

\sum_{i} \Pi(A_i)^\alpha < \infty.

This means that \Pi is not too complex and that we can make sense of it through discretization.

Under these two conditions, bayesian learning is asymptotically correct: the posterior distribution concentrates around the truth.

Theorem (Walker, 2004)
If the conditions (1) and (2) are satisfied, than for any L^1 neighborhood N of P_0, we have

\Pi(N \,|\, x_1, x_2, \dots, x_n) \rightarrow 1

almost surely as x_i \sim P_0.

This is helpful, but has not yet solved our problem.

  • How do we, generally, construct priors satisfying the two conditions?
  • How can we use these priors to solve practical problems?

This is where our research (as well as a lot of other research in Bayesian nonparametrics) comes in.

3. Some of our research

Two contributions I want to present.

  • We developped a relationship between some statistical models and approximation theory that can be used to easily construct priors satisfying the two conditions.
  • We use it to solve problems raised in the litterature.

Let’s see how this works in the context of directional statistics. Say we have some data distributed on the circle or, more generally, on a compact metric space.

 

We want to learn about the data-generating mechanism, e.g. do

  • density estimation,
  • hypothesis tests,
  • classification,

or any other statistical procedure. First, we specify a prior on the set \mathbb{F} of all bounded densities on the circle.

3.1 Prior specification

We begin with a density basis of the trigonometric polynomials introduced in 2009:

C_{j,n}(u) \propto \left(1 + \cos\left( u - \frac{2\pi j}{2n+1}\right)\right)^n, \quad j = 0,1,\dots, 2n.

basis_2

We studied the statistical properties of this basis, and use it to construct approximation operators

T_n : \mathbb{F} \ni f \mapsto \sum_j  C_{j,n} \int_{R_{j,n}}f(x) dx,

where R_{j,n} = \left[\frac{\pi(2j-1)}{2n+1}, \frac{\pi(2j+1)}{2n+1}\right). It can be shown that these operators are variation-diminishing and possess shape-preserving properties. More importantly, they give the decomposition

\overline{\cup_{n\in \mathbb{N}} T_n(\mathbb{F})} = \mathbb{F}.

By specifying priors \Pi_n on T_n(\mathbb{F}) and mixing them together, we obtain the prior

\Pi = \sum_{n}\rho(n) \Pi_n :\, A \mapsto \sum_n \rho(n) \Pi_n\left(A \cap T_n(\mathbb{F})\right).

on \mathbb{F}.

This reduces the problem of specifying a prior on an infinite dimensional space, to specifying an infinite number of priors on finite dimensional spaces. This turns out to be easier, and from properties of T_n we obtain asymptotic correctness of bayesian learning.

Theorem.
Let T_n : L^1 \rightarrow L^1 be any linear function mapping \mathbb{F} to itself, with \dim T_n (\mathbb{F}) increasing. If \|T_n f - f\|_\infty \rightarrow 0 for all continuous f, then taking \rho such that 0 < \rho(n) <  e^{-\dim T_n(\mathbb{F})} and \Pi_n > 0 ensures that bayesian learning based on \Pi is asymptotically correct.

What is interesting here:

  • The approach scales easily to (almost) any set of bounded densities on a compact metric space.
  • The approximation theory litterature provides a wealth of well-studied approximation operators that can be used to construct such priors.
  • Properties of the operators relate to properties of the priors. If, for instance, it is known that the true density of P_0 is unimodal, then using a unimodality-preserving operator yields a prior the space of unimodal densities.

3.2 Simple application

We use a prior of the type considered above to estimate the density of P_0, also quantifying uncertainty about our estimate. Below, the data-generating distribution P_0 is drawn in orange. A hundred samples are represented in the grey histogram. The blue line is an estimate of the unknown orange function, and the blue shaded region quantifies uncertainty (50% and 90% credible regions).

n_100.png

4. Take-home message

There are two things I would like you to remember.

1. The calculus of probability provides an operational definition of learning. That is what Bayesian statistics is about.

2. As you must already know, different fields of mathematics enrich each other in their interactions. Here, it is approximation theory that provides tools that ensure bayesian learning works correctly.

Thank you for your attention!

References.

Continue reading

Constrained semiparametric modelling (for directional statistics)

main

Explanations

Angular data arises in many scientific fields, such as in experimental biology for the study of animal orientation, and in bioinformatics in relation to the protein structure prediction problem.

 

 

The statistical analysis of this data requires adapted tools such as 2\pi-periodic density models. Fernandez-Duran (Biometrics, 60(2), 2004) proposed non-negative trigonometric sums (i.e. non-negative trigonometric polynomials) as a flexible family of circular distributions. However, the coefficients of trigonometric polynomials expressed in the standard basis 1, \cos(x), \sin(x), \dots are difficult to interpret and we do not see how an informative prior could be specified through this parametrization. Moreover, the use of this basis was criticized by Ferreira et al. (Bayesian Analysis, 3(2), 2008) as resulting in a “wigly approximation, unlikely to be useful in most real applications”.

Trigonometric density basis

Here, we suggest the use of a density basis of the trigonometric polynomials and argue it is well suited to statistical applications. In particular, coefficients of trigonometric densities expressed in this basis possess an intuitive geometric interpretation. Furthermore, we show how “wiggliness” can be precisely controlled using this basis and how another geometric constraint, periodic unimodality, can be enforced [first proposition on the poster]. To ensure that nothing is lost by using this basis, we also show that the whole model consists of precisely all positive trigonometric densities, together with the basis functions [first theorem on the poster].

Prior specification

Priors can be specified on the coefficients of mixtures in our basis and on the degree of the trigonometric polynomials to be used. Through the interpretability of the coefficients and the shape-preserving properties of the basis, different types of prior knowledge may be incorporated. Together with an approximate understanding of mass allocation, these include:

  • periodic unimodality;
  • bounds on total variation; and
  • knowledge of the marginal distributions (in the multivariate case).

The priors obtained this way are part of a well-studied family called sieve priors, including the well-known Bernstein-Dirichlet prior, and are finite mixtures with an unknown number of components. Most results and interpretations about the Bernstein-Dirichlet prior (see Petrone & Wasserman (J. R. Stat. Soc. B., 64(1),  2002), Kruijer and Van der Vaart (J. Stat. Plan. Inference, 138(7), 2008), McVinish et al. (Scand. J. Statist., 36(2), 2009) can carry over to the priors we consider, but we dot not discuss them further.

Approximation-theoric framework

Our density models arise as the image of “shape-perserving” linear approximation operators. This approximation-theoric relationship is used to obtain a notably large prior Kullback-Leibler support and ensures strong posterior consistency at all bounded (not necessarily continuous) density. The result partly relies on known properties of sieve priors, as well as general consistency results (Walker (Ann. Statist., 32(5), 2004)), but extends known result by removing an usual continuity hypothesis on the densities at which consistency is achieved (see Wu & Ghosal (‎Electron. J. Stat., 2, 2008), Petrone & Veronese (Statistica Sinica, 20, 2010)). For contraction rates, higher order smoothness conditions are usually required (see Shen & Ghosal (Scand. J. Statist., 42(4), 2015)).

For example, consider the prior induced by the random density

T_n \mathcal{D} := \sum_i \mathcal{D}(R_{i,n}) C_{i,n},\qquad (1)

where \mathcal{D} is a Dirichlet process, n is distributed on \mathbb{N} and R_{i,n} is a partition of the circle. It has the strong posterior consistency at all bounded density provided that the associated operator

T_n : f \mapsto \sum_i C_{i,n} \int_{R_{i,n}} f

is such that \|T_n f - f\|_\infty \rightarrow 0 for all continuous f.

More generally, let \mathbb{F} be a set of bounded densities on some compact metric space \mathbb{M}, let T_n : L^1(\mathbb{M}) \rightarrow L^1(\mathbb{M}), n \in \mathbb{N}, be a sequence of operators that are:

  • shape preserving: T_n maps densities to densities and T_n(\mathbb{F}) \subset \mathbb{F}; and
  • approximating: \|T_n f - f\|_\infty \rightarrow 0 for all continuous f;

and finally let \Pi_n be priors on T_n(\mathbb{F}) with full support. A sieve prior on \mathbb{F} is defined by

\Pi : A \mapsto \sum_n \rho(n) \Pi_n(A \cap T_n(\mathbb{F})).

Theorem.
If 0 < \rho(n) < Ce^{-c d_n} for some increasing sequence d_n bounding the dimensions of T_n (\mathbb{F}), then the posterior distribution of \Pi is strongly consistent at each density of \mathbb{F}.

The approximation theory literature is rich in such operators. The theorem shows that they provide strongly consistent priors on arbitrary density spaces simply given priors \Pi_n on T_n(\mathbb{F}).

Basic density estimation:

res_5_15_2

A thousand samples (grey histogram) were drawn from the density in orange. The prior is defined by (1) with the Dirichlet process centered on the uniform density and with a precision parameter of 2. The degree n is distributed as a \text{Poiss}(15). The blue line is the posterior mean, the dark blue shaded region is a 50% pointwise credible region around the median, and the light blue shaded region is a 90% credible region.

Approximation

Présentation (20 minutes) au séminaire du 5e.

Je présente le théorème d’approximation de Weierstrass pour les fonctions périodiques, en utilisant une base des polynômes trigonométriques récemment suggérée par Róth et al. (2009). Celle-ci se prête naturellement bien à notre application.

Théorème d’approximation de Weierstrass.
Soit f : \mathbb{R} \rightarrow \mathbb{R} une fonction 2\pi-périodique. Si f est continue, alors on peut construire des polynômes trigonométriques f_1, f_2, f_3, \dots tels que

f(x) = \sum_{i=1}^{\infty} f_i(x)

et tels que la convergence de la série ci-dessus est uniforme.

Ce théorème intervient dans plusieurs domaines: en topologie pour démontrer le théorème du point fixe de Brouwer, en géométrie pour l’inégalité isopérimétrique et en géométrie algébrique pour le théorème de Nash-Tognoli. Il implique que \{1, \cos(x), \sin(x), \cos(2x), \sin(2x), \dots\}, en tant que système orthonormal, est complèt dans L^2(\mathbb{S}^1). Plus généralement, on s’en sert pour ramener un problème sur les fonctions continues à un problème sur les polynômes, où le calcul différentiel et l’algèbre linéaire s’appliquent. Les démonstrations constructives du théorème fournissent de plus des outils permettant d’effectuer la régression ou la reconstruction de courbes et de surfaces.

Notions de base

Un polynôme trigonométrique (de degré m) est une fonction de x prenant la forme

a_0 + \sum_{n=1}^m \left\{a_n \cos(nx) + b_n \sin(nx)\right\}.

Notons que les sommes et les produits de polynômes trigonométriques sont encore de tels polynômes. De façon un peu moins évidente, ils forment un système de Chebyshev: pour tout ensemble \{x_i\}_{i=1}^{2m+1} de points distincts et pour tout \{y_i\}_{i=1}^{2m+1}\subset \mathbb{R}, il existe un unique polynôme trigonométrique P_m de degré m tel que

P_m(x_i)=y_i, \quad \forall i \in \{1,2,\dots, 2m+1\}.

Les fonctions périodiques et continues sur \mathbb{R} s’identifie aux fonctions continues sur le cercle

\mathbb{S}^1 = \mathbb{R}/2\pi\mathbb{Z}

munit de la distance de la longueur d’arc

d(u,v) = \min_{k \in \mathbb{Z}} |u-v+2\pi k|.

Cela suit d’un principe général: les fonctions continues sur \mathbb{R}^k et invariantes sous l’action d’un groupe G, dont les orbites ne possèdent pas de points d’accumulation, s’identifient aux fonctions continues sur le quotient \mathbb{R}^k / G munit de la distance d([u],[v]) = \min_{g, h\in G} \|g(u) - h(v)\|. (Ici, [u] est l’orbite de u sous l’action de G.) On peut obtenir, avec ces quotients, des surfaces telles le tore, le ruban de Mobius et la bouteille de Klein.

Démonstration du théorème

Considérons la fonction positive

C_{0,n}(u) = c_n\left(1+\cos(u)\right)^n,\quad c_n = \frac{2\pi}{2n+1}\left(\int_{0}^{2\pi} (1+\cos u)^n du\right)^{-1}

et ses translatées

C_{j,n}(u) = C_{0,n}\left(u - \tfrac{2\pi j}{2n+1} \right),\quad j=0,1, \dots, 2n

qui sont disposés d’une façon régulière autour du cercle.

Les C_{j,n} forment une partition de l’unité.

Remarquons que pour tout i \in \{0,1, \dots, 2n\}, on a

\sum_{j=0}^{2n} C_{j,n}\left(\tfrac{2\pi j}{2n+1}\right) = \sum_{j=0}^{2n} C_{j,n}(0)

par symétrie cyclique des C_{j,n}. Ainsi, la fonction \sum_{j=0}^{2n}C_{j,n} est constante en 2n+1 points. Comme c’est un polynôme trigonométrique de degré n et que ceux-ci forment un système de Chebyshev, il faut alors que ce soit une constante. Or,

\int_{0}^{2\pi} \sum_{j=0}^{2n} C_{j,n}(u) du = \sum_{j=0}^{2n}  \frac{2\pi\int_0^{2\pi} \left(1+\cos\left(u - \tfrac{2\pi j}{2n+1}\right)\right)^n du}{(2n+1)\int_0^{2\pi}\left(1+\cos u\right)^n du} = 2\pi,

d’où

\sum_{j=-n}^n C_{j,n} \equiv 1.

Construction des approximants

Posons

T_n(u) = \sum_{j=0}^{2n} f\left(\tfrac{2\pi j}{2n+1}\right) C_{j,n}(u), \quad T_0 = 0,

et montrons que T_n converge uniformément vers f lorsque n \rightarrow \infty. En prenant f_n = T_n-T_{n-1}, on aura démontré le théorème.

On calcule, en utilisant le fait que \sum_{j=0}^{2n} C_{j,n} \equiv 1,

|T_n (u) - f(u)| \le \sum_{j=0}^{2n} \left| f\left(\tfrac{2\pi j}{2n+1}\right) - f(u) \right| C_{j,n}(u) = (*).

Fixons maintenant \varepsilon > 0 et posons \delta > 0 tel que d(u,v) < \delta \Rightarrow |f(u) - f(v)| < \varepsilon. Avec A = \{j \,|\, d(u, \tfrac{2\pi j}{2n+1}) < \delta\}, la somme ci-dessus s’écrit

(*) \le \varepsilon \sum_{j \in A} C_{j,n}(u) + 2 \sup_x |f(x)| \sum_{j \in A^c} C_{j,n}(u) \le \varepsilon + 2 \|f\|_\infty \sum_{j \in A^c} C_{j,n}(u).

Or,

\sum_{j \in A^c} C_{j,n}(u) \le (2n+1) C_{0,n}(\delta) \rightarrow 0,

d’où pour n suffisamment grand on obtient

|T_n (u) - f(u)| < 2\varepsilon.

Comme \delta et n ne dépendent pas de u, et puisque \varepsilon > 0 était arbitraire, T_n converge uniformément vers f. CQFD.

References

[1] Róth. Á. et al. (2009). A cyclic basis for closed curve and surface modelling. Computer Aided Geometric Design, 26, 528-546.

[2] Bernstein, S. (1912). Démonstration du théorème de Weierstrass fondée sur le calcul des probabilités.