# Fractional Posteriors and Hausdorff alpha-entropy

Bhattacharya, Pati & Yan (2016) wrote an interesting paper on Bayesian fractional posteriors. These are based on fractional likelihoods – likelihoods raised to a fractional power – and provide robustness to misspecification. One of their results shows that fractional posterior contraction can be obtained as only a function of prior mass attributed to neighborhoods, in a sort of Kullback-Leibler sense, of the parameter corresponding to the true data generating distribution (or the one closest to it in the Kullback-Leibler sense). With regular posteriors, on the other hand, a complexity constraint on the prior distribution is usually also required in order to show posterior contraction.

Their result made me think of the approach of Xing & Ranneby (2008) to posterior consistency. Therein, a prior complexity constraint specified through the so-called Hausdorff $\alpha$-entropy is used to allow bounding the regular posterior distribution by something that is similar to a fractional posterior distribution. As it turns out, the proof of Theorem 3.2 of of Battacharya & al. (2016) can almost directly be adapted to regular posteriors in certain cases, using the Hausdorff $\alpha$-entropy to bridge the gap. Let me explain this in some more detail.

Le me consider well-specified discrete priors for simplicity. More generally, the discretization trick could possibly yield similar results for non-discrete priors.

I will follow as closely as possible the notations of Battacharya & al. (2016). Let $\{p_{\theta}^{(n)} \mid \theta \in \Theta\}$ be a dominated statistical model, where $\Theta = \{\theta_1, \theta_2, \theta_3, \dots\}$ is discrete. Assume $X^{(n)} \sim p_{\theta_0}^{(n)}$ for some $\theta_0 \in \Theta$, let

$B_n(\varepsilon, \theta_0) = \left\{ \int p_{\theta_0}^{(n)}\log\frac{p_{\theta_0}^{(n)}}{p_{\theta}^{(n)}} < n\varepsilon^2,\, \int p_{\theta_0}^{(n)}\log^2\frac{p_{\theta_0}^{(n)}}{p_{\theta}^{(n)}} < n\varepsilon^2 \right\}$

and define the Renyi divergence of order $\alpha$ as

$D^{(n)}_{\alpha}(\theta, \theta_0) = \frac{1}{\alpha-1}\log\int\{p_{\theta}^{(n)}\}^\alpha \{p_{\theta_0}^{(n)}\}^{1-\alpha}.$

We let $\Pi_n$ be a prior on $\Theta$ and its fractional posterior distribution of order $\alpha$ is defined as

$\Pi_{n, \alpha}(A \mid X^{(n)}) \propto \int_{A}p_{\theta}^{(n)}\left(X^{(n)}\right)^\alpha\Pi_n(d\theta)$

In this well-specified case, one of their result is the following:

Theorem 3.2 of Bhattacharya & al. (particular case)
Fix $\alpha \in (0,1)$ and assume that $\varepsilon_n$ satisfies $n\varepsilon_n^2 \geq 2$ and

$\Pi_n(B_n(\varepsilon_n, \theta_0)) \geq e^{-n\varepsilon_n^2}.$

Then, for any $D \geq 2$ and $t > 0$,

$\Pi_{n,\alpha}\left( \frac{1}{n}D_\alpha^{(n)}(\theta, \theta_0) \geq \frac{D+3t}{1-\alpha} \varepsilon_n^2 \mid X^{(n)} \right) \leq e^{-tn\varepsilon_n^2}.$

holds with probability at least $1-2/\{(D-1+t)^2n\varepsilon_n^2\}$.

## What about regular posteriors?

Let us define the $\alpha$-entropy of the prior $\Pi_n$ as

$H_\alpha(\Pi_n) = \sum_{\theta \in \Theta} \Pi_n(\theta)^\alpha.$

An adaptation of the proof of the previous Theorem, in our case where $\Pi_n$ is discrete, yields the following.

Proposition (Regular posteriors)
Fix $\alpha \in (0,1)$ and assume that $\varepsilon_n$ satisfies $n\varepsilon_n^2 \geq 2$ and

$\Pi_n(B_n(\varepsilon_n, \theta_0)) \geq e^{-n\varepsilon_n^2}.$

Then, for any $D \geq 2$ and $t > 0$,

$\Pi_{n}\left( \frac{1}{n}D_\alpha^{(n)}(\theta, \theta_0) \geq \frac{D+3t}{1-\alpha} \varepsilon_n^2 \mid X^{(n)} \right)^\alpha \leq H_\alpha(\Pi_n) e^{-tn\varepsilon_n^2}.$

holds with probability at least $1-2/\{(D-1+t)^2n\varepsilon_n^2\}$.

Note that $H_\alpha(\Pi_n)$ may be infinite, in which case the upper bound on the tails of $\frac{1}{n}D_\alpha^{(n)}$ is trivial. When the prior is not discrete, my guess is that the complexity term $H_\alpha(\Pi_n)$ should be replaced by a discretization entropy ${}H_\alpha(\Pi_n; \varepsilon_n)$ which is the $\alpha$-entropy of a discretized version of $\Pi_n$ whose resolution (in the Hellinger sense) is some function of $\varepsilon_n$.

# Topology and Machine Learning

Won best poster award at the CSSC.

# Bayesian numerical analysis

Distributing points $\{x_i\}_{i=1}^n$ on the sphere as to minimize the mean square error

$\mathbb{E}\left[\left(q_n(f) - \int_{\mathbb{S}^2}f(s)\,ds\right)^2\right]$

of the quadrature formula $q_n(f) =\frac{1}{n}\sum_{i=1}^n f(x_i)$, where $f$ is a centered Gaussian process with covariance function $C(x,y) = \exp(\langle x, y \rangle)$. Shown is $n=6, 12, 23$.

# Bayesian binary classification using partitions of unity

Sunday afternoon project. I think it is possible to get topological garanties for the reconstruction of the classification boundary using the Nash-Tognoli theorem.

# 1. The problem

Points $x_i$, $i=1\dots, N$ are distributed on some space $\mathbb{M}$ and associated a label $\ell_i\in \{0,1\}$. It is assumed that

$\ell_i \sim \text{Ber}(p(x_i)), \qquad (1)$

where $p: \mathbb{M} \rightarrow [0,1]$ is some unknown integrable function. Estimating $p$ from the data $\{(x_i,\ell_i)\,|\, i=1,2,\dots, N\}$ allows us to predict $\ell_{n+1}$ given $x_{n+1}$.Read More »

# Bayesian learning

Friday july 28 at 17:00
Rutherford Physics Building, Room 118, McGill

Next week, I’ll be talking about Bayesian learning at the Mathematical congress of the americas and at the Canadian undergraduate mathematics conference. These are somewhat challenging talks: I need to sell the idea of Bayesian statistics to a general mathematical audience (which knows nothing about it), interest them in some though problems of Bayesian nonparametrics, and then present some of our research results. This must be done in under 20 minutes.

To make the presentation more intuitive and accessible, I borrowed some language from machine learning. I’m talking about learning rather than inference, uncertain knowledge rather than subjective belief, and “asymptotic correctness” rather than consistency. These are essentially synonymous, although some authors might use them in different ways. This should not cause problems for this introductory talk.Read More »

# The discretization trick

I explain the discretization trick that I mentioned in my previous post (Posterior consistency under possible misspecification). I think it was introduced by Walker (New approaches to Bayesian consistency (2004)).

Let $\mathbb{F}$ be a set of densities and let $\Pi$ be a prior on $\mathbb{F}$. If $x_1, x_2, x_3, \dots$ follows some distribution $P_0$ having density $f_0$, then the posterior distribution of $\Pi$ can be written as

$\Pi(A | x_1, \dots, x_n) \propto \int_A \prod_{i=1}^n f(x_i) \Pi(df).$

The discretization trick is to find densities $f_{1}, f_2, f_3, \dots$ in the convex hull of $A$ (taken in the space of all densities) such that

$\int_A \prod_{i=1}^n f(x_i) = \prod_{i=1}^n f_i(x_i) \Pi(A).$Read More »

# Posterior consistency under (possible) misspecification

We assume, without too much loss of generality, that our priors are discrete. When dealing with Hellinger separable density spaces, it is possible to discretize posterior distributions to study consistency (see this post about it).

Let $\Pi$ be a prior on a countable space $\mathcal{N} = \{f_1, f_2, f_3, \dots\}$ of probability density functions, with $\Pi(f) > 0$ for all $f \in \mathcal{N}$. Data $X_1, X_2, X_3, \dots$ follows (independently) some unknown distribution $P_0$ with density $f_0$.

We denote by $D_{KL}(f_0, f) = \int f_0 \log\frac{f_0}{f}$ the Kullback-Leibler divergence and we let $D_{\frac{1}{2}}(f_0, f) = 1 - \int \sqrt{f_0 f}$ be half of the squared Hellinger distance.

The following theorem states that the posterior distribution of $\Pi$ accumulates in Hellinger neighborhoods of $f_0$, assuming the prior is root-summable (i.e. $\sum_{f \in \mathcal{N}} \Pi(f)^\alpha < \infty$ for some $\alpha > 0$) . In the well-specified case (i.e. $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f) = 0$), the posterior accumulates in any neighborhood of $f_0$. In the misspecified case, small neighborhoods of $f_0$ could be empty, but the posterior distribution still accumulates in sufficiently large neighborhoods (how large exactly is a function of $\alpha$ and $\inf_{f \in \mathcal{N}} D_{KL}(f_0, f)$).Read More »

# The choice of prior in bayesian nonparametrics – part 2

See part 1. Most proofs are omitted; I’ll post them with the complete pdf later this week.

# The structure of $\mathcal{M}$

Recall that $\mathbb{M}$ is is a Polish space (ie. a complete and separable metric space). It is endowed with its borel $\sigma$-algebra $\mathfrak{B}$ which is the smallest family of subsets of $\mathbb{M}$ that contains its topology and that is closed under countable unions and intersections. All subsets of $\mathbb{M}$ we consider in the following are supposed to be part of $\mathfrak{B}$. A probability measure on $\mathbb{M}$ is a function $\mu : \mathfrak{B} \rightarrow [0,1]$ such that for any countable partition $A_1, A_2, A_3, \dots$ of $\mathbb{M}$ we have that $\sum_{i=1}^\infty \mu(A_i) = 1.$ The set $\mathcal{M}$ consists of all such probability measures.

Note that since $\mathbb{M}$ is complete and separable, every probability measure $\mu \in \mathcal{M}$ is regular (and tight). It means that the measure of any $A\subset \mathbb{M}$ can be well approximated from the measure of compact subsets of $A$ as well as from the measure of open super-sets of $A$:

$\mu(A) = \sup \left\{\mu(K) \,|\, K \subset A \text{ is compact}\right\}\\ = \inf \left\{\mu(U) \,|\, U \supset A \text{ is open}\right\}.$

## Metrics on $\mathcal{M}$

Let me review some facts. A natural metric used to compare the mass allocation of two measures $\mu, \nu \in \mathbb{M}$ is the total variation distance defined by

$\|\mu - \nu\|_{TV} = \sup_{A \subset \mathbb{M}}|\mu(A) - \nu(A)|.$Read More »

# Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

## The problem.

Let $f, g, h$ be three densities and suppose that, $x_i \sim h$, $i \in \mathbb{N}$, independently. What happens to the likelihood ratio

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}$

as $n\rightarrow \infty$?

Clearly, it depends. If $h = g \not = f$, then

$\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$

almost surely at an exponential rate. More generally, if $h$ is closer to $g$ than to $f$, in some sense, we’d expect that $\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0$. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

$D_{KL}(f, g) = \int f \log\frac{f}{g}.$

It can be verified that $D_{KL}(f,g) \geq 0$ with equality if and only if $f=g$, and that

$D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)$

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

## Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in $(1)$ (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.Read More »

# The choice of prior in bayesian nonparametrics – Introduction

In preparation for the 11th Bayesian nonparametrics conference, I’m writing (and rewriting) notes on the background of our research (i.e. some of the general theory of bayesian nonparametrics). There are some good books on the subject (such as Bayesian Nonparametrics (Ghosh and Ramamoorthi, 2003)), but I wanted a more introductory focus and to present Choi and Ramamoorthi’s very clear point of view on posterior consistency (Remarks on the consistency of posterior distributions, 2008).

# 1. Introduction

Let $\mathbb{X}$ be a complete and separable metric space and let $\mathcal{M}$ be the space of all probability measures on $\mathbb{X}$. Some unknown distribution $P_0\in \mathcal{M}$ is generating observable data $\mathcal{D}_n = (X_1, X_2, \dots, X_n) \in \mathbb{X}^n$, where each $X_i$ is independently drawn from $P_0$. The problem is to learn about $P_0$ using only $\mathcal{D}_n$ and prior knowledge.

Example (Discovery probabilities).
A cryptographer observes words, following some distribution $P_0$, in an unknown countable language $\mathcal{L}$. What are the $P_0$-probabilities of the words observed thus far? What is the probability that the next word to be observed has never been observed before?

## 1.1 Learning and uncertainty

We need an employable definition of learning. As a first approximation, we can consider learning to be the reduction of uncertainty about what is $P_0$. This requires a quantification of how uncertain we are to begin with. Then, hopefully, as data is gathered out uncertainty decreases and we are able to pinpoint $P_0$.

This is the core of Bayesian learning, alghough our definition is not yet entirely satisfactory. There are some difficulties with this idea of quantifying uncertainty, at least when using information-theoric concepts. The solution we adopt here is the use of probabilities to quantify uncertain knowledge (bayesians would also talk of subjective probabilities quantifying rational belief). For example, you may know that a coin flip is likely to be fair, although it is not impossible the two sides of the coin are both the same. This is uncertain knowledge about the distribution of heads and tails in the coin flips, and you could assign probabilities to the different possibilities.

More formally, prior uncertain knowledge about what is $P_0$ is quantified by a probability measure $\Pi$ on $\mathcal{M}$. For any $A \subset \mathcal{M}$, $\Pi(A)$ is the the prior probability that “$P_0 \in A$“. Then, given data $\mathcal{D}_n$, prior probabilities are adjusted to posterior probabilities: $\Pi$ becomes $\Pi_n$, the conditional distribution of $\Pi$ given $\mathcal{D}_n$. The celebrated Bayes’ theorem provides a formula to calculate $\Pi_n$ from $\Pi$ and $\mathcal{D}_n$. Thus we have an operational definition of learning in our statistical framework.

Learning is rationally adjusting uncertain knowledge in the light of new information.

For explanations as to why probabilities are well suited to the representation of uncertain knowledge, I refer the reader to Pearl (Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988). We will also see that the operation of updating the prior to posterior posterior probabilities does work as intended.

## 1.2 The choice of prior

Specifying prior probabilities, that is quantifying prior uncertain knowledge, is not a simple task. It is especially difficult when uncertainty is over the non-negligeable part $\mathcal{M}$ of an infinite dimensional vector space. Fortunately, “probability is not about numbers, it is about the structure of reasoning”, as Glenn Shafer puts it (cited in Pearl, 1988, p.15). The exact numbers given to the events “$P_0 \in A$” are not of foremost importance; what matters is how probabilities are more qualitatively put together, and how this relates to the learning process.

Properties of prior distributions, opening them to scrutiny, criticism and discussion, must be identified and related to what happens as more and more data is gathered.

Part 2.