# Blog has migrated!

I will continue to post on

It will be much easier for me to write math and code on this new platform + I get to style it any way I want.

# What this blog is about

My professional page is at olivierbinette.ca.

# Posterior Concentration in Terms of the Separation alpha-Entropy

This post continues the series on posterior concentration under misspecification. Here I introduce an unifying point of view on the subject through the introduction of the separation $\alpha$-entropy. We use this notion of prior entropy to bridge the gap between Bayesian fractional posteriors and regular posterior distributions: in the case where this entropy is finite, direct analogues to some of the concentration results for fractional posteriors (Bhattacharya et al., 2019) are recovered.

This post is going to be quite abstract, just like last week. I’ll talk in a future post about how this separation $\alpha$-entropy generalizes generalizes the covering numbers for testing under misspecification of Kleijn et al. (2006) as well as the prior summability conditions of De Blasi et al. (2013).

Quick word of warning: this is not the definitive version of the results I’m working on, but I still had to get them out somewhere.

Another word of warning: WordPress has gotten significantly worse at dealing with math recently. I will find a new platform, but for now expect to find typos and some rendering issues.

## The framework

We continue in the same theoretical framework as before: $\mathbb{F}$ is a set of densities on a complete and separable metric space $\mathcal{X}$ with respect to a $\sigma$-finite measure $\mu$ defined on the Borel $\sigma$-algebra of $\mathcal{X}$, $H$ is the Hellinger distance defined by

$H(f, g) = \left(\int \left(\sqrt{f} - \sqrt{g}\right)^2 \, d\mu\right)^{1/2}$

and we make use of the Rényi divergences defined by

$d_\alpha(f, g) = -\alpha^{-1}\log A_\alpha(f, g),\quad A_\alpha(f, g) = \int f^{\alpha}g^{1-\alpha} \,d\mu .$

Here we assume that data is generated following a distribution $f_0 \in \mathbb{F}$ having a density in our model (this assumption could be weakened), and therefore defined the off-centered Rényi divergence

$d_\alpha^{f_0}(f, f^\star) = -\alpha^{-1}\log(A_\alpha^{f_0}(f, f^\star))$

where

$A_\alpha^{f_0}(f, f^\star) = \int (f/f^\star)^\alpha f_0\,d\mu$

assuming that all this is well defined.

### Prior and posterior distributions

Now let $\Pi$ be a prior on $\mathbb{F}$. Given either a single data point $X \sim f_0$ or a sequence of independent variables $X^{(n)} = \{X_i\}_{i=1}^n$ with common probability density function $f_0$, the posterior distribution of $\Pi$ given $X^{(n)}$ is the random quantity $\Pi(\cdot \mid X^{(n)})$ defined by

$\Pi\left(A\mid X^{(n)}\right) = \int_A \prod_{i=1}^n f(X_i) \Pi(df)\Big/ \int_{\mathbb{F}} \prod_{i=1}^n f(X_i) \Pi(df)$

and $\Pi(\cdot \mid X) = \Pi(\cdot \mid X^{(1)})$. This may not always be well-defined, but I don’t want to get into technicalities for now.

## Separation $\alpha$-entropy

We state our concentration results in terms of the separation $\alpha$-entropy. It is inspired by the Hausdorff $\alpha$-entropy introduced in Xing et al. (2009), although the separation $\alpha$-entropy has no relationship with the Hausdorff measure and instead builds upon the concept of $\delta$-separation of Choi et al. (2008) defined below.

Given a set $A \subset \mathbb{F}$, we denote by $\langle A \rangle$ the convex hull of $A$: it is the set of all densities of the form $\int_A f \,\nu(df)$ where $\nu$ is a probability measure on $A$.

Definition ($\delta$-separation).
Let $f_0 \in \mathbb{F}$ be fixed as above. A set of densities $A \subset \mathbb{F}$ is said to be $\delta$-separated from $f^\star \in \mathbb{F}$ with respect to the divergence $d_\alpha^{f_0}$ if for every $f \in \langle A \rangle$,

$d_\alpha^{f_0}\left(f, f^\star\right) \geq \delta.$

A collection of sets $\{A_i\}_{i=1}^\infty$ is said to be $\delta$-separated from $f_0$ if every $A \in \{A_i\}_{i=1}^\infty$ is $\delta$-separated from $f_0$.

An important property of $\delta$-separation, first noted by Walker (2004) and used for the study of posterior consistency, is that it scales with product densities. The general statement of the result is stated in the following lemma.

Lemma (Separation of product densities).
Let $(\mathcal{X}_i, \mathcal{B}_{i}, \mu_i)$, $i \in\{ 1,2, \dots, n\}$, be a sequence of $\sigma$-finite measured spaces where each $\mathcal{X}_i$ is a complete and separable locally compact metric space and $\mathcal{B}_i$ is the corresponding Borel $\sigma$-algebra. Denote by $\mathbb{F}_i$ the set of probability density functions on $(\mathcal{X}_i, \mathcal{B}{i}, \mu_i)$, fix $f_{0,i} \in \mathbb{F}_i$ and let $A_i \subset \mathbb{F}_i$ be $\delta_i$-separated from $f_{i}^\star \in \mathbb{F}_i$ with respect to $d_\alpha^{f_{0,i}}$ for some $\delta_i \geq 0$. Let $\prod_{i=1}^n A_i = \left\{\prod_{i=1}^n f_{i} \mid f_i \in \mathbb{F}_i\right\}$ where $\prod_{i=1}^n f_i$ is the product density on $\prod_{i=1}^n \mathcal{X}_i$ defined by $(x_1, \dots, x_n) \mapsto \prod_{i=1}^n f_i(x_i)$. Then $\prod_{i=1}^nA_i$ is $\left(\sum_{i=1}^n\delta_i\right)$-separated from $\prod_{i=1}^n f_{i}^\star$ with respect to $d_\alpha^{f_0}$ where $f_0 = \prod_{i=1}^nf_{0,i}$.

We can now define the separation $\alpha$-entropy of a set $A\subset \mathbb{F}$ with parameter $\delta > 0$ as the minimal $\alpha$-entropy of a $\delta$-separated covering of $A$. When this entropy is finite, we can study the concentration properties of the posterior distribution using simple information-theoretic techniques similar to those used in Bhattacharya (2019) for the study of Bayesian fractional posteriors.

Definition (Separation $\alpha$-entropy).
Fix $\delta > 0$, $\alpha \in (0,1)$ and let $A$ be a subset of $\mathbb{F}$. Recall $\Pi$, $f_0$ and $f^\star$ fixed as previously. The separation $\alpha$-entropy of $A$ is defined as

$\mathcal{S}_\alpha^\star(A, \delta) = \mathcal{S}_\alpha^\star(A, \delta; \Pi, f_0, f^\star) = \inf \,(1-\alpha)^{-1} \log \sum_{i=1}^\infty \left(\frac{\Pi(A_i)}{\Pi(A)}\right)^\alpha$

where the infimum is taken over all (measurable) families $\{A_i\}_{i=1}^\infty$, $A_i \subset \mathbb{F}$, satisfying $\Pi(A \backslash (\cup_{i}A_i)) = 0$ and which are $\delta$-separated from $f_0$ with respect to the divergence $d_\alpha^{f_0}$. When no such covering exists we let $\mathcal{S}_\alpha(A, \delta) = \infty$, and when $\Pi(A) = 0$ we define $\mathcal{S}_\alpha(A, \delta) = 0$.

Remark.
When $f_0 = f^\star$, so that $d_\alpha^{f_0}(f, f^\star) = d_\alpha(f, f_0)$, we drop the indicator $\star$ and denote $\mathcal{S}_\alpha(A, \delta) = \mathcal{S}^\star(A, \delta)$, to emphasize the fact.

Proposition (Properties of the separation $\alpha$-entropy).
The separation $\alpha$-entropy $\mathcal{S}_\alpha^\star(A, \delta)$ of a set $A\subset \mathbb{F}$ is non-negative and $\mathcal{S}_\alpha(A, \delta) = 0$ if $A$ is $\delta$-separated from $f^\star$ with respect to the divergence $d_\alpha^{f_0}$. Furthermore, if $0 < \alpha \leq \beta < 1$ and $0 < \delta \leq \delta'$, then

${}\mathcal{S}_\alpha^\star(A, \delta) \leq \mathcal{S}_\alpha^\star(A, \delta')$

and if also $f^\star = f_0$, then

${}\mathcal{S}_\beta(A, \tfrac{1-\beta}{\beta}\delta) \leq \mathcal{S}_\alpha(A, \tfrac{1-\alpha}{\alpha}\delta).$

For a subset $A \subset B \subset \mathbb{F}$ with $\Pi(A) > 0$, we have

and, more generally, if $A \subset \bigcup_{n=1}^\infty B_n$ for subsets $B_n \subset \mathbb{F}$, then

${}\Pi(A)^{\alpha}\left(\exp\mathcal{S}_\alpha^\star(A, \delta)\right)^{1-\alpha} \leq \sum_{n=1}^\infty \Pi(B_n)^\alpha \left(\exp\mathcal{S}_\alpha^\star(B_n, \delta)\right)^{1-\alpha}.$

## Posterior consistency

Theorem (Posterior consistency).
Let $f_0, f^\star \in \mathbb{F}$ and let $\{X_i\}$ be a sequence of independent random variables with common probability density $f_0$. Suppose there exists $\delta > 0$ such that

$\Pi\left({f \in \mathbb{F} \mid D(f_0| f) < \delta}\right) > 0.$

If $A \subset \mathbb{F}$ satisfies $\mathcal{S}_\alpha^\star\left(A, \delta\right) < \infty$ for some $\alpha \in (0,1)$, then $\Pi\left(A\mid \{X_i\}_{i=1}^n\right) \rightarrow 0$ almost surely as $n\rightarrow \infty$.

Remark.
The condition $\mathcal{S}_\alpha^\star(A, \delta) < \infty$ implies in particular that $A \subset \{f\in \mathbb{F} \mid d_\alpha^{f_0}(f, f^\star) \geq \delta\}$.

Corollary (Well-specified consistency).
Suppose that $f_0$ is in the Kullback-Leibler support of $\Pi$. If $A \subset \mathbb{F}$ satisfies $\mathcal{S}_\alpha(A, \delta) < \infty$ for some $\alpha \in (0,1)$ and for some $\delta > 0$, then $\Pi_n(A) \rightarrow 0$ almost surely as $n \rightarrow \infty$.

Corollary (Well-specified Hellinger consistency).
Suppose that $f_0$ is in the Kullback-Leibler support of $\Pi$ and fix $\varepsilon > 0$. If there exists a covering $\{A_i\}_{i=1}^\infty$ of $\mathbb{F}$ by Hellinger balls of diameter at most $\delta < \varepsilon$ satisfying $\sum_{i=1}^\infty \Pi(A_i)^\alpha < \infty$ for some $\alpha \in (0,1)$, then $\Pi_n\left(\left\{f \in \mathbb{F} \mid H(f, f_0) \geq \varepsilon \right\}\right) \rightarrow 0$ almost surely as $n\rightarrow \infty$.

## Posterior concentration

Following Kleijn et al. (2006) and Bhattacharya et al. (2019), we let

$B(\delta, f^\star;f_0) = \left\{ f\in \mathbb{F} \mid \int \log \left(\frac{f}{f^\star}\right) f_0 \,d\mu \leq \delta,\, \int \left(\log \left(\frac{f}{f^\star}\right)\right)^2 f_0 \,d\mu \leq \delta \right\}$

be a Kullback-Leibler type neighborhood of $f^\star$ (relatively to $f_0$) where the second moment of the log likelihood ratio $\log(f/f^\star)$ is also controlled.

Theorem (Posterior concentration bound).
Let $f_0, f^\star \in \mathbb{F}$ and let $X \sim f_0$. For any $\delta > 0$ and $\kappa > 1$ we have that

$\log\Pi(A \mid X) \leq \frac{1-\alpha}{\alpha}\mathcal{S}_\alpha^\star(A, \kappa \delta)- \log\Pi(B(\delta, f^\star;f_0)) - \kappa\delta$

holds with probability at least $1-8/(\alpha^2\delta)$.

Corollary (Posterior concentration bound, i.i.d. case).
Let $f_0, f^\star \in \mathbb{F}$ and let $\{X_i\}$ be a sequence of independent random variables with common probability density $f_0$. For any $\delta > 0$ and $\kappa > 1$ we have that

$\log\Pi\left(A \mid \{X_i\}_{i=1}^n\right) \leq \frac{1-\alpha}{\alpha}\mathcal{S}_\alpha^\star(A, \kappa \delta)- \log\Pi(B(\delta, f^\star;f_0)) - n\kappa\delta$

holds with probability at least $1-8/(\alpha^2 n \delta)$.

Read More »

# Some Comparison Inequalities for Off-Centered Rényi Divergences

Divergences between probability distributions $P$, $Q$ where say $P \ll Q$, provide distributional characteristics of the likelihood ratio $\frac{dP}{dQ}(x)$ when $x \sim Q$. This post is about simple properties of what I call “off-centered” divergences, where the concern is about distributional characteristics of $\frac{dP}{dQ}(x)$ in the misspecified case $x \sim Q_0$ when it may be the case that $Q_0 \not = Q$. The need arises from the study of likelihood-based inference in misspecified models (Kleijn and van der Vaart (2006); Bhattacharya et al. (2019)).

So here’s the framework in which we work. Let $\mathcal{X}$ be a complete and separable metric space together with its Borel $\sigma$-algebra $\mathcal{B}_{\mathcal{X}}$ and a $\sigma$-finite measure $\mu$ on $(\mathcal{X}, \mathcal{B}_{\mathcal{X}})$. We denote by $\mathbb{F}$ the set of all probability distributions which are absolutely continuous with respect to $\mu$ and we identify every element $f \in \mathbb{F}$ to (a chosen version of) its probability density function satisfying $f \geq 0$ and necessarily $\int f\, d\mu = 1$. Our basic metric structure on $\mathbb{F}$ is provided by the Hellinger distance

$H(f,g) = \left(\int (\sqrt{f} - \sqrt{g})^2\right)^{1/2}.$

Additionally, we make use of the Rényi divergence of order $\alpha \in (0,1]$ here given by

$d_\alpha(f, g) = -\alpha^{-1}\log A_{\alpha}(f,g),\quad A_\alpha(f, g) = \int_{{g > 0}} f^{\alpha}g^{1-\alpha}\,d\mu,$

where $A_\alpha$(f, g) is refered to as the $\alpha$-affinity between $f$ and $g$. In the case where $\alpha = 0$, we let $d_0$ be the Kullback-Leibler divergence (or relative entropy) defined as

$d_0(f, g) = D(g | f) = \int_{{g >0}} \log(g/f) g\,d\mu.$

Furthermore, we note the following standard inequalities relating together $d_\alpha$ and $H$ for different levels of $\alpha \in (0,1]$ (van Erven (2014); Bhattacharya et al. (2019)):

• $d_{1/2} = \frac{-1}{2}\log(1-H(f,g)^2)$;
• if $0 < \alpha \leq \beta < 1$, then $d_\beta \leq d_\alpha \leq \frac{1-\alpha}{\alpha}\frac{\beta}{1-\beta} d_\beta$;
• if $D(g|f) < \infty$, then $d_\alpha(f, g) \rightarrow d_0(f,g) = D(g|f)$ as $\alpha \rightarrow 0$.

Point (ii) can be improved when $\alpha = 1/2$. In this case, Proposition 3.1 of Zhang (2006) implies that for $\beta \in (0, 1/2]$, $d_{1/2} \geq 2\beta d_\beta$ and for $\beta \in [1/2, 1)$, $d_{1/2} \leq 2\beta d_\beta$.

## 1. Off-centered divergences

Fix $Q_0$ any probability measure on $(\mathcal{X}, \mathcal{B}_{\mathcal{X}})$. We let $f, g \in \mathbb{F}$. In order to study the behaviour of $\frac{f}{g}(X)$ when $X \sim Q_0$, assuming this is well-defined, we consider

$A\alpha^{Q_0}(f, g) = \mathbb{E}{X \sim Q_0}\left[ \left(\frac{f(X)}{g(X)}\right)^{\alpha} \right] = \int_{\{f > 0\}} \left(f/g\right)^{\alpha}\,d Q_0$

and similarly we define the off-centered Rényi divergence

$d_\alpha^{Q_0}(f,g) = -\alpha^{-1}\log\left( A_\alpha^{Q_0}(f,g) \right).$

Finally, we make use of

$d_0^{Q_0}(f,g) = D^{Q_0}(g|f) = \int_{{g > 0}} \log\left(g/f\right)\,d Q_0$

Note that unless we assume $Q_0 \ll \mu$, there is a dependence in the definition of $d_\alpha^{Q_0}$ to the choice of density representatives $f$ and $g$. That is, $f$ and $g$ must be measurable functions that are well-defined pointwise and not only up to $\mu$-equivalence.

Furthermore, typically, $d_\alpha^{Q_0}$ will take negative values. Considering $d_\alpha^{Q_0}(f,g)$ over $f \in \mathcal{P} \subset\mathbb{F}$ where $\mathcal{P}$ is some fixed convex part of $\mathbb{F}$, and if there exists $f \in \mathcal{P}$ such that $D(Q_0|f) < \infty$ (which implies in particular that $Q_0\ll f \ll \mu$), then we can say that $d_\alpha^{Q_0}(f,g)\geq 0$ for every $f \in \mathcal{P}$ if and only if $g \in \arg\min_{h \in \mathcal{P}}D(Q_0| h)$. Sufficiency follows from Kleijn and van der Vaart (2006) while necessity is a consequence of Proposition 2 below.

## 2. Comparison inequalities

Our first inequalities provide results analogous to $d_{\beta} \leq d_\alpha \leq \frac{1-\alpha}{\alpha}\frac{\beta}{1-\beta}d_\beta$ when $0 < \alpha \leq \beta < 1$: the off-centered divergence $d_\alpha^{Q_0}$ is also decreasing in $\alpha$, and the reverse inequality holds up to some modifications.

Proposition 1.
Let $d_\alpha^{Q_0}$ be defined as before in terms of a probability measure $Q_0$ on ${}(\mathcal{X}, \mathcal{B}_{\mathcal{X}})$. For any $0 < \alpha \leq \beta < 1$, we have

$d_{\beta}^{Q_0} \leq d_{\alpha}^{Q_0} \leq \frac{1-\alpha}{\alpha}\frac{\beta}{1-\beta} d_{\beta}^{Q_0} + \frac{\alpha-\beta}{\alpha(1-\beta)} d_1^{Q_0}.$

Proof.
These are straighforward applications of Jensen’s inequality. For the first inequality, since $\beta \geq \alpha$,

$A_\beta^{Q_0}(f, g) = \int_{{f > 0}} \left(\frac{f}{g} \right)^{\beta}\, dQ_0 \geq \left(\int \left(\frac{f}{g}\right)^{\alpha}\,dQ_0 \right)^{\beta / \alpha}.$

Applying the decreasing function $-\beta^{-1} \log(\cdot)$ yields the result. For the second inequality, first assume that $Q_0({f > 0,\, g = 0}) = 0$. Then using the fact that $\frac{1-\alpha}{1-\beta} \geq 1$ we find

Applying the function $-\alpha^{-1}\log(\cdot)$ then yields the result. When $Q_0({f>0,\, g = 0}) > 0$, then both $d_\alpha^{Q_0}$ and $d_\beta^{Q_0}$ are infinite and the inequality also holds. //

The following Proposition shows how $d_\alpha^{Q_0}$-neighborhoods of the form ${f \in \mathbb{F} \mid d_\alpha^{Q_0}(f,g)}$ around $g\in \mathbb{F}$ are related to $d_\alpha$-neighborhoods around $Q_0$. It also provides the converse to the non-negativity result $d_\alpha(f, g) \geq 0$ when $g$ is a point of minimal Kullback-Leibler divergence: when $D(Q_0| f) < D(Q_0|g)$, then necessarily $d_\alpha^{Q_0}(f, g) < 0$.

Proposition 2.
Let $Q_0$ be a probability measure on ${}(\mathcal{X}, \mathcal{B}_{\mathcal{X}})$ that is absolutely continuous with respect to $\mu$ with density $q_0 \in \mathbb{F}$ and let $f, g \in \mathbb{F}$ be such that $Q_0(\{f > 0,\, g = 0\}) = 0$, $g(\{f > 0,\, q_0 = 0\}) = 0$. Then

$d_\alpha(f, Q_0) \geq (1-\alpha) d_\alpha^{Q_0}(f,g) + \alpha d_\alpha(f,g)$

and

$d_\alpha^{Q_0}(f,g) \leq d_0^{Q_0}(f,g).$

Proof.
Applying Jensen’s inequality, we find

Applying the decreasing function $-\alpha^{-1}\log(\cdot)$ then yields the result. For the second inequality, note that

Read More »

# The Misspecified Mathematical Theory of Bayesian Misspecification

The last few weeks (and months) have been quite busy for me. On top of university visits and preparing my immigration to the US, I’ve had to max out my assistant teaching load this semester. I’m also teaching a few courses about Bayesian stats to senior undergrads in order to help out my advisor, I’ve been co-organizing a student statistics conference, judging a science fair, etc. It’s all fun and games and good experience, but this has also been an excuse for me to avoid what’s bothering me in my research. So here’s what’s going on in that area and how I feel about it.

There’s been renewed interest recently in the behaviour of posterior distributions when dealing with misspecified models, i.e. when the true data-generating distribution falls outside of the (Kullback-Leibler) support of the prior. For instance, Grunwald and van Ommen (2017) have empirically shown that inconsistency can arise even in standard regression models, when the data is corrupted by samples from a point mass at zero. Their solution to this problem revolves around dampening the posterior distribution by raising the likelihood function to a fractional power, resulting in what is called a generalized or fractional posterior distribution (a particular case of a Gibbs posterior distribution).

A general theory of posterior concentration for fractional posterior distributions has been developped by Bhattacharya et al. (2019), in which they show that the use of fractional posteriors aleviates the prior complexity constraint that is typically required by known results for posterior consistency. They get finite sample posterior concentration bounds and oracle inequalities in this context.

## My research

On my side, I’ve been working on the use of a new notion of prior complexity, which I refer to as the separation $\alpha$-entropy (closely related to the Hausdorff $\alpha$-entropy of Xing and Ranneby (2009) and to the concept of $\alpha$-separation discussed in Choi et Ramamoorthi (2008)), which allows to bridge the gap between regular and fractional distributions: when this entropy is finite, results analogous to those of Bhattacharya et al (2019) are recovered. This notion of prior complexity also generalizes the covering numbers for testing under misspecification introduced by Kleijn and van der Vaart (2006), avoiding testing arguments, and generalizes as well the prior root summability conditions of Walker (2004).

I think it provides a neat way to unify a literature that otherwise might be a bit difficult to get into, but it is more conceptually interesting than practically useful: we never know in what way a model is misspecified, and things can get as bad as the misspecification gets. So in practice we still have little clue of what’s going on. Yet the tools developped for this conceptual study and the general understanding that we get might turn out helpful in developping ways to detect misspecification. So I’m not too worried about doing “mathy stuff” as, I hope, it will turn out helpful at some point.

What worries me is that our mathematical tools are unable to grasp the kind of misspecification that really happens in practice. That is, the typical mathematical theory of misspecification, worked out in dominated models, might be itself entirely misspecified. I have some ideas of how I could fix part of the problem, but it’s quite a big issue.

## The issue

Ok, so here’s what’s going on. We have a model $\mathcal{M}$ dominated by some measure $\mu$. Very roughly speaking, we can say that the model is misspecified if the true data-generating distribution, corresponding to a parameter $\theta_0$ , falls outside of the model $\mathcal{M}$. In convex parametric models, assuming that thet truth doesn’t fall too far from the model and in particular has a density with respect to our dominating measure, then the posterior distribution will typically converge to a point mass at the Kullback-Leibler minimizer $\theta^\star$ under manageable conditions.

It’s possible to define nonparametric models (and prior distributions with full support on these models) that encompass all or nearly all probability distributions absolutely continuous with respect to $\mu$, and we can ensure that the posterior distribution will converge to $\theta_0$ (in the large sample limit of i.i.d. observations from $\theta_0$), under some verifiable conditions. There is a large literature devoted to this, and it is typically required that $\theta_0$ has a continuous density. In my paper Bayesian Nonparametrics for Directional Statistics (2019), we developped a general framework for density estimation on compact metric spaces which ensures convergence at all bounded, possibly discontinuous densities, when the dominating measure $\mu$ is finite. I.i.d. misspecification (when staying inside of the dominating model) then becomes a non-issue.

The picture in the nonparametric case looks something like this below: the model is dense in the space of all absolutely continuous distributions, although there might still be some “holes” in there.

There are a few results about what’s going on in this case, although most of them are about i.i.d. misspecification. My research has been about unifying and extending these results to non-i.i.d. misspecification in this context of dominated models, providing finite sample posterior concentration bounds and asymptotic convergence in terms of prior complexity and concentration.

I can theoretically deal with problems such as a target data-generating distribution which keeps shifting as we gather more and more data, but we require that the true data-generating distribution has a density with respect to the dominating measure (or its nth product). And changing the dominating measure in order to incorporate $\theta_0$ in the model would be cheating: we don’t know in advance what $\theta_0$ is.

## The real world: singular misspecification

There’s typically no reason why the true data-generating distribution should have a density with respect to our chosen dominating measure. While we can still do Bayesian computation in that case, our theoretical analysis breaks down: anything can happen depending on the choice of density representatives in our model.

This is singular misspecification: the true data-generating distribution has a component which is singular to $\mu$.

This issue has led to some confusion in the literature, and some of the limitations involved in typical mathematical frameworks of consistency under misspecification seem to have been neglected. For instance, in their inconsistency example, Grunwald and van Ommen (2017) considered a true data-generating distribution which is a mixture containing a point mass at zero: in this case, there is no density in the model and no point of minimal Kullback-Leibler divergence (contrarily to what they state in their paper; at least as far as I understood what they meant in this regard). They show that the sufficient conditions for posterior consistency of De Blasi and Walker (2013) do not hold in their context, but let’s be clear: even if these regularity conditions were to hold, the results of De Blasi and Walker (2013) are inaplicable in this context.

So what can we do? I think there might still be things to say when we consider models of continuous densities, but more work is required in this area.

Read More »

# Counting cells in microscopic images

The Statistical Society of Canada has posted a few weeks ago its Case Studies (a grad data science competition) for the 2019 annual meeting held in Calgary on May 26 to 29. One of the case study is about counting cells in microscopic images which look like this:

Unfortunately, the organizers forgot to remove from the test set of images the actual cell counts.

Ok, that’s not quite fair. Truth is that they tried to remove the true cell counts, but didn’t quite manage to do so.

So here’s what’s going on. The file names of the images in the training set take forms such as A01_C1_F1_s01_w2, and the number following the letter “C” in the name indicates the true cell count in the image. While they removed this number, they forgot the remove the number following the letter “A”, which is in a simple bijection with the true cell count… The file names in the test set look like this: A01_F1_s01_w1.

Now even if that number following the letter “A” was removed, there would still be other problems: the number following the letter “s” in the file name also carries quite a bit of information… I don’t know why they left all that in.

I’ve contacted the organizers about this, but they don’t see it as being an important problem for the competition, even when 60% of the team’s scoring will be based on a RMSE prediction score.

Another fun fact about this case study: it is possible to get a root mean square error (RMSE) of about 1-2 cells through linear regression with only one covariate. Try to guess what predictor I used (hint: it’s roughly invariant under the type of blurring that they applied to some of the images.)

# 3D Data Visualization with WebGL/three.js

I wanted to make a web tool for high-dimensional data exploration through spherical multidimensional scaling (S-MDS). The basic idea of S-MDS is to map a possibly high-dimensional dataset on the sphere while approximately preserving a matrix of pairwise distances (or divergences). An interactive visualization tool could help explore the mapped dataset and translate observations back to the original data domain. I’m not quite finished, but I made a frontend prototype. The next step would be to implement the multidimensional scaling algorithm in Javascript. I may get to this if I find the time.

In the current applet, you can visualize the positions and depths of earthquakes of magnitude greater than 6 from January 1st 2014 up to January 1st 2019. Data is from the US Geological Survey (usgs.gov). Code is on GitHub.

# Fractional Posteriors and Hausdorff alpha-entropy

Bhattacharya, Pati & Yan (2016) wrote an interesting paper on Bayesian fractional posteriors. These are based on fractional likelihoods – likelihoods raised to a fractional power – and provide robustness to misspecification. One of their results shows that fractional posterior contraction can be obtained as only a function of prior mass attributed to neighborhoods, in a sort of Kullback-Leibler sense, of the parameter corresponding to the true data generating distribution (or the one closest to it in the Kullback-Leibler sense). With regular posteriors, on the other hand, a complexity constraint on the prior distribution is usually also required in order to show posterior contraction.

Their result made me think of the approach of Xing & Ranneby (2008) to posterior consistency. Therein, a prior complexity constraint specified through the so-called Hausdorff $\alpha$-entropy is used to allow bounding the regular posterior distribution by something that is similar to a fractional posterior distribution. As it turns out, the proof of Theorem 3.2 of of Battacharya & al. (2016) can almost directly be adapted to regular posteriors in certain cases, using the Hausdorff $\alpha$-entropy to bridge the gap. Let me explain this in some more detail.

Le me consider well-specified discrete priors for simplicity. More generally, the discretization trick could possibly yield similar results for non-discrete priors.

I will follow as closely as possible the notations of Battacharya & al. (2016). Let $\{p_{\theta}^{(n)} \mid \theta \in \Theta\}$ be a dominated statistical model, where $\Theta = \{\theta_1, \theta_2, \theta_3, \dots\}$ is discrete. Assume $X^{(n)} \sim p_{\theta_0}^{(n)}$ for some $\theta_0 \in \Theta$, let

$B_n(\varepsilon, \theta_0) = \left\{ \int p_{\theta_0}^{(n)}\log\frac{p_{\theta_0}^{(n)}}{p_{\theta}^{(n)}} < n\varepsilon^2,\, \int p_{\theta_0}^{(n)}\log^2\frac{p_{\theta_0}^{(n)}}{p_{\theta}^{(n)}} < n\varepsilon^2 \right\}$

and define the Renyi divergence of order $\alpha$ as

$D^{(n)}_{\alpha}(\theta, \theta_0) = \frac{1}{\alpha-1}\log\int\{p_{\theta}^{(n)}\}^\alpha \{p_{\theta_0}^{(n)}\}^{1-\alpha}.$

We let $\Pi_n$ be a prior on $\Theta$ and its fractional posterior distribution of order $\alpha$ is defined as

$\Pi_{n, \alpha}(A \mid X^{(n)}) \propto \int_{A}p_{\theta}^{(n)}\left(X^{(n)}\right)^\alpha\Pi_n(d\theta)$

In this well-specified case, one of their result is the following:

Theorem 3.2 of Bhattacharya & al. (particular case)
Fix $\alpha \in (0,1)$ and assume that $\varepsilon_n$ satisfies $n\varepsilon_n^2 \geq 2$ and

$\Pi_n(B_n(\varepsilon_n, \theta_0)) \geq e^{-n\varepsilon_n^2}.$

Then, for any $D \geq 2$ and $t > 0$,

$\Pi_{n,\alpha}\left( \frac{1}{n}D_\alpha^{(n)}(\theta, \theta_0) \geq \frac{D+3t}{1-\alpha} \varepsilon_n^2 \mid X^{(n)} \right) \leq e^{-tn\varepsilon_n^2}.$

holds with probability at least $1-2/\{(D-1+t)^2n\varepsilon_n^2\}$.

## What about regular posteriors?

Let us define the $\alpha$-entropy of the prior $\Pi_n$ as

$H_\alpha(\Pi_n) = \sum_{\theta \in \Theta} \Pi_n(\theta)^\alpha.$

An adaptation of the proof of the previous Theorem, in our case where $\Pi_n$ is discrete, yields the following.

Proposition (Regular posteriors)
Fix $\alpha \in (0,1)$ and assume that $\varepsilon_n$ satisfies $n\varepsilon_n^2 \geq 2$ and

$\Pi_n(B_n(\varepsilon_n, \theta_0)) \geq e^{-n\varepsilon_n^2}.$

Then, for any $D \geq 2$ and $t > 0$,

$\Pi_{n}\left( \frac{1}{n}D_\alpha^{(n)}(\theta, \theta_0) \geq \frac{D+3t}{1-\alpha} \varepsilon_n^2 \mid X^{(n)} \right)^\alpha \leq H_\alpha(\Pi_n) e^{-tn\varepsilon_n^2}.$

holds with probability at least $1-2/\{(D-1+t)^2n\varepsilon_n^2\}$.

Note that $H_\alpha(\Pi_n)$ may be infinite, in which case the upper bound on the tails of $\frac{1}{n}D_\alpha^{(n)}$ is trivial. When the prior is not discrete, my guess is that the complexity term $H_\alpha(\Pi_n)$ should be replaced by a discretization entropy ${}H_\alpha(\Pi_n; \varepsilon_n)$ which is the $\alpha$-entropy of a discretized version of $\Pi_n$ whose resolution (in the Hellinger sense) is some function of $\varepsilon_n$.

Read More »

# ISM at the Eureka! science festival

We’ve been hard at work getting ready for the Eureka! science festival held this weekend at the Montreal Science Centre. Come check it out!

At the festival: