I will continue to post on
It will be much easier for me to write math and code on this new platform + I get to style it any way I want.
I will continue to post on
It will be much easier for me to write math and code on this new platform + I get to style it any way I want.
This post continues the series on posterior concentration under misspecification. Here I introduce an unifying point of view on the subject through the introduction of the separation -entropy. We use this notion of prior entropy to bridge the gap between Bayesian fractional posteriors and regular posterior distributions: in the case where this entropy is finite, direct analogues to some of the concentration results for fractional posteriors (Bhattacharya et al., 2019) are recovered.
This post is going to be quite abstract, just like last week. I’ll talk in a future post about how this separation -entropy generalizes generalizes the covering numbers for testing under misspecification of Kleijn et al. (2006) as well as the prior summability conditions of De Blasi et al. (2013).
Quick word of warning: this is not the definitive version of the results I’m working on, but I still had to get them out somewhere.
Another word of warning: WordPress has gotten significantly worse at dealing with math recently. I will find a new platform, but for now expect to find typos and some rendering issues.
We continue in the same theoretical framework as before: is a set of densities on a complete and separable metric space with respect to a -finite measure defined on the Borel -algebra of , is the Hellinger distance defined by
and we make use of the Rényi divergences defined by
Here we assume that data is generated following a distribution having a density in our model (this assumption could be weakened), and therefore defined the off-centered Rényi divergence
assuming that all this is well defined.
Now let be a prior on . Given either a single data point or a sequence of independent variables with common probability density function , the posterior distribution of given is the random quantity defined by
and . This may not always be well-defined, but I don’t want to get into technicalities for now.
We state our concentration results in terms of the separation -entropy. It is inspired by the Hausdorff -entropy introduced in Xing et al. (2009), although the separation -entropy has no relationship with the Hausdorff measure and instead builds upon the concept of -separation of Choi et al. (2008) defined below.
Given a set , we denote by the convex hull of : it is the set of all densities of the form where is a probability measure on .
Let be fixed as above. A set of densities is said to be -separated from with respect to the divergence if for every ,
A collection of sets is said to be -separated from if every is -separated from .
An important property of -separation, first noted by Walker (2004) and used for the study of posterior consistency, is that it scales with product densities. The general statement of the result is stated in the following lemma.
Lemma (Separation of product densities).
Let , , be a sequence of -finite measured spaces where each is a complete and separable locally compact metric space and is the corresponding Borel -algebra. Denote by the set of probability density functions on , fix and let be -separated from with respect to for some . Let where is the product density on defined by . Then is -separated from with respect to where .
We can now define the separation -entropy of a set with parameter as the minimal -entropy of a -separated covering of . When this entropy is finite, we can study the concentration properties of the posterior distribution using simple information-theoretic techniques similar to those used in Bhattacharya (2019) for the study of Bayesian fractional posteriors.
Definition (Separation -entropy).
Fix , and let be a subset of . Recall , and fixed as previously. The separation -entropy of is defined as
where the infimum is taken over all (measurable) families , , satisfying and which are -separated from with respect to the divergence . When no such covering exists we let , and when we define .
When , so that , we drop the indicator and denote , to emphasize the fact.
Proposition (Properties of the separation -entropy).
The separation -entropy of a set is non-negative and if is -separated from with respect to the divergence . Furthermore, if and , then
and if also , then
For a subset with , we have
and, more generally, if for subsets , then
Theorem (Posterior consistency).
Let and let be a sequence of independent random variables with common probability density . Suppose there exists such that
If satisfies for some , then almost surely as .
The condition implies in particular that .
Corollary (Well-specified consistency).
Suppose that is in the Kullback-Leibler support of . If satisfies for some and for some , then almost surely as .
Corollary (Well-specified Hellinger consistency).
Suppose that is in the Kullback-Leibler support of and fix . If there exists a covering of by Hellinger balls of diameter at most satisfying for some , then almost surely as .
Following Kleijn et al. (2006) and Bhattacharya et al. (2019), we let
be a Kullback-Leibler type neighborhood of (relatively to ) where the second moment of the log likelihood ratio is also controlled.
Theorem (Posterior concentration bound).
Let and let . For any and we have that
holds with probability at least .
Corollary (Posterior concentration bound, i.i.d. case).
Let and let be a sequence of independent random variables with common probability density . For any and we have that
holds with probability at least .Read More »
Divergences between probability distributions , where say , provide distributional characteristics of the likelihood ratio when . This post is about simple properties of what I call “off-centered” divergences, where the concern is about distributional characteristics of in the misspecified case when it may be the case that . The need arises from the study of likelihood-based inference in misspecified models (Kleijn and van der Vaart (2006); Bhattacharya et al. (2019)).
So here’s the framework in which we work. Let be a complete and separable metric space together with its Borel -algebra and a -finite measure on . We denote by the set of all probability distributions which are absolutely continuous with respect to and we identify every element to (a chosen version of) its probability density function satisfying and necessarily . Our basic metric structure on is provided by the Hellinger distance
Additionally, we make use of the Rényi divergence of order here given by
where (f, g) is refered to as the -affinity between and . In the case where , we let be the Kullback-Leibler divergence (or relative entropy) defined as
Furthermore, we note the following standard inequalities relating together and for different levels of (van Erven (2014); Bhattacharya et al. (2019)):
Point (ii) can be improved when . In this case, Proposition 3.1 of Zhang (2006) implies that for , and for , .
Fix any probability measure on . We let . In order to study the behaviour of when , assuming this is well-defined, we consider
and similarly we define the off-centered Rényi divergence
Finally, we make use of
Note that unless we assume , there is a dependence in the definition of to the choice of density representatives and . That is, and must be measurable functions that are well-defined pointwise and not only up to -equivalence.
Furthermore, typically, will take negative values. Considering over where is some fixed convex part of , and if there exists such that (which implies in particular that ), then we can say that for every if and only if . Sufficiency follows from Kleijn and van der Vaart (2006) while necessity is a consequence of Proposition 2 below.
Our first inequalities provide results analogous to when : the off-centered divergence is also decreasing in , and the reverse inequality holds up to some modifications.
Let be defined as before in terms of a probability measure on . For any , we have
These are straighforward applications of Jensen’s inequality. For the first inequality, since ,
Applying the decreasing function yields the result. For the second inequality, first assume that . Then using the fact that we find
Applying the function then yields the result. When , then both and are infinite and the inequality also holds. //
The following Proposition shows how -neighborhoods of the form around are related to -neighborhoods around . It also provides the converse to the non-negativity result when is a point of minimal Kullback-Leibler divergence: when , then necessarily .
Let be a probability measure on that is absolutely continuous with respect to with density and let be such that , . Then
Applying Jensen’s inequality, we find
Applying the decreasing function then yields the result. For the second inequality, note thatRead More »
The last few weeks (and months) have been quite busy for me. On top of university visits and preparing my immigration to the US, I’ve had to max out my assistant teaching load this semester. I’m also teaching a few courses about Bayesian stats to senior undergrads in order to help out my advisor, I’ve been co-organizing a student statistics conference, judging a science fair, etc. It’s all fun and games and good experience, but this has also been an excuse for me to avoid what’s bothering me in my research. So here’s what’s going on in that area and how I feel about it.
There’s been renewed interest recently in the behaviour of posterior distributions when dealing with misspecified models, i.e. when the true data-generating distribution falls outside of the (Kullback-Leibler) support of the prior. For instance, Grunwald and van Ommen (2017) have empirically shown that inconsistency can arise even in standard regression models, when the data is corrupted by samples from a point mass at zero. Their solution to this problem revolves around dampening the posterior distribution by raising the likelihood function to a fractional power, resulting in what is called a generalized or fractional posterior distribution (a particular case of a Gibbs posterior distribution).
A general theory of posterior concentration for fractional posterior distributions has been developped by Bhattacharya et al. (2019), in which they show that the use of fractional posteriors aleviates the prior complexity constraint that is typically required by known results for posterior consistency. They get finite sample posterior concentration bounds and oracle inequalities in this context.
On my side, I’ve been working on the use of a new notion of prior complexity, which I refer to as the separation -entropy (closely related to the Hausdorff -entropy of Xing and Ranneby (2009) and to the concept of -separation discussed in Choi et Ramamoorthi (2008)), which allows to bridge the gap between regular and fractional distributions: when this entropy is finite, results analogous to those of Bhattacharya et al (2019) are recovered. This notion of prior complexity also generalizes the covering numbers for testing under misspecification introduced by Kleijn and van der Vaart (2006), avoiding testing arguments, and generalizes as well the prior root summability conditions of Walker (2004).
I think it provides a neat way to unify a literature that otherwise might be a bit difficult to get into, but it is more conceptually interesting than practically useful: we never know in what way a model is misspecified, and things can get as bad as the misspecification gets. So in practice we still have little clue of what’s going on. Yet the tools developped for this conceptual study and the general understanding that we get might turn out helpful in developping ways to detect misspecification. So I’m not too worried about doing “mathy stuff” as, I hope, it will turn out helpful at some point.
What worries me is that our mathematical tools are unable to grasp the kind of misspecification that really happens in practice. That is, the typical mathematical theory of misspecification, worked out in dominated models, might be itself entirely misspecified. I have some ideas of how I could fix part of the problem, but it’s quite a big issue.
Ok, so here’s what’s going on. We have a model dominated by some measure . Very roughly speaking, we can say that the model is misspecified if the true data-generating distribution, corresponding to a parameter , falls outside of the model . In convex parametric models, assuming that thet truth doesn’t fall too far from the model and in particular has a density with respect to our dominating measure, then the posterior distribution will typically converge to a point mass at the Kullback-Leibler minimizer under manageable conditions.
It’s possible to define nonparametric models (and prior distributions with full support on these models) that encompass all or nearly all probability distributions absolutely continuous with respect to , and we can ensure that the posterior distribution will converge to (in the large sample limit of i.i.d. observations from ), under some verifiable conditions. There is a large literature devoted to this, and it is typically required that has a continuous density. In my paper Bayesian Nonparametrics for Directional Statistics (2019), we developped a general framework for density estimation on compact metric spaces which ensures convergence at all bounded, possibly discontinuous densities, when the dominating measure is finite. I.i.d. misspecification (when staying inside of the dominating model) then becomes a non-issue.
The picture in the nonparametric case looks something like this below: the model is dense in the space of all absolutely continuous distributions, although there might still be some “holes” in there.
There are a few results about what’s going on in this case, although most of them are about i.i.d. misspecification. My research has been about unifying and extending these results to non-i.i.d. misspecification in this context of dominated models, providing finite sample posterior concentration bounds and asymptotic convergence in terms of prior complexity and concentration.
I can theoretically deal with problems such as a target data-generating distribution which keeps shifting as we gather more and more data, but we require that the true data-generating distribution has a density with respect to the dominating measure (or its nth product). And changing the dominating measure in order to incorporate in the model would be cheating: we don’t know in advance what is.
There’s typically no reason why the true data-generating distribution should have a density with respect to our chosen dominating measure. While we can still do Bayesian computation in that case, our theoretical analysis breaks down: anything can happen depending on the choice of density representatives in our model.
This is singular misspecification: the true data-generating distribution has a component which is singular to .
This issue has led to some confusion in the literature, and some of the limitations involved in typical mathematical frameworks of consistency under misspecification seem to have been neglected. For instance, in their inconsistency example, Grunwald and van Ommen (2017) considered a true data-generating distribution which is a mixture containing a point mass at zero: in this case, there is no density in the model and no point of minimal Kullback-Leibler divergence (contrarily to what they state in their paper; at least as far as I understood what they meant in this regard). They show that the sufficient conditions for posterior consistency of De Blasi and Walker (2013) do not hold in their context, but let’s be clear: even if these regularity conditions were to hold, the results of De Blasi and Walker (2013) are inaplicable in this context.
So what can we do? I think there might still be things to say when we consider models of continuous densities, but more work is required in this area.Read More »
The Statistical Society of Canada has posted a few weeks ago its Case Studies (a grad data science competition) for the 2019 annual meeting held in Calgary on May 26 to 29. One of the case study is about counting cells in microscopic images which look like this:
Unfortunately, the organizers forgot to remove from the test set of images the actual cell counts.
Ok, that’s not quite fair. Truth is that they tried to remove the true cell counts, but didn’t quite manage to do so.
So here’s what’s going on. The file names of the images in the training set take forms such as A01_C1_F1_s01_w2, and the number following the letter “C” in the name indicates the true cell count in the image. While they removed this number, they forgot the remove the number following the letter “A”, which is in a simple bijection with the true cell count… The file names in the test set look like this: A01_F1_s01_w1.
Now even if that number following the letter “A” was removed, there would still be other problems: the number following the letter “s” in the file name also carries quite a bit of information… I don’t know why they left all that in.
I’ve contacted the organizers about this, but they don’t see it as being an important problem for the competition, even when 60% of the team’s scoring will be based on a RMSE prediction score.
Another fun fact about this case study: it is possible to get a root mean square error (RMSE) of about 1-2 cells through linear regression with only one covariate. Try to guess what predictor I used (hint: it’s roughly invariant under the type of blurring that they applied to some of the images.)
In the current applet, you can visualize the positions and depths of earthquakes of magnitude greater than 6 from January 1st 2014 up to January 1st 2019. Data is from the US Geological Survey (usgs.gov). Code is on GitHub.
Félix Locas presented me this problem.
Let . Show that
The series is easy to calculate. It is, for instance, the difference between the integrals of geometric series: