The last few weeks (and months) have been quite busy for me. On top of university visits and preparing my immigration to the US, I’ve had to max out my assistant teaching load this semester. I’m also teaching a few courses about Bayesian stats to senior undergrads in order to help out my advisor, I’ve been co-organizing a student statistics conference, judging a science fair, etc. It’s all fun and games and good experience, but this has also been an excuse for me to avoid what’s bothering me in my research. So here’s what’s going on in that area and how I feel about it.
There’s been renewed interest recently in the behaviour of posterior distributions when dealing with misspecified models, i.e. when the true data-generating distribution falls outside of the (Kullback-Leibler) support of the prior. For instance, Grunwald and van Ommen (2017) have empirically shown that inconsistency can arise even in standard regression models, when the data is corrupted by samples from a point mass at zero. Their solution to this problem revolves around dampening the posterior distribution by raising the likelihood function to a fractional power, resulting in what is called a generalized or fractional posterior distribution (a particular case of a Gibbs posterior distribution).
A general theory of posterior concentration for fractional posterior distributions has been developped by Bhattacharya et al. (2019), in which they show that the use of fractional posteriors aleviates the prior complexity constraint that is typically required by known results for posterior consistency. They get finite sample posterior concentration bounds and oracle inequalities in this context.
On my side, I’ve been working on the use of a new notion of prior complexity, which I refer to as the separation -entropy (closely related to the Hausdorff -entropy of Xing and Ranneby (2009) and to the concept of -separation discussed in Choi et Ramamoorthi (2008)), which allows to bridge the gap between regular and fractional distributions: when this entropy is finite, results analogous to those of Bhattacharya et al (2019) are recovered. This notion of prior complexity also generalizes the covering numbers for testing under misspecification introduced by Kleijn and van der Vaart (2006), avoiding testing arguments, and generalizes as well the prior root summability conditions of Walker (2004).
I think it provides a neat way to unify a literature that otherwise might be a bit difficult to get into, but it is more conceptually interesting than practically useful: we never know in what way a model is misspecified, and things can get as bad as the misspecification gets. So in practice we still have little clue of what’s going on. Yet the tools developped for this conceptual study and the general understanding that we get might turn out helpful in developping ways to detect misspecification. So I’m not too worried about doing “mathy stuff” as, I hope, it will turn out helpful at some point.
What worries me is that our mathematical tools are unable to grasp the kind of misspecification that really happens in practice. That is, the typical mathematical theory of misspecification, worked out in dominated models, might be itself entirely misspecified. I have some ideas of how I could fix part of the problem, but it’s quite a big issue.
Ok, so here’s what’s going on. We have a model dominated by some measure . Very roughly speaking, we can say that the model is misspecified if the true data-generating distribution, corresponding to a parameter , falls outside of the model . In convex parametric models, assuming that thet truth doesn’t fall too far from the model and in particular has a density with respect to our dominating measure, then the posterior distribution will typically converge to a point mass at the Kullback-Leibler minimizer under manageable conditions.
It’s possible to define nonparametric models (and prior distributions with full support on these models) that encompass all or nearly all probability distributions absolutely continuous with respect to , and we can ensure that the posterior distribution will converge to (in the large sample limit of i.i.d. observations from ), under some verifiable conditions. There is a large literature devoted to this, and it is typically required that has a continuous density. In my paper Bayesian Nonparametrics for Directional Statistics (2019), we developped a general framework for density estimation on compact metric spaces which ensures convergence at all bounded, possibly discontinuous densities, when the dominating measure is finite. I.i.d. misspecification (when staying inside of the dominating model) then becomes a non-issue.
The picture in the nonparametric case looks something like this below: the model is dense in the space of all absolutely continuous distributions, although there might still be some “holes” in there.
There are a few results about what’s going on in this case, although most of them are about i.i.d. misspecification. My research has been about unifying and extending these results to non-i.i.d. misspecification in this context of dominated models, providing finite sample posterior concentration bounds and asymptotic convergence in terms of prior complexity and concentration.
I can theoretically deal with problems such as a target data-generating distribution which keeps shifting as we gather more and more data, but we require that the true data-generating distribution has a density with respect to the dominating measure (or its nth product). And changing the dominating measure in order to incorporate in the model would be cheating: we don’t know in advance what is.
The real world: singular misspecification
There’s typically no reason why the true data-generating distribution should have a density with respect to our chosen dominating measure. While we can still do Bayesian computation in that case, our theoretical analysis breaks down: anything can happen depending on the choice of density representatives in our model.
This is singular misspecification: the true data-generating distribution has a component which is singular to .
This issue has led to some confusion in the literature, and some of the limitations involved in typical mathematical frameworks of consistency under misspecification seem to have been neglected. For instance, in their inconsistency example, Grunwald and van Ommen (2017) considered a true data-generating distribution which is a mixture containing a point mass at zero: in this case, there is no density in the model and no point of minimal Kullback-Leibler divergence (contrarily to what they state in their paper; at least as far as I understood what they meant in this regard). They show that the sufficient conditions for posterior consistency of De Blasi and Walker (2013) do not hold in their context, but let’s be clear: even if these regularity conditions were to hold, the results of De Blasi and Walker (2013) are inaplicable in this context.
So what can we do? I think there might still be things to say when we consider models of continuous densities, but more work is required in this area.
- Bhattacharya, A., D. Pati, and Y. Yang (2019).Bayesian fractional posteriors.Ann.Statist. 47(1), 39–66.
- Choi, T. and R. V. Ramamoorthi (2008).Remarks on consistency of posterior distributions,Volume Volume 3, pp. 170–186. Beachwood, Ohio, USA: Institute of Mathematical Statistics.
- De Blasi, P. and S. G. Walker (2013). Bayesian asymptotics with misspecified models.StatisticaSinica, 169–187.
- Grünwald, P. and T. van Ommen (2017). Inconsistency of bayesian inference for misspecifiedlinear models, and a proposal for repairing it.Bayesian Anal. 12(4), 1069–1103.
- Kleijn, B. J., A. W. van der Vaart, et al. (2006). Misspecification in infinite-dimensional bayesianstatistics.The Annals of Statistics 34(2), 837–877.
- Ramamoorthi, R. V., K. Sriram, and R. Martin (2015). On posterior concentration in misspec-ified models.Bayesian Anal. 10(4), 759–789.
- Walker, S. (2004). New approaches to Bayesian consistency.Ann. Statist. 32(5), 2028–2043.
- Xing, Y. and B. Ranneby (2009). Sufficient conditions for Bayesian consistency. J. Stat. Plan.Inference 139(7), 2479–2489.