Divergences between probability distributions , where say , provide distributional characteristics of the likelihood ratio when . This post is about simple properties of what I call “off-centered” divergences, where the concern is about distributional characteristics of in the misspecified case when it may be the case that . The need arises from the study of likelihood-based inference in misspecified models (Kleijn and van der Vaart (2006); Bhattacharya et al. (2019)).
So here’s the framework in which we work. Let be a complete and separable metric space together with its Borel -algebra and a -finite measure on . We denote by the set of all probability distributions which are absolutely continuous with respect to and we identify every element to (a chosen version of) its probability density function satisfying and necessarily . Our basic metric structure on is provided by the Hellinger distance
Additionally, we make use of the Rényi divergence of order here given by
where (f, g) is refered to as the -affinity between and . In the case where , we let be the Kullback-Leibler divergence (or relative entropy) defined as
Furthermore, we note the following standard inequalities relating together and for different levels of (van Erven (2014); Bhattacharya et al. (2019)):
- if , then ;
- if , then as .
Point (ii) can be improved when . In this case, Proposition 3.1 of Zhang (2006) implies that for , and for , .
1. Off-centered divergences
Fix any probability measure on . We let . In order to study the behaviour of when , assuming this is well-defined, we consider
and similarly we define the off-centered Rényi divergence
Finally, we make use of
Note that unless we assume , there is a dependence in the definition of to the choice of density representatives and . That is, and must be measurable functions that are well-defined pointwise and not only up to -equivalence.
Furthermore, typically, will take negative values. Considering over where is some fixed convex part of , and if there exists such that (which implies in particular that ), then we can say that for every if and only if . Sufficiency follows from Kleijn and van der Vaart (2006) while necessity is a consequence of Proposition 2 below.
2. Comparison inequalities
Our first inequalities provide results analogous to when : the off-centered divergence is also decreasing in , and the reverse inequality holds up to some modifications.
Let be defined as before in terms of a probability measure on . For any , we have
These are straighforward applications of Jensen’s inequality. For the first inequality, since ,
Applying the decreasing function yields the result. For the second inequality, first assume that . Then using the fact that we find
Applying the function then yields the result. When , then both and are infinite and the inequality also holds. //
The following Proposition shows how -neighborhoods of the form around are related to -neighborhoods around . It also provides the converse to the non-negativity result when is a point of minimal Kullback-Leibler divergence: when , then necessarily .
Let be a probability measure on that is absolutely continuous with respect to with density and let be such that , . Then
Applying Jensen’s inequality, we find
Applying the decreasing function then yields the result. For the second inequality, note that
- Bhattacharya, A., D. Pati, and Y. Yang (2019). Bayesian fractional posteriors. The Annals of Statistics. 47(1), 39–66.
- Kleijn, B. J., A. W. van der Vaart, et al. (2006). Misspecification in infinite-dimensional bayesianstatistics. The Annals of Statistics 34(2), 837–877.
- Zhang, T. (2006). Fromε-entropy to kl-entropy: Analysis of minimum information complexitydensity estimation. The Annals of Statistics. 34(5), 2180–2210
- van Erven, T. and P. Harremos (2014). Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory 60(7), 3797–3820.