Some Comparison Inequalities for Off-Centered Rényi Divergences

Divergences between probability distributions P, Q where say P \ll Q, provide distributional characteristics of the likelihood ratio \frac{dP}{dQ}(x) when x \sim Q. This post is about simple properties of what I call “off-centered” divergences, where the concern is about distributional characteristics of \frac{dP}{dQ}(x) in the misspecified case x \sim Q_0 when it may be the case that Q_0 \not = Q. The need arises from the study of likelihood-based inference in misspecified models (Kleijn and van der Vaart (2006); Bhattacharya et al. (2019)).

So here’s the framework in which we work. Let \mathcal{X} be a complete and separable metric space together with its Borel \sigma-algebra \mathcal{B}_{\mathcal{X}} and a \sigma-finite measure \mu on (\mathcal{X}, \mathcal{B}_{\mathcal{X}}). We denote by \mathbb{F} the set of all probability distributions which are absolutely continuous with respect to \mu and we identify every element f \in \mathbb{F} to (a chosen version of) its probability density function satisfying f \geq 0 and necessarily \int f\, d\mu = 1. Our basic metric structure on \mathbb{F} is provided by the Hellinger distance

H(f,g) = \left(\int (\sqrt{f} - \sqrt{g})^2\right)^{1/2}.

Additionally, we make use of the Rényi divergence of order \alpha \in (0,1] here given by

d_\alpha(f, g) = -\alpha^{-1}\log A_{\alpha}(f,g),\quad A_\alpha(f, g) = \int_{{g > 0}} f^{\alpha}g^{1-\alpha}\,d\mu,

where A_\alpha(f, g) is refered to as the \alpha-affinity between f and g. In the case where \alpha = 0, we let d_0 be the Kullback-Leibler divergence (or relative entropy) defined as

d_0(f, g) = D(g | f) = \int_{{g >0}} \log(g/f) g\,d\mu.

Furthermore, we note the following standard inequalities relating together d_\alpha and H for different levels of \alpha \in (0,1] (van Erven (2014); Bhattacharya et al. (2019)):

  • d_{1/2} = \frac{-1}{2}\log(1-H(f,g)^2);
  • if 0 < \alpha \leq \beta < 1, then d_\beta \leq d_\alpha \leq \frac{1-\alpha}{\alpha}\frac{\beta}{1-\beta} d_\beta;
  • if D(g|f) < \infty, then d_\alpha(f, g) \rightarrow d_0(f,g) = D(g|f) as \alpha \rightarrow 0.

Point (ii) can be improved when \alpha = 1/2. In this case, Proposition 3.1 of Zhang (2006) implies that for \beta \in (0, 1/2], d_{1/2} \geq 2\beta d_\beta and for \beta \in [1/2, 1), d_{1/2} \leq 2\beta d_\beta.

1. Off-centered divergences

Fix Q_0 any probability measure on (\mathcal{X}, \mathcal{B}_{\mathcal{X}}). We let f, g \in \mathbb{F}. In order to study the behaviour of \frac{f}{g}(X) when X \sim Q_0, assuming this is well-defined, we consider

A\alpha^{Q_0}(f, g) = \mathbb{E}{X \sim Q_0}\left[ \left(\frac{f(X)}{g(X)}\right)^{\alpha} \right] = \int_{\{f > 0\}} \left(f/g\right)^{\alpha}\,d Q_0

and similarly we define the off-centered Rényi divergence

d_\alpha^{Q_0}(f,g) = -\alpha^{-1}\log\left( A_\alpha^{Q_0}(f,g) \right).

Finally, we make use of

d_0^{Q_0}(f,g) = D^{Q_0}(g|f) = \int_{{g > 0}} \log\left(g/f\right)\,d Q_0

Note that unless we assume Q_0 \ll \mu, there is a dependence in the definition of d_\alpha^{Q_0} to the choice of density representatives f and g. That is, f and g must be measurable functions that are well-defined pointwise and not only up to \mu-equivalence.

Furthermore, typically, d_\alpha^{Q_0} will take negative values. Considering d_\alpha^{Q_0}(f,g) over f \in \mathcal{P} \subset\mathbb{F} where \mathcal{P} is some fixed convex part of \mathbb{F}, and if there exists f \in \mathcal{P} such that D(Q_0|f) < \infty (which implies in particular that Q_0\ll f \ll \mu), then we can say that d_\alpha^{Q_0}(f,g)\geq 0 for every f \in \mathcal{P} if and only if g \in \arg\min_{h \in \mathcal{P}}D(Q_0| h). Sufficiency follows from Kleijn and van der Vaart (2006) while necessity is a consequence of Proposition 2 below.

2. Comparison inequalities

Our first inequalities provide results analogous to d_{\beta} \leq d_\alpha \leq \frac{1-\alpha}{\alpha}\frac{\beta}{1-\beta}d_\beta when 0 < \alpha \leq \beta < 1: the off-centered divergence d_\alpha^{Q_0} is also decreasing in \alpha, and the reverse inequality holds up to some modifications.

Proposition 1.
Let d_\alpha^{Q_0} be defined as before in terms of a probability measure Q_0 on {}(\mathcal{X}, \mathcal{B}_{\mathcal{X}}). For any 0 < \alpha \leq \beta < 1, we have

d_{\beta}^{Q_0} \leq  d_{\alpha}^{Q_0} \leq \frac{1-\alpha}{\alpha}\frac{\beta}{1-\beta} d_{\beta}^{Q_0} + \frac{\alpha-\beta}{\alpha(1-\beta)} d_1^{Q_0}.

Proof.
These are straighforward applications of Jensen’s inequality. For the first inequality, since \beta \geq \alpha,

A_\beta^{Q_0}(f, g) = \int_{{f > 0}} \left(\frac{f}{g} \right)^{\beta}\, dQ_0 \geq \left(\int \left(\frac{f}{g}\right)^{\alpha}\,dQ_0 \right)^{\beta / \alpha}.

Applying the decreasing function -\beta^{-1} \log(\cdot) yields the result. For the second inequality, first assume that Q_0({f > 0,\, g = 0}) = 0. Then using the fact that \frac{1-\alpha}{1-\beta} \geq 1 we find

Applying the function -\alpha^{-1}\log(\cdot) then yields the result. When Q_0({f>0,\, g = 0}) > 0, then both d_\alpha^{Q_0} and d_\beta^{Q_0} are infinite and the inequality also holds. //

The following Proposition shows how d_\alpha^{Q_0}-neighborhoods of the form {f \in \mathbb{F} \mid d_\alpha^{Q_0}(f,g)} around g\in \mathbb{F} are related to d_\alpha-neighborhoods around Q_0. It also provides the converse to the non-negativity result d_\alpha(f, g) \geq 0 when g is a point of minimal Kullback-Leibler divergence: when D(Q_0| f) < D(Q_0|g), then necessarily d_\alpha^{Q_0}(f, g) < 0.

Proposition 2.
Let Q_0 be a probability measure on {}(\mathcal{X}, \mathcal{B}_{\mathcal{X}}) that is absolutely continuous with respect to \mu with density q_0 \in \mathbb{F} and let f, g \in \mathbb{F} be such that Q_0(\{f > 0,\, g = 0\}) = 0, g(\{f > 0,\, q_0 = 0\}) = 0. Then

d_\alpha(f, Q_0) \geq (1-\alpha) d_\alpha^{Q_0}(f,g) + \alpha d_\alpha(f,g)

and

d_\alpha^{Q_0}(f,g) \leq d_0^{Q_0}(f,g).

Proof.
Applying Jensen’s inequality, we find

Applying the decreasing function -\alpha^{-1}\log(\cdot) then yields the result. For the second inequality, note that

References:

  • Bhattacharya, A., D. Pati, and Y. Yang (2019). Bayesian fractional posteriors. The Annals of Statistics. 47(1), 39–66.
  • Kleijn, B. J., A. W. van der Vaart, et al. (2006). Misspecification in infinite-dimensional bayesianstatistics. The Annals of Statistics 34(2), 837–877.
  • Zhang, T. (2006). Fromε-entropy to kl-entropy: Analysis of minimum information complexitydensity estimation. The Annals of Statistics. 34(5), 2180–2210
  • van Erven, T. and P. Harremos (2014). Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory 60(7), 3797–3820.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s