Fractional Posteriors and Hausdorff alpha-entropy

Bhattacharya, Pati & Yan (2016) wrote an interesting paper on Bayesian fractional posteriors. These are based on fractional likelihoods – likelihoods raised to a fractional power – and provide robustness to misspecification. One of their results shows that fractional posterior contraction can be obtained as only a function of prior mass attributed to neighborhoods, in a sort of Kullback-Leibler sense, of the parameter corresponding to the true data generating distribution (or the one closest to it in the Kullback-Leibler sense). With regular posteriors, on the other hand, a complexity constraint on the prior distribution is usually also required in order to show posterior contraction.

Their result made me think of the approach of Xing & Ranneby (2008) to posterior consistency. Therein, a prior complexity constraint specified through the so-called Hausdorff \alpha-entropy is used to allow bounding the regular posterior distribution by something that is similar to a fractional posterior distribution. As it turns out, the proof of Theorem 3.2 of of Battacharya & al. (2016) can almost directly be adapted to regular posteriors in certain cases, using the Hausdorff \alpha-entropy to bridge the gap. Let me explain this in some more detail.

Le me consider well-specified discrete priors for simplicity. More generally, the discretization trick could possibly yield similar results for non-discrete priors.

I will follow as closely as possible the notations of Battacharya & al. (2016). Let \{p_{\theta}^{(n)} \mid \theta \in \Theta\} be a dominated statistical model, where \Theta = \{\theta_1, \theta_2, \theta_3, \dots\} is discrete. Assume X^{(n)} \sim p_{\theta_0}^{(n)} for some \theta_0 \in \Theta, let

B_n(\varepsilon, \theta_0) = \left\{ \int p_{\theta_0}^{(n)}\log\frac{p_{\theta_0}^{(n)}}{p_{\theta}^{(n)}} < n\varepsilon^2,\, \int p_{\theta_0}^{(n)}\log^2\frac{p_{\theta_0}^{(n)}}{p_{\theta}^{(n)}} < n\varepsilon^2 \right\}

and define the Renyi divergence of order \alpha as

D^{(n)}_{\alpha}(\theta, \theta_0) = \frac{1}{\alpha-1}\log\int\{p_{\theta}^{(n)}\}^\alpha \{p_{\theta_0}^{(n)}\}^{1-\alpha}.

We let \Pi_n be a prior on \Theta and its fractional posterior distribution of order \alpha is defined as

\Pi_{n, \alpha}(A \mid X^{(n)}) \propto \int_{A}p_{\theta}^{(n)}\left(X^{(n)}\right)^\alpha\Pi_n(d\theta)

In this well-specified case, one of their result is the following:

Theorem 3.2 of Bhattacharya & al. (particular case)
Fix \alpha \in (0,1) and assume that \varepsilon_n satisfies n\varepsilon_n^2 \geq 2 and

\Pi_n(B_n(\varepsilon_n, \theta_0)) \geq e^{-n\varepsilon_n^2}.

Then, for any D \geq 2 and t > 0,

\Pi_{n,\alpha}\left( \frac{1}{n}D_\alpha^{(n)}(\theta, \theta_0) \geq \frac{D+3t}{1-\alpha} \varepsilon_n^2 \mid X^{(n)} \right) \leq e^{-tn\varepsilon_n^2}.

holds with probability at least 1-2/\{(D-1+t)^2n\varepsilon_n^2\}.

What about regular posteriors?

Let us define the \alpha-entropy of the prior \Pi_n as

H_\alpha(\Pi_n) = \sum_{\theta \in \Theta} \Pi_n(\theta)^\alpha.

An adaptation of the proof of the previous Theorem, in our case where \Pi_n is discrete, yields the following.

Proposition (Regular posteriors)
Fix \alpha \in (0,1) and assume that \varepsilon_n satisfies n\varepsilon_n^2 \geq 2 and

\Pi_n(B_n(\varepsilon_n, \theta_0)) \geq e^{-n\varepsilon_n^2}.

Then, for any D \geq 2 and t > 0,

\Pi_{n}\left( \frac{1}{n}D_\alpha^{(n)}(\theta, \theta_0) \geq \frac{D+3t}{1-\alpha} \varepsilon_n^2 \mid X^{(n)} \right)^\alpha \leq H_\alpha(\Pi_n) e^{-tn\varepsilon_n^2}.

holds with probability at least 1-2/\{(D-1+t)^2n\varepsilon_n^2\}.

Note that H_\alpha(\Pi_n) may be infinite, in which case the upper bound on the tails of \frac{1}{n}D_\alpha^{(n)} is trivial. When the prior is not discrete, my guess is that the complexity term H_\alpha(\Pi_n) should be replaced by a discretization entropy {}H_\alpha(\Pi_n; \varepsilon_n) which is the \alpha-entropy of a discretized version of \Pi_n whose resolution (in the Hellinger sense) is some function of \varepsilon_n.

Read More »

Échantillonnage préférentiel



Le problème est de calculer I(f) = \int f d\lambda,\lambda est une mesure de probabilité sur un espace X et f : X \rightarrow \mathbb{R} est intégrable. Si \{X_n\} est une suite de variables aléatoires indépendantes et distribuées selon \mu, alors on peut approximer I(f) par

I_n(f) = \frac{1}{n}\sum_{i=1}^n f(X_i),

qui est dit un estimateur de Monte-Carlo.
En pratique, il peut être difficile de générer X_n \sim \lambda. On préférera alors introduire une mesure \mu, avec \lambda absolument continue par rapport à \mu, de sorte que

I_n(f;\mu) = \frac{1}{n}\sum_{i=1}^n f(Y_i) \tfrac{d\lambda}{d\mu}(Y_i), \quad Y_i \sim^{ind.} \mu,

soit une estimée de I(f) plus commode à calculer. Cette technique, dite de l’échantillonnage préférentiel, peut aussi servir à améliorer la qualité de l’estimateur I_n par exemple en réduisant sa variance.Read More »

Remark on the asymptotics of the likelihood ratio and the K.-L. divergence

The problem.

Let f, g, h be three densities and suppose that, x_i \sim h, i \in \mathbb{N}, independently. What happens to the likelihood ratio

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)}

as n\rightarrow \infty?

Clearly, it depends. If h = g \not = f, then

\prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0

almost surely at an exponential rate. More generally, if h is closer to g than to f, in some sense, we’d expect that \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0. Such a measure of “closeness” of “divergence” between probability distributions is given by the Kullback-Leibler divergence

D_{KL}(f, g) = \int f \log\frac{f}{g}.

It can be verified that D_{KL}(f,g) \geq 0 with equality if and only if f=g, and that

D_{KL}(h,g) < D_{KL}(h,f) \Longrightarrow \prod_{i=1}^n \frac{f(x_i)}{g(x_i)} \rightarrow 0 \qquad (1)

almost surely at an exponential rate. Thus the K.L.-divergence can be used to solve our problem.

Better measures of divergence?

There are other measures of divergence that can determine the asymptotic behavior of the likelihood ratio as in (1) (e.g. the discrete distance). However, in this note, I give conditions under which the K.-L. divergence is, up to topological equivalence, the “best” measure of divergence.Read More »

Statistical aspects of protein structure prediction

The basics

Amino acids are small molecules of the form


where R is a side chain called the R-group. There are 20 different amino acids found in proteins, each characterized by its R-group.

Peptides and proteins are chains of amino acids. Proteins are long such chains, whereas peptides and polypeptides are shorter ones. The amino acids are linked together by peptide bonds:


Read More »