See part 1. Most proofs are omitted; I’ll post them with the complete pdf later this week.

# The structure of

Recall that is is a Polish space (ie. a complete and separable metric space). It is endowed with its borel -algebra which is the smallest family of subsets of that contains its topology and that is closed under countable unions and intersections. All subsets of we consider in the following are supposed to be part of . A probability measure on is a function such that for any countable partition of we have that The set consists of all such probability measures.

Note that since is complete and separable, every probability measure is *regular* (and *tight*). It means that the measure of any can be well approximated from the measure of compact subsets of as well as from the measure of open super-sets of :

## Metrics on

Let me review some facts. A natural metric used to compare the mass allocation of two measures is the *total variation distance* defined by

It is relatively straightforward to verify that is complete under this distance, but it is not in general separable. To see this, suppose that . If a ball centered at contains a dirac measure , , then must have a point mass at . Yet any measure contains at most a countable number of point masses, and there is an uncountable number of dirac measures on . Thus no countable subset of can cover up to an of error.

This distance can be relaxed to the Prokhorov metric, comparing mass allocation up to -neighborhoods. It is defined as

where is the -neighborhood of . It is a metrization of the topology of weak convergence of probability measures, and is separable under this distance.

The compact sets of under the Prokhorov metric admit a simple characterization given by the Prokhorov theorem: is precompact if and only if is uniformly tight (for each , there exists a compact such that ). This means that a sequence admits a weakly convergent subsequence if and only if is uniformly tight.

Characterizations of weak convergence are given by the Portemanteau theorem, which says in particular that converges weakly to if and only if

for all continuous and bounded and continuous . It is also equivalent to

for all sets such that .

## Measures of divergence

In addition to metrics, that admit a geometric interpretation through the triangle inequality, statistical measures of divergence can also be considered. Here, we consider functions that can be used to determine the rate of convergence of the likelihood ratio

where and .

### Kullback-Leibler divergence

The *weight of evidence* in favor of the hypothesis “” versus “” given a sample is defined as

It measures how much information about the hypotheses is brought by the observation of . (For a justification of this interpretation, see Good (*Weight of evidence: a brief survey*, 1985).) The Kullback-Leibler divergence between and is defined as the expected weight of evidence given that :

The following properties of the Kullback-Leibler divergence support its interpretation as an expected weight of evidence.

**Theorem 1** (Kullback and Leibler, 1951).

*We have*

*with equality if and only if ;**with equality if and only if is a sufficient statistic for .*

Furthermore, the KL divergence can be used to precisely identify exponential rates of convergence of the likelihood ratio. The first part of the next proposition says that is finite if and only if the likelihood ratio , cannot convergence super-exponentially fast towards . The second part identifies the rate of convergence then the KL divergence is finite.

**Proposition 2.**

*Let (independently). The KL divergence is finite if and only if there exists an such that*

*with positive probability.*

Finally, suppose we are dealing with a submodel such that the rates of convergences of the likelihood ratios in are of an exponential order. By the previous proposition, this is equivalent to the fact that , . We can show that the KL divergence is, up to topological equivalence, the best measure of divergence that determines the convergence of the likelihood ratio. That is, suppose is such that

at an exponential rate, almost surely when , and that if and only if . Then, the topology induced by is coarser than the topology induced by .

**Proposition 3.**

Let be as above and let be such that , . Then, the topology on induced by is weaker than the topology induced by . More precisely, we have that

### alpha-affinity and alpha-divergence

We define the -affinity between two probability measures as the expectancy of another transform of the likelihood ratio. Let be two probability measures dominated by , with and . Given , the -affinity between and is

**Proposition 4.**

*For all , we have that*

*1. with equality if and only if ;*

*2. is monotonous in and jointly concave in its arguments;*

*3. is jointly multiplicative under products:*

*4. if , then*

*Proof.*

1-2 follow from Jensen’s inequality and the joint concavity of . 3 follows from Fubini’s theorem. For

(iv), the first inequality is a particular case of 2 and Hölder’s inequality finally yields

The -divergence is obtained as

Other similar divergences considered in the litterature are

but we prefer for its simplicity. When , it is closely related to the hellinger distance

through

Other important and well-known inequalities are given below.

**Proposition 5.**

*We have*

*and*

This, together with proposition 4 (4) , yields similar bounds for the other divergences.

## Finite models

Let be a prior on that is finitely supported. That is, for some and with . Suppose that independently follow some .

The following proposition ensures that as data is gathered, the posterior distribution of concentrates on the measures that are closest to .

**Proposition 6.**

*Let . If , then*

*almost surely as .*

Pingback: The choice of prior in bayesian nonparametrics – Introduction – Math. Stat. Notes

Pingback: The discretization trick – Math. Stat. Notes