The choice of prior in bayesian nonparametrics – Introduction

In preparation for the 11th Bayesian nonparametrics conference, I’m writing (and rewriting) notes on the background of our research (i.e. some of the general theory of bayesian nonparametrics). There are some good books on the subject (such as Bayesian Nonparametrics (Ghosh and Ramamoorthi, 2003)), but I wanted a more introductory focus and to present Choi and Ramamoorthi’s very clear point of view on posterior consistency (Remarks on the consistency of posterior distributions, 2008).

1. Introduction

Let \mathbb{X} be a complete and separable metric space and let \mathcal{M} be the space of all probability measures on \mathbb{X}. Some unknown distribution P_0\in \mathcal{M} is generating observable data \mathcal{D}_n = (X_1, X_2, \dots, X_n) \in \mathbb{X}^n, where each X_i is independently drawn from P_0. The problem is to learn about P_0 using only \mathcal{D}_n and prior knowledge.

Example (Discovery probabilities).
A cryptographer observes words, following some distribution P_0, in an unknown countable language \mathcal{L}. What are the P_0-probabilities of the words observed thus far? What is the probability that the next word to be observed has never been observed before?

1.1 Learning and uncertainty

We need an employable definition of learning. As a first approximation, we can consider learning to be the reduction of uncertainty about what is P_0. This requires a quantification of how uncertain we are to begin with. Then, hopefully, as data is gathered out uncertainty decreases and we are able to pinpoint P_0.

This is the core of Bayesian learning, alghough our definition is not yet entirely satisfactory. There are some difficulties with this idea of quantifying uncertainty, at least when using information-theoric concepts. The solution we adopt here is the use of probabilities to quantify uncertain knowledge (bayesians would also talk of subjective probabilities quantifying rational belief). For example, you may know that a coin flip is likely to be fair, although it is not impossible the two sides of the coin are both the same. This is uncertain knowledge about the distribution of heads and tails in the coin flips, and you could assign probabilities to the different possibilities.

More formally, prior uncertain knowledge about what is P_0 is quantified by a probability measure \Pi on \mathcal{M}. For any A \subset \mathcal{M}, \Pi(A) is the the prior probability that “P_0 \in A“. Then, given data \mathcal{D}_n, prior probabilities are adjusted to posterior probabilities: \Pi becomes \Pi_n, the conditional distribution of \Pi given \mathcal{D}_n. The celebrated Bayes’ theorem provides a formula to calculate \Pi_n from \Pi and \mathcal{D}_n. Thus we have an operational definition of learning in our statistical framework.

Learning is rationally adjusting uncertain knowledge in the light of new information.

For explanations as to why probabilities are well suited to the representation of uncertain knowledge, I refer the reader to Pearl (Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988). We will also see that the operation of updating the prior to posterior posterior probabilities does work as intended.

1.2 The choice of prior

Specifying prior probabilities, that is quantifying prior uncertain knowledge, is not a simple task. It is especially difficult when uncertainty is over the non-negligeable part \mathcal{M} of an infinite dimensional vector space. Fortunately, “probability is not about numbers, it is about the structure of reasoning”, as Glenn Shafer puts it (cited in Pearl, 1988, p.15). The exact numbers given to the events “P_0 \in A” are not of foremost importance; what matters is how probabilities are more qualitatively put together, and how this relates to the learning process.

Properties of prior distributions, opening them to scrutiny, criticism and discussion, must be identified and related to what happens as more and more data is gathered.

Part 2.

2 thoughts on “The choice of prior in bayesian nonparametrics – Introduction

  1. Pingback: The choice of prior in bayesian nonparametrics – part 2 – Math. Stat. Notes

  2. Pingback: Sieve priors and approximation – Math. Stat. Notes

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s