## Sunday, March 22, 2015

### What Are the Units In Bayes Rule?

I often better understand statistics by drawing physical analogs. Expected value is the balancing of a distributions on a pin, skewness is a misloaded washing machine banging away, and so on. Sometimes the analogs come easily, and other times I have to work at it. I find that I tend to look for the units of things.

Surprisingly, consideration of units is not too common. Let me give you an example. Suppose I am modeling the mass of Brazil nuts (in grams) as a normal with some mean and variance. Most analysts know immediately that the units of the mean and variance parameters are $$gm$$ and $$gm^2$$, respectively. How about the units for the likelihood function for 10 observations? For any hypothetical value of the mean and variance, the evaluation of the likelihood is a number, and it has physical units. I'll get to them later.

Considering the units of Bayes Rule has affected how I view what it means, and describing this effect is the point of this blog post. Let's start with the units of density, progress to the units of likelihood, and finally to Bayes Rule itself.

# The Units of Density

The figure shows the typical normal curve, and I am using it here to describe the mass of Brazil nuts. Most of us know there are numbers on the y-axis even though we hardly ever given them any thought. What unit are they in? We start with probability, and note that the area under the curve on an interval is the probability that an observation will occur in that interval. For example, the shaded area between 1.15 gm and 1.25 gm is .121, and, consequently, the probability that the next observation will fall between 1.15 and 1.25 is .121. We can approximate this area, $$A$$, for a small interval between $$x-\Delta/2$$ and $$x+\Delta/2$$ as:
$A=f(x) \times \Delta.$
The figure shows the construction for $$x=1.2$$ and $$\Delta=.1$$. Now let's look at the units. The area $$A$$ is a probability, so it is a pure number without units. The interval $$\Delta$$ is an interval on mass in grams, so it has units $$gm$$. To balance the units, the remaining term, $$f(x)$$ must be in units of $$1/gm$$. That is, the density is the rate of change in the probability measure per gram. No wonder the curve is called a probability density function! It is just like a physical density—a change in (probability) mass per unit of the variable. In general, we write may write that the probability an observation $$X$$ is in a small interval around $$x$$ as
$Pr(X \in [x-\Delta/2,x+\Delta/2]) = f(x)\Delta.$
Here the units of $$f(x)$$ (in $$gm^{-1}$$) and of $$\Delta$$ (in $$gm$$) cancel out as they should.

All density functions are in units of the reciprocal of the measurement unit. For example, the normal density is
$f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt(2\pi)}\exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right).$
The unit is determined by the $$1/\sigma$$ in the leading term. And where $$\sigma$$ is in units of $$x$$, $$1/\sigma$$ is in units of $$1/x$$. A valid density is always in reciprocal measurement units.

# The Probability of An Oberservation is Zero

One of the questions we can ask about our density is the probability that a Brazil nut weighs a specific value, say 1 gram. The answer, perhaps somewhat surprisingly, is 0. No Brazil nut weights exactly 1.0000… grams to infinite precision. The area under a point is zero. This is why we always ask about probabilities that observations are in intervals rather than at points.

# The Unit of Likelihood Functions

Let's look at the likelihood function for two parameters, $$\mu$$ and $$\sigma^2$$, and for two observations, $$x_1$$ and $$x_2$$. What is the units of the likelihood? The likelihood is the joint density of the observations, which may be denoted $$f(x_1,x_2;\mu,\sigma^2)$$ treated as a function of parameters. Now this density describes how probability mass varies as a function of both $$x_1$$ and $$x_2$$; that is, it lives on a plane. We may write
$Pr(X_1 \in [x_1-\Delta/2,x_1+\Delta/2] \vee X_2 \in [x_2-\Delta/2,2_1+\Delta/2]) = f(x_1,x_2)\Delta^2$
It is clear from here that the units of $$\Delta^2$$ are in $$gm^2$$, and, hence, the units of $$f(x_1,x_2)$$ are in $$gm^{-2}$$. If we had $$n$$ observations, the unit of the joint density among them is $$gm^{-n}$$.

The likelihood is just the joint density repeatedly evaluated at a point for different parameter values. The evaluation though is that of a density, and the units are those of the density. Hence, the likelihood functions units are the reciprocal of those of the data taken jointly. If the mass of 10 Brazil nuts is observed in grams, the likelihood function has units $$gm^{-10}$$.

# The Units of Bayes Rule

Bayes Rule is the application of the Law of Conditional Probability to parameters in a statistical model. The Law of Conditional Probability is that
$\Pr(A|B)=\frac{\Pr(B|A)\Pr(A)}{\Pr(B)}$
There are no units to worry about here as all numbers are probabilities without units.

The application for a statistical model goes as follows: Let $$y=y_1,y_2,\ldots,y_n$$ be a sequence of $$n$$ observations that are modeled as $$Y=Y_1,Y_2,\ldots,Y_n$$, a sequence of $$N$$ random variables. Let $$\Theta=\Theta_1,\Theta_2,\ldots,\Theta_p$$ be a sequence of $$p$$ parameters, which, because we are Bayesian are also random variables. Bayes Rule tells us how to update beliefs about a particular parameter value $$\theta$$, and it is often written as:
$\Pr(\theta|y) = \frac{\Pr(y|\theta)\Pr(\theta)}{\Pr(y)}$
which is shorthand for the more obtuse
$\Pr(\Theta=\theta|Y=y) = \frac{\Pr(Y=y|\Theta=\theta)\Pr(\Theta=\theta)}{\Pr(Y=y)}.$

At first glance, this equation is not only obtuse, but makes no sense. If $$Y$$ and $$\Theta$$ are continuous, then all the probabilities are identically zero. Bayes rule is then $$0=(0\times 0)/0$$, which is not very helpful.

The problem is that Bayes rule is usually written with hard-to-understand shorthand. The equation is not really about probabilities of the random quantities at set points, but about how random quantities fall into little intervals around the points. For example, $$Pr(\Theta_1=\theta_1)$$ is horrible shorthand for $$Pr(\Theta_1 \in [\theta_1-\Delta_{\theta_1},\theta_1+\Delta_{\theta_1})$$. Fortunately, it may be written as $$f(\theta_1)\Delta_{\theta_1}$$. The same holds for the joint probabilities as well. For example $$Pr(\Theta=\theta)$$ is shorthand for $$f(\theta)\Delta_\theta$$. In the Brazil-nut example, let $$\theta_1$$ and $$\theta_2$$ are mean and variance parameters in $$gm$$ and $$gm^2$$, respectively. Then, $$\Delta_\theta$$ is in units of $$gm\times gm^2$$ (or $$gm^3$$) and $$f(\theta)$$ is in units of $$1/gm^3$$.

With this notation, we may rewrite Bayes Rule as
$f(\theta|y)\Delta_\theta = \frac{f(y|\theta)\Delta_y f(\theta)\Delta_\theta}{f(y)\Delta_y}.$
The units of $$\Delta_\theta$$ are $$gm^3$$; the units of $$\Delta_y$$ are $$gm^n$$; the units of $$f(y)$$ and $$f(y|\theta)$$ are in $$gm^{-n}$$; the units of $$f(\theta)$$ and $$f(\theta|y)$$ are in $$gm^{-3}$$. Everything is in balance as it must be.

Of course, we can cancel out the $$\Delta$$ terms yielding the following form of Bayes Rule for continuous parameters and data:
$f(\theta|y) = \frac{f(y|\theta)f(\theta)}{f(y)}.$

Bayes Rule describes a conservation of units, and that conservation becomes more obvious when each side is unit free. Let's move the terms around so that Bayes Rule is expressed as
$\frac{f(\theta|y)}{f(\theta)} = \frac{f(y|\theta)}{f(y)}$
The left side describes how beliefs should change about a parameter value $$\theta$$ in light of observations $$y$$. The right side describes how well a parameter value $$\theta$$ changes the predicted probability of the data compared to all parameter values. The symmetry here is obvious and beautiful. It is my go-to explanation of Bayes Rule, and I am going to write more about it in future blog posts.

Sometimes, Bayes Rule is written as
$f(\theta|y) \propto f(y|\theta) f(\theta)$
$\mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior}.$
Jackman (2009) and Rouder and Lu (2005) take this approach in their explanations of Bayes Rule. For example, Jackman's Figures 1.2, the one that depicts updating, is not only without units, it is without a vertical axes altogether! When we don't track units, the role of $$f(Y)$$ becomes hidden, and the above conservation missed. My own view is that while the proportional expression Bayes Rule exceedingly useful for computations, it precludes the deepest understanding of Bayes Rule.  