Sunday, March 22, 2015

What Are the Units In Bayes Rule?

I often better understand statistics by drawing physical analogs. Expected value is the balancing of a distributions on a pin, skewness is a misloaded washing machine banging away, and so on. Sometimes the analogs come easily, and other times I have to work at it. I find that I tend to look for the units of things.

Surprisingly, consideration of units is not too common. Let me give you an example. Suppose I am modeling the mass of Brazil nuts (in grams) as a normal with some mean and variance. Most analysts know immediately that the units of the mean and variance parameters are \(gm\) and \(gm^2\), respectively. How about the units for the likelihood function for 10 observations? For any hypothetical value of the mean and variance, the evaluation of the likelihood is a number, and it has physical units. I'll get to them later.

Considering the units of Bayes Rule has affected how I view what it means, and describing this effect is the point of this blog post. Let's start with the units of density, progress to the units of likelihood, and finally to Bayes Rule itself.

The Units of Density

The figure shows the typical normal curve, and I am using it here to describe the mass of Brazil nuts. Most of us know there are numbers on the y-axis even though we hardly ever given them any thought. What unit are they in? We start with probability, and note that the area under the curve on an interval is the probability that an observation will occur in that interval. For example, the shaded area between 1.15 gm and 1.25 gm is .121, and, consequently, the probability that the next observation will fall between 1.15 and 1.25 is .121. We can approximate this area, \(A\), for a small interval between \(x-\Delta/2\) and \(x+\Delta/2\) as:
A=f(x) \times \Delta.
The figure shows the construction for \(x=1.2\) and \(\Delta=.1\). Now let's look at the units. The area \(A\) is a probability, so it is a pure number without units. The interval \(\Delta\) is an interval on mass in grams, so it has units \(gm\). To balance the units, the remaining term, \(f(x)\) must be in units of \(1/gm\). That is, the density is the rate of change in the probability measure per gram. No wonder the curve is called a probability density function! It is just like a physical density—a change in (probability) mass per unit of the variable.

In general, we write may write that the probability an observation \(X\) is in a small interval around \(x\) as
Pr(X \in [x-\Delta/2,x+\Delta/2]) = f(x)\Delta.
Here the units of \(f(x)\) (in \(gm^{-1}\)) and of \(\Delta\) (in \(gm\)) cancel out as they should.

All density functions are in units of the reciprocal of the measurement unit. For example, the normal density is
f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt(2\pi)}\exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right).
The unit is determined by the \(1/\sigma\) in the leading term. And where \(\sigma\) is in units of \(x\), \(1/\sigma\) is in units of \(1/x\). A valid density is always in reciprocal measurement units.

The Probability of An Oberservation is Zero

One of the questions we can ask about our density is the probability that a Brazil nut weighs a specific value, say 1 gram. The answer, perhaps somewhat surprisingly, is 0. No Brazil nut weights exactly 1.0000… grams to infinite precision. The area under a point is zero. This is why we always ask about probabilities that observations are in intervals rather than at points.

The Unit of Likelihood Functions

Let's look at the likelihood function for two parameters, \(\mu\) and \(\sigma^2\), and for two observations, \(x_1\) and \(x_2\). What is the units of the likelihood? The likelihood is the joint density of the observations, which may be denoted \(f(x_1,x_2;\mu,\sigma^2)\) treated as a function of parameters. Now this density describes how probability mass varies as a function of both \(x_1\) and \(x_2\); that is, it lives on a plane. We may write
Pr(X_1 \in [x_1-\Delta/2,x_1+\Delta/2] \vee X_2 \in [x_2-\Delta/2,2_1+\Delta/2]) = f(x_1,x_2)\Delta^2
It is clear from here that the units of \(\Delta^2\) are in \(gm^2\), and, hence, the units of \(f(x_1,x_2)\) are in \(gm^{-2}\). If we had \(n\) observations, the unit of the joint density among them is \(gm^{-n}\).

The likelihood is just the joint density repeatedly evaluated at a point for different parameter values. The evaluation though is that of a density, and the units are those of the density. Hence, the likelihood functions units are the reciprocal of those of the data taken jointly. If the mass of 10 Brazil nuts is observed in grams, the likelihood function has units \(gm^{-10}\).

The Units of Bayes Rule

Bayes Rule is the application of the Law of Conditional Probability to parameters in a statistical model. The Law of Conditional Probability is that
There are no units to worry about here as all numbers are probabilities without units.

The application for a statistical model goes as follows: Let \(y=y_1,y_2,\ldots,y_n\) be a sequence of \(n\) observations that are modeled as \(Y=Y_1,Y_2,\ldots,Y_n\), a sequence of \(N\) random variables. Let \(\Theta=\Theta_1,\Theta_2,\ldots,\Theta_p\) be a sequence of \(p\) parameters, which, because we are Bayesian are also random variables. Bayes Rule tells us how to update beliefs about a particular parameter value \(\theta\), and it is often written as:
\Pr(\theta|y) = \frac{\Pr(y|\theta)\Pr(\theta)}{\Pr(y)}
which is shorthand for the more obtuse
\Pr(\Theta=\theta|Y=y) = \frac{\Pr(Y=y|\Theta=\theta)\Pr(\Theta=\theta)}{\Pr(Y=y)}.

At first glance, this equation is not only obtuse, but makes no sense. If \(Y\) and \(\Theta\) are continuous, then all the probabilities are identically zero. Bayes rule is then \(0=(0\times 0)/0\), which is not very helpful.

The problem is that Bayes rule is usually written with hard-to-understand shorthand. The equation is not really about probabilities of the random quantities at set points, but about how random quantities fall into little intervals around the points. For example, \(Pr(\Theta_1=\theta_1)\) is horrible shorthand for \(Pr(\Theta_1 \in [\theta_1-\Delta_{\theta_1},\theta_1+\Delta_{\theta_1})\). Fortunately, it may be written as \(f(\theta_1)\Delta_{\theta_1}\). The same holds for the joint probabilities as well. For example \(Pr(\Theta=\theta)\) is shorthand for \(f(\theta)\Delta_\theta\). In the Brazil-nut example, let \(\theta_1\) and \(\theta_2\) are mean and variance parameters in \(gm\) and \(gm^2\), respectively. Then, \(\Delta_\theta\) is in units of \(gm\times gm^2\) (or \(gm^3\)) and \(f(\theta)\) is in units of \(1/gm^3\).

With this notation, we may rewrite Bayes Rule as
f(\theta|y)\Delta_\theta = \frac{f(y|\theta)\Delta_y f(\theta)\Delta_\theta}{f(y)\Delta_y}.
The units of \(\Delta_\theta\) are \(gm^3\); the units of \(\Delta_y\) are \(gm^n\); the units of \(f(y)\) and \(f(y|\theta)\) are in \(gm^{-n}\); the units of \(f(\theta)\) and \(f(\theta|y)\) are in \(gm^{-3}\). Everything is in balance as it must be.

Of course, we can cancel out the \(\Delta\) terms yielding the following form of Bayes Rule for continuous parameters and data:
f(\theta|y) = \frac{f(y|\theta)f(\theta)}{f(y)}.

Bayes Rule describes a conservation of units, and that conservation becomes more obvious when each side is unit free. Let's move the terms around so that Bayes Rule is expressed as
\frac{f(\theta|y)}{f(\theta)} = \frac{f(y|\theta)}{f(y)}
The left side describes how beliefs should change about a parameter value \(\theta\) in light of observations \(y\). The right side describes how well a parameter value \(\theta\) changes the predicted probability of the data compared to all parameter values. The symmetry here is obvious and beautiful. It is my go-to explanation of Bayes Rule, and I am going to write more about it in future blog posts.

Sometimes, Bayes Rule is written as
f(\theta|y) \propto f(y|\theta) f(\theta)
or, more informally:
\mbox{Posterior} \propto \mbox{Likelihood} \times \mbox{Prior}.
Jackman (2009) and Rouder and Lu (2005) take this approach in their explanations of Bayes Rule. For example, Jackman's Figures 1.2, the one that depicts updating, is not only without units, it is without a vertical axes altogether! When we don't track units, the role of \(f(Y)\) becomes hidden, and the above conservation missed. My own view is that while the proportional expression Bayes Rule exceedingly useful for computations, it precludes the deepest understanding of Bayes Rule.

Saturday, March 14, 2015

Estimating Effect Sizes Requires Some Thought

The small effect sizes observed in the Many Labs replication project has me thinking….

Researcher think that effect size is a basic, important, natural measure for summarizing experimental research. The approach goes by “estimation” because the measured sample effect size is an estimate for the true or population effect size. Estimation seems pretty straightforward: the effect-size estimate is the mean divided by the standard deviation. Who could disagree with that?

Let's step back and acknowledge that true effect size is not a real thing in nature. It is a parameter in a model of data. Models are assuredly our creations. In practice, we tend to forget about the models and think about effect sizes as real quantities readily meaningful and accessible. And it is here that we can occasionally run into trouble, especially for smaller effect-size measurements.

Estimating Coin-Flip Outcomes

Consider the following digression to coin flips to build intuition about estimation. Suppose a wealthy donor came to you offering to help fund your research. The only question is the amount. The donor says:

I have a certain coin. I let you see 1000 flips first, and then you can estimate the outcome for the next 1000. If you get it right, I will donate $10,000 to your research. If you're close, I will reduced the amount by squared error. So, if you are off by 1, I will donate $9,999; off by 2, I will donate $9,996; if you are off by 20, I will donate $9,600, and so on. If you are off by more than 100, then there will be no donation.“ *”

Let's suppose the outcome on the first 1000 flips is 740 heads. It seems uncontroversial to think that perhaps this value, 740, is the best estimate for the next 1000. But suppose that the outcome is 508 heads. Now we have a bit of drama. Some of you might think that 508 is the best estimate for the next 1000, but I suspect that we are chasing a bit of noise. Perhaps the coin is fair. The fair-coin hypothesis is entirely plausible—--after all, have you ever seen a biased coin? Moreover, the result of the first 1000 flips, the 508 heads, should bolster confidence in this fair-coin hypothesis. And if we are convinced then after seeing the 508 heads, the best estimate is 500 rather than 508. Why leave money on the table?

Now let's ratchet up the tension. Suppose you observed 540 heads on the first 1000 flips. I chose this value purposefully. If you were bold enough to specify the possibilities, the fair-coin null and a uniformly distributed alternative, then 540 is an equivalence point. If you observe between 460 and 540 heads, then you should gain greater confidence in the null. If you observe less than 460 or greater than 540 heads, the you should gain greater confidence in the uniform alternative. At 540, you remain unswayed. If we believe that both the null and alternative are equally plausible going in, and our beliefs do not change, then we should average. That is, the best estimate, the one that minimizes squared error loss, is half way in between. If we want to maximize the donation, then we should estimate neither 500 nor 540 but 520!

The estimate of the number of heads, \(\hat{Y}\) is a weighted average. Let \(X\) be the number of heads on the first 1000 and \(P(\mbox{Fair}|X)\) be our belief that the coin is fair after observing \(X\) heads.
\[ \hat{Y} = \mbox{Pr}(\mbox{Fair}|X) \times 500 + (1-\mbox{Pr}(\mbox{Fair}|X)) X.\]
The following figure shows how the estimation works. The left plot is a graph of \(\mbox{Pr}(\mbox{Fair}|X)\) as a function of \(X\). (The computations assume that the unfair coin probabilities a prior follow a uniform and that, a prior, the fair and unfair coin models are equally likely.) It shows that for values near 500, the belief increases but for values away from 500, the belief decreases. The right plot shows weighted-averaged estimate of \(\hat{Y}\). Note the departure from the diagonal. As can be seen, the possibility of a fair coin provides for an expanded region where the fair-coin estimate (500) is influential. Note that this influence depends on the first 1000 flips: if the number of heads from the first 1000 is far from 500, then the fair-coin hypothesis has virtually no influence.

Model Averaging For Effect Sizes

I think we should estimate effect sizes by model averaging including the possibility that some effects are identically zero. The following figure shows the case for a sample size of 50. There is an ordinary effects model, where effect size is distributed a priori as a standard normal, and a null model. When the sample effect size is small, the plausibility of the null model increases, and the estimate is shrunk toward zero. When the sample effect size is large, the effects model dominates, and the estimate is very close to the sample value.

Of interest to me are the small effect sizes. Consider say a sample effect size of .06. The model-averaged effect-size estimate is .0078, about 13% of the sample statistic. We see here for small effect sizes dramatic shrinkage, as it should be if we have increased confidence in the null. I wonder if many of the reported effect sizes in Many Labs 3 are more profitably shrunk, perhaps dramatically, to zero.

But the Null is Never True….What?

Cohen and Meehl were legendary contributors. And one of their most famous dictums was that the null was never true to arbitrary precision. If I had a $10 for every researcher who repeats to this proposition, then I would be quite well off. I have three replies: 1. the dictum is irrelevant, 2. the dictum is assuredly wrong on some cases, and 3. in other cases, the dictum is better treated as an empirical question rather than a statement of faith.

  • Irrelevant: The relevant question is not whether the null is true or not. It is whether the null is theoretically important. Invariances, regularity, lawfulness are often of interest even if they are never true to arbitrary precision. Jupiter, for example, does not orbit the sun in a perfect ellipse. There are, after all, small tugs from other objects. Yet, Kepler's Law are of immense importance even if they hold only approximately. My own take of psychological science is that if we allowed people to treat the null as important and interesting, they surely would, and this would be good.
  • Wrong in some cases: Assuredly, there are cases where the point null does hold exactly. Consider the random-number generator in common packages, say those that produce a uniform distribution between 0 and 1. I could hope that the next number is high, say above .9. I contend that this hope has absolutely no effect whatsoever on the next generated number.
  • Testable in others: The evidence for null vs. a specified alternative may be assessed. But to do so, you need to start with real mass on the null. And that is what we do with our Bayes factor implementations. So, why be convinced of something a priori that you know is wrong in some cases and can be tested in others.

If You've Gotten This Far

If you are convinced by my arguments, then may I suggest you downweight effect sizes as a useful measure. They are clearly marginal or averaged across different models. Of more meaning to me is not the averaged size of effects, but the probability of an effect. If you turn your attention there, then welcome to Bayesian model comparison!

Monday, March 9, 2015

High Reliability in the Lab

Ten days ago, Tim pulled me aside.  "Jeff, there's the keyboard from Room 2," he said pointing at the trash.  "When I pressed the 'z' key, I had to hit it three times for it to respond.  I wanted to tell you first because I know it affects some of the experiments."

"Thanks, we'll deal with it," I said as that sinking feeling set in.   The questions immediately came to mind.  How long has this keyboard been bad?  What data are compromised?  What can we do to catch this sooner?  Are our systems working?  What the hell are we doing?

About five-years ago I had the good fortune of being on a nursing PhD committee.  The candidate was applying the theory of highly-reliable organizations to the nursing setting in hopes of improving patient safety.  As I read the characteristics of highly-reliable organizations in the thesis, I started making comparisons to my lab.  We made mistakes in the lab, and although none had bitten us in the ass, perhaps we were more fortunate than good.  And I knew my way of dealing with mistakes, asking people to be more careful, wasn't that effective.  But the described characteristics of highly reliable organizations seemed abstract and not immediately translatable.  They were for large-scale organizations like hospitals and air-traffic control, and they were for dangerous environments where people could die.  My lab is small.  Not only have we never injured anyone, it seems highly unlikely that we ever will.

But mistakes and other adverse events not only cost us time, they affect the trust other people may place in our data and analyses.  We need to keep them at a minimum and understand them when they occur.  So I began adapting the characteristics of highly reliable organizations for my lab.   Making changes does take time and effort, but I suspect the rewards are well worth it.

Here are the characteristics of highly reliable organizations ( and our evolving implementations: 
  1. Sensitivity to Operations.  Sensitivity to operations in the context of a lab means enhanced attention to the processes by which data (and interpretations) are produced.   We implemented the characteristic by (a) Defining a clear audit trail of lab activities via a MySQL relational database.  This database has tabs for IRBs, experiments, experimenters, participants, and runs.   (b) Automating many tasks so that smooth operation is less reliant on human meticulous.  (c)  Codifying a uniformity among different experiments and experimenters so that more people have a greater understanding of more elements of the operations.  
  2. Preoccupation with Failure.  My goal is to anticipate some of the mistakes that can be made and think of ways to prevent them.  For me this has been achieved primarily in two ways:  First, I am structuring the lab so computers do more and people do less.  Examples include writing scripts that all experiments use to name files, add metadata, log demographics, populate the MySQL databases, and upload data to repositories. These steps have increased the accessibility and quality of the data and metadata we collect.  It's still evolving.  Second, we discuss near-miss mistakes a lot.  Near-miss mistakes are things that didn't quite go right even though they had no effect on the integrity of the lab.   Near-miss mistakes if not identified could be an experiment-ruining mistake next time around.
  3. Resilience.  It is hard to define a resilient lab.  For us, it means that we accept that mistakes will happen and we put in place procedures to deal with them and, hopefully, learn from them.  We have an adverse-events tab in our MySQL database.  All issues are logged as an adverse event without blame or shame.  What's the problem?  What's the short term solution?  What are the long-term solutions and needed policy changes?  Adverse events are discussed at lab meetings, perhaps repeatedly, until we can get some resolution. 
  4. Reluctance to simplify interpretations.  The simple interpretation of mistakes in my lab is that people occasionally make mistakes and hardware occasionally fails.  We should be more careful where we can and live with the rest.  We, however, are reluctant to take this stance.  Mistakes are a failure (by the PI) to anticipate a certain constellation of conditions, and we should try to identify that constellation as we can envision and adapt where necessary. 
  5. Deference to Expertise.  All lab members have some expertise in data collection.  Perhaps the most qualified experts are the graduate students and post-docs.  They write the code, organize the undergraduates, and analyze the data.   The undergrads are on the front lines.  They are setting up the machines, briefing, debriefing, and interacting with the participants.  I bring a long-run view.  Together, with deference to appropriate expertise, we figure out solutions.
So, with last weeks' s keyboard failure, we can see some of these principles in action.  I logged an adverse event, and typed up the problem.  I asked Tim how and why he discovered it (he was changing an experiment in Room 2).  We brought it up at the next lab meeting and divided our discussion into immediate concerns and log-term solutions as follows:

The immediate concerns were about the integrity of the data in a few experiments that use the "z" key.   We had a record of which participants were run on these experiments.  We also have a record of which room and computer they were run in.   It is automatically logged.    I had previously set up an environmental variable for each machine and in the start up scripts, this variable is read and outputted to the database along with other session variables.  Therefore we know exactly which data is at risk, about 10\% of participants.  Moreover, the bad "z" key was used for catch trials.  The response times weren't analyzed so it is feasible that the failure is inconsequential to the integrity of the experiments.  A bit of good luck here.

The long-term solutions were harder.  Hardware will fail.  Yet, the single-sticky-key keyboard failure is pretty rare in that it is so hard to detect.  The two most common hardware failures are hard-drive and mother-board failures which are immediately detectable.  We may go another decade without seeing a hard-to-detect hardware failure.  So, with this in mind, we considered a few options.  One option we consider was doing nothing.  After all, the failure is rare and we were able to mitigate the damage.  Another option we could do is make some hardware assurance protocol that could be run weekly.  Yet, the idea of spending much time on such a rare problem that we could mitigate seemed was deemed wasteful of lab resources.  So we settled on the following.  When each of us develops a new experiment, we are required to run it on ourselves once fully to confirm the conditions appear as we intended, appear at the frequencies we intended, and that the data are analyzable as intended.  We switch experiments say once-a-month or so, and so there is this check as often.  Most experimenter run these checks in Rooms 1 and 4 or on their own machines.  From now on, the checks are to be run in the lab and the machine the check was run is noted in the database.   Experimenters are asked to see which machine was last checked and make sure all machines get checked now and again.  In this manner, we can detect these failures more quickly without any additional time or effort.  Another solution I thought of after our meeting is asking participants if they noticed anything strange that we need to be aware of.    I will bring up this issue at the next lab meeting, to my experts, to see if it is helpful. 

Thursday, March 5, 2015

Just Tell Us What Is Proper and Appropriate: Thoughts on the Upcoming ASA Statement

These are great times to be a methodologist.  There is a crisis in reproducibility; there is a loss of confidence across several domains; and standard are being debated and changing.  It is a time of flux, innovation, and chaos.  None of us is quite sure how it is going to shake out.  

The most recent installment in this drama is the banning of significance tests in a social-psychology journal.  This banning has received much attention, and the comments that has caught my eye are those from American Statistical Association's (ASA) Ronald Wasserstein.  In it, Wasserstein reports that a group of more two dozen experts is developing an official ASA response.  Until then, ASA "encourages journal editors and others... [to] not discard the proper and appropriate use of statistical inference." (emphasis added)  ASA's statement provided me an opportunity to more closely examine the relationship between scientists and statisticians.  Here is my message to ASA:

We psychologists tend to put you statisticians on a pedestal as experts who tell us what to and what not to do.  You statisticians should not welcome being put on this pedestal for there is a dark side.  What we psychologists tend to do is cleave off analysis from the other parts of the research process.  Whereas we see our research overall as perhaps creative and insightful, many of us see analysis as procedural, formulaic, hard, and done to meet sanctioned standards.  In this view, we psychologists are shifting our responsibilities to you statisticians.  We ask you to sanction our methods, to declare our calculations kosher, to absolve from being thoughtful, judicious, transparent, and accountable in analysis.  My sense is that this transfer of responsibility is as rampant as it is problematic.  It results in a less transparent, more manufactured, more arbitrary psychological science. 

When ASA weighs in, it should be mindful of the current pedestal/responsibility dynamic.  ASA should encourage researchers to be responsible for their analysis---that is, to justify them without recourse to blind authority or mere convention.  ASA should encourage thoughfulness rather than adherence.  Telling us which options are *proper and appropriate* won't do.  Promoting repsonsiblity and thoughtfulness, however, seems easier said than done.  Good luck.  Have fun.  Knock yourselves out.

Perhaps my experience is helpful.  Statisticians add immeasurable value by helping me instantiate competing theories as formal models.  These models imply competing constraints on data.  We work as a team in developing a  computationally-convenient analysis plan for assessing the evidence for the models from the data.  And if we do a good job instantiating theories as models, then we may interpret our statistical inference as inference about the theories.  In the end we share responsibility as a team for understanding how the constraint in the models is theoretically relevant, and how patterns in the data may be interpreted in theoretical terms.