Sunday, May 31, 2015

Simulating Bayes Factors and p-Values

I see people critiquing Bayes factors based on simulations these days, and example include recent blog posts by Uri Simonsohn and Dr.-R. These authors assume some truth, say that the true effect size is .4, and then simulate the distribution of Bayes factors are like across many replicate samples.  The resulting claim is that Bayes factors are biased, and don't control long run error rates.   I think the use of such simulations is not helpful.  With tongue-in-cheek, I consider them frequocentrist.  Yeah, I just made up that word.  Let's pronounce it as  "freak-quo-centrists.  It refers to using frequentist criteria and standards to evaluate Bayesian arguments.

To show that frequocentric arguments are lacking, I am going to do the reverse here.  I am going to evaluate p-values with a Bayescentric simulation.

I created a set of 40,000 replicate experiments of 10 observations each.  Half of these sets were from the null model; half were from an alternative model with a true effect size of .4.   Let's suppose you picked one of these 40,000 and asked if it were from the null model or from the effect model.  If you ignore the observations entirely, then you would rightly think it is a 50-50 proposition.  The question is how much do you gain from looking at the data.

Figure 1A shows the histograms of observed effect sizes for each model.  The top histogram (salmon) is for the effect model; the bottom, downward going histogram (blue) is for the null model.  I drew it downward to reduce clutter.

The arrows highlight the bin between .5 and .6.  Suppose we had observed an effect size there.  According to the simulation, 2,221 of the 20,000 replicates under the alternative model are in this bin.  And 599 of the 20,000 replicates under the null model are in this bin.   If we had observed an effect size in this bin, then the proportion of times it comes from the null model is 599/(2,221+599) = .21.  So, with this observed effect size, the probability goes from 50-50 to 20-80.  Figure 1B shows the proportion of replicates from the null model, and the dark point is for the highlighted bin.  As a rule, the proportion of replicates from the null decreases with effect size.

We can see how well p-values match these probabilities.  The dark red solid line is the one-tail p-values, and these are miscalibrated.  They clearly overstate the evidence against the null and for an effect.  Bayes factors, in contrast, get this problem exactly right---it is the problem they are designed to solve.  The dashed lines show the probabilities derived from the Bayes factors, and they are spot on.  Of course, we didn't need simulations to show this concordance.  It falls directly from the law of conditional probability.

Some of you might find this demonstration unhelpful because it misses the point of what a p-value is what it does.  I get it.  It's exactly how I feel about others' simulations of Bayes factors.

This blog post is based on my recent PBR paper: Optional Stopping: No Problem for Bayesians.  It shows that Bayes factors solves the problem they are designed to solve even in the presence of optional stopping.


Dr. R said...

Dear Jeff,

I am happy that we are engaging in a constructive dialogue that may not result in agreement but at least better understanding of the sources of disagreement.

I am also happy that this exchange is happening online in real time without months of delay and hidden peer-reviews.

My first question is why you used a one-tailed p-value. My second question is what your prior distribution is (I assume a half Cauchy with scaling factor .707). My third question is how you would interpret an observed effect size of d = -1, with a posteriori probabilty of ~ 1 in favor of the null-hypothesis. Finally, I am not sure how posteriori-probabilities are related to Bayes-Factors.

Sincerely, Dr. R (Uli)

Jeff Rouder said...

Hi Uli,

1. I used a 1-sided p-value because I knew the direction of the effect was positive. The goal was to see how much more info there was when conditioning on the data then when not.

2. The BF was for the models at hand. They were given. So in this case it was two points: one at zero and one at .4.

3. Yes, d = -1 is evidence for the null vs. .4.

4. post-p = BF/(1+BF) for equal prior odds. If the prior odds are pi, then pi*BF/(1+pi*BF)

Best, Jeff

Winthrop Harvey said...

Firstly, it’s not true that Bayesian methods in principle are biased against small effect sizes. This depends on your priors. But I don’t think the “default” prior being biased against small effect sizes is a weakness. I think it’s a strength!

To paraphrase Cohen, "The Null is Always False (Even When It is True)."

Oh sure, the theoretical null hypothesis can be really, truly TRUE - but if you take a large enough sample size you will find an effect "showing otherwise" to whichever p-value you want. This is not because the null hypothesis you stated is false - but rather because your experiment is imperfect, and is not actually testing the theoretical null hypothesis but an approximation, the experimental null hypothesis.

We use controls to minimize confounds, and good study design does a good job of making sure that any confounds remaining are very small. But it is practically impossible to ELIMINATE confounds. You can't control conditions perfectly. There's no such thing as a perfect experiment, even in simulation studies you could have minute imperfections in random number generation or the physical activity of your computer producing some extraordinarily slight confound. You can't control everything, and when it comes to chaotic real world systems everything has SOME effect. Maybe it's .00001, but there's going to be an effect, and if you have enough n, and your power actually increases with n, then you'll eventually detect it.

If you model the level of confounds in your experiment as a random variable, what is the probability that you just happen to hit exactly 0? It doesn't even matter what the probability distribution is, the chance of hitting EXACTLY 0 to perfect precision is, in fact, EXACTLY 0. The only thing you're sure about is that your experiment isn't perfect.

The point being... if you get p=.0000001, on a difference of .5%, and then you say you reject the null hypothesis because it's just so UNLIKELY... you're in for some pain. Because what you've detected isn't that the null isn't true, what you've detected is the imperfection in your ability to create an experimental setup that actually tests the theoretical null.

The experimental null you're testing is an APPROXIMATION of the theoretical null. You cannot reasonably expect to ever create an experiment with NO confounds of any arbitrarily small magnitude.

The theoretical null may or may not be true. The experimental null is ALWAYS false, in the limit of large n. You cannot control for every confound - you cannot even conceive of every confound!

But the problem is when people ignore the fact that experimental or systematic error can only be reduced, not eliminated, and then go on to think that p=.000000001 at a miniscule effect size is strong evidence against the null. But what a Bayesian says is, "I expect (have a prior) that even if the theoretical null is true, there's going to be some tiny confound I couldn't control, so if I see a very small effect, it's most likely a confound." Unless you SPECIFICALLY hypothesized (had a prior for!) a very small effect size, finding a small effect is strong evidence FOR the null regardless of the p value!

If you were looking for an effect d=.4, but find an effect d=.05 with very low p-value, it’s really tempting to say, “Well, the effect was (much) smaller than we thought it was, but it’s real, here it is, look at this tiny p-value!”

Bayes keeps us honest because it forces us to reveal via our priors EXACTLY how large an effect has to be before we will be able to think we are gathering evidence for it over the null. This is not a weakness, but a strength! If you don’t do this, you find ESP is real because that darn null hypothesis is so improbable.

Winthrop Harvey said...

Continued off last post:

On Dr. R’s post, he says, “The main difference [between Bayes factors and p-values] is that p-values have a constant meaning for different sample sizes. That is, p = .04 has the same meaning in studies with N = 10, 100, or 1000 participants. However, the interpretation of Bayes-Factors changes with sample size.” He clarifies later, “In contrast, p-values have a consistent meaning. They quantify how probable it is that random sampling error alone could have produced a deviation between an observed sample parameter and a postulated population parameter.”

This is true. The p-value is the probability that sampling error alone could have produced the result.

But sampling error is never alone. Our experiments aren’t perfect. The null is always false (even when it's true).

This means that the INTERPRETATION of a p-value VERY MUCH depends on sample size, and if you don’t acknowledge this you’re in for a world of trouble! P=.04 for 10 participants is very, very different from p=.04 for 1000 because the observed effect size to get p=.04 for 10 participants is very large, whereas the observed effect size needed to get p=.04 for 1000 participants is rather low. Not only is this of practical concern even if you think that the evidence is somehow equally good for both effects because the p-value is the same, but the evidence is NOT equally good for both because the small effect in the n=1000 sample could much more easily be a confound!

If an effect if of a real, fixed sized, then you expect increased sample size to result in decreased p. What this means is that for a given effect, increased sample size actually requires SMALLER p-values to provide the same level of evidence! If you keep getting the same p-value as n increases, that means your effect size is decreasing with n (which is pretty odd, if it’s a real effect!). Wagenmaker’s 2007 article “A practical solution to the pervasive problem of p-values” demonstrates this quite well (especially figure 6). (If you think it’s unfair to use Bayesian methods to assess p-values, keep in mind also that Bayesian methods are actually, and provably, correct).

This result is quite counterintuitive, which makes it extraordinarily important. Most people would say that a 500 person p=.04 study provides stronger evidence for an effect than a 50 person p=.04 study.

Most people are wrong, and this is BEFORE even considering confounds which an overpowered study might become ensnared in! At best, you can say they provide equal evidence for their effects, but that the 500 person study has a much smaller one. But to even say that is to assume that your study is perfect. Smaller effect sizes are inherently less reliable even with equal statistical evidence.

So if Dr. R thinks that “the interpretation of Bayes-Factors changes with sample size “ is a unique weakness of Bayes factors, he’s simply not reasoning correctly. P-values have exactly the same problem, only it’s far, far worse because people aren’t aware of it.
Again and again, the Bayes methods gets attacked for some problem, (subjectivity, interpretability), that p-values supposedly lack, when the case is really that Bayes keeps us honest and forces us to reveal analytical lability front and center, while p-values let us hide it.

Jeff Rouder said...

Thank you Winthrop, these are very insightful and helpful comments.