Thursday, April 9, 2015

Reply to Uri Simonsohn's Critique of Default Bayesian Tests

How we draw inferences about theoretical positions from data, how we know things, is central to any science.  My colleagues and I advocate Bayes factors, and Uri Simonsohn provides a critique.  I welcome Uri’s comments  because they push the conversation forward.  I am pleased to be invited to reply, and am helped by Joe Hilgard, an excellent graduate student here at Mizzou.

The key question is what to conclude from small effects when we observe them.  It is a critical question because psychological science is littered with these pesky small effects---effects that are modest in size and evaluate to be barely significant. Such observed effects could reflect many possibilities including null effects, small effects, or even large effects.

According to Uri, Bayes factors are prejudiced against small effects because small effects when true may be interpreted as evidence for the null. By contrast, Ward Edwards (1965, Psy Bull, p. 400) described classical methods as violently biased against the null because if the null is true, these methods never provide support for it.

Uri's argument assumes that observed small effects reflect true small effects, as shown in his Demo 1. The Bayes factor is designed to answer a different question: What is the best model conditional on data, rather than how do statistics behave in the long run conditional on unknown truths?  I discuss the difference between these questions; why the Bayes factor question is more intellectually satisfying; and how to simulate data to assess veracity of Bayes factor in my 2014 Psy Bull Rev paper.

The figure below shows the Bayes-factor evidence for alternative model relative to the null model conditional on data, which in this case is the observed effect size. We also show the same for p-values, and while the figures are broadly similar they also have dissimilarities.  I am not sure we can discern which is more reasonable without deeper analysis.  Even though we think Uri's Demo 1 is not that helpful, it is still noteworthy that the Bayes factors favor the null to greater extent as the true effect size decreases.

The question remains how to interpret observed small effects.  We use Bayes factors to assess relative evidence for the alternative vs. the null.  Importantly, we give the null a fighting chance.  Nulls are statement of invariance, sameness, conservation, lawfulness, constraint, and regularity.  We suspect that if nulls are taken seriously, that is, that if one can state evidence for or against them as dictated by data, then they will prove useful in building theory.    Indeed, statements of invariances, constraint, lawfulnesses and the like have undergirded scientific advancement for over four centuries.  It is time we joined this bandwagon.

Bayes factors are formed by comparing predictions of competing models.  The logic is intuitive and straightforward.  Consider two competing explanations.  Before we observe the data, we use the explanations to predict where the observed effect size might be.  We do this by making probability statements.  For example, we may calculate the probability that the observed effect size is between .2 and .3.  Making predictions under the null is easy, and these probabilities follow gracefully from the t-distribution.

Making predictions under the alternative is not much harder for simple cases.  We specify a reasonable distribution over true effect sizes, and based on these, we can calculate probabilities on effect size intervals as with the null.

Once the predictions are stated, the rest is a matter of comparing how well the observed effect size matches the predictions.  That's it.  Did the data conform to one prediction better than another?  And that is exactly what the Bayes factor tells us.  Prediction is evidence.

My sense is that researchers need some time to get used to making predictions because it is new in psychological science.   I realize that some folks would rather not make predictions---it requires care, thought, and judiciousness.  Even worse, the predictions are bold and testable.  My colleagues and I are here to help, whether it is help in specifying models that make predictions or deriving the predictions themselves.

I would recommend that researchers who are unwilling to specify models of the alternative hypothesis avoid inference altogether.  There is no principled way of pitting a null model that makes predictions against a vague alternative that does not.  Inference is not always appropriate, and for some perfectly good research endeavors, description is powerful and persuasive.  The difficult case, though, is continuing in what we are doing---rejecting nulls with small effects with scant evidence while denying the beauty and usefulness of invariances in psychological science.


P. Gomez said...

"My colleagues and I are here to help ..."


Dr. R said...

Dear Dr. Rouder,

You frame the discussion in terms of observed effect sizes (i.e., a mean difference/covariance in s sample).

"The key question is what to conclude from small effects when we observe them."

I think framing the discussion in terms of observed effect sizes is not productive.

A much more important question is how useful different statistical approaches are to investigate effects, given the actual size of an effect (i.e., the effect in the population or an unbiased sample drawn from a population).

How useful is the default Bayesian approach to the study of effects that are actually small. Let's use Cohen's d = .2 as a small effect size.

Now a researcher wants to examine whether a small effect exists. For example, Bem (2011) claimed to expect a small effect for ESP.

A traditional way of demonstrating this effect, is to conduct a study with 80% power for an effect size of d = .2. If the true effect size is d = .2, the test will produce a significant result 8 out of 10 times. If the true effect is 0, it will do so only 5% of the time.

For a Bayesian test, it is necessary to contrast two hypotheses. In this case, it makes sense to contrast d = 0 with d = .2. If the true effect size were d = .2 and the sample size is large enough to minimize the effect of sampling error, the BF would favor H: d = .2. If the true effect size is d = 0, it would favor H: d = 0.

However, what would happen if Bem (2011) overestimated the actual effect size of PSI based on a meta-analysis with inflated effect sizes and the true effect size was only d = .05?

In this case, BF would favor H:d=0 over H:d=.2 because the true effect size is closer to 0 than to .2.

Thus, the interpretation of a BF depends on the two hypotheses that are being pitted against each other.
It is not clear what is gained by pitting two hypotheses against each other. It is also possible to compute p-values for d = 0 and d = .2. This might show that both hypotheses are being rejected (p < .05) and neither one describes the actual empirical data well.

Personally, I believe that Bayesian statistic is useful in research areas where theory is strong enough to make quantitative predictions (I use the Bayesian Information Criterion all the time in structural equation modeling).

However, to apply this method to studies that merely examine whether an effect exists seems to create more problems than it solves.

Sincerely, Dr. R

Jeff Rouder said...

Hi Dr. R. Please call me Jeff; hopefully we can be on a first-name basis.

While it may appear tangential, I hated the Heisenberg Uncertainty Principle as a kid. I had no need for limits on my knowledge no matter how principled those limits might be. Only later in life did I learn to appreciate these limits as beautiful part of nature.

Now let's take the ESP example. Some might want to know the estimate of an effect-size parameter in a model. I tend to use mixture models for this purpose that allows for point nulls (see, but which model to use is a matter of taste.

Others might want to make a statement about a theory, say that there is no such thing as ESP or that there is ESP. Both of these are really important theories, and neither deserves to be short changed at all. In NHST or inference by confidence intervals and the like, one encodes just one of these two theories, the null, and either rejects it or says nothing. What a pathetic choice if we consider both theoretical statements important. It is unappealing that we cannot state evidence for the null ever, and even when the null is true. I can't endorse not being able to state anything about such a reasonable theory as there is no ESP. When do we reject the alternative?

So, if we decide we wish to treat both theories on a fair, level playing ground, we need to decide how the alternative---there is ESP---relates to data. We need a model that makes predictions in its own right. If you think the null is theoretically attractive and reasonable, then you must specify the alternative so that it makes predictions. There is no way around this point. It's a matter of math and logic. I guess you can start with the null as unimportant, never true, rejection fodder. But then why do testing in the first place if the null isn't on the same theoretical ground as the alternative?

What you are asking for is a principled way of testing without specification. That, in my opinion, is like asking to be able to localize the speed and position of a particle to infinite precision. It violates what may be learned.


Alex ETz said...

"It is not clear what is gained by pitting two hypotheses against each other."

Better stop doing all of those power calculations then...

Anonymous said...

What precisely annoys others probably the most along with these types of running watches is definitely that you don't pay just in the actual products wax replica watches, you furthermore mght spend for the selling and classified ads and additionally sponsorships of which get deeply into all of them. Along with a look at, important things would be a good deal a lot easier, just as they should be: an individual find the money for the observe and the watch solely. It can be expressed by pro watchmakers, with all the exact same fabrics, without any concession regarding quality. All of these watchmen are classified as the true characters, simply because they know very well what ways to get relished in addition to like to look at his or her designs currently being enjoyed by simply several persons feasible. Once consumers think about timepieces, many of them look into badly designed Chinese language imitations involving some type. Except for a great deal of many, which is not the truth any more. Along with the industry recent rate of growth, these kind of wristwatches will be built in order to reach customers, making a point individuals give back designed for different products. The times while getting timepieces meant having scammed are process driving all of us. The actual pieces have become identified through level of quality fake watches. Every one of these is definitely made exclusively using top quality under consideration. Qualified watchmakers allot hours and hours of their time mainly are very important the bottom device is really a great wrist watch, indistinguishable within the genuine, even if get just one next to the additional. Subsequently after leaving behind this production set, this wrist watches should go away an exceedingly no-nonsense high quality regulate technique, in order to make sure solely things that honestly can be perfect access the buyer, want . joyful customer is really a coming back customers.

Unknown said...

"Did the data conform to one prediction better than another? And that is exactly what the Bayes factor tells us. Prediction is evidence. [...] My sense is that researchers need some time to get used to making predictions because it is new in psychological science."

Huh? It conforms to one model vs. another. I don't see predictions anywhere. You are looking at the probability of the observed data under each model -- in one way of writing the Bayes Factor, this is p(y|H1)/p(y|H0). It's not a posterior predictive distribution, there is no comparison of "new" data points. Nor are you using out-of-sample estimate of error, cross-validation to compare models, or performing forecasting, etc. So I feel it is misleading to call this "prediction" or to claim "prediction is evidence" in this context. While I agree prediction is evidence (out-of-sample error is a great measure), you are not performing prediction here, you are model fitting under two different hypotheses/models.

Jeff Rouder said...

Hi Alex, Thanks for writing. I think you missed something here. Lets look at p(y|H1), the probability of the data given the model instantiating H1. Now, even before we collect data, we can evaluate the probability for any possible set of observed data. If we do this for all possible data, we derive a proper distribution that tells us where the not-yet-collected are probable and where they are not. To me, this is a beautiful definition of a prediction. We are placing probabilities on intervals before seeing the data about where the data will occur. In fact, this is about as good as it get: We can talk about the most probable point (say the mean or mode), about the dispersion, even about higher moments as well. I cannot think of a better definition for the word "prediction" than these a priori marginal probability distributions of data. I hope you can see to join me.

I have provided examples in several newer blog posts including

Unknown said...

Jeff, it seems to me you are describing a prior predictive distribution. That's fine, but that's not what the Bayes Factor is, since p(y|H1) is the marginal distribution of the observed data, not the distribution of all possible data. Moreover, p(y|H1) is also not a posterior predictive distribution, as I said. If you wanted to compare the posterior predictive distributions conditional on H0 and H1, respectively, (aka "posterior predictive checks" for the two hypotheses) I would agree with you.

Unknown said...

Jeff, I read your newer post. While I would phrase it a bit differently, I do see now what you mean. Nice post!

Jeff Rouder said...

Thanks. I should say, "Strength of evidence is predictive accuracy" The "evidence is prediction" is too shorthanded and too vague to be effective.