Thursday, April 9, 2015
Reply to Uri Simonsohn's Critique of Default Bayesian Tests
How we draw inferences about theoretical positions from data, how we know things, is central to any science. My colleagues and I advocate Bayes factors, and Uri Simonsohn provides a critique. I welcome Uri’s comments because they push the conversation forward. I am pleased to be invited to reply, and am helped by Joe Hilgard, an excellent graduate student here at Mizzou.
The key question is what to conclude from small effects when we observe them. It is a critical question because psychological science is littered with these pesky small effects---effects that are modest in size and evaluate to be barely significant. Such observed effects could reflect many possibilities including null effects, small effects, or even large effects.
According to Uri, Bayes factors are prejudiced against small effects because small effects when true may be interpreted as evidence for the null. By contrast, Ward Edwards (1965, Psy Bull, p. 400) described classical methods as violently biased against the null because if the null is true, these methods never provide support for it.
Uri's argument assumes that observed small effects reflect true small effects, as shown in his Demo 1. The Bayes factor is designed to answer a different question: What is the best model conditional on data, rather than how do statistics behave in the long run conditional on unknown truths? I discuss the difference between these questions; why the Bayes factor question is more intellectually satisfying; and how to simulate data to assess veracity of Bayes factor in my 2014 Psy Bull Rev paper.
The figure below shows the Bayes-factor evidence for alternative model relative to the null model conditional on data, which in this case is the observed effect size. We also show the same for p-values, and while the figures are broadly similar they also have dissimilarities. I am not sure we can discern which is more reasonable without deeper analysis. Even though we think Uri's Demo 1 is not that helpful, it is still noteworthy that the Bayes factors favor the null to greater extent as the true effect size decreases.
The question remains how to interpret observed small effects. We use Bayes factors to assess relative evidence for the alternative vs. the null. Importantly, we give the null a fighting chance. Nulls are statement of invariance, sameness, conservation, lawfulness, constraint, and regularity. We suspect that if nulls are taken seriously, that is, that if one can state evidence for or against them as dictated by data, then they will prove useful in building theory. Indeed, statements of invariances, constraint, lawfulnesses and the like have undergirded scientific advancement for over four centuries. It is time we joined this bandwagon.
Bayes factors are formed by comparing predictions of competing models. The logic is intuitive and straightforward. Consider two competing explanations. Before we observe the data, we use the explanations to predict where the observed effect size might be. We do this by making probability statements. For example, we may calculate the probability that the observed effect size is between .2 and .3. Making predictions under the null is easy, and these probabilities follow gracefully from the t-distribution.
Making predictions under the alternative is not much harder for simple cases. We specify a reasonable distribution over true effect sizes, and based on these, we can calculate probabilities on effect size intervals as with the null.
Once the predictions are stated, the rest is a matter of comparing how well the observed effect size matches the predictions. That's it. Did the data conform to one prediction better than another? And that is exactly what the Bayes factor tells us. Prediction is evidence.
My sense is that researchers need some time to get used to making predictions because it is new in psychological science. I realize that some folks would rather not make predictions---it requires care, thought, and judiciousness. Even worse, the predictions are bold and testable. My colleagues and I are here to help, whether it is help in specifying models that make predictions or deriving the predictions themselves.
I would recommend that researchers who are unwilling to specify models of the alternative hypothesis avoid inference altogether. There is no principled way of pitting a null model that makes predictions against a vague alternative that does not. Inference is not always appropriate, and for some perfectly good research endeavors, description is powerful and persuasive. The difficult case, though, is continuing in what we are doing---rejecting nulls with small effects with scant evidence while denying the beauty and usefulness of invariances in psychological science.