## Saturday, March 14, 2015

### Estimating Effect Sizes Requires Some Thought

The small effect sizes observed in the Many Labs replication project has me thinking….

Researcher think that effect size is a basic, important, natural measure for summarizing experimental research. The approach goes by “estimation” because the measured sample effect size is an estimate for the true or population effect size. Estimation seems pretty straightforward: the effect-size estimate is the mean divided by the standard deviation. Who could disagree with that?

Let's step back and acknowledge that true effect size is not a real thing in nature. It is a parameter in a model of data. Models are assuredly our creations. In practice, we tend to forget about the models and think about effect sizes as real quantities readily meaningful and accessible. And it is here that we can occasionally run into trouble, especially for smaller effect-size measurements.

# Estimating Coin-Flip Outcomes

Consider the following digression to coin flips to build intuition about estimation. Suppose a wealthy donor came to you offering to help fund your research. The only question is the amount. The donor says:

I have a certain coin. I let you see 1000 flips first, and then you can estimate the outcome for the next 1000. If you get it right, I will donate $10,000 to your research. If you're close, I will reduced the amount by squared error. So, if you are off by 1, I will donate$9,999; off by 2, I will donate $9,996; if you are off by 20, I will donate$9,600, and so on. If you are off by more than 100, then there will be no donation.“ *”

Let's suppose the outcome on the first 1000 flips is 740 heads. It seems uncontroversial to think that perhaps this value, 740, is the best estimate for the next 1000. But suppose that the outcome is 508 heads. Now we have a bit of drama. Some of you might think that 508 is the best estimate for the next 1000, but I suspect that we are chasing a bit of noise. Perhaps the coin is fair. The fair-coin hypothesis is entirely plausible—--after all, have you ever seen a biased coin? Moreover, the result of the first 1000 flips, the 508 heads, should bolster confidence in this fair-coin hypothesis. And if we are convinced then after seeing the 508 heads, the best estimate is 500 rather than 508. Why leave money on the table?

Now let's ratchet up the tension. Suppose you observed 540 heads on the first 1000 flips. I chose this value purposefully. If you were bold enough to specify the possibilities, the fair-coin null and a uniformly distributed alternative, then 540 is an equivalence point. If you observe between 460 and 540 heads, then you should gain greater confidence in the null. If you observe less than 460 or greater than 540 heads, the you should gain greater confidence in the uniform alternative. At 540, you remain unswayed. If we believe that both the null and alternative are equally plausible going in, and our beliefs do not change, then we should average. That is, the best estimate, the one that minimizes squared error loss, is half way in between. If we want to maximize the donation, then we should estimate neither 500 nor 540 but 520!

The estimate of the number of heads, $$\hat{Y}$$ is a weighted average. Let $$X$$ be the number of heads on the first 1000 and $$P(\mbox{Fair}|X)$$ be our belief that the coin is fair after observing $$X$$ heads.
$\hat{Y} = \mbox{Pr}(\mbox{Fair}|X) \times 500 + (1-\mbox{Pr}(\mbox{Fair}|X)) X.$
The following figure shows how the estimation works. The left plot is a graph of $$\mbox{Pr}(\mbox{Fair}|X)$$ as a function of $$X$$. (The computations assume that the unfair coin probabilities a prior follow a uniform and that, a prior, the fair and unfair coin models are equally likely.) It shows that for values near 500, the belief increases but for values away from 500, the belief decreases. The right plot shows weighted-averaged estimate of $$\hat{Y}$$. Note the departure from the diagonal. As can be seen, the possibility of a fair coin provides for an expanded region where the fair-coin estimate (500) is influential. Note that this influence depends on the first 1000 flips: if the number of heads from the first 1000 is far from 500, then the fair-coin hypothesis has virtually no influence.

# Model Averaging For Effect Sizes

I think we should estimate effect sizes by model averaging including the possibility that some effects are identically zero. The following figure shows the case for a sample size of 50. There is an ordinary effects model, where effect size is distributed a priori as a standard normal, and a null model. When the sample effect size is small, the plausibility of the null model increases, and the estimate is shrunk toward zero. When the sample effect size is large, the effects model dominates, and the estimate is very close to the sample value.

Of interest to me are the small effect sizes. Consider say a sample effect size of .06. The model-averaged effect-size estimate is .0078, about 13% of the sample statistic. We see here for small effect sizes dramatic shrinkage, as it should be if we have increased confidence in the null. I wonder if many of the reported effect sizes in Many Labs 3 are more profitably shrunk, perhaps dramatically, to zero.

# But the Null is Never True….What?

Cohen and Meehl were legendary contributors. And one of their most famous dictums was that the null was never true to arbitrary precision. If I had a $10 for every researcher who repeats to this proposition, then I would be quite well off. I have three replies: 1. the dictum is irrelevant, 2. the dictum is assuredly wrong on some cases, and 3. in other cases, the dictum is better treated as an empirical question rather than a statement of faith. • Irrelevant: The relevant question is not whether the null is true or not. It is whether the null is theoretically important. Invariances, regularity, lawfulness are often of interest even if they are never true to arbitrary precision. Jupiter, for example, does not orbit the sun in a perfect ellipse. There are, after all, small tugs from other objects. Yet, Kepler's Law are of immense importance even if they hold only approximately. My own take of psychological science is that if we allowed people to treat the null as important and interesting, they surely would, and this would be good. • Wrong in some cases: Assuredly, there are cases where the point null does hold exactly. Consider the random-number generator in common packages, say those that produce a uniform distribution between 0 and 1. I could hope that the next number is high, say above .9. I contend that this hope has absolutely no effect whatsoever on the next generated number. • Testable in others: The evidence for null vs. a specified alternative may be assessed. But to do so, you need to start with real mass on the null. And that is what we do with our Bayes factor implementations. So, why be convinced of something a priori that you know is wrong in some cases and can be tested in others. # If You've Gotten This Far If you are convinced by my arguments, then may I suggest you downweight effect sizes as a useful measure. They are clearly marginal or averaged across different models. Of more meaning to me is not the averaged size of effects, but the probability of an effect. If you turn your attention there, then welcome to Bayesian model comparison! #### 7 comments: Unknown said... Nice post! It looks similar to penalized regression estimation, such as ridge and LASSO methods. Cf. http://genomemedicine.com/content/supplementary/s13073-014-0117-z-s3.pdf for a picture not too dissimilar from your picture on the right-hand side. (Minor note: when the donor says "If you are off by more than 100, then there will be no donation", he is formally correct, but he could have said "If you are off by more than sqrt(1000) = 32, then there will be no donation") Jeff Rouder said... Thank you. Model averaging along with ridge and lasso are regularization techniques designed to boost predictive accuracy. They share a similar spirit. I think the "penalty" of the null model in model averaging is more pronounced than with the other methods. As for the donor, I am confused. The donor has$10,000 and if you are off by 100, then $10,000-$(100^2) = $0. I was just trying to prevent negative donations :) Unknown said... Apologies, I mis-read the text. I read that the donor had$ 1,000.

amisis said...

If I can leave my 2 cents, Ridge and Lasso regressions are formally penalizations over the L1 and L2 norms of the regression coefficients (where all the measurements have to be standardized). From a Bayesian point of view it is exactly equivalent to put a prior on the values of these coefficients (Gaussian or Laplacian with standard deviation 1 for the L2 and L1 penalizations). So they correspond to the use of an informed prior on the estimate of the parameters.

Jeff Rouder said...

Thanks Amisis. Your two cents are always welcome here. Yup, Lasso and Ridge are ways of shrinking the effect size estimate toward zero. The shrinkage I am seeking from model averaging is in the same spirit, but the prior specification is even more informed. I place a spike ant zero and use what are know as spike-and-slab priors. The amount of shrink or regularization may be much larger in the spike and slab approach.

amisis said...

This is a good example of the importance of the priors in the analysis :)
Even with the same generative model, the two different priors are describing two different analysis. In the Lasso regression the idea is that all the elements are relevant, but maybe the effect is so little that it can be safely removed (the set of parameters with value 0 has null measure), while in your model there is a proper possibility that the effect is really zero.

Unknown said...