Saturday, January 5, 2019

P-values and Sample Sizes, the Survey

I ran a brief 24 hour survey in which many of you participated.  Thank you.

The main goal was to explore how people weigh off sample size and p-values.  I think with the adoption of power and sample-size planning, many people have mistakenly used pre-data intuitions for post-data analysis.  Certainly, if we had no data, we would correctly think all other things being equal that a larger study has greater potential to be more evidential than a smaller one.  But what about after the data are collected.

Here is the survey.  The darker blue bar is the most popular response.

The Answers

My own feeling is that the study with the smaller sample size is more evidential.   Let's take it from a few points-of-view:

Significance Testing:  If you are a strict adherence to significance testing, then you would use the p-values.  You might choose "same."  However, the example shows why significance testing is critiqued.  Let's consider comparisons across small and very large sample sizes, say N1=50 and N2=1,000,000.  The observed effect size for the first experiment is a healthy .32; that for the second is a meager .002.  So, as sample size increases and p-values do not, we are observing smaller and smaller effects.

Modern Testing I: Modern testing has been influenced by considerations of effect sizes.  If effect size is to matter inference at all, then the correct answer is the smaller sample size.  After all, the p-values are equal and the smaller sample size has the larger effect size.

ModernTesting II: Another way of thinking about modern testing is that the analyst chooses a level based on context.  An obvious factor is sample size, and many authors recommend lowering alpha with increasing sample size.  Hence, the same p-value is more likely to be significant wit the smaller sample size.  

Bayesian Testing:  For all reasonable priors, the Bayes factor favors the smaller sample size because larger effect sizes are more compatible, in general, with the effect than with the null.  Tom Faulkenberry notes that if you get to see the data first and fine tune the priors, then you can game a higher Bayes factor for N2 than N1. 

What We Learned

For me, the best answer is N1 because it captures the appropriate post-data intuition that everything else equal larger effect sizes are preferable to smaller effect sizes when establishing effects.  Unfortunately, it was the least popular choice at 18%.

One of the shocking thing to me is the popularity of N2 (24%).  I can't think of any inferential strategy that would give credence to an N2 response.  So, if you chose N2, you may wish to rethink about how you evaluate the significance of effects.  The same response (18%) make sense only if you are willing to ignore effect size.  This ignorance, however, strikes me as unwise in the current climate.  

The most popular response is "depends." (40%).  I am not sure what to make of depends responses.  I suspect for some of you, it was a cop out to see the results.  For others, it was an overly technical response to cover your bases.  In any case, it really doesn't depend that much.  Go with bigger effects when establishing effects.


Shravan Vasishth said...

What about Type M error? I would have said it depends on the prospective power function. If I know that N1 or N2 will likely lead to low power, I would not believe either result.

Jeff Rouder said...

Thx. I think, sure, before data that intuition is correct. But once the data are in, that intuition is misleading. I think it is a mistake that ppl keep bashing small N studies after the fact.

Shravan Vasishth said...

But even after the study is done, I can still ask what the design properties of the expt are, a priori. That's what Gelman and Carlin 2014 is all about, I believe. Maybe I am missing something here.

Moritz Körber said...

I guess it depends whether you condition on significance or not, see for example Royall 1986

Jeff Rouder said...

Shravan, I'll have to read it again. Maybe they are wrong about this. Moritz, Thanks for the link.

Justin said...

Depends plenty. I'd like to know the quality of the experimental designs / studies themselves. Are we assuming they are of equivalent quality here?