Thursday, February 28, 2019

Reimagining Meta-Analysis





Fruit and Meta-Analysis

The fruit in my house weigh on average 93 grams.  I know this because I weighed them.  The process of doing so is a good analogy for meta-analysis, though a lot less painful.

I bet you find the value of 93 grams rather uninformative.  It reflects what my family likes to eat, say bananas more than kiwis and strawberries more than blackberries.  In fact, even though I went through the effort of gathering the fruit, setting fruit criteria (for the record, I excluded a cucumber because cucumbers, while bearing seeds, just don't taste like fruit), and weighing them, dare I say this  mean doesn't mean anything.  And this is the critique Julia Haaf (@JuliaHaaf), Joe Hilgard (@JoeHilgard), Clint Davis-Stober (@ClintinS) and I  provide for meta-analytic means in our just-accepted Psychological Methods paper.   Just because you can compute  a sample mean doesn't mean that it is automatically helpful.

Means are most meaningful to me when they measure the central tendency of some naturally interesting random process.  For example, if I was studying environmental impacts on children's growth in various communities, the mean (and quantiles) of height and weight for children of given ages is certainly meaningful.  Even though the sample of kids in a community is diverse in wealth, race, etc., the mean is helpful in understanding say environmental factors such as a local pesticide factory.

In meta-analysis, the mean is over something haphazard...what type of paradigms happen to be trendy for certain questions.   The collection of studies is more like a collection of fruit in my house.  And just as the fruit mean reflects my family's preferences about fruit as much as any biological variation among seeded plant things, the meta-analytic mean reflects the sociology of researchers (how they decide what data to collect) as much as the phenomenon under study.

Do All Studies Truly?

In our recent paper, we dispense with the meta-analytic mean.  It simply is not a target for us for scientific inference.   Instead, we ask a different question, "Do All Studies Truly...."  To set the stage, we note that most findings have a canonical direction.  For example, we might think that playing violent video games increases rather than decreases subsequent aggressive behavior.  Increases here is the canonical direction, and we can call it the positive effect.  If we gather a collection of studies on the effects of video game violence, do all truly have an effect in this positive direction, that is do all truly increase aggression or do some increase and others truly decrease aggression.  Next, let's focus on  truly.  Truly for a study refers to the what would happen in the large-sample limit of many people.  In any finite sample for any study, we might observe a negative-direction effect from sampling noise, but the main question is about the true values.  Restated, how plausible is it that all studies have a true positive effect even though some might have negative sample effects?  Using Julia's and my previous work, we show how to compute this plausibility across a collection of studies.

So What?

Let's say, "yes," it is plausible that all studies of a collection truly have effects in a common direction, say violent video games do indeed increase aggression.  What is implied is much more constrained than some statement about the meta-analytic mean.  It is about robustness.  Whatever the causes of variation in the data set, the main finding is robust to these causes.  It is not that just the average shows the effect, but all studies plausibly do.  What a strong statement to make when it holds!

Now, let's take the opposite possibility, "no."  It is not plausible that all studies truly have effects in a common direction.  With high probability some have true effects in the opposite direction.  The upshot is a rich puzzle.  Which studies go one way and which go the other way?  Why?  What are the mediators?

In our view, then, the very first meta-analytic question is "do all studies truly."  The answer will surely shape what we do next.

Can You Do It Too?

Maybe, maybe not.  The actual steps are not that difficult.  One needs to perform a Bayesian analysis and gather the posterior samples.  The models are pretty straightforward and are easy to implement in the Bayes Factor package, stan or JAGS.  Then, to compute  the plausibility of the "do all studies truly" question, one needs to count how many posterior samples fall in certain ranges. So, if you can gather MCMC posterior samples for a model and count, you are in good shape.

We realize that some people may be drawn to the question and may be repelled by the lack of an automatic solution.   Julia and I have unrealized dreams of automating the process.  But, in the meantime, if you have a cool data set and an interest in the does-every-study-truly question, let us know.

Rouder, JN, Haaf, JM, Davis-Stover, C, Hilgard, J (in press) Beyond Overall Effects: A Bayesian Approach to Finding Constraints Across A Collection Of Studies In Meta-Analysis. Psychological Methods.





Saturday, January 5, 2019

P-values and Sample Sizes, the Survey

I ran a brief 24 hour survey in which many of you participated.  Thank you.

The main goal was to explore how people weigh off sample size and p-values.  I think with the adoption of power and sample-size planning, many people have mistakenly used pre-data intuitions for post-data analysis.  Certainly, if we had no data, we would correctly think all other things being equal that a larger study has greater potential to be more evidential than a smaller one.  But what about after the data are collected.

Here is the survey.  The darker blue bar is the most popular response.




The Answers

My own feeling is that the study with the smaller sample size is more evidential.   Let's take it from a few points-of-view:

Significance Testing:  If you are a strict adherence to significance testing, then you would use the p-values.  You might choose "same."  However, the example shows why significance testing is critiqued.  Let's consider comparisons across small and very large sample sizes, say N1=50 and N2=1,000,000.  The observed effect size for the first experiment is a healthy .32; that for the second is a meager .002.  So, as sample size increases and p-values do not, we are observing smaller and smaller effects.

Modern Testing I: Modern testing has been influenced by considerations of effect sizes.  If effect size is to matter inference at all, then the correct answer is the smaller sample size.  After all, the p-values are equal and the smaller sample size has the larger effect size.

ModernTesting II: Another way of thinking about modern testing is that the analyst chooses a level based on context.  An obvious factor is sample size, and many authors recommend lowering alpha with increasing sample size.  Hence, the same p-value is more likely to be significant wit the smaller sample size.  

Bayesian Testing:  For all reasonable priors, the Bayes factor favors the smaller sample size because larger effect sizes are more compatible, in general, with the effect than with the null.  Tom Faulkenberry notes that if you get to see the data first and fine tune the priors, then you can game a higher Bayes factor for N2 than N1. 

What We Learned

For me, the best answer is N1 because it captures the appropriate post-data intuition that everything else equal larger effect sizes are preferable to smaller effect sizes when establishing effects.  Unfortunately, it was the least popular choice at 18%.

One of the shocking thing to me is the popularity of N2 (24%).  I can't think of any inferential strategy that would give credence to an N2 response.  So, if you chose N2, you may wish to rethink about how you evaluate the significance of effects.  The same response (18%) make sense only if you are willing to ignore effect size.  This ignorance, however, strikes me as unwise in the current climate.  

The most popular response is "depends." (40%).  I am not sure what to make of depends responses.  I suspect for some of you, it was a cop out to see the results.  For others, it was an overly technical response to cover your bases.  In any case, it really doesn't depend that much.  Go with bigger effects when establishing effects.