Saturday, February 27, 2016

Bayesian Analysis: Smear and Compare

This blog post is co-written with Julia Haaf (@JuliaHaaf).

Suppose Theory A makes a prediction about main effects and Theory B makes a prediction about interactions.  Can we compare the theories with data?  This question was posed on the Facebook Psychological Methods Group by Rickard Carlsson.

Uli Schimmak (@R_Index) put the discussion on general terms with this setup:
Should patients take Drug X to increase their life expectancy?
Theory A = All patients benefit equally (unlikely to be true).
Theory B1 = Women benefit, but men's LE is not affected.
Theory B2 = Women benefit, but men's LE decreases (cross-over).
Theory B3 = Women benefit, and men benefit too, but less.

We are going to assume that each of these statements represents a theoretically interesting position of constraint. The goal is to use data to state the relative evidence for or against these positions.  This question is pretty hard from a frequentist perspective as there are difficult order constraints to be considered.  Fortunately, it is relatively simple from a Bayesian model-comparison perspective.

Model Specifications

The first and perhaps most important steps is representing these verbal positions as competing statistical models, or performing model specification.  Model specification is a bit of an art, and here is our approach:

Let \(Y_{ijk}\) be life expectancy where \(i\) denotes gender (\(i=w\) for women; \(i=m\) for men), where \(j\) denotes drug status (\(j=0\) for placebo; \(j=1\) for treatment), and where \(k\) denote the replicate as there are several people per cell. We can start with the standard setup:
Y_{ijk} \sim \mbox{Normal}(\mu_{ij},\sigma^2).

The next step is building in meaningful constraints on true cell means \(\mu_{ij}\).  The standard approach is to think in terms of the grand mean, main effects, and interactions.  We think in this case and for these positions, the standard approach is not as suitable as the following two-cornerstone approach:

\(\mu_{w0} = \alpha\)
\(\mu_{w1} = \alpha + \beta\)
\(\mu_{m0} = \gamma \)
\(\mu_{m1} = \gamma+\delta \)

With this parameterization, all the models can be expressed as various constraints on the relationship between \(\beta\), the effect for women, and \(\delta\), the effect for men.

Model A: The constraint in Theory A is instantiated by setting \(\beta=\delta\). We place a half normal on this single parameter, the equal effect of the drug on men and women. See the Figure below.

Model B1: The constraint in Theory B1 instantiated by setting \(\beta>0\) and \(\delta=0\).  A half normal on \(\beta\) will do.

Model B2: The constraint in Theory B2 is instantiated by setting \(\beta>0\) and \(\delta<0\).  We used independent half normals here.

Model B3: The constrain in Theory B3 is that \(0<\delta<\beta\). This model is also shown, and it is similar to the preceding one; the difference is in the form of the constraints.

Of course, there are other models which might be useful including the null, models with no commitment to benefit for women, or models that do not assume women benefit more than men. Adding them presents no additional difficulties in the Bayesian approach.

Analysis: Smear & Compare

If we are willing to make fine specifications, as above, then it is straight forward to derive predictions for data of a set sample size. These predictions are shown as a function of the sample effect for men and women, that is, the change in lifespan between treatment and placebo for each gender.   These effects are denoted as as \(\hat{\beta}\) and \(\hat{\delta}\), respectively.  Here are the predictions:

Notice how these predictions are smeared versions of the model.  That is what sample noise does!

With these predictions, we are ready to observe data. Suppose we observe that the treatment extends women's lives by 2 years and men's lives by one year.  We have now included this observed value as a red dot in the below figures.

As can be seen, the observation is best predicted by Model B3.

The Bayes factor is the relative comparisons of these predictions.  We can ask how much B3 beats the other models.  Here it is:

B3 beats A by 3.7-to-1
B3 beats B1 by 18.2-to-1
B3 beats B2 by 212-to-1

Saturday, February 6, 2016

What It Would Take To Believe in ESP?

"Bem (2011) is still not retracted.  Not enough public critique?"  -- R-Index, Tweet on February 5th, 2016.
Bem's 2011 paper remains controversial because of the main claim of ESP.  Many researchers probably agree with me that the odds that ESP is true is quite small.  My subjective belief is that it is about three times as unlikely as winning the PowerBall jackpot.  Yet, Bem's paper is well written and well argued.  In many ways it is a model of how psychology papers should be written.  And so we have a problem---either there is ESP or the everyday way we produce and communicate knowledge is grievously flawed.   One benefit of Bem (2011) is that it forces us to reevaluate our production of knowledge perhaps more forcefully than any direct argument could.  How could the ordinary applications of our methods lead to the ESP conclusion?

There Is Evidence for an ESP Effect

The call to retract Bem is unfortunate.   There is no evidence of any specific fraud nor any element of substantial incompetence.  That does not mean the paper is free from critique---there is much to criticize as I will briefly mention subsequently (see also Tal Yarkoni's blog).  Yet, even when the critiques are taken into account, there is evidence from the reported data of an ESP effect.  Morey and I found a Bayes factor of about 40-to-1 in favor of an ESP effect.

In getting this value, we noted a number of issues as follows:  We feel Experiments 5, 6, and 7 were too opportunistic.  There was no clear prediction for the direction of the effect---either retroactive mere exposure where future repeats increase the feeling of liking, or retroactive habituation where future repeats decrease the feeling of liking.  Both of these explanations were used post-hoc to explain different ESP trends, and we argue this usage is suspect and discarded these results.  We also worried about the treatment of non-erotic stimuli.  In Experiments 2-4, emotional non-erotic stimuli elicited ESP; in Experiments 8-9 neutral stimuli elicited ESP.  In Experiment 1, however, these non-erotic stimuli did not elicit ESP, in fact only the erotic ones did.  So, we feel Experiment 1 is a failure of ESP for these non-erotic stimuli and treated it as such in our analysis.  Even with  these corrections, there was 40-to-1 evidence for an ESP effect.

In fact, the same basic story holds for telepathy.  Storm et al. meta-analytically  reviewed 67 studies and found a z of about 6, indicating overwhelming evidence for this ESP effect.  We went in, examined a bunch of these studies and trimmed out several that did not meet the criterion.  Even so, the Bayes factor was as much as 330-to-1 in favor of an ESP effect!  (see Rouder et al,, 2013)

Do I Believe In ESP

No.  I believe that there is some evidence in the data for something, but the odds that it is ESP is too remote.  Perhaps there are a lot of missing negative studies.

Toward Believing In ESP: The Movie Theatre Experiment

So what would it take to believe in ESP?  I think  Feynman once noted that a physicist would not be satisfied with such small effects.  She would build a better detector or design a better experiment.  (I can't find the Feyman cite, please help).  So here is what would convince me:

I'd like to get 500 people in a movie theatre and see if they could feel the same future.  Each would have an iPad, and before hand, each would have provided his or her preferences for erotica.  A trial would start with a prediction---each person would have to predict whether an ensuing coin flip will land heads or tails.  From this, we tally the predictions to get a group point prediction.  If more people predict a head than a tail, the group prediction is heads; if more people predict a tails, the group prediction is tails.  Now we flip the coin.  If the group got it right, then everyone is rewarded with the erotica of their choice.  If the group got it wrong, then everyone is shown  a gross IAPS photo of decapitated puppies and the like.   We can run some 100 trials.  I bet people would have fun.

Here is the frequentist analysis:  Let's suppose under the ESP alternative that people feel the future with a rate of .51 compared to the .50 baseline.  So, how often is the group prediction from 500 people correct? The answer is .66. Telling whether performance is .66 or .50 is not too hard.  If we run 100 total trials, we can divide up at 58: 58 or less group-correct trials is evidence for the null; 59 or more group-correct trials is evidence for ESP.  The odds of getting over 58 group-correct trials under the null is .044.  The odds of getting under 59 group-correct trials under the ESP alternative is .058    The group prediction about a shared future is a better detector than the usual way.

Of course, I would perform a Bayesian analysis of the data.  I would put a distribution on the per person ESP effect, allowsing some people to not feel the future at all.  Then I would generalize this to a distribution for the group, derive predictions for this mode and the null, and do the usual Bayes factor comparison.  I am not sure this experiment would fully convince me, but it would change my skeptical beliefs by a few orders of magnitude.  Do it twice and I might even be a believer!

Now, how to get funding to run the experiment?  Mythbusters?

Closing Thoughts: Retractions and Shaming

The claim that Bem (2011) should be retracted perhaps comes from the observations that getting 9 or 9 significant effects with such a small effect size and with the reported sample sizes is pretty rare.  I am not a fan of this type of argument for retraction.  I would much rather the critique be made, and we move on.  Bem's paper has done the field much good.  Either Bem has found the most important scientific finding in the last 100 years or has taught us much about how we do research.  Either way, it is a win-win.  I welcome his meta-analysis on the same grounds.