Saturday, February 6, 2016

What It Would Take To Believe in ESP?

"Bem (2011) is still not retracted.  Not enough public critique?"  -- R-Index, Tweet on February 5th, 2016.
Bem's 2011 paper remains controversial because of the main claim of ESP.  Many researchers probably agree with me that the odds that ESP is true is quite small.  My subjective belief is that it is about three times as unlikely as winning the PowerBall jackpot.  Yet, Bem's paper is well written and well argued.  In many ways it is a model of how psychology papers should be written.  And so we have a problem---either there is ESP or the everyday way we produce and communicate knowledge is grievously flawed.   One benefit of Bem (2011) is that it forces us to reevaluate our production of knowledge perhaps more forcefully than any direct argument could.  How could the ordinary applications of our methods lead to the ESP conclusion?

There Is Evidence for an ESP Effect

The call to retract Bem is unfortunate.   There is no evidence of any specific fraud nor any element of substantial incompetence.  That does not mean the paper is free from critique---there is much to criticize as I will briefly mention subsequently (see also Tal Yarkoni's blog).  Yet, even when the critiques are taken into account, there is evidence from the reported data of an ESP effect.  Morey and I found a Bayes factor of about 40-to-1 in favor of an ESP effect.

In getting this value, we noted a number of issues as follows:  We feel Experiments 5, 6, and 7 were too opportunistic.  There was no clear prediction for the direction of the effect---either retroactive mere exposure where future repeats increase the feeling of liking, or retroactive habituation where future repeats decrease the feeling of liking.  Both of these explanations were used post-hoc to explain different ESP trends, and we argue this usage is suspect and discarded these results.  We also worried about the treatment of non-erotic stimuli.  In Experiments 2-4, emotional non-erotic stimuli elicited ESP; in Experiments 8-9 neutral stimuli elicited ESP.  In Experiment 1, however, these non-erotic stimuli did not elicit ESP, in fact only the erotic ones did.  So, we feel Experiment 1 is a failure of ESP for these non-erotic stimuli and treated it as such in our analysis.  Even with  these corrections, there was 40-to-1 evidence for an ESP effect.

In fact, the same basic story holds for telepathy.  Storm et al. meta-analytically  reviewed 67 studies and found a z of about 6, indicating overwhelming evidence for this ESP effect.  We went in, examined a bunch of these studies and trimmed out several that did not meet the criterion.  Even so, the Bayes factor was as much as 330-to-1 in favor of an ESP effect!  (see Rouder et al,, 2013)

Do I Believe In ESP

No.  I believe that there is some evidence in the data for something, but the odds that it is ESP is too remote.  Perhaps there are a lot of missing negative studies.

Toward Believing In ESP: The Movie Theatre Experiment

So what would it take to believe in ESP?  I think  Feynman once noted that a physicist would not be satisfied with such small effects.  She would build a better detector or design a better experiment.  (I can't find the Feyman cite, please help).  So here is what would convince me:

I'd like to get 500 people in a movie theatre and see if they could feel the same future.  Each would have an iPad, and before hand, each would have provided his or her preferences for erotica.  A trial would start with a prediction---each person would have to predict whether an ensuing coin flip will land heads or tails.  From this, we tally the predictions to get a group point prediction.  If more people predict a head than a tail, the group prediction is heads; if more people predict a tails, the group prediction is tails.  Now we flip the coin.  If the group got it right, then everyone is rewarded with the erotica of their choice.  If the group got it wrong, then everyone is shown  a gross IAPS photo of decapitated puppies and the like.   We can run some 100 trials.  I bet people would have fun.

Here is the frequentist analysis:  Let's suppose under the ESP alternative that people feel the future with a rate of .51 compared to the .50 baseline.  So, how often is the group prediction from 500 people correct? The answer is .66. Telling whether performance is .66 or .50 is not too hard.  If we run 100 total trials, we can divide up at 58: 58 or less group-correct trials is evidence for the null; 59 or more group-correct trials is evidence for ESP.  The odds of getting over 58 group-correct trials under the null is .044.  The odds of getting under 59 group-correct trials under the ESP alternative is .058    The group prediction about a shared future is a better detector than the usual way.

Of course, I would perform a Bayesian analysis of the data.  I would put a distribution on the per person ESP effect, allowsing some people to not feel the future at all.  Then I would generalize this to a distribution for the group, derive predictions for this mode and the null, and do the usual Bayes factor comparison.  I am not sure this experiment would fully convince me, but it would change my skeptical beliefs by a few orders of magnitude.  Do it twice and I might even be a believer!

Now, how to get funding to run the experiment?  Mythbusters?


Closing Thoughts: Retractions and Shaming

The claim that Bem (2011) should be retracted perhaps comes from the observations that getting 9 or 9 significant effects with such a small effect size and with the reported sample sizes is pretty rare.  I am not a fan of this type of argument for retraction.  I would much rather the critique be made, and we move on.  Bem's paper has done the field much good.  Either Bem has found the most important scientific finding in the last 100 years or has taught us much about how we do research.  Either way, it is a win-win.  I welcome his meta-analysis on the same grounds. 


3 comments:

Ulrich Schimmack said...


Hi Jeff,

disappointing. First, you could have mentioned the arguments for retraction.

1. Correlation of r = -.9 between sample size and effect size in an article that claimed to have planned all studies with fixed effect size and sample size of N = 100 to have 80% power.

2. Published articles that show the percentage of significant results is not justified by the actual power of the studies (Francis, 2012; Schimmack, 2012).

3. Strong evidence that the studies are based on dishonest/selective reporting of results

https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Your casual comment "I am not a fan of these arguments" may work for the Superbowl, but falls a bit short of criteria for scientific discourse.

Jeff Rouder said...

Uli, Thanks for your comments and the reference to the blog post. I hadn't seen the TIVA analysis previously and it makes a lot of sense. The value of 1.0 is a good lower limit of the variance of Z of a set of studies; clearly just about everything else should make it inflate including positive noncentrality. My concern is the assumption that studies are randomly selected and iid, which seems to belie the method.

As for the Bem article, I find Bem to be very open about what he did. We may not agree with it, but dishonest is far too strong. I would rather reserve that word for outright fraud, and even in these cases, I have empathy for the perpetrators. Bem reports his file drawer; he reports his methods in clear detail; it is easy to pinpoint analytic steps I don't agree with because he is so straightforward with them. I respect Bem for the paper and effort. I tend to think he is fooling himself but I am not upset by it at all. QUite the contrary. I invite you to read our analysis in Rouder and Morey (2011).

I guess I don't get rattled by outrageous claims in psychology. I tend to want to do my own work, focus on invariances and other useful constraints in data, try to relate them to mechanisms, and move on. I really have no interest in verbal theory, two-by-three designs, and seeing the world through the prism of effects, effect sizes and power. The game at its core is corrupt; why should I get too upset at the most outward manifestations when there is a whole core to expose.

phayes said...

“I believe that there is some evidence in the data for something [...] So what would it take to believe in ESP? I think Feynman once noted that a physicist would not be satisfied with such small effects.”

It would take a completely different kind of experiment. You're simply never going to demonstrate retrocausal psi (or any other firmly established physics-defying phenomenon) with data coming from the “messy, incorrigible realm of clinical trials”. Small effects aren't really the issue: what really matters is whether or not your experiment can convincingly exclude all the somethings that are not ESP but are no less plausible.

When I first read about some 'statistical' dice rolling experiments that had supposedly established the existence of PK I wondered if they'd firmed-up those results with some more cogent, e.g. torsion balance based, experiments. Instead I found they'd gone in exactly the opposite direction (and provided a nonsensical rationalisation).

“A great surprise of the early work was that PK affected only rolling dice, but could not be measured as a force acting on a stationary die on a sensitive scale. PK seemed to act only where chance processes were involved. This suggested that PK could not be considered as a force, comparable to electric or magnetic forces.”

I was very amused by that and I expect Feynman would've been too.