Tuesday, March 15, 2016

Statistical Difficulties from the Outer Limits

You would think that the more data we collect, the closer we should be to the truth.  

This blog post falls into the "I may be wrong" category.  I hope many of you comment.

ESP: God's Gift To Bayesians?


It seems like ESP is God's gift to Bayesians.  We use it like a club to reinforce the plausibility of null hypotheses and to point out the difficulties of frequentist analysis.

In the 1980s, a group of Princeton University engineers set out to test ESP by asking people to use their minds to change the outcome of a random noise generator (check out their website).   Over the course of a decade, these engineers collected an astounding 104,490,000 trials.  On each trial, the random noise generator flipped a gate with known probability of exactly .5.  The question was whether a human operator using only the power of his or her mind could increase this rate.  Indeed, they found 52,263,471 gate flips, or 0.5001768 of all trials.  This proportion, though only slightly larger than .5, is nonetheless significantly larger with a damn low p-value of .0003.   The figure below shows the distribution of successes under the null, and the observation is far to the right.  The green interval is the 99% CI, and it does not include the null.



Let's assume these folks have a decent set up and the true probability should be .5 without human ESP intervention.  Did they show ESP?

What do you think?  There data are numerous, but do you feel closer to the truth?  Impressed by the low p-value?  Bothered by the wafer-thin effect?  Form an opinion; leave a comment.  

Bayesians love this example because we can't really fathom what a poor frequentist would do?  The p-value is certainly lower than .05, even lower than .01, and even lower than .001.  So, it seems like a frequentist would need to buy in.  The only way out is to lower the Type I error rate in response to the large sample size.  But to what value and why?


ESP: The Trojan Horse?


ESP might seem like God's gift to Bayesians, but maybe it is a Trojan Horse.  A Bayes factor model comparison analysis goes as following.  The no-ESP null model is
\[ M_0: Y  \sim \mbox{Binomial}(.5,N) \]

The ESP alternative is
\[ M_1: Y|\theta  \sim \mbox{Binomial}(\theta,N) \]
A prior on \(\theta\) is needed to complete the specification.  For the moment, let's use a flat one, \(\theta \sim \mbox{Unif}(0,1) \).

It is pretty easy to calculate a Bayes factor here, and the answer is 12-to-1 in favor of the null.   What a relief.

ESP proponents might rightly criticize this prior as too dispersed.  We may reasonably assume that \( \theta \) should not be less than .5 as we can assume the human operators are following the direction to increase rather than decrease the proportion of gate flips.   Also, the original investigators might argue that it is unreasonable to expect anything more than a .1% effect, so the top might be .501.  In fact, they might argue they ran such a large experiment because they expected a prior such a small effect.  The prior is   \(\theta \sim \mbox{Unif}(.5,.501) \), then the Bayes factor is 84-to-1 for an ESP effect.

The situation seems tenuous.  The below figure shows the Bayes factors for both priors as a function of the number of trials.  To draw these curves, I simply kept the proportion of success constant at 0.5001768.  The line is for the observed number of trials.  With this proportion, the Bayes factor not only depend on the prior, but they also depend in unintuitive ways on sample size.  For instance, if we doubled the number of trials and successes, the Bayes factors become 40-to-1 and 40,000-to-1, respectively, for the flat prior and the very small interval one.




Oh, I can see the anti-Bayes crowd getting ready to chime in should they read this.   Sanjay Srivastava may take the high road and discuss the apparent lack of practicality of the Bayes factor.  Uri Simonsohn may boldly declare that Bayesians can't find truth.   And perhaps Uli Shimmack will create a new index, the M-Index, where M stands for moron.  Based on his analysis of my advocacy, he may declare I have the second highest known M-Index, perhaps surpassed only by E.-J. Wagenmakers.

Seems like ESP was a bit of a Trojan Horse.  It looked all good, and then turned on us.

But What Happened?


Bayes' rule is ok of course.  The problem is us.  We tend to ask too much of statistics.   Before I get to my main points, I need to address one issue,  What is the model?  Many will call the data model, the binomial specification in this case, "the model."  The other part, the priors on parameters, is not part of "the model", it is the prior.  Yet, it is better to think of "the model" as the combination of the binomial and prior specification.  It's all one model, and this one model provides a priori predictive distribution about where the data should fall (see my last blog post).  The binomial is a conditional specification, and the prior completes the model.

With this in mind that the above figure strikes me as quite reasonable.  Consider the red line, the one that compares the null to the model where the underlying probability ranges across the full interval.  Take the point for 10,000 trials.   The number of successes is 5,002 which is almost 1/2 of all trials.  Not surprisingly,  this value is evidence for the null compared to this diffuse alternative.  But the same value is not evidence for the null compared to the more constrained alternative model where \(.5<\theta<.501\).  Both the null and this alternative are about the same for 10,000 trials, and each predict 5,002 successes out of 10,000 trials equally well.  Hence,  the Bayes factor is equivocal.  This alternative and the null are so similar that it takes way more data to discriminate among them.   As we gain more and more data, say 100,000,000 trials, the  slight discrepancy from 1/2 can be resolved, and the Bayes factors start to favor the alternative models.  As the sample size is increased further, the discrepancy becomes more pronounced.  Everything in that figure makes beautiful sense to me--- it all is as it should be.  Bayes rule is ok.

Having more and more data doesn't get us closer to the truth.  It does, however, is give us greater resolution to more finely discriminate among models.

Loose Ends  


The question, "is there an effect" strikes me as ill formed.   Yet, we answer the question affirmatively daily.  Sometimes, effects are obvious, and they hit you between the eyes.  How can that be if the question is not well formed?

I think when there are large effects, just about any diffuse alternative model will do.  As long as the alternative is diffuse, data with large effects easily discriminate this diffuse alternative from the null.  It is in this sense that effects are obviously large.

What this example shows that if one tries to resolve small effects with large sample sizes, there is intellectual tension.  Models matter.  Models are all that matter.  Large data gives you greater resolution to discriminate among similar models.  And perhaps little else.

The Irony Is...


This ESP example is ironic.  The data are so numerous that they are capable of finely discriminating among just about any set of models we wish, even the difference between a point null and a uniform null subtending .001 in width on the probability scale.  The irony is that we have no bona-fide competing models to discriminate.  ESP by definition seemingly precludes any scientific explanation, and without such explanation, all alternatives to the null are a bit contrived.  So while we can discriminate among models, there really is only one plausible one, the null, and no need for discrimination at all.

If forced to do inference here (which means someone buys me a beer),  I would choose the full-range uniform as the alternative model and state the 12-to-1 ratio for the null.  ESP is such a strange proposition that why would values of \( \theta \) near .5 be any more a priori plausible than those away from it?


5 comments:

Ulrich Schimmack said...



1. It depends on the error rate that you find acceptable to reject the null hypothesis. In quantum physics the rule is to use sigma > 5 to claim a discovery (Higgs-Bosson discovery).

p = .0003 is only about 3 sigma.

qnorm(1-.0003*2) = 3.24

So, we may say that it doesn't look like random but the evidence is not strong enough to say randomness cannot explain this outcome.

2. Rejecting randomness as an explanation does not automatically lead to the conclusion that ESP exists. It simply means it wasn't random. Maybe the random generator was not perfect? Maybe the experimental protocol was not followed exactly? A real discovery would require an independent replication.

3. The effect size is very small. It would be amazing if it works, but what practical application could benefit from this effect size?

It is also interesting to see what a Bayesian Analysis would show? What is BF with a scaling factor of 0.0001?


JP de Ruiter said...

Jeff, isn't this just also a case of ESP having (for you and me, at least) a very low hypothesis prior, which is so low that even a large BF10 will not make the posterior for H1 larger than very small?

Jeff Rouder said...

Thank you both for commenting.

Uli, 1. I What do you think is appropriate here? 2. Agreed. The question then is whether there is evidence for a systematic effect regardless of whether it is ESP or some artifact. 3. Oooh, don't go there. Many very practical technologies rely on tiny quantum effects including GPS and tunneling microscopes. Radar is a very small phase change as well. As for the BF with even smaller factors, it looks like the blue line but dives down quicker.

JP: Let's not worry about ESP and think about an ESP effect whether it is due to an artifact in the machine or true ESP. The question is "what is the evidence for an ESP effect."

Sam Schwarzkopf said...

Great post! I have been pondering this question in various ways for quite some time even going so far to invent a new bootstrapping test for it ;) (and given the large number of trials involve this would fail in this scenario unless you have a looooot of time on your hands... :P).

I think the argument about the effect size being too small is a red herring. Those ESP researchers (just read the latest Bem study) are wont to point towards quantum mechanics or the fact that the second law of thermodynamics is probabilistic, not a fixed law. In essence, they try to argue that ESP like this *is* possible even if it is improbable.

As I keep arguing, even if this assumption is correct the effect sizes have to be miniscule. So actually the effect size in your example here is much better evidence for ESP that, say, Bem's experiment in which the effect size is actually several orders of magnitude to high (even 51% correct when chance is 50% seems extreme for a real precognition ability - both based on the flaky quantum handwaving and also practically for the observable consequences it should have in the world outside the lab).

Now I am not saying that this tiny effect here is evidence for ESP. But the effect size falls more into the ballpark I would predict if it were. Whether or not it is ESP is not for statistics to decide, at least not until you can make testable predictions as to how an artifact should differ from "true" ESP. Not sure that you can. Until someone comes up with a good test I will favour the simpler explanation that it is some kind of fluke or artifact.

Jeff Rouder said...

Sam, Thanks for the comments. I agree about the artifacts. My sense is that artifacts probably result in a uniform distribution of p, but only the smallest ones will not be detected outright as such. So I guess the artifact distribution is also quite small and near .5.