Friday, October 28, 2016

A Probability Riddle

Some flu strains can jump from people to birds, and, perhaps vice-versa.

Suppose \(A\) is the event that there is a flu outbreak in a certain community say in the next month, and let \(P(A)\) denote the probability of this event occurring.    Suppose \(B\) is the even that there is flu outbreak among chickens in the same community in the same time frame, with \(P(B)\) being the probability of this event as well.

Now let's focus in on the relative flu risk to humans from chickens.  Let's define this risk as
If the flu strain jumps from chickens to people, then the conditional probability, \(P(A|B)\) may well be higher than baserate, \(P(A)\), and the risk to people will be greater than 1.0.

Now, if you are one of those animal-lover types, you might worry about the relative flu risk to chickens from people.  It is:

At this point, you might have the intuition that there is no good reason to think \(R_h\) would be the same value as \(R_c\).  You might think that the relative risk is a function of say the virology and biology of chickens, people, and viruses.

And you would be wrong.  While it may be that chickens and people have different base rates and different conditions, it must be that \(R_h=R_c\).  It is a matter of math rather than biology or virology.

To see the math, let's start with the Law of Conditional Probability:
P(A|B) = \frac{P(B|A)P(A)}{P(B)}.

We can move \(P(A)\) from one side to the other, arriving at
\frac{P(A|B)}{P(A)} = \frac{P(B|A)}{P(B)} .

Now, note that the left-hand side is the risk to people and the right hand side is the risk to chickens.

I find the fact that these risk ratios are preserved to be a bit counterintuitive.  It is part of what makes conditional probability hard.

Sunday, April 10, 2016

This Summer's Challenge: Share Your Data

"It would take me weeks of going through my data and coordinating them, documenting them, and cleaning them if I were to share them." anonymous senior faculty member

"Subject 7 didn't show. There is an empty file. Normally the program would label the next person Subject 8 and we would just exclude Subject 7 in analysis. But now that we are automatically posting data, what should I do? Should I delete the empty file so the next person is Subject 7?" anonymous student in my lab

"Why? Data from a bad study is, by definition, no good." @PsychScienctists, in response to my statement that all data should be curated and available.

All three of the above quotes illustrate a common way of thinking about data. Our data reflect something about us. When we share them, we are sharing something deep and meaningful about ourselves. Our data may be viewed as statements about our competence, our organizational skills, our meticulousness, our creativity, and our lab culture. Even the student in my lab feels this pressure. This student is worried that our shared data won't be viewed as sufficiently systematic because we have no data for Subject 7. Maybe we want to present a better image.

The Data-Are-The-Data Mindset

I don't subscribe to the Judge-Me-By-My-Data mindset. Instead, I think of data as follows:
  • Scientific data are precious resources collected for the common good.
  • We should think in terms of stewardship rather than ownership. Be good stewards
  • Data are neither good nor bad, nor are they neat nor messy. They just are.
  • We should judge each other by the authenticity of our data

Mistake-Free Data Stewardship through Born-Open Data

To be good stewards and to insure authentic data, we upload everything, automatically, every night. Nobody has to remember anything, nobody makes decisions---it all just happens. Data are uploaded to GitHub where everyone can see them. In fact, I don't even use locally stored data for analysis; I point my analyses to the copy on GitHub. We upload data from well-though-out experiments. We upload data from poorly-thought-out-bust experiments. We upload pilot data. We upload incomplete data. If we collected it, it is uploaded. We have an accurate record of what happened in the lab, and you all are welcome to look in at our GitHub account. I call this approach born-open data, and have an in-press paper coming out about it. We have been doing born-open data for about a year.

So far, the main difference I have noticed is an increase in quality control with no energy or time spent to maintain this quality. Nothing ever gets messed up, and there is no after-the-fact reconstruction of what had happened. There is only one master copy of data---the one on GitHub. Analysis code points to the GitHub version. We never analyze the wrong or incomplete data. And it is trivially easy to share our analyses among lab members and others. In fact, we can build the analyses right into our papers with Knitr and Markdown. Computers are so much more meticulous than we will ever be. They never take a night off!

This Summer's Challenge: Automatic Data Curation

I'd like to propose a challenge: Set up your own automatic data curation system for new data that you collect. Work with your IT people. Set up the scripts. Hopefully, when next Fall rolls around, you too are practicing born-open data!

Tuesday, April 5, 2016

The Bayesian Guarantee And Optional Stopping.

Frequentist intuitions run so deep in us that we often mistakenly interpret Bayesian statistics in frequentist ones. Optional stopping has always been a case in point.  Bayesian quantities, when interpreted correctly, are not affected by optional stopping.  This fact is guaranteed by Bayes' Theorem.  Previously, I have shown how this guarantee works for Bayes factors.  Here, let's consider the simple case of estimating an effect size.

For demonstration purposes, let's generate data from a normal with unknown mean, \(\mu\), but known variance of 1.  I am going to use a whacky optional stopping rule that favors sample means near .5 over others.  Here is how it works:  I. As each observation comes in, compute the running sample mean. II. Compute a probability of stopping that is dependent on the sample mean according to the figure below.  The probability favors stopping for sample means near .5.  III. Flip a coin with sides labeled "STOP" and "GO ON" with the below probability.  IV. Do what the coin says (up to a maximum of 50 observations, then stop no matter).

The results of this rule is a bias toward sample means near .5.   I ran a simulation with a true mean of zero for ten thousand replicates (blue histogram below).  The key property is a biasing of the observed sample means higher than the true value of zero.   Bayesian estimation seems biased too.    The green histogram shows the posterior means when the prior on \(\mu\) is a normal with mean of zero and a standard deviation of .5.  The bias is less, but that just reflects the details of the situation where the true value, zero, is also favored by the prior.

So it might seem I have proved the opposite of my point---namely that optional stopping affects Bayesian estimation.

Nope.  The above case offers a frequentist interpretation, and that interpretation entered when we examined the behavior on a true value, the value zero.  Bayesians don't interpret analyses conditional on unknown "truths".

The Bayesian Guarantee 

Bayes' Theorem provides a guarantee.  If you start with your prior and observed data, then Bayes' Theorem guarantees that the posterior is the optimal set of probability statements about the parameter at hand.  It is a bit subtle to see this in simulation because one needs to condition on data rather than on some unknown truth.

Here is how a Bayesian uses simulation shows the Bayesian Guarantee.

I. On each replicate, sample a different true value from the prior.  In my case, I just draw from a normal centered at zero with standard deviation of .5 since that is my prior on effect size for this  post.  Then, on each replicate, simulate data from that truth value for that replicate.  I have chosen data of 25 observations (from a normal with variance of 1).  A histogram of the sample mean across these varying true values is provided below, left panel.   I ran the simulation for 100,000 replicates.

II. The histogram is that of data (sample means) we expect under our prior.  We need to condition on data, so let's condition on an observed sample mean of .3.  I have highlighted a small bin between .25 and .35 with red.  Observations fall in this bin about 6% of the time.

III.  Look at all the true values that generated those sample means in the bin with .3.  These true values are shown in the yellow histogram.  This histogram is the target of Bayes' Theorem, that is, we can use Bayes Theorem to describe this distribution without going through the simulations.   I have computed the posterior distribution for a sample mean of .3 and 25 observations under my prior, and plotted it as the line.  Notice the correspondence.  This correspondence is the simulation showing that Bayes Theorem works.  It works, by the way, for every bin though I have just shown it for the one centered on .3.

TAKE HOME 1: Bayes Theorem tells the distribution of true values given your prior and the data.

Is The Bayesian Guarantee Affected By Optional Stopping?

So, we come to the crux move.  Let's simulate the whacky optional stopping rule that favors sample means near .5.  Once again, we start with the prior, and for each replicate we choose a different truth value as a sample from the prior.  Then we simulate data using optional stopping, and the resulting sample means are shown in the histogram on the left.  Optional stopping has affected these data dramatically.  No matter, we choose our bin, again around .3, and plot the true values that led to these sample means.  These true values are shown as the yellow histogram on the right.  They are far more spread out than in the previous simulation without optional stopping primarily because stopping occurred often for less than 25 observations.  Now, is this spread predicted?  Yes.  On each replication we obtain a posterior distribution, and these vary from replication-to-replication because the sample size is random.  I averaged these posteriors (as I should), and the result is the line that corresponds well to the histogram.

TAKE HOME  II: Bayes Theorem tells you where the true values are given your prior and the data, and it doesn't matter how the data were sampled!

And this should be good news.


R code



runMean=function(y) cumsum(y)/(1:length(y))
minIndex=function(y) order(y)[1]




hist(ybar,col='lightblue',main="",xlab="Sample Mean",breaks=50,xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
mtext("Distortion of Sample Mean",side=3,adj=.5,line=1,cex=1.3)
hist(v*c,col='lightgreen',main="",xlab="Posterior Mean",xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
mtext("Distortion of Posterior Mean",side=3,adj=.5,line=1,cex=1.3)



good=(ybar >(mid-.05) & ybar<(mid+.05))
plot(bars,col=myCol,xlab="Sample Mean",main="")
mtext(side=3,adj=.5,line=0,cex=1.3,"Sample Mean Across Prior")


hist([good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow',
for (i in 1:length(myES))
mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3")


good=(ybar >(mid-.05) & ybar<(mid+.05))
plot(bars,col=myCol,xlab="Sample Mean",main="",xlim=c(-4,3))
mtext(side=3,adj=.5,line=0,cex=1.3,"Sample Mean Across Prior")


hist([good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow',main="",
for (i in 1:length(myES))
mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3")

#stop probability

plot(ybar,1-prob,typ='l',lwd=2,ylab="Stopping Probability",xlab="Sample Mean",ylim=c(0,.55))
mtext("Optional Stopping Depends on Sample Mean",side=3,adj=.5,line=-1,cex=1.3)

Monday, March 28, 2016

The Effect-Size Puzzler, The Answer

I wrote the Effect-Size Puzzler because it seemed to me that people have reduced the concept of effect size to a few formulas on a spreadsheet.  It is a useful concept that deserves a bit more thought.

In the example I had provided is the simplest case I can think of that is germane to experimental psychologists.  We ask 25 people to perform 50 trials in each of 2 conditions, and ask what is the effect size of the condition effect.  Think Stroop if you need a context.

The answer, by the way, is \(+\infty\).  I'll get to it.

The good news about effect sizes  

Effect sizes have revolutionized how we compare and understand experimental results.  Nobody knows whether a 3% change in error rate is big or small or comparable across experiments; everybody knows what an effect size of .3 means.  And our understanding is not associate or mnemonic, we can draw a picture like the one below and talk about overlap and difference.  It is this common meaning and portability that licenses a modern emphasis on estimation.  Sorry estimators, I think you are stuck with standardized effect sizes.

Below is a graph from Many Labs 3 that makes the point.  Here, the studies have vastly different designs and dependent measures.  Yet, they can all be characterized in unison with effect size.

The bad news about effect size

Even for the simplest experiment above, there is a lot of confusion.  Jake Westfall provides 5 different possibilities and claims that perhaps 4 of these 5 are reasonable at least under certain circumstances.  The following comments were provided on Twitter and Facebook: Daniel Lakens makes recommendations as to which one we shall consider the preferred effect size measure.  Tal Yarkoni and Uli Shimmack wonder about the appropriateness of effect size in within subject designs and prefer unstandarized effects (see Jan Vanhove's blog).  Rickard Carlson prefers effect sizes in physical units where possible, say in milliseconds in my Effect Size Puzzler.   Sanjay Srinivasta needs the goals and contexts first before weighing in.  If I got this wrong, please let me know.

From an experimental perspective, The Effect Size Puzzler is as simple as it gets.  Surely we can do better than to abandon the concept of standardized effect sizes or to be mired in arbitrary choices.

Modeling: the only way out

Psychologists often think of statistics as procedures, which, in my view, is the most direct path to statistical malpractice.  Instead, statistical reasoning follows from statistical models.  And if we had a few guidelines and a model, then standardized effect sizes are well defined and useful.  Showing off the power of model thinking rather than procedure thinking is why I came up with the puzzler.

Effect-size guidelines

#1:  Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population.

#2:  Measures of true effect and true amount of variability are only defined in statistical models.  They don't really exist accept within the context of a model.  The model is important.  It needs to be stated.

#3: The true effect size should not be tied to the number of participants nor the number of trials per participant.  True effect sizes characterize a state of nature independent of our design.

The Puzzler Model

I generated the data to be realistic.  They had the right amount of skew and offset, and the tails fell like real RTs do.   Here is a graph of the generating model for the fastest and slowest individuals:

All data had a lower shift of .3s (see green arrow), because we typically trim these out as being too fast for a choice RT task.  The scale was influenced by both an overall participant effect and a condition effect, and the influence was multiplicative.  So faster participants had smaller effects; slower participants had bigger effects.  This pattern too is typical of RT data.   The best way to describe these data is in terms of percent-scale change.  The effect was to change the scale by 10.5%, and this amount was held constant across all people.  And because it was held constant, that is, there was no variability in the effect,  the standardized effect size in this case is infinitely large.

Now, let's go explore the data.  I am going to skip over all the exploratory stuff that would lead me to the following transform, Y = log(RT-.3), and just apply it.  Here is a view of the transformed generating model:

So, lets put plain-old vanilla normal models on Y.  First, let's take care of replicates.
\[ Y_{ijk} \sim \mbox{Normal} (\mu_{ij},\sigma^2)\]
where \(i\)$ indexes individuals, \(j=1,2\) indexes conditions, and \(k\) indexes replicates.  Now, lets model \(\mu_{ij}\).  A general formulation is
\[\mu_{ij} = \alpha_i+x_j\beta_i,\]
where \(x_j\) is a dummy code of 0 for Condition 1 and 1 for Condition 2.  The term \(\beta_i\) is the ith individual's effect.  We can model it as
\[\beta_i \sim \mbox{Normal}(\beta_0,\delta^2)\]
where \(\beta_0\) is the mean effect across people and \(\delta^2\) is the variation of the effect across people.

With this model, the true effect size is \[d_t = \frac{\beta_0}{\delta}.\] Here, by true, I just mean that it is a parameter rather than a sample statistic.  And that's it, and there is not much more to say in my opinion.   In my simulations the true value of each individual's effect was .1.  So the mean, \( \beta_0\), is .1 and the standard deviation, \(\delta\), is, well, zero.  Consequently, the true standardized effect size is \(d_t=+\infty\).   I can't justify any other standardized measure that captures the above principles.


Could a good analyst have found this infinite value?  That is a fair question. The plot below shows individuals' effects, and I have ordered them from smallest to largest.  A key question is whether these are spread out more than expected from within-cell sample noise alone.  It these individual sample effects are more spread out, then there is evidence for true individual variation in \(\beta_i\).  If these stay as clustered as predicted by sample noise alone, then there is evidence that people's effects do not vary.  The solid line is the prediction within within-cell noise alone.   It is pretty darn good.  (The dashed line is the null that people have the same, zero-valued true effect).  I also computed a one-way random-effects F statistic to see if there is a common effect or many individual effects.  It was one effect F(24,2450) = 1.03.  Seems like one effect.

These one-effect results should be heeded.  It is a structural element that I would not want to miss in any data set.   We should hold plausible the idea that the standardized effect size is exceedingly high as the variation across people seems very small if not zero.

To estimate effect sizes, we need a hierarchical model.  You can use Mplus, AMOS, LME4, WinBugs, JAGS, or whatever you wish.  Because I am an old and don't learn new tricks easily, I will do what I always do and program these models from scratch.

I used the general model above in the Bayesian context.  The key specification is the prior on \( \delta^2\).   In the log-normal, the variance is a shape parameter, and it is somewhere around \(.4^2\).  Effects across people are usually about 1/5th of this say \(.08^2\).  To capture variances in this range, I would use a  \(\delta^2 \sim \mbox{Inverse Gamma(.1,.01)} \) prior for general estimation.  This is a flexible prior tuned for the 10 to 100 millisecond range for variation in effects across people.  The following plot shows the resulting estimates of individual effects as a function of the sample effect values.
The noteworthy feature is the lack of variation in model estimates of individual's effects!  This type of pattern where variation in model estimates are attenuated compared to sample statistics is called shrinkage, and it occurs because the hierarchical models don't chase within-cell sample noise.  Here the shrinkage is nearly complete, leading again to the conclusion that there is no real variation across people, or an infinitely large standardized effect size.  For the record, the estimated effect size here is 5.24, which, in effect size units, is getting quite large!

The final step for me is comparing this variable effect model to a model with no variation, say \( \beta_i = \beta_0 \) for all people.  I would do this comparison with Bayes factor.  But, I am out of energy and you are out of patience, so we will save it for another post.

Back To Jake Westfall

Jake Westfall promotes a design-free version of Cohen's d where one forgets that the design is within-subject and uses an all-sources-summed-and-mashed-together variance measure.  He does this to stay true to Cohen's formulae.  I think it is a conceptual mistake.

I love within-subject designs precisely because one can separate variability due to people, variability within a cell, and variability in the effect across people.  In between-subject designs, you have no choice but to mash all this variability together due to the limitations of the design.   Within-subject designs are superior, so why go backwards and mash the sources of variances together when you don't have to?  This advise strikes me as crazy.  To Jake's credit, he recognizes that the effect-size measures promoted here are useful, but doesn't want us to call them Cohen's d.  Fine, we can just call them Rouder's within-subject totally-appropriate standardized effect-size measures.  Just don't forget the hierarchical shrinkage when you use it!

Thursday, March 24, 2016

The Effect-Size Puzzler

Effect sizes are bantered around as useful summaries of the data.  Most people think they are straightforward and obvious.  So if you think so, perhaps you won't mind a bit of a challenge?  Let's call it "The Effect-Size Puzzler," in homage to NPR's CarTalk.  I'll buy the first US winner a nice Mizzou sweatshirt (see here).  Standardized effect size please.

I have created a data set with 25 people each observing 50 trials in 2 conditions.  It's from a priming experiment.  It looks about like real data.  Here is the download.

The three columns are:

  • id (participant: 1...25)
  • cond (condition: 1,2)
  • rt (response time in seconds).  

There are a total of 2500 rows.

I think it will take you just a few moments to load it and tabulate your effect size for the condition effect.  Have fun.  Write your answer in a comment or write me an email.

I'll provide the correct answer in a blog next week.

HINT: If you wish to get rid of the skew and stabilize the variances, try the transform y=log(rt-.3)

Monday, March 21, 2016

Roll Your Own II: Bayes Factors With Null Intervals

The Bayes factors we develop compare the null model to an alternative model.  This null model is almost always a single point---the true effect is identically zero.    People sometimes confuse our advocacy for Bayes factor with that for point-null-hypothesis testing.  They even critique Bayes factor with the Cohenesque claim that the point null is never true.

Bayes factor is a general way of measuring the strength of evidence from data for competing models.  It is not tied to the point null.

We develop for the point null because we think it is a useful, plausible, theoretically meaningful model.  Others might disagree, and these disagreements are welcome as part of the exchange of viewpoints in science.

In the blog post Roll Your Own: How to Compute Bayes Factors for Your Priors, I provided R code to compute a Bayes factor between a point-null and a user-specified alternative for a simple setup motivated by the one-sample t-test.  I was heartened by the reception and I hope a few of you are using the code (or the comparable code provided by Richard Morey).  There have been some requests to generalize the code for non-point nulls.  Here, let's explore the Bayes factor for any two models in a simple setup.  As it turns out, the generalization is instructive and computationally trivial.   We have all we need from the previous posts.

Using Interval Nulls: An Example

Consider the following two possibilities:

I. Perhaps you feel the point null is too constrained and would rather adopt a null model with mass on a small region around zero rather than at the point.  John Kruschke calls these regions ROPEs (regions of posterior equivalence).

II. Perhaps you are more interested in the direciton in an effect rather than whether it is zero or not.  In this case, you might consider testing two one-sided models against each other.

For this blog, I am going to retain four different priors. Let’s start with a data model. Data are independent normal draws with mean \(\mu\) and variance \(\sigma^2\). It is more convenient re-express the normal as a function of effect size, \(\delta\) and \(\sigma^2\) where \(\delta=\mu/\sigma)\). Here is the formal specification:
\[ Y_i \mid \delta,\sigma^2 \stackrel{iid}{\sim} \mbox{Normal}(\sigma\delta,\sigma^2).\]
Now, the substantive positions as prior models on effect size:
  1. \(M_0\), A Point Null Model: \(\delta=0\)
  2. \(M_1\), A ROPE Model: \(\delta \sim \mbox{Unif}(-.25,.25)\)
  3. \(M_2\), A Positive Model: \(\delta \sim \mbox{Gamma(3,2.5)}\)
  4. \(M_3\), A Negative Model: \(-\delta \sim \mbox{Gamma(3,2.5)}\)
Here are these four models expressed graphically as distributions:

I picked these four models, but you can pick as many ones as you wish. For example, you can include a normal if you wish.

Oh, let's look at some data.  Suppose the observed effect size is .35 for an N of 60.

Going Transitive

Bayes factors are the comparison between two models.  Hence we would like to compute the Bayes factors between any of these models.  Let \(B_{ij}\) be the comparison between the ith and jth model.  We want a Table like this:

\(B_{00}\) \(B_{01}\) \(B_{02}\) \(B_{03}\)
\(B_{10}\) \(B_{11}\) \(B_{12}\) \(B_{13}\)
\(B_{20}\) \(B_{21}\) \(B_{22}\) \(B_{23}\)
\(B_{30}\) \(B_{31}\) \(B_{32}\) \(B_{33}\)

Off the bat, we know the Bayes factor between a model and itself is 1 and that \(B_{ij} = 1/B_{ji}\). So we only need to worry about the lower corner.

\(B_{10}\) 1
\(B_{20}\) \(B_{21}\) 1
\(B_{30}\) \(B_{31}\) \(B_{32}\) 1

We can use the code below, from the previous post to figure out the null vs. all the other models.

\[ B_{10} = 4.9, \quad B_{20} = 4.2, \quad B_{30} = .0009 \]

Here we see that the point null is not as attractive or the ROPE null or the positive model. It is more attractive, however, than the negative model.

Suppose, however, that you are most interested in the ROPE null and its comparison to the positive and negative model. The missing Bayes factors are \(B_{12}\), \(B_{13}\), and \(B_{23}\).

The key application of transitivity is as follows:

\[ B_{ij} = B_{ik} \times B_{kj}. \]

So, we can compute \(B_{12}\) as follows: \(B_{12} = B_{10} \times B_{02} = B_{10}/B_{20} = 4.9/4.2 = 1.2\).

The other two Bayes factors are computed likewise: \(B_{13} = 5444 \) and \(B_{23} = 4667\)

So what have we learned. Clearly, if you were pressed to choose a direction, it is in the positive direction. That said, the evidence for a positive effect is slight when compared to a ROPE null.

Snippets of R Code

#First, Define Your Models as a List
#lo, lower bound of support
#hi, upper bound of support
#fun, density function

#here are Models M1, M2, M3 
#add or change here for your models 

mod1=list(lo=-.25,hi=.25,fun=function(x,lo,hi) dunif(x,lo,hi))
mod2=list(lo=0,hi=Inf,fun=function(x,lo,hi) dgamma(x,shape=3,rate=2.5))
mod3=list(lo=-Inf,hi=0,fun=function(x,lo,hi) dgamma(-x,shape=3,rate=2.5))

#note, we dont need to specify the point null, it is built into the code

#Lets make sure the densities are proper, here is a function to do so:

normalize=function(mod) return(c(mod,K=1/integrate(mod$fun,lower=mod$lo,upper=mod$hi,lo=mod$lo,hi=mod$hi)$value))

#and now we normalize the three models

#Observed Data

#Here is the key function that computes the Bayes factor between a model and the point null
f= function(delta) mod$fun(delta,mod$lo,mod$hi)*mod$K


print(paste("B10=",B10,"   B20=",B20,"   B30=",B30))


print(paste("B12=",B12,"   B13=",B13,"   B23=",B23))

Tuesday, March 15, 2016

Statistical Difficulties from the Outer Limits

You would think that the more data we collect, the closer we should be to the truth.  

This blog post falls into the "I may be wrong" category.  I hope many of you comment.

ESP: God's Gift To Bayesians?

It seems like ESP is God's gift to Bayesians.  We use it like a club to reinforce the plausibility of null hypotheses and to point out the difficulties of frequentist analysis.

In the 1980s, a group of Princeton University engineers set out to test ESP by asking people to use their minds to change the outcome of a random noise generator (check out their website).   Over the course of a decade, these engineers collected an astounding 104,490,000 trials.  On each trial, the random noise generator flipped a gate with known probability of exactly .5.  The question was whether a human operator using only the power of his or her mind could increase this rate.  Indeed, they found 52,263,471 gate flips, or 0.5001768 of all trials.  This proportion, though only slightly larger than .5, is nonetheless significantly larger with a damn low p-value of .0003.   The figure below shows the distribution of successes under the null, and the observation is far to the right.  The green interval is the 99% CI, and it does not include the null.

Let's assume these folks have a decent set up and the true probability should be .5 without human ESP intervention.  Did they show ESP?

What do you think?  There data are numerous, but do you feel closer to the truth?  Impressed by the low p-value?  Bothered by the wafer-thin effect?  Form an opinion; leave a comment.  

Bayesians love this example because we can't really fathom what a poor frequentist would do?  The p-value is certainly lower than .05, even lower than .01, and even lower than .001.  So, it seems like a frequentist would need to buy in.  The only way out is to lower the Type I error rate in response to the large sample size.  But to what value and why?

ESP: The Trojan Horse?

ESP might seem like God's gift to Bayesians, but maybe it is a Trojan Horse.  A Bayes factor model comparison analysis goes as following.  The no-ESP null model is
\[ M_0: Y  \sim \mbox{Binomial}(.5,N) \]

The ESP alternative is
\[ M_1: Y|\theta  \sim \mbox{Binomial}(\theta,N) \]
A prior on \(\theta\) is needed to complete the specification.  For the moment, let's use a flat one, \(\theta \sim \mbox{Unif}(0,1) \).

It is pretty easy to calculate a Bayes factor here, and the answer is 12-to-1 in favor of the null.   What a relief.

ESP proponents might rightly criticize this prior as too dispersed.  We may reasonably assume that \( \theta \) should not be less than .5 as we can assume the human operators are following the direction to increase rather than decrease the proportion of gate flips.   Also, the original investigators might argue that it is unreasonable to expect anything more than a .1% effect, so the top might be .501.  In fact, they might argue they ran such a large experiment because they expected a prior such a small effect.  The prior is   \(\theta \sim \mbox{Unif}(.5,.501) \), then the Bayes factor is 84-to-1 for an ESP effect.

The situation seems tenuous.  The below figure shows the Bayes factors for both priors as a function of the number of trials.  To draw these curves, I simply kept the proportion of success constant at 0.5001768.  The line is for the observed number of trials.  With this proportion, the Bayes factor not only depend on the prior, but they also depend in unintuitive ways on sample size.  For instance, if we doubled the number of trials and successes, the Bayes factors become 40-to-1 and 40,000-to-1, respectively, for the flat prior and the very small interval one.

Oh, I can see the anti-Bayes crowd getting ready to chime in should they read this.   Sanjay Srivastava may take the high road and discuss the apparent lack of practicality of the Bayes factor.  Uri Simonsohn may boldly declare that Bayesians can't find truth.   And perhaps Uli Shimmack will create a new index, the M-Index, where M stands for moron.  Based on his analysis of my advocacy, he may declare I have the second highest known M-Index, perhaps surpassed only by E.-J. Wagenmakers.

Seems like ESP was a bit of a Trojan Horse.  It looked all good, and then turned on us.

But What Happened?

Bayes' rule is ok of course.  The problem is us.  We tend to ask too much of statistics.   Before I get to my main points, I need to address one issue,  What is the model?  Many will call the data model, the binomial specification in this case, "the model."  The other part, the priors on parameters, is not part of "the model", it is the prior.  Yet, it is better to think of "the model" as the combination of the binomial and prior specification.  It's all one model, and this one model provides a priori predictive distribution about where the data should fall (see my last blog post).  The binomial is a conditional specification, and the prior completes the model.

With this in mind that the above figure strikes me as quite reasonable.  Consider the red line, the one that compares the null to the model where the underlying probability ranges across the full interval.  Take the point for 10,000 trials.   The number of successes is 5,002 which is almost 1/2 of all trials.  Not surprisingly,  this value is evidence for the null compared to this diffuse alternative.  But the same value is not evidence for the null compared to the more constrained alternative model where \(.5<\theta<.501\).  Both the null and this alternative are about the same for 10,000 trials, and each predict 5,002 successes out of 10,000 trials equally well.  Hence,  the Bayes factor is equivocal.  This alternative and the null are so similar that it takes way more data to discriminate among them.   As we gain more and more data, say 100,000,000 trials, the  slight discrepancy from 1/2 can be resolved, and the Bayes factors start to favor the alternative models.  As the sample size is increased further, the discrepancy becomes more pronounced.  Everything in that figure makes beautiful sense to me--- it all is as it should be.  Bayes rule is ok.

Having more and more data doesn't get us closer to the truth.  It does, however, is give us greater resolution to more finely discriminate among models.

Loose Ends  

The question, "is there an effect" strikes me as ill formed.   Yet, we answer the question affirmatively daily.  Sometimes, effects are obvious, and they hit you between the eyes.  How can that be if the question is not well formed?

I think when there are large effects, just about any diffuse alternative model will do.  As long as the alternative is diffuse, data with large effects easily discriminate this diffuse alternative from the null.  It is in this sense that effects are obviously large.

What this example shows that if one tries to resolve small effects with large sample sizes, there is intellectual tension.  Models matter.  Models are all that matter.  Large data gives you greater resolution to discriminate among similar models.  And perhaps little else.

The Irony Is...

This ESP example is ironic.  The data are so numerous that they are capable of finely discriminating among just about any set of models we wish, even the difference between a point null and a uniform null subtending .001 in width on the probability scale.  The irony is that we have no bona-fide competing models to discriminate.  ESP by definition seemingly precludes any scientific explanation, and without such explanation, all alternatives to the null are a bit contrived.  So while we can discriminate among models, there really is only one plausible one, the null, and no need for discrimination at all.

If forced to do inference here (which means someone buys me a beer),  I would choose the full-range uniform as the alternative model and state the 12-to-1 ratio for the null.  ESP is such a strange proposition that why would values of \( \theta \) near .5 be any more a priori plausible than those away from it?

Saturday, February 27, 2016

Bayesian Analysis: Smear and Compare

This blog post is co-written with Julia Haaf (@JuliaHaaf).

Suppose Theory A makes a prediction about main effects and Theory B makes a prediction about interactions.  Can we compare the theories with data?  This question was posed on the Facebook Psychological Methods Group by Rickard Carlsson.

Uli Schimmak (@R_Index) put the discussion on general terms with this setup:
Should patients take Drug X to increase their life expectancy?
Theory A = All patients benefit equally (unlikely to be true).
Theory B1 = Women benefit, but men's LE is not affected.
Theory B2 = Women benefit, but men's LE decreases (cross-over).
Theory B3 = Women benefit, and men benefit too, but less.

We are going to assume that each of these statements represents a theoretically interesting position of constraint. The goal is to use data to state the relative evidence for or against these positions.  This question is pretty hard from a frequentist perspective as there are difficult order constraints to be considered.  Fortunately, it is relatively simple from a Bayesian model-comparison perspective.

Model Specifications

The first and perhaps most important steps is representing these verbal positions as competing statistical models, or performing model specification.  Model specification is a bit of an art, and here is our approach:

Let \(Y_{ijk}\) be life expectancy where \(i\) denotes gender (\(i=w\) for women; \(i=m\) for men), where \(j\) denotes drug status (\(j=0\) for placebo; \(j=1\) for treatment), and where \(k\) denote the replicate as there are several people per cell. We can start with the standard setup:
Y_{ijk} \sim \mbox{Normal}(\mu_{ij},\sigma^2).

The next step is building in meaningful constraints on true cell means \(\mu_{ij}\).  The standard approach is to think in terms of the grand mean, main effects, and interactions.  We think in this case and for these positions, the standard approach is not as suitable as the following two-cornerstone approach:

\(\mu_{w0} = \alpha\)
\(\mu_{w1} = \alpha + \beta\)
\(\mu_{m0} = \gamma \)
\(\mu_{m1} = \gamma+\delta \)

With this parameterization, all the models can be expressed as various constraints on the relationship between \(\beta\), the effect for women, and \(\delta\), the effect for men.

Model A: The constraint in Theory A is instantiated by setting \(\beta=\delta\). We place a half normal on this single parameter, the equal effect of the drug on men and women. See the Figure below.

Model B1: The constraint in Theory B1 instantiated by setting \(\beta>0\) and \(\delta=0\).  A half normal on \(\beta\) will do.

Model B2: The constraint in Theory B2 is instantiated by setting \(\beta>0\) and \(\delta<0\).  We used independent half normals here.

Model B3: The constrain in Theory B3 is that \(0<\delta<\beta\). This model is also shown, and it is similar to the preceding one; the difference is in the form of the constraints.

Of course, there are other models which might be useful including the null, models with no commitment to benefit for women, or models that do not assume women benefit more than men. Adding them presents no additional difficulties in the Bayesian approach.

Analysis: Smear & Compare

If we are willing to make fine specifications, as above, then it is straight forward to derive predictions for data of a set sample size. These predictions are shown as a function of the sample effect for men and women, that is, the change in lifespan between treatment and placebo for each gender.   These effects are denoted as as \(\hat{\beta}\) and \(\hat{\delta}\), respectively.  Here are the predictions:

Notice how these predictions are smeared versions of the model.  That is what sample noise does!

With these predictions, we are ready to observe data. Suppose we observe that the treatment extends women's lives by 2 years and men's lives by one year.  We have now included this observed value as a red dot in the below figures.

As can be seen, the observation is best predicted by Model B3.

The Bayes factor is the relative comparisons of these predictions.  We can ask how much B3 beats the other models.  Here it is:

B3 beats A by 3.7-to-1
B3 beats B1 by 18.2-to-1
B3 beats B2 by 212-to-1

Saturday, February 6, 2016

What It Would Take To Believe in ESP?

"Bem (2011) is still not retracted.  Not enough public critique?"  -- R-Index, Tweet on February 5th, 2016.
Bem's 2011 paper remains controversial because of the main claim of ESP.  Many researchers probably agree with me that the odds that ESP is true is quite small.  My subjective belief is that it is about three times as unlikely as winning the PowerBall jackpot.  Yet, Bem's paper is well written and well argued.  In many ways it is a model of how psychology papers should be written.  And so we have a problem---either there is ESP or the everyday way we produce and communicate knowledge is grievously flawed.   One benefit of Bem (2011) is that it forces us to reevaluate our production of knowledge perhaps more forcefully than any direct argument could.  How could the ordinary applications of our methods lead to the ESP conclusion?

There Is Evidence for an ESP Effect

The call to retract Bem is unfortunate.   There is no evidence of any specific fraud nor any element of substantial incompetence.  That does not mean the paper is free from critique---there is much to criticize as I will briefly mention subsequently (see also Tal Yarkoni's blog).  Yet, even when the critiques are taken into account, there is evidence from the reported data of an ESP effect.  Morey and I found a Bayes factor of about 40-to-1 in favor of an ESP effect.

In getting this value, we noted a number of issues as follows:  We feel Experiments 5, 6, and 7 were too opportunistic.  There was no clear prediction for the direction of the effect---either retroactive mere exposure where future repeats increase the feeling of liking, or retroactive habituation where future repeats decrease the feeling of liking.  Both of these explanations were used post-hoc to explain different ESP trends, and we argue this usage is suspect and discarded these results.  We also worried about the treatment of non-erotic stimuli.  In Experiments 2-4, emotional non-erotic stimuli elicited ESP; in Experiments 8-9 neutral stimuli elicited ESP.  In Experiment 1, however, these non-erotic stimuli did not elicit ESP, in fact only the erotic ones did.  So, we feel Experiment 1 is a failure of ESP for these non-erotic stimuli and treated it as such in our analysis.  Even with  these corrections, there was 40-to-1 evidence for an ESP effect.

In fact, the same basic story holds for telepathy.  Storm et al. meta-analytically  reviewed 67 studies and found a z of about 6, indicating overwhelming evidence for this ESP effect.  We went in, examined a bunch of these studies and trimmed out several that did not meet the criterion.  Even so, the Bayes factor was as much as 330-to-1 in favor of an ESP effect!  (see Rouder et al,, 2013)

Do I Believe In ESP

No.  I believe that there is some evidence in the data for something, but the odds that it is ESP is too remote.  Perhaps there are a lot of missing negative studies.

Toward Believing In ESP: The Movie Theatre Experiment

So what would it take to believe in ESP?  I think  Feynman once noted that a physicist would not be satisfied with such small effects.  She would build a better detector or design a better experiment.  (I can't find the Feyman cite, please help).  So here is what would convince me:

I'd like to get 500 people in a movie theatre and see if they could feel the same future.  Each would have an iPad, and before hand, each would have provided his or her preferences for erotica.  A trial would start with a prediction---each person would have to predict whether an ensuing coin flip will land heads or tails.  From this, we tally the predictions to get a group point prediction.  If more people predict a head than a tail, the group prediction is heads; if more people predict a tails, the group prediction is tails.  Now we flip the coin.  If the group got it right, then everyone is rewarded with the erotica of their choice.  If the group got it wrong, then everyone is shown  a gross IAPS photo of decapitated puppies and the like.   We can run some 100 trials.  I bet people would have fun.

Here is the frequentist analysis:  Let's suppose under the ESP alternative that people feel the future with a rate of .51 compared to the .50 baseline.  So, how often is the group prediction from 500 people correct? The answer is .66. Telling whether performance is .66 or .50 is not too hard.  If we run 100 total trials, we can divide up at 58: 58 or less group-correct trials is evidence for the null; 59 or more group-correct trials is evidence for ESP.  The odds of getting over 58 group-correct trials under the null is .044.  The odds of getting under 59 group-correct trials under the ESP alternative is .058    The group prediction about a shared future is a better detector than the usual way.

Of course, I would perform a Bayesian analysis of the data.  I would put a distribution on the per person ESP effect, allowsing some people to not feel the future at all.  Then I would generalize this to a distribution for the group, derive predictions for this mode and the null, and do the usual Bayes factor comparison.  I am not sure this experiment would fully convince me, but it would change my skeptical beliefs by a few orders of magnitude.  Do it twice and I might even be a believer!

Now, how to get funding to run the experiment?  Mythbusters?

Closing Thoughts: Retractions and Shaming

The claim that Bem (2011) should be retracted perhaps comes from the observations that getting 9 or 9 significant effects with such a small effect size and with the reported sample sizes is pretty rare.  I am not a fan of this type of argument for retraction.  I would much rather the critique be made, and we move on.  Bem's paper has done the field much good.  Either Bem has found the most important scientific finding in the last 100 years or has taught us much about how we do research.  Either way, it is a win-win.  I welcome his meta-analysis on the same grounds. 

Sunday, January 24, 2016

Roll Your Own: How to Compute Bayes Factors For Your Priors

Sometimes people ask what prior they should use in computing Bayes factor. It’s a great question, and trying to answer it leads to a deeper understanding of Bayesian methods.  Here I provide some R code so that you can visualize and compute Bayes factors for (almost) any prior you wish. At the end of this blog I provide all the R code in one block. You might even want to go down there and cut-and-paste it into your R editor now.

The Bayes factor compares two models. Let’s take one of them to be a point null. The other, the alternative, is up to you.


Let’s first start with a data model. Data are independent normal draws with mean μ and variance σ2. It is more convenient re-express the normal as a function of effect size, δ and σ2 where δ = μ/σ). Here is the formal specification:
Y_i \mid \delta,\sigma^2 \stackrel{iid}{\sim} \mbox{Normal}(\sigma\delta,\sigma^2).

A null model is implemented by setting δ = 0. The alternative model on effect size is up to you. Here is how you do it. Below is the specification for the R function altDens, the density of the alternative. I have chosen a normal with mean of .5 and a standard deviation of .3. I have also truncated this distribution at zero, and negative values are not allowed. You can see that in the specification of the function altDens. Set lo and hi, the support of the alternative. Here I set them to 0 and infinity, respectively.
#Specify Alternative (up to constant of propertionality)
lo=0 #lower bound of support
hi=Inf #upper bound of support
You may notice that this alternative is not quite a proper density because it does not integrate to 1 That’s ok, the following function scales the alternative so the density does indeed integrate to 1.0.
#Normalize alternative density in case user does not, 
f=function(delta) K*altDens(delta)

Visualizing the Models

Let’s now take a look at the competing models on δ, the effect size. Here is the code and the graphs:
plot(delta,f(delta),typ='n',xlab="Effect Size Parameter Delta",ylab="Density",ylim=c(0,1.4*maxAlt),main="Models")

Bayes Factor Computation

Bayes factor is based on prediction. It is the ratio of the predicted density of the data under one model relative to that under another. The predicted density of data for the null is easy to compute and is related to the central t distribution. Here is the code.
nullPredF=function(obs,N) dt(sqrt(N)*obs,N-1)
We can compute this predicted density for any observed effect size or for all of them. The following code does it for a reasonable range of effect sizes for a sample size of 30. You can change N, the sample size, as needed.
Getting the predictive density for the alternative is a bit harder. For each nonzero effect size parameter, the distribution of the observed effect follows a noncentral t distribution. Hence, to obtain predictions across all nonzero effect size parameters, we need to integrate the alternative model against the noncentral t distribution. Here is the code with a simple loop:


for (i in 1:I) altPred[i]=altPredF(obs[i],N)
Now we can plot the predictions for all observed effect sizes:
plot(type='l',obs,nullPred,ylim=c(0,top),xlab="Observed Effect Size",ylab="Density",main="Predictions",col='darkblue',lwd=2)

Let’s Run an Experiment

Suppose we just ran an experiment and observed a sample effect size of .4. Let’s look at the predictions for this value.
plot(type='l',obs,nullPred,ylim=c(0,top),xlab="Observed Effect Size",ylab="Density",main="Predictions",col='darkblue',lwd=2)

#cat("Predictive Density under the null is ",valNull)
#cat("Predictive Density under the specified alternative is ",valAlt)
#cat("Bayes factor (alt/null) is ",valAlt/valNull)
How well did the models do? Well, this value was clearly better predicted under the specified alternative than under the null. The density under the null is .04; the density under the alternative .205, and the ratio between them is 5.16-to-1. This ratio is the Bayes factor. In this case the Bayes factor indicate support of slightly more than 5-to-1 for the specified alternative.

Other Alternatives

The alternative is specified with just a bit of R code that can be changed. Suppose, for example, one wished a flat or uniform alternative from -.2 to 1.2. Here is the bit of code
#Specify Alternative (up to constant of propertionality)
lo=-.2 #lower bound of support
hi=1.2 #upper bound of support
Run this first, and then run the rest to see what happens.
Here is another specification; this alternative is a gamma
#Specify Alternative (up to constant of propertionality)
lo=0 #lower bound of support
hi=Inf #upper bound of support

The R Code in One Block


#Specify Alternative (up to constant of propertionality)
#Change this section to change alternative
lo=0 #lower bound of support
hi=Inf #upper bound of support

#Normalize alternative density in case user does not, 
f=function(delta) K*altDens(delta)


#Plot Alternative as a density and Null as a point arrow
plot(delta,f(delta),typ='n',xlab="Effect Size Parameter Delta",ylab="Density",ylim=c(0,1.4*maxAlt),main="Models")

#Prediction Function Under Null
nullPredF=function(obs,N) dt(sqrt(N)*obs,N-1)

#Prediction Function Under the Alternative

for (i in 1:I) altPred[i]=altPredF(obs[i],N)

#Plot The Predictions
plot(type='l',obs,nullPred,ylim=c(0,top),xlab="Observed Effect Size",ylab="Density",main="Predictions",col='darkblue',lwd=2)

#Evaluate Predicted Density at Observed Value
cat("Bayes factor (alt/null) is ",valAlt/valNull)