Sunday, April 10, 2016

This Summer's Challenge: Share Your Data

"It would take me weeks of going through my data and coordinating them, documenting them, and cleaning them if I were to share them." anonymous senior faculty member

"Subject 7 didn't show. There is an empty file. Normally the program would label the next person Subject 8 and we would just exclude Subject 7 in analysis. But now that we are automatically posting data, what should I do? Should I delete the empty file so the next person is Subject 7?" anonymous student in my lab

"Why? Data from a bad study is, by definition, no good." @PsychScienctists, in response to my statement that all data should be curated and available.

All three of the above quotes illustrate a common way of thinking about data. Our data reflect something about us. When we share them, we are sharing something deep and meaningful about ourselves. Our data may be viewed as statements about our competence, our organizational skills, our meticulousness, our creativity, and our lab culture. Even the student in my lab feels this pressure. This student is worried that our shared data won't be viewed as sufficiently systematic because we have no data for Subject 7. Maybe we want to present a better image.

The Data-Are-The-Data Mindset

I don't subscribe to the Judge-Me-By-My-Data mindset. Instead, I think of data as follows:
  • Scientific data are precious resources collected for the common good.
  • We should think in terms of stewardship rather than ownership. Be good stewards
  • Data are neither good nor bad, nor are they neat nor messy. They just are.
  • We should judge each other by the authenticity of our data

Mistake-Free Data Stewardship through Born-Open Data

To be good stewards and to insure authentic data, we upload everything, automatically, every night. Nobody has to remember anything, nobody makes decisions---it all just happens. Data are uploaded to GitHub where everyone can see them. In fact, I don't even use locally stored data for analysis; I point my analyses to the copy on GitHub. We upload data from well-though-out experiments. We upload data from poorly-thought-out-bust experiments. We upload pilot data. We upload incomplete data. If we collected it, it is uploaded. We have an accurate record of what happened in the lab, and you all are welcome to look in at our GitHub account. I call this approach born-open data, and have an in-press paper coming out about it. We have been doing born-open data for about a year.

So far, the main difference I have noticed is an increase in quality control with no energy or time spent to maintain this quality. Nothing ever gets messed up, and there is no after-the-fact reconstruction of what had happened. There is only one master copy of data---the one on GitHub. Analysis code points to the GitHub version. We never analyze the wrong or incomplete data. And it is trivially easy to share our analyses among lab members and others. In fact, we can build the analyses right into our papers with Knitr and Markdown. Computers are so much more meticulous than we will ever be. They never take a night off!

This Summer's Challenge: Automatic Data Curation

I'd like to propose a challenge: Set up your own automatic data curation system for new data that you collect. Work with your IT people. Set up the scripts. Hopefully, when next Fall rolls around, you too are practicing born-open data!



Tuesday, April 5, 2016

The Bayesian Guarantee And Optional Stopping.

Frequentist intuitions run so deep in us that we often mistakenly interpret Bayesian statistics in frequentist ones. Optional stopping has always been a case in point.  Bayesian quantities, when interpreted correctly, are not affected by optional stopping.  This fact is guaranteed by Bayes' Theorem.  Previously, I have shown how this guarantee works for Bayes factors.  Here, let's consider the simple case of estimating an effect size.

For demonstration purposes, let's generate data from a normal with unknown mean, \(\mu\), but known variance of 1.  I am going to use a whacky optional stopping rule that favors sample means near .5 over others.  Here is how it works:  I. As each observation comes in, compute the running sample mean. II. Compute a probability of stopping that is dependent on the sample mean according to the figure below.  The probability favors stopping for sample means near .5.  III. Flip a coin with sides labeled "STOP" and "GO ON" with the below probability.  IV. Do what the coin says (up to a maximum of 50 observations, then stop no matter).



The results of this rule is a bias toward sample means near .5.   I ran a simulation with a true mean of zero for ten thousand replicates (blue histogram below).  The key property is a biasing of the observed sample means higher than the true value of zero.   Bayesian estimation seems biased too.    The green histogram shows the posterior means when the prior on \(\mu\) is a normal with mean of zero and a standard deviation of .5.  The bias is less, but that just reflects the details of the situation where the true value, zero, is also favored by the prior.



So it might seem I have proved the opposite of my point---namely that optional stopping affects Bayesian estimation.

Nope.  The above case offers a frequentist interpretation, and that interpretation entered when we examined the behavior on a true value, the value zero.  Bayesians don't interpret analyses conditional on unknown "truths".

The Bayesian Guarantee 

Bayes' Theorem provides a guarantee.  If you start with your prior and observed data, then Bayes' Theorem guarantees that the posterior is the optimal set of probability statements about the parameter at hand.  It is a bit subtle to see this in simulation because one needs to condition on data rather than on some unknown truth.

Here is how a Bayesian uses simulation shows the Bayesian Guarantee.

I. On each replicate, sample a different true value from the prior.  In my case, I just draw from a normal centered at zero with standard deviation of .5 since that is my prior on effect size for this  post.  Then, on each replicate, simulate data from that truth value for that replicate.  I have chosen data of 25 observations (from a normal with variance of 1).  A histogram of the sample mean across these varying true values is provided below, left panel.   I ran the simulation for 100,000 replicates.

II. The histogram is that of data (sample means) we expect under our prior.  We need to condition on data, so let's condition on an observed sample mean of .3.  I have highlighted a small bin between .25 and .35 with red.  Observations fall in this bin about 6% of the time.

III.  Look at all the true values that generated those sample means in the bin with .3.  These true values are shown in the yellow histogram.  This histogram is the target of Bayes' Theorem, that is, we can use Bayes Theorem to describe this distribution without going through the simulations.   I have computed the posterior distribution for a sample mean of .3 and 25 observations under my prior, and plotted it as the line.  Notice the correspondence.  This correspondence is the simulation showing that Bayes Theorem works.  It works, by the way, for every bin though I have just shown it for the one centered on .3.

TAKE HOME 1: Bayes Theorem tells the distribution of true values given your prior and the data.





Is The Bayesian Guarantee Affected By Optional Stopping?

So, we come to the crux move.  Let's simulate the whacky optional stopping rule that favors sample means near .5.  Once again, we start with the prior, and for each replicate we choose a different truth value as a sample from the prior.  Then we simulate data using optional stopping, and the resulting sample means are shown in the histogram on the left.  Optional stopping has affected these data dramatically.  No matter, we choose our bin, again around .3, and plot the true values that led to these sample means.  These true values are shown as the yellow histogram on the right.  They are far more spread out than in the previous simulation without optional stopping primarily because stopping occurred often for less than 25 observations.  Now, is this spread predicted?  Yes.  On each replication we obtain a posterior distribution, and these vary from replication-to-replication because the sample size is random.  I averaged these posteriors (as I should), and the result is the line that corresponds well to the histogram.




TAKE HOME  II: Bayes Theorem tells you where the true values are given your prior and the data, and it doesn't matter how the data were sampled!

And this should be good news.




*****************

R code

set.seed(123)

m0=0
v0=.5^2

runMean=function(y) cumsum(y)/(1:length(y))
minIndex=function(y) order(y)[1]

mySampler=function(t.mu,topN)
{
M=length(t.mu)
mean=rep(t.mu,topN)
y=matrix(nrow=M,ncol=topN,rnorm(M*topN,mean,1))
ybar=t(apply(y,1,runMean))
prob=plogis((ybar-.6)^2,0,.2)
another=matrix(nrow=M,ncol=topN,rbinom(M*topN,1,prob))
stop=apply(another,1,minIndex)
return(list("ybar"=ybar[cbind(1:M,stop)],"N"=stop))
}

goodSampler=function(t.mu,topN){
M=length(t.mu)
mean=rep(t.mu,topN)
y=matrix(nrow=M,ncol=topN,rnorm(M*topN,mean,1))
return(apply(y,1,mean))}

M=10000

png('freqResults.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))
t.mu=rep(0,M)
out=mySampler(t.mu,50)
ybar=out$ybar
N=out$N
v=1/(N+1/v0)
c=(N*ybar+m0/v0)
hist(ybar,col='lightblue',main="",xlab="Sample Mean",breaks=50,xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
mtext("Distortion of Sample Mean",side=3,adj=.5,line=1,cex=1.3)
abline(v=mean(ybar),lwd=3,lty=2)
hist(v*c,col='lightgreen',main="",xlab="Posterior Mean",xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
abline(v=mean(v*c),lwd=3,lty=2)
mtext("Distortion of Posterior Mean",side=3,adj=.5,line=1,cex=1.3)
dev.off()


###############################
set.seed(456)
png('bayesGuarantee.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))


M=100000
N=25
t.mu=rnorm(M,m0,sqrt(v0))
ybar=goodSampler(t.mu,N)
myBreak=seq(-2.45,2.45,.1)
bars=hist(ybar,breaks=myBreak,plot=F)

mid=.3
good=(ybar >(mid-.05) & ybar<(mid+.05))
myCol=rep("white",length(myBreak))
myCol[round(bars$mids,2)==0.3]='red'
plot(bars,col=myCol,xlab="Sample Mean",main="")
mtext(side=3,adj=.5,line=0,cex=1.3,"Sample Mean Across Prior")

v=1/(N+1/v0)
c=(N*mid+m0/v0)

hist(t.mu[good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow',
ylim=c(0,2.2),main="",xlim=c(-1.75,1.75))
myES=seq(-2,2,.01)
post=1:length(myES)
for (i in 1:length(myES))
post[i]=mean(dnorm(myES[i],c*v,sqrt(v)))
lines(myES,post,lwd=2)
mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3")

dev.off()
########################
set.seed(790)
png('moneyShot.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))


M=100000
t.mu=rnorm(M,m0,sqrt(v0))
out=mySampler(t.mu,50)
ybar=out$ybar
N=out$N
myBreak=seq(-5.95,5.95,.1)
bars=hist(ybar,breaks=myBreak,plot=F)

mid=.3
good=(ybar >(mid-.05) & ybar<(mid+.05))
myCol=rep("white",length(myBreak))
myCol[round(bars$mids,2)==0.3]='red'
plot(bars,col=myCol,xlab="Sample Mean",main="",xlim=c(-4,3))
mtext(side=3,adj=.5,line=0,cex=1.3,"Sample Mean Across Prior")

v=1/(N[good]+1/v0)
c=(N[good]*ybar[good]+m0/v0)

hist(t.mu[good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow',main="",
ylim=c(0,2.2),xlim=c(-1.75,1.75))
myES=seq(-2,2,.01)
post=1:length(myES)
for (i in 1:length(myES))
post[i]=mean(dnorm(myES[i],c*v,sqrt(v)))
lines(myES,post,lwd=2)
mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3")

dev.off()



######################
#stop probability

png(file="probStop.png",width=480,height=480)
par(cex=1.3,mar=c(4,4,1,1),mgp=c(2,1,0))
ybar=seq(-2,3,.01)
prob=plogis((ybar-.6)^2,0,.2)
plot(ybar,1-prob,typ='l',lwd=2,ylab="Stopping Probability",xlab="Sample Mean",ylim=c(0,.55))
mtext("Optional Stopping Depends on Sample Mean",side=3,adj=.5,line=-1,cex=1.3)
dev.off()