## Tuesday, April 5, 2016

### The Bayesian Guarantee And Optional Stopping.

Frequentist intuitions run so deep in us that we often mistakenly interpret Bayesian statistics in frequentist ones. Optional stopping has always been a case in point.  Bayesian quantities, when interpreted correctly, are not affected by optional stopping.  This fact is guaranteed by Bayes' Theorem.  Previously, I have shown how this guarantee works for Bayes factors.  Here, let's consider the simple case of estimating an effect size.

For demonstration purposes, let's generate data from a normal with unknown mean, $$\mu$$, but known variance of 1.  I am going to use a whacky optional stopping rule that favors sample means near .5 over others.  Here is how it works:  I. As each observation comes in, compute the running sample mean. II. Compute a probability of stopping that is dependent on the sample mean according to the figure below.  The probability favors stopping for sample means near .5.  III. Flip a coin with sides labeled "STOP" and "GO ON" with the below probability.  IV. Do what the coin says (up to a maximum of 50 observations, then stop no matter).

The results of this rule is a bias toward sample means near .5.   I ran a simulation with a true mean of zero for ten thousand replicates (blue histogram below).  The key property is a biasing of the observed sample means higher than the true value of zero.   Bayesian estimation seems biased too.    The green histogram shows the posterior means when the prior on $$\mu$$ is a normal with mean of zero and a standard deviation of .5.  The bias is less, but that just reflects the details of the situation where the true value, zero, is also favored by the prior.

So it might seem I have proved the opposite of my point---namely that optional stopping affects Bayesian estimation.

Nope.  The above case offers a frequentist interpretation, and that interpretation entered when we examined the behavior on a true value, the value zero.  Bayesians don't interpret analyses conditional on unknown "truths".

### The Bayesian Guarantee

Bayes' Theorem provides a guarantee.  If you start with your prior and observed data, then Bayes' Theorem guarantees that the posterior is the optimal set of probability statements about the parameter at hand.  It is a bit subtle to see this in simulation because one needs to condition on data rather than on some unknown truth.

Here is how a Bayesian uses simulation shows the Bayesian Guarantee.

I. On each replicate, sample a different true value from the prior.  In my case, I just draw from a normal centered at zero with standard deviation of .5 since that is my prior on effect size for this  post.  Then, on each replicate, simulate data from that truth value for that replicate.  I have chosen data of 25 observations (from a normal with variance of 1).  A histogram of the sample mean across these varying true values is provided below, left panel.   I ran the simulation for 100,000 replicates.

II. The histogram is that of data (sample means) we expect under our prior.  We need to condition on data, so let's condition on an observed sample mean of .3.  I have highlighted a small bin between .25 and .35 with red.  Observations fall in this bin about 6% of the time.

III.  Look at all the true values that generated those sample means in the bin with .3.  These true values are shown in the yellow histogram.  This histogram is the target of Bayes' Theorem, that is, we can use Bayes Theorem to describe this distribution without going through the simulations.   I have computed the posterior distribution for a sample mean of .3 and 25 observations under my prior, and plotted it as the line.  Notice the correspondence.  This correspondence is the simulation showing that Bayes Theorem works.  It works, by the way, for every bin though I have just shown it for the one centered on .3.

TAKE HOME 1: Bayes Theorem tells the distribution of true values given your prior and the data.

### Is The Bayesian Guarantee Affected By Optional Stopping?

So, we come to the crux move.  Let's simulate the whacky optional stopping rule that favors sample means near .5.  Once again, we start with the prior, and for each replicate we choose a different truth value as a sample from the prior.  Then we simulate data using optional stopping, and the resulting sample means are shown in the histogram on the left.  Optional stopping has affected these data dramatically.  No matter, we choose our bin, again around .3, and plot the true values that led to these sample means.  These true values are shown as the yellow histogram on the right.  They are far more spread out than in the previous simulation without optional stopping primarily because stopping occurred often for less than 25 observations.  Now, is this spread predicted?  Yes.  On each replication we obtain a posterior distribution, and these vary from replication-to-replication because the sample size is random.  I averaged these posteriors (as I should), and the result is the line that corresponds well to the histogram.

TAKE HOME  II: Bayes Theorem tells you where the true values are given your prior and the data, and it doesn't matter how the data were sampled!

And this should be good news.

*****************

R code

set.seed(123)

m0=0
v0=.5^2

runMean=function(y) cumsum(y)/(1:length(y))
minIndex=function(y) order(y)

mySampler=function(t.mu,topN)
{
M=length(t.mu)
mean=rep(t.mu,topN)
y=matrix(nrow=M,ncol=topN,rnorm(M*topN,mean,1))
ybar=t(apply(y,1,runMean))
prob=plogis((ybar-.6)^2,0,.2)
another=matrix(nrow=M,ncol=topN,rbinom(M*topN,1,prob))
stop=apply(another,1,minIndex)
return(list("ybar"=ybar[cbind(1:M,stop)],"N"=stop))
}

goodSampler=function(t.mu,topN){
M=length(t.mu)
mean=rep(t.mu,topN)
y=matrix(nrow=M,ncol=topN,rnorm(M*topN,mean,1))
return(apply(y,1,mean))}

M=10000

png('freqResults.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))
t.mu=rep(0,M)
out=mySampler(t.mu,50)
ybar=out$ybar N=out$N
v=1/(N+1/v0)
c=(N*ybar+m0/v0)
hist(ybar,col='lightblue',main="",xlab="Sample Mean",breaks=50,xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
mtext("Distortion of Sample Mean",side=3,adj=.5,line=1,cex=1.3)
abline(v=mean(ybar),lwd=3,lty=2)
hist(v*c,col='lightgreen',main="",xlab="Posterior Mean",xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
abline(v=mean(v*c),lwd=3,lty=2)
mtext("Distortion of Posterior Mean",side=3,adj=.5,line=1,cex=1.3)
dev.off()

###############################
set.seed(456)
png('bayesGuarantee.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))

M=100000
N=25
t.mu=rnorm(M,m0,sqrt(v0))
ybar=goodSampler(t.mu,N)
myBreak=seq(-2.45,2.45,.1)
bars=hist(ybar,breaks=myBreak,plot=F)

mid=.3
good=(ybar >(mid-.05) & ybar<(mid+.05))
myCol=rep("white",length(myBreak))
myCol[round(bars$mids,2)==0.3]='red' plot(bars,col=myCol,xlab="Sample Mean",main="") mtext(side=3,adj=.5,line=0,cex=1.3,"Sample Mean Across Prior") v=1/(N+1/v0) c=(N*mid+m0/v0) hist(t.mu[good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow', ylim=c(0,2.2),main="",xlim=c(-1.75,1.75)) myES=seq(-2,2,.01) post=1:length(myES) for (i in 1:length(myES)) post[i]=mean(dnorm(myES[i],c*v,sqrt(v))) lines(myES,post,lwd=2) mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3") dev.off() ######################## set.seed(790) png('moneyShot.png',width=960,height=480) par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0)) M=100000 t.mu=rnorm(M,m0,sqrt(v0)) out=mySampler(t.mu,50) ybar=out$ybar
N=out$N myBreak=seq(-5.95,5.95,.1) bars=hist(ybar,breaks=myBreak,plot=F) mid=.3 good=(ybar >(mid-.05) & ybar<(mid+.05)) myCol=rep("white",length(myBreak)) myCol[round(bars$mids,2)==0.3]='red'
plot(bars,col=myCol,xlab="Sample Mean",main="",xlim=c(-4,3))
mtext(side=3,adj=.5,line=0,cex=1.3,"Sample Mean Across Prior")

v=1/(N[good]+1/v0)
c=(N[good]*ybar[good]+m0/v0)

hist(t.mu[good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow',main="",
ylim=c(0,2.2),xlim=c(-1.75,1.75))
myES=seq(-2,2,.01)
post=1:length(myES)
for (i in 1:length(myES))
post[i]=mean(dnorm(myES[i],c*v,sqrt(v)))
lines(myES,post,lwd=2)
mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3")

dev.off()

######################
#stop probability

png(file="probStop.png",width=480,height=480)
par(cex=1.3,mar=c(4,4,1,1),mgp=c(2,1,0))
ybar=seq(-2,3,.01)
prob=plogis((ybar-.6)^2,0,.2)
plot(ybar,1-prob,typ='l',lwd=2,ylab="Stopping Probability",xlab="Sample Mean",ylim=c(0,.55))
mtext("Optional Stopping Depends on Sample Mean",side=3,adj=.5,line=-1,cex=1.3)
dev.off()

#### 12 comments:

Tal said...

I don't understand the rationale for conditioning your analysis on a sample mean of around .3. In the real world, it would be rather un-Bayesian to condition this way if you believed that all samples were being generated by the same underlying process (which they are in your simulation). Surely samples with a mean other than .3 are just as useful in updating our estimate of the "true" parameter.

The salient question is whether, in your optional stopping simulation, the posterior distribution over all samples (i.e., not just ones selected post-hoc based on the sample mean) matches the prior distribution. Ideally, it should given that you're sampling the true population values according to the prior. But it won't in this case, because of the bias introduced by the optional stopping rule. Of course, you're welcome to argue that bias is a frequentist property that Bayesians shouldn't care about it, but I suspect that in that case your simulation results will look more like a modus tollens than a modus ponens to most people.

Alexander Etz said...

"I don't understand the rationale for conditioning your analysis on a sample mean of around .3."
The point of a Bayesian analysis is to infer which values of the parameter plausibly led to the actually observed data (represented by the posterior). In this example the actually observed data is .3, so that's what the posterior is conditioned on.

John K. Kruschke said...

This demo and its underlying math are clear, but this Bayesian framing isn't answering the question that I think is being asked.

The question is not, For a specific data value, what posterior parameter distribution is inferred? (And is that parameter distribution biased relative to the specific data value?)

The question is, When the Bayesian posterior distributions for each data value are integrated across all the sample data sets, is that aggregate distribution biased relative to the known generating prior?

I don't think it helps the asker of that question to tell them they shouldn't ask that question, or that it's not a Bayesian question. They still want an answer to the question.

Jeff Rouder said...

Tal, I am having a hard time understanding your question/points. I think perhaps you misunderstand my goals. Sometimes people simulate how optional stopping affects Bayesian quantities by simulating data from known truth and then plotting histograms of the resulting Bayesian estimates and seeing how they deviate from the ground truth. Bayesian queries are not focused on this conditional but on the reverse one. What can we learn about the plausibility of various true values conditional on data (and our prior). So, in simulation, you need to condition on some data value, I chose .3 but analogous results hold for any value. I collected all the truth values that produced a sample mean near .3. I used the simulations to construct one distribution of true values conditional on some sample mean, and then I showed that Bayes Theorem describes this distribution. And this good description always holds, which is, well, quite neat. In practice, you would have a single sample mean---the result of your experiment---and you would like to know the plausibility of true values. Bayes Theorem well describes these plausibilities even optional stopping. And that too, is quite neat.

Jeff Rouder said...

John, Thanks for your comments. I think the question asked depends on whether the analyst is Bayesian or frequentist. I am Bayesian. I ask, given my data, what are my beliefs about the plausibility of different truth statements. Bayes' Theorem answers this question. I am heartened to know that Bayes' Theorem holds regardless of how the data were collected.

The context for the blog is a simulation on Facebook by John Cleavenger who simulated Bayesian estimation from a single ground truth value as in the top analysis. The claim is that Bayesian estimation is problematic. I replied then that it is problematic if you interpret the Bayesian quantities as a frequentist rather than as a Bayesian. So here I offer what I consider to be the proper Bayesian interpretation.

Tal Yarkoni said...

Jeff, I think I understand your goals, and my point is basically the same as John's: you're answering a question that nobody really cares about. And I don't actually think it's a frequentist/Bayesian split in this case; I suspect that many people who call themselves Bayesians would also conclude that the parameter estimates are biased in a perfectly meaningful sense if you use optional stopping. The fact that Bayesians are trying to infer plausible values from data and not the other way around doesn't mean that a Bayesian should reject the idea that there is a single asymptotic parameter estimate that one would converge on given all of the data in the world. It should be clear from your setup that if you use an optional stopping rule like the one you propose, you will never converge on the true parameter estimate no matter how much data you collect. Now, you can insist that this idea of convergence on some true parameter value in the long run is a frequentist notion if you like, but my point (and John's too, I think) was that if you do that, the main consequence will be to push people away from Bayesian inference, because it will just seem to most people like Bayesian inference can't tell them what they want to know (which is not true--it can tell them what they want to know, provided one accepts that the stopping rule does matter).

In effect, what you're saying is: given any prior, any data, and any sampling plan, Bayes will give you the optimal answer. Well, sure. But that's not what researchers want; I have a hard time seeing many people wanting a posterior estimate that depends on the sampling plan! I mean, suppose I decided that my sampling plan is to visually inspect my results every 10 subjects and drop all the observations that don't fit with my hypothesis. Strictly speaking, you would still be correct to say that "Bayes Theorem tells you where the true values are given your prior and the data, and it doesn't matter how the data were sampled". But this would be a completely useless analysis; nobody in their right mind would trust the resulting parameter estimate--not because of Bayes (or lack thereof), but because it's a stupid and clearly biased stopping rule. The same is true (though to a lesser degree) of the stopping rule in your example. Nobody is going to care if your estimate is optimal conditional on some really bad decisions if the upshot is that no amount of data will ever lead to the asymptotically correct estimate you'd get from sampling the entire population in one shot.

JP de Ruiter said...

Tal, the question Jeff answers here is not one that "nobody cares about". It's a question that scientists using data in order to learn about the truths of their beliefs *should* care about.

Felix Thoemmes said...

Thanks a lot for posting the R code! Much appreciated.

Jeff Rouder said...

Well, now get your excellent point. It is a strong challenge. We can make it even stronger. Suppose an unscrupulous researcher collected data and changed all the negatively-vaued observations to positively-valued observations. I could simulate data from this process, and I can even show that given a set of observations, Bayes' Theorem accurately describes the true values that generated these data. Now your point is "so what." We have a data integrity problem here and the fact that Bayes' Theorem works is at best inconsequential.

My attention is captured and my thinking could stand some sharpening. Thanks.

Some points:

1. If your question is how well you capture a hypothetical truth should it hold, by all means please use frequentist methods. In most cases they are tuned in efficiency and bias, we have a lot good sense about how they behave, and they control long-term error rates well.

2. If you choose Bayesian methods because of the frequentist properties, say so. Say I am a frequentist and I am using Bayesian because it is more tractable and in some cases. Also some people choose certain prior because they have excellent frequentist properties. This is fine, but, in my view, at heart this is frequentist endeavor with Bayes used as a convenient computational engine. But let's be honest about the goal. Is it about updating beliefs or about error control for select hypotheticals. Mine is the former.

3. I think the issue with the unscrupulous researcher who changes the sign of negative values is a model issue rather than an analysis issue. The Bayesian research is learning exactly about the parameters in the model, but the model is not well describing the data-generation and fraud process. What we need (in addition to honest, open reporting) are more useful models. Model specification to me is a substantive issue not a statistical one. Our inference about parameters is only as good as our models. That point probably needs to be said a few times because most researchers reify these estimates as properties of the world. To me, they are collectively fabricated.

Tal Yarkoni said...

I agree with all of the above. I don't see the issue of sampling scheme (which I agree can be thought of as a form of model specification) as being intrinsically related to the choice of frequentist vs. Bayesian statistical methods. Which is why I object to the argument that optional stopping is irrelevant if you're a Bayesian. It's true that the machinery of Bayesian inference works just fine no matter what data you feed it (whereas the frequentist machinery won't always), but that doesn't mean that a researcher who's using Bayesian inference doesn't need to think about the sampling scheme. The fact that one camera might work perfectly well underwater while another one doesn't is no guarantee that the photographer is going to get the best shot of the beach from below the water line--and depending on many other factors, it could turn out that the less resilient instrument produces more useful results.

I agree that people should be able to explain why they're using whatever approach they're using.

Jeff Rouder said...

Yeah, I think we are there. Thanks for the comments---I learned a lot. I think I would write the blog post a bit differently if I could do it over again. Jeff

Joachim Vandekerckhove said...

The "data integrity problem" is associated with a relatively simple quantitative rule, I think. The sampling plan matters for Bayesians iff it affects the kernel of the likelihood (and matters for frequentists even if it doesn't). Some censoring rules will have that effect.