Monday, March 28, 2016

The Effect-Size Puzzler, The Answer

I wrote the Effect-Size Puzzler because it seemed to me that people have reduced the concept of effect size to a few formulas on a spreadsheet.  It is a useful concept that deserves a bit more thought.

In the example I had provided is the simplest case I can think of that is germane to experimental psychologists.  We ask 25 people to perform 50 trials in each of 2 conditions, and ask what is the effect size of the condition effect.  Think Stroop if you need a context.

The answer, by the way, is \(+\infty\).  I'll get to it.

The good news about effect sizes  

Effect sizes have revolutionized how we compare and understand experimental results.  Nobody knows whether a 3% change in error rate is big or small or comparable across experiments; everybody knows what an effect size of .3 means.  And our understanding is not associate or mnemonic, we can draw a picture like the one below and talk about overlap and difference.  It is this common meaning and portability that licenses a modern emphasis on estimation.  Sorry estimators, I think you are stuck with standardized effect sizes.

Below is a graph from Many Labs 3 that makes the point.  Here, the studies have vastly different designs and dependent measures.  Yet, they can all be characterized in unison with effect size.

The bad news about effect size

Even for the simplest experiment above, there is a lot of confusion.  Jake Westfall provides 5 different possibilities and claims that perhaps 4 of these 5 are reasonable at least under certain circumstances.  The following comments were provided on Twitter and Facebook: Daniel Lakens makes recommendations as to which one we shall consider the preferred effect size measure.  Tal Yarkoni and Uli Shimmack wonder about the appropriateness of effect size in within subject designs and prefer unstandarized effects (see Jan Vanhove's blog).  Rickard Carlson prefers effect sizes in physical units where possible, say in milliseconds in my Effect Size Puzzler.   Sanjay Srinivasta needs the goals and contexts first before weighing in.  If I got this wrong, please let me know.

From an experimental perspective, The Effect Size Puzzler is as simple as it gets.  Surely we can do better than to abandon the concept of standardized effect sizes or to be mired in arbitrary choices.

Modeling: the only way out

Psychologists often think of statistics as procedures, which, in my view, is the most direct path to statistical malpractice.  Instead, statistical reasoning follows from statistical models.  And if we had a few guidelines and a model, then standardized effect sizes are well defined and useful.  Showing off the power of model thinking rather than procedure thinking is why I came up with the puzzler.

Effect-size guidelines

#1:  Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population.

#2:  Measures of true effect and true amount of variability are only defined in statistical models.  They don't really exist accept within the context of a model.  The model is important.  It needs to be stated.

#3: The true effect size should not be tied to the number of participants nor the number of trials per participant.  True effect sizes characterize a state of nature independent of our design.

The Puzzler Model

I generated the data to be realistic.  They had the right amount of skew and offset, and the tails fell like real RTs do.   Here is a graph of the generating model for the fastest and slowest individuals:

All data had a lower shift of .3s (see green arrow), because we typically trim these out as being too fast for a choice RT task.  The scale was influenced by both an overall participant effect and a condition effect, and the influence was multiplicative.  So faster participants had smaller effects; slower participants had bigger effects.  This pattern too is typical of RT data.   The best way to describe these data is in terms of percent-scale change.  The effect was to change the scale by 10.5%, and this amount was held constant across all people.  And because it was held constant, that is, there was no variability in the effect,  the standardized effect size in this case is infinitely large.

Now, let's go explore the data.  I am going to skip over all the exploratory stuff that would lead me to the following transform, Y = log(RT-.3), and just apply it.  Here is a view of the transformed generating model:

So, lets put plain-old vanilla normal models on Y.  First, let's take care of replicates.
\[ Y_{ijk} \sim \mbox{Normal} (\mu_{ij},\sigma^2)\]
where \(i\)$ indexes individuals, \(j=1,2\) indexes conditions, and \(k\) indexes replicates.  Now, lets model \(\mu_{ij}\).  A general formulation is
\[\mu_{ij} = \alpha_i+x_j\beta_i,\]
where \(x_j\) is a dummy code of 0 for Condition 1 and 1 for Condition 2.  The term \(\beta_i\) is the ith individual's effect.  We can model it as
\[\beta_i \sim \mbox{Normal}(\beta_0,\delta^2)\]
where \(\beta_0\) is the mean effect across people and \(\delta^2\) is the variation of the effect across people.

With this model, the true effect size is \[d_t = \frac{\beta_0}{\delta}.\] Here, by true, I just mean that it is a parameter rather than a sample statistic.  And that's it, and there is not much more to say in my opinion.   In my simulations the true value of each individual's effect was .1.  So the mean, \( \beta_0\), is .1 and the standard deviation, \(\delta\), is, well, zero.  Consequently, the true standardized effect size is \(d_t=+\infty\).   I can't justify any other standardized measure that captures the above principles.


Could a good analyst have found this infinite value?  That is a fair question. The plot below shows individuals' effects, and I have ordered them from smallest to largest.  A key question is whether these are spread out more than expected from within-cell sample noise alone.  It these individual sample effects are more spread out, then there is evidence for true individual variation in \(\beta_i\).  If these stay as clustered as predicted by sample noise alone, then there is evidence that people's effects do not vary.  The solid line is the prediction within within-cell noise alone.   It is pretty darn good.  (The dashed line is the null that people have the same, zero-valued true effect).  I also computed a one-way random-effects F statistic to see if there is a common effect or many individual effects.  It was one effect F(24,2450) = 1.03.  Seems like one effect.

These one-effect results should be heeded.  It is a structural element that I would not want to miss in any data set.   We should hold plausible the idea that the standardized effect size is exceedingly high as the variation across people seems very small if not zero.

To estimate effect sizes, we need a hierarchical model.  You can use Mplus, AMOS, LME4, WinBugs, JAGS, or whatever you wish.  Because I am an old and don't learn new tricks easily, I will do what I always do and program these models from scratch.

I used the general model above in the Bayesian context.  The key specification is the prior on \( \delta^2\).   In the log-normal, the variance is a shape parameter, and it is somewhere around \(.4^2\).  Effects across people are usually about 1/5th of this say \(.08^2\).  To capture variances in this range, I would use a  \(\delta^2 \sim \mbox{Inverse Gamma(.1,.01)} \) prior for general estimation.  This is a flexible prior tuned for the 10 to 100 millisecond range for variation in effects across people.  The following plot shows the resulting estimates of individual effects as a function of the sample effect values.
The noteworthy feature is the lack of variation in model estimates of individual's effects!  This type of pattern where variation in model estimates are attenuated compared to sample statistics is called shrinkage, and it occurs because the hierarchical models don't chase within-cell sample noise.  Here the shrinkage is nearly complete, leading again to the conclusion that there is no real variation across people, or an infinitely large standardized effect size.  For the record, the estimated effect size here is 5.24, which, in effect size units, is getting quite large!

The final step for me is comparing this variable effect model to a model with no variation, say \( \beta_i = \beta_0 \) for all people.  I would do this comparison with Bayes factor.  But, I am out of energy and you are out of patience, so we will save it for another post.

Back To Jake Westfall

Jake Westfall promotes a design-free version of Cohen's d where one forgets that the design is within-subject and uses an all-sources-summed-and-mashed-together variance measure.  He does this to stay true to Cohen's formulae.  I think it is a conceptual mistake.

I love within-subject designs precisely because one can separate variability due to people, variability within a cell, and variability in the effect across people.  In between-subject designs, you have no choice but to mash all this variability together due to the limitations of the design.   Within-subject designs are superior, so why go backwards and mash the sources of variances together when you don't have to?  This advise strikes me as crazy.  To Jake's credit, he recognizes that the effect-size measures promoted here are useful, but doesn't want us to call them Cohen's d.  Fine, we can just call them Rouder's within-subject totally-appropriate standardized effect-size measures.  Just don't forget the hierarchical shrinkage when you use it!

Thursday, March 24, 2016

The Effect-Size Puzzler

Effect sizes are bantered around as useful summaries of the data.  Most people think they are straightforward and obvious.  So if you think so, perhaps you won't mind a bit of a challenge?  Let's call it "The Effect-Size Puzzler," in homage to NPR's CarTalk.  I'll buy the first US winner a nice Mizzou sweatshirt (see here).  Standardized effect size please.

I have created a data set with 25 people each observing 50 trials in 2 conditions.  It's from a priming experiment.  It looks about like real data.  Here is the download.

The three columns are:

  • id (participant: 1...25)
  • cond (condition: 1,2)
  • rt (response time in seconds).  

There are a total of 2500 rows.

I think it will take you just a few moments to load it and tabulate your effect size for the condition effect.  Have fun.  Write your answer in a comment or write me an email.

I'll provide the correct answer in a blog next week.

HINT: If you wish to get rid of the skew and stabilize the variances, try the transform y=log(rt-.3)

Monday, March 21, 2016

Roll Your Own II: Bayes Factors With Null Intervals

The Bayes factors we develop compare the null model to an alternative model.  This null model is almost always a single point---the true effect is identically zero.    People sometimes confuse our advocacy for Bayes factor with that for point-null-hypothesis testing.  They even critique Bayes factor with the Cohenesque claim that the point null is never true.

Bayes factor is a general way of measuring the strength of evidence from data for competing models.  It is not tied to the point null.

We develop for the point null because we think it is a useful, plausible, theoretically meaningful model.  Others might disagree, and these disagreements are welcome as part of the exchange of viewpoints in science.

In the blog post Roll Your Own: How to Compute Bayes Factors for Your Priors, I provided R code to compute a Bayes factor between a point-null and a user-specified alternative for a simple setup motivated by the one-sample t-test.  I was heartened by the reception and I hope a few of you are using the code (or the comparable code provided by Richard Morey).  There have been some requests to generalize the code for non-point nulls.  Here, let's explore the Bayes factor for any two models in a simple setup.  As it turns out, the generalization is instructive and computationally trivial.   We have all we need from the previous posts.

Using Interval Nulls: An Example

Consider the following two possibilities:

I. Perhaps you feel the point null is too constrained and would rather adopt a null model with mass on a small region around zero rather than at the point.  John Kruschke calls these regions ROPEs (regions of posterior equivalence).

II. Perhaps you are more interested in the direciton in an effect rather than whether it is zero or not.  In this case, you might consider testing two one-sided models against each other.

For this blog, I am going to retain four different priors. Let’s start with a data model. Data are independent normal draws with mean \(\mu\) and variance \(\sigma^2\). It is more convenient re-express the normal as a function of effect size, \(\delta\) and \(\sigma^2\) where \(\delta=\mu/\sigma)\). Here is the formal specification:
\[ Y_i \mid \delta,\sigma^2 \stackrel{iid}{\sim} \mbox{Normal}(\sigma\delta,\sigma^2).\]
Now, the substantive positions as prior models on effect size:
  1. \(M_0\), A Point Null Model: \(\delta=0\)
  2. \(M_1\), A ROPE Model: \(\delta \sim \mbox{Unif}(-.25,.25)\)
  3. \(M_2\), A Positive Model: \(\delta \sim \mbox{Gamma(3,2.5)}\)
  4. \(M_3\), A Negative Model: \(-\delta \sim \mbox{Gamma(3,2.5)}\)
Here are these four models expressed graphically as distributions:

I picked these four models, but you can pick as many ones as you wish. For example, you can include a normal if you wish.

Oh, let's look at some data.  Suppose the observed effect size is .35 for an N of 60.

Going Transitive

Bayes factors are the comparison between two models.  Hence we would like to compute the Bayes factors between any of these models.  Let \(B_{ij}\) be the comparison between the ith and jth model.  We want a Table like this:

\(B_{00}\) \(B_{01}\) \(B_{02}\) \(B_{03}\)
\(B_{10}\) \(B_{11}\) \(B_{12}\) \(B_{13}\)
\(B_{20}\) \(B_{21}\) \(B_{22}\) \(B_{23}\)
\(B_{30}\) \(B_{31}\) \(B_{32}\) \(B_{33}\)

Off the bat, we know the Bayes factor between a model and itself is 1 and that \(B_{ij} = 1/B_{ji}\). So we only need to worry about the lower corner.

\(B_{10}\) 1
\(B_{20}\) \(B_{21}\) 1
\(B_{30}\) \(B_{31}\) \(B_{32}\) 1

We can use the code below, from the previous post to figure out the null vs. all the other models.

\[ B_{10} = 4.9, \quad B_{20} = 4.2, \quad B_{30} = .0009 \]

Here we see that the point null is not as attractive or the ROPE null or the positive model. It is more attractive, however, than the negative model.

Suppose, however, that you are most interested in the ROPE null and its comparison to the positive and negative model. The missing Bayes factors are \(B_{12}\), \(B_{13}\), and \(B_{23}\).

The key application of transitivity is as follows:

\[ B_{ij} = B_{ik} \times B_{kj}. \]

So, we can compute \(B_{12}\) as follows: \(B_{12} = B_{10} \times B_{02} = B_{10}/B_{20} = 4.9/4.2 = 1.2\).

The other two Bayes factors are computed likewise: \(B_{13} = 5444 \) and \(B_{23} = 4667\)

So what have we learned. Clearly, if you were pressed to choose a direction, it is in the positive direction. That said, the evidence for a positive effect is slight when compared to a ROPE null.

Snippets of R Code

#First, Define Your Models as a List
#lo, lower bound of support
#hi, upper bound of support
#fun, density function

#here are Models M1, M2, M3 
#add or change here for your models 

mod1=list(lo=-.25,hi=.25,fun=function(x,lo,hi) dunif(x,lo,hi))
mod2=list(lo=0,hi=Inf,fun=function(x,lo,hi) dgamma(x,shape=3,rate=2.5))
mod3=list(lo=-Inf,hi=0,fun=function(x,lo,hi) dgamma(-x,shape=3,rate=2.5))

#note, we dont need to specify the point null, it is built into the code

#Lets make sure the densities are proper, here is a function to do so:

normalize=function(mod) return(c(mod,K=1/integrate(mod$fun,lower=mod$lo,upper=mod$hi,lo=mod$lo,hi=mod$hi)$value))

#and now we normalize the three models

#Observed Data

#Here is the key function that computes the Bayes factor between a model and the point null
f= function(delta) mod$fun(delta,mod$lo,mod$hi)*mod$K


print(paste("B10=",B10,"   B20=",B20,"   B30=",B30))


print(paste("B12=",B12,"   B13=",B13,"   B23=",B23))

Tuesday, March 15, 2016

Statistical Difficulties from the Outer Limits

You would think that the more data we collect, the closer we should be to the truth.  

This blog post falls into the "I may be wrong" category.  I hope many of you comment.

ESP: God's Gift To Bayesians?

It seems like ESP is God's gift to Bayesians.  We use it like a club to reinforce the plausibility of null hypotheses and to point out the difficulties of frequentist analysis.

In the 1980s, a group of Princeton University engineers set out to test ESP by asking people to use their minds to change the outcome of a random noise generator (check out their website).   Over the course of a decade, these engineers collected an astounding 104,490,000 trials.  On each trial, the random noise generator flipped a gate with known probability of exactly .5.  The question was whether a human operator using only the power of his or her mind could increase this rate.  Indeed, they found 52,263,471 gate flips, or 0.5001768 of all trials.  This proportion, though only slightly larger than .5, is nonetheless significantly larger with a damn low p-value of .0003.   The figure below shows the distribution of successes under the null, and the observation is far to the right.  The green interval is the 99% CI, and it does not include the null.

Let's assume these folks have a decent set up and the true probability should be .5 without human ESP intervention.  Did they show ESP?

What do you think?  There data are numerous, but do you feel closer to the truth?  Impressed by the low p-value?  Bothered by the wafer-thin effect?  Form an opinion; leave a comment.  

Bayesians love this example because we can't really fathom what a poor frequentist would do?  The p-value is certainly lower than .05, even lower than .01, and even lower than .001.  So, it seems like a frequentist would need to buy in.  The only way out is to lower the Type I error rate in response to the large sample size.  But to what value and why?

ESP: The Trojan Horse?

ESP might seem like God's gift to Bayesians, but maybe it is a Trojan Horse.  A Bayes factor model comparison analysis goes as following.  The no-ESP null model is
\[ M_0: Y  \sim \mbox{Binomial}(.5,N) \]

The ESP alternative is
\[ M_1: Y|\theta  \sim \mbox{Binomial}(\theta,N) \]
A prior on \(\theta\) is needed to complete the specification.  For the moment, let's use a flat one, \(\theta \sim \mbox{Unif}(0,1) \).

It is pretty easy to calculate a Bayes factor here, and the answer is 12-to-1 in favor of the null.   What a relief.

ESP proponents might rightly criticize this prior as too dispersed.  We may reasonably assume that \( \theta \) should not be less than .5 as we can assume the human operators are following the direction to increase rather than decrease the proportion of gate flips.   Also, the original investigators might argue that it is unreasonable to expect anything more than a .1% effect, so the top might be .501.  In fact, they might argue they ran such a large experiment because they expected a prior such a small effect.  The prior is   \(\theta \sim \mbox{Unif}(.5,.501) \), then the Bayes factor is 84-to-1 for an ESP effect.

The situation seems tenuous.  The below figure shows the Bayes factors for both priors as a function of the number of trials.  To draw these curves, I simply kept the proportion of success constant at 0.5001768.  The line is for the observed number of trials.  With this proportion, the Bayes factor not only depend on the prior, but they also depend in unintuitive ways on sample size.  For instance, if we doubled the number of trials and successes, the Bayes factors become 40-to-1 and 40,000-to-1, respectively, for the flat prior and the very small interval one.

Oh, I can see the anti-Bayes crowd getting ready to chime in should they read this.   Sanjay Srivastava may take the high road and discuss the apparent lack of practicality of the Bayes factor.  Uri Simonsohn may boldly declare that Bayesians can't find truth.   And perhaps Uli Shimmack will create a new index, the M-Index, where M stands for moron.  Based on his analysis of my advocacy, he may declare I have the second highest known M-Index, perhaps surpassed only by E.-J. Wagenmakers.

Seems like ESP was a bit of a Trojan Horse.  It looked all good, and then turned on us.

But What Happened?

Bayes' rule is ok of course.  The problem is us.  We tend to ask too much of statistics.   Before I get to my main points, I need to address one issue,  What is the model?  Many will call the data model, the binomial specification in this case, "the model."  The other part, the priors on parameters, is not part of "the model", it is the prior.  Yet, it is better to think of "the model" as the combination of the binomial and prior specification.  It's all one model, and this one model provides a priori predictive distribution about where the data should fall (see my last blog post).  The binomial is a conditional specification, and the prior completes the model.

With this in mind that the above figure strikes me as quite reasonable.  Consider the red line, the one that compares the null to the model where the underlying probability ranges across the full interval.  Take the point for 10,000 trials.   The number of successes is 5,002 which is almost 1/2 of all trials.  Not surprisingly,  this value is evidence for the null compared to this diffuse alternative.  But the same value is not evidence for the null compared to the more constrained alternative model where \(.5<\theta<.501\).  Both the null and this alternative are about the same for 10,000 trials, and each predict 5,002 successes out of 10,000 trials equally well.  Hence,  the Bayes factor is equivocal.  This alternative and the null are so similar that it takes way more data to discriminate among them.   As we gain more and more data, say 100,000,000 trials, the  slight discrepancy from 1/2 can be resolved, and the Bayes factors start to favor the alternative models.  As the sample size is increased further, the discrepancy becomes more pronounced.  Everything in that figure makes beautiful sense to me--- it all is as it should be.  Bayes rule is ok.

Having more and more data doesn't get us closer to the truth.  It does, however, is give us greater resolution to more finely discriminate among models.

Loose Ends  

The question, "is there an effect" strikes me as ill formed.   Yet, we answer the question affirmatively daily.  Sometimes, effects are obvious, and they hit you between the eyes.  How can that be if the question is not well formed?

I think when there are large effects, just about any diffuse alternative model will do.  As long as the alternative is diffuse, data with large effects easily discriminate this diffuse alternative from the null.  It is in this sense that effects are obviously large.

What this example shows that if one tries to resolve small effects with large sample sizes, there is intellectual tension.  Models matter.  Models are all that matter.  Large data gives you greater resolution to discriminate among similar models.  And perhaps little else.

The Irony Is...

This ESP example is ironic.  The data are so numerous that they are capable of finely discriminating among just about any set of models we wish, even the difference between a point null and a uniform null subtending .001 in width on the probability scale.  The irony is that we have no bona-fide competing models to discriminate.  ESP by definition seemingly precludes any scientific explanation, and without such explanation, all alternatives to the null are a bit contrived.  So while we can discriminate among models, there really is only one plausible one, the null, and no need for discrimination at all.

If forced to do inference here (which means someone buys me a beer),  I would choose the full-range uniform as the alternative model and state the 12-to-1 ratio for the null.  ESP is such a strange proposition that why would values of \( \theta \) near .5 be any more a priori plausible than those away from it?