Monday, March 28, 2016

The Effect-Size Puzzler, The Answer

I wrote the Effect-Size Puzzler because it seemed to me that people have reduced the concept of effect size to a few formulas on a spreadsheet.  It is a useful concept that deserves a bit more thought.

In the example I had provided is the simplest case I can think of that is germane to experimental psychologists.  We ask 25 people to perform 50 trials in each of 2 conditions, and ask what is the effect size of the condition effect.  Think Stroop if you need a context.

The answer, by the way, is \(+\infty\).  I'll get to it.

The good news about effect sizes  

Effect sizes have revolutionized how we compare and understand experimental results.  Nobody knows whether a 3% change in error rate is big or small or comparable across experiments; everybody knows what an effect size of .3 means.  And our understanding is not associate or mnemonic, we can draw a picture like the one below and talk about overlap and difference.  It is this common meaning and portability that licenses a modern emphasis on estimation.  Sorry estimators, I think you are stuck with standardized effect sizes.

Below is a graph from Many Labs 3 that makes the point.  Here, the studies have vastly different designs and dependent measures.  Yet, they can all be characterized in unison with effect size.

The bad news about effect size

Even for the simplest experiment above, there is a lot of confusion.  Jake Westfall provides 5 different possibilities and claims that perhaps 4 of these 5 are reasonable at least under certain circumstances.  The following comments were provided on Twitter and Facebook: Daniel Lakens makes recommendations as to which one we shall consider the preferred effect size measure.  Tal Yarkoni and Uli Shimmack wonder about the appropriateness of effect size in within subject designs and prefer unstandarized effects (see Jan Vanhove's blog).  Rickard Carlson prefers effect sizes in physical units where possible, say in milliseconds in my Effect Size Puzzler.   Sanjay Srinivasta needs the goals and contexts first before weighing in.  If I got this wrong, please let me know.

From an experimental perspective, The Effect Size Puzzler is as simple as it gets.  Surely we can do better than to abandon the concept of standardized effect sizes or to be mired in arbitrary choices.

Modeling: the only way out

Psychologists often think of statistics as procedures, which, in my view, is the most direct path to statistical malpractice.  Instead, statistical reasoning follows from statistical models.  And if we had a few guidelines and a model, then standardized effect sizes are well defined and useful.  Showing off the power of model thinking rather than procedure thinking is why I came up with the puzzler.

Effect-size guidelines

#1:  Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population.

#2:  Measures of true effect and true amount of variability are only defined in statistical models.  They don't really exist accept within the context of a model.  The model is important.  It needs to be stated.

#3: The true effect size should not be tied to the number of participants nor the number of trials per participant.  True effect sizes characterize a state of nature independent of our design.

The Puzzler Model

I generated the data to be realistic.  They had the right amount of skew and offset, and the tails fell like real RTs do.   Here is a graph of the generating model for the fastest and slowest individuals:

All data had a lower shift of .3s (see green arrow), because we typically trim these out as being too fast for a choice RT task.  The scale was influenced by both an overall participant effect and a condition effect, and the influence was multiplicative.  So faster participants had smaller effects; slower participants had bigger effects.  This pattern too is typical of RT data.   The best way to describe these data is in terms of percent-scale change.  The effect was to change the scale by 10.5%, and this amount was held constant across all people.  And because it was held constant, that is, there was no variability in the effect,  the standardized effect size in this case is infinitely large.

Now, let's go explore the data.  I am going to skip over all the exploratory stuff that would lead me to the following transform, Y = log(RT-.3), and just apply it.  Here is a view of the transformed generating model:

So, lets put plain-old vanilla normal models on Y.  First, let's take care of replicates.
\[ Y_{ijk} \sim \mbox{Normal} (\mu_{ij},\sigma^2)\]
where \(i\)$ indexes individuals, \(j=1,2\) indexes conditions, and \(k\) indexes replicates.  Now, lets model \(\mu_{ij}\).  A general formulation is
\[\mu_{ij} = \alpha_i+x_j\beta_i,\]
where \(x_j\) is a dummy code of 0 for Condition 1 and 1 for Condition 2.  The term \(\beta_i\) is the ith individual's effect.  We can model it as
\[\beta_i \sim \mbox{Normal}(\beta_0,\delta^2)\]
where \(\beta_0\) is the mean effect across people and \(\delta^2\) is the variation of the effect across people.

With this model, the true effect size is \[d_t = \frac{\beta_0}{\delta}.\] Here, by true, I just mean that it is a parameter rather than a sample statistic.  And that's it, and there is not much more to say in my opinion.   In my simulations the true value of each individual's effect was .1.  So the mean, \( \beta_0\), is .1 and the standard deviation, \(\delta\), is, well, zero.  Consequently, the true standardized effect size is \(d_t=+\infty\).   I can't justify any other standardized measure that captures the above principles.


Could a good analyst have found this infinite value?  That is a fair question. The plot below shows individuals' effects, and I have ordered them from smallest to largest.  A key question is whether these are spread out more than expected from within-cell sample noise alone.  It these individual sample effects are more spread out, then there is evidence for true individual variation in \(\beta_i\).  If these stay as clustered as predicted by sample noise alone, then there is evidence that people's effects do not vary.  The solid line is the prediction within within-cell noise alone.   It is pretty darn good.  (The dashed line is the null that people have the same, zero-valued true effect).  I also computed a one-way random-effects F statistic to see if there is a common effect or many individual effects.  It was one effect F(24,2450) = 1.03.  Seems like one effect.

These one-effect results should be heeded.  It is a structural element that I would not want to miss in any data set.   We should hold plausible the idea that the standardized effect size is exceedingly high as the variation across people seems very small if not zero.

To estimate effect sizes, we need a hierarchical model.  You can use Mplus, AMOS, LME4, WinBugs, JAGS, or whatever you wish.  Because I am an old and don't learn new tricks easily, I will do what I always do and program these models from scratch.

I used the general model above in the Bayesian context.  The key specification is the prior on \( \delta^2\).   In the log-normal, the variance is a shape parameter, and it is somewhere around \(.4^2\).  Effects across people are usually about 1/5th of this say \(.08^2\).  To capture variances in this range, I would use a  \(\delta^2 \sim \mbox{Inverse Gamma(.1,.01)} \) prior for general estimation.  This is a flexible prior tuned for the 10 to 100 millisecond range for variation in effects across people.  The following plot shows the resulting estimates of individual effects as a function of the sample effect values.
The noteworthy feature is the lack of variation in model estimates of individual's effects!  This type of pattern where variation in model estimates are attenuated compared to sample statistics is called shrinkage, and it occurs because the hierarchical models don't chase within-cell sample noise.  Here the shrinkage is nearly complete, leading again to the conclusion that there is no real variation across people, or an infinitely large standardized effect size.  For the record, the estimated effect size here is 5.24, which, in effect size units, is getting quite large!

The final step for me is comparing this variable effect model to a model with no variation, say \( \beta_i = \beta_0 \) for all people.  I would do this comparison with Bayes factor.  But, I am out of energy and you are out of patience, so we will save it for another post.

Back To Jake Westfall

Jake Westfall promotes a design-free version of Cohen's d where one forgets that the design is within-subject and uses an all-sources-summed-and-mashed-together variance measure.  He does this to stay true to Cohen's formulae.  I think it is a conceptual mistake.

I love within-subject designs precisely because one can separate variability due to people, variability within a cell, and variability in the effect across people.  In between-subject designs, you have no choice but to mash all this variability together due to the limitations of the design.   Within-subject designs are superior, so why go backwards and mash the sources of variances together when you don't have to?  This advise strikes me as crazy.  To Jake's credit, he recognizes that the effect-size measures promoted here are useful, but doesn't want us to call them Cohen's d.  Fine, we can just call them Rouder's within-subject totally-appropriate standardized effect-size measures.  Just don't forget the hierarchical shrinkage when you use it!


Jeromy Anglim said...

Interesting post:
You write that "Here, by true, I just mean that it is a parameter rather than a sample statistic."

Note there are many possible effect sizes that could be defined as parameters based on the parameters of a hierarchical model.

However, the parameter you describe is not aligned with my understanding of what is intended by Cohen's d.

Just to recap, I'd define the standard sample effect size in this case as follows: (1) aggregate the data so you have the average rt for each person in each condition, (2) the effect size is the difference in the means for the conditions divided by the standard deviation of data aggregated to the person-level within a condition (if one condition was clearly the control group, you might use that SD, otherwise take the pooled SD).

Of course, this is not design-independent. If you had more trials in each condition, you would measure the person's expected RT for that condition more reliably. As a result, the effect size would get larger on average. This is the basis of the corrections to effect sizes commonly performed in meta-analyses (e.g., Hunter and Schmidt approach).

That said, in a decent design where we have fairly good reliability (e.g., .80 or .90) then the difference between the expected effect size with imperfect reliability and the effect size with asymptotic reliability would be small. Furthermore, many studies in psychology have reliability in the .80 to .90 zone. So as a heuristic, sample cohen's d is quite useful and as long as you're comparing effect sizes from studies with similar reliabilities and not concerned with small differences, then all should be fine.

Ultimately, the value of cohen's d is that it uses the standard deviation in differences between people as a metric which makes it easier to compare effects across studies when we are less familiar with the metric.
Thus, if you want the effect size to be meaningfully comparable to effect sizes used elsewhere in the literature, it's best to treat within-subjects replicate data as if it was (a) aggregated up to the person-level, and (b) calculate cohen's d as you would as if it was a between subjects design.

If this was a hierarchical model, I imagine there are various models that could be defined, but ultimately the meaningful parameter that is consistent with what is intended by cohen's d would be the difference in means over a measure of between-person standard deviation in person-level averages.

I haven't check this too much, but I think it could look something like the following:

y_ijk ~ N(mu_ij, sigma^2)

mu_ij = Beta_0i + Beta_1 x_j

beta_0i ~ N(mu_beta0, sigma^2_beta0)

d = Beta_1 / (sqrt(sigma^2 + sigma^2_beta0))

where x_j in 0, 1 depending on condition, i is person, j is condition

Jake Westfall said...

I'm not really sure where to even begin with my responses, so I guess I'll just take it line-by-line. My response got too so I've split it up into two comments.

"#1: Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population."
Well, okay, just as long as we all acknowledge that this is a definition handed down from Jeff Rouder, not really justified in any way, and never previously applied in any paper or discussion of effect size that I've personally seen. It's an interesting proposal to consider, but it would be nice to see some kind of argument for why we should accept it. Three immediate issues that I can see: (1) it excludes most derivates of Cohen's d, as well as all "variance explained" measures such as R^2; (2) it implies that effect sizes are inherently inestimable in between-subject designs; (3) it is not clear how it can be generalized to studies with multiple random factors, where there is not a unitary notion of "the population," but rather multiple populations being sampled in the study (e.g., students and classrooms, participants and stimuli).

"#3: The true effect size should not be tied to the number of participants nor the number of trials per participant. True effect sizes characterize a state of nature independent of our design."
This depends a bit on what counts as "the design." It seems that for you "the design" refers to the sample sizes (number of subjects, number of replicates, etc.). Many other people would also consider "the design" to encompass things like the number and structure of conditions (e.g., is the experimental drug compared only to a passive placebo, or to both active and passive placebos, or in a 2x2 factorial structure; do subjects receive both drug and placebo or only one of these; etc.). My guess is that, in your view, these different designs all involve different data models and thus effect sizes are inherently incomparable between them. Correct me if this imputation is wrong. But if it's right, then we should note that this is another controversial aspect of your proposed effect size definition.

Jake Westfall said...

"I love within-subject designs precisely because one can separate variability due to people, variability within a cell, and variability in the effect across people. In between-subject designs, you have no choice but to mash all this variability together due to the limitations of the design. Within-subject designs are superior, so why go backwards and mash the sources of variances together when you don't have to?"
Hold up. It almost sounds like you read me as advocating that we run our studies using between-subject designs and not within-subject designs. Just to be clear, I did not and would not say this. As for why we would choose to put all of the variance components in the denominator of the effect size even in a within-subjects design, I think I pretty much laid out my reasons why in my blog post, and as far as I can see you haven't actually responded to or even acknowledged them. The short answer is that doing this is the only way that we can in principle meaningfully compare effect sizes across different designs, such as between-subject and within-subject designs. This is one of the major motivating reasons for using standardized effect sizes in the first place. Your proposed effect size seems to preclude the possibility of any such comparisons as a matter of definition.

"Jake Westfall promotes a design-free version of Cohen's d where one forgets that the design is within-subject and uses an all-sources-summed-and-mashed-together variance measure."

"He does this to stay true to Cohen's formulae."
Not exactly. I do this because of (I think) well-reasoned considerations that I laid out in my blog post, which I won't repeat here, but which you are free to read and respond to. It turns out that these considerations compel us to adopt Cohen's classical formula. But there is nothing sacred about doing things Cohen's way just because that's how he did them, and if these considerations had implied some other effect size definition, then I would be urging us to forget Cohen.

"To Jake's credit, he recognizes that the effect-size measures promoted here are useful, but doesn't want us to call them Cohen's d. Fine, we can just call them Rouder's within-subject totally-appropriate standardized effect-size measures. "
Actually, Rouder's within-subject totally-appropriate standardized effect-size measure wasn't even included in my list of 5 possible effect sizes. It is similar but not equivalent to d_z.

Rickard Carlsson said...

Fascinating post! You said we should speak up if our points were not correctly characterised. I wasn't trying to make the point that we should interpret the ms directly; quite the opposite! I was advocating an approach were we calculates a score based on our research question and only after that calculate Cohens d based on the variability in this score. Although I used a different terminology, it's conceptually very similar to what you did. In principle, with the same understanding of the research question and Data we could end up with very similar effect sizes I think. But my approach is more subjective and qualitative in nature than yours. Again, really interesting post.

Jake: Wouldn't you agree that every experimenter sets up his/her experiment to control the variance in relation to the research question? Isn't this just one way to do that? That is, by comparing statistical models, Jeff reduces the variability thus leading to a larger effect size for this comparison. The way I see it he's answering a different question here than the Cohen's d does.

Jeff Rouder said...

Thank you all for the comments. I will slowly start to digest and respond. This response goes Jeremy's and Jake's question of whether we should include the within-cell variance, sigma, in our effect-size principles. I think not. Consider the following thought experiment with a between-subjects comparison, say the height of men and women One researcher carefully measures height with a doctor's office instrument; the other eyeballs it. The first gets a larger effect size measure than the second because the second has far more measurement error. Now, I think there is a true effect size in this case, and the first researchers value, the bigger estimate, is closer. Moreover, if we knew the error from the second researchers eyeballing procedure, wouldn't we wish to correct for it? Correcting for it in the within-subject case is parceling out variance into within-cell variation and across-cell variation and focusing solely on across-cell variation, or delta in my case.

Jake Westfall said...

Your thought experiment is about measurement error, but sigma does not only contain measurement error. (Indeed, whether it contains measurement error at all is a question of whether it makes sense in a given context to view the observed outcome as being a noisy indicator of an underlying "true" outcome variable -- which might make sense in the thought experiment, but probably doesn't for reaction times.) Even if we suppose, for argument's sake, that we would like to correct the height estimates for measurement error, this would not imply removing sigma entirely from the effect size denominator. At most it would imply using a smaller, corrected estimate of sigma that excluded measurement error but still contained many other sources of unexplained variation.

Jake Westfall said...

The thought experiment also does not imply that we should exclude participant mean variance from the effect size denominator.

Jeff Rouder said...

Well, as an experimentalist who runs these things all the time, I view sigma^2 as capturing measurement error. The error is from the person who is merely an instrument. I want to know their true scale or, in the normal case, log-scale, or beta_i in my text. Relative to beta_i, the RTs are perturb by measurement error, the same type as our researcher who eyeballs height. They are exactly equivalent from my point of view as an experimentalist. To beat this error, I run ea. person through more trials, just say as as our researcher might ask additional people to eyeball height. So, sigma^2 is my measurement error, and God knows I don't want to talk about it much or affect my view of beta_i. I care greatly about beta_i and its variance in the population of course. This case is completely analogous to the between subject case---if you had a less than perfect instrument and could characterize the error, what I do is exactly what you would do to in a between subjects design. The problem is that you cant characterize what I am terming measurement error. But the analogy is just fine.

hardsci said...

Here’s another thought experiment. Suppose I design a study to test the effect of an antidepressant on positive mood. Two groups get a course of antidepressant or placebo. We do an experience-sampling assessment where each subject is queried via smartphone about their current mood 50 random times over 2 weeks, and we fit a multilevel model to the data.

By Rouder’s rationale, we’d ignore the within-person variability (sigma^2) and just retain the between-person variance (delta^2). But there’s a problem! Mood, by definition, is something that varies within persons. If we define “the effect size” as the ratio of the group difference to delta, we are not indexing an effect on mood. It is an effect on something else, maybe personality or temperament — not mood.

That’s why on Twitter I (only half-jokingly) said the correct answer is “it depends.” It depends what your question is; and the question drives both the model specification and the choice of what parameter from it you call “the effect size.” Because if I post my thought-experiment data to OSF, and then someone comes along and wants to know the effect of antidepressant on personality, they can run the same model I did but calculate a different effect size corresponding to their different question.

Taking it a step further though, Jeff is worried about measurement error. Fair enough: it’s reasonable to assume that within-person variance in mood assessments reflects both true within-person mood variability and measurement error. So does that mean that the effect size is undefined/unestimable in this design?

The answer is “it depends” because there is no such thing as “THE effect size.” There are different effect sizes because they are answers to different questions. The original-recipe Cohen’s d from Jake’s blog post is an answer to a perfectly reasonable question: “What is the relationship between operations?” (i.e., between assignment to levels of the manipulation of the IV and a single measurement of the DV.) There isn’t one answer to “what is THE effect size” but if we are going to pick a default, it’s a pretty good one because it’s both calculable and comparable across many different designs. Of course Jake gave some other answers, which apply when you have some other questions. Shrunken variances or latent variables open up even more possibilities, but that doesn’t make any of the other ones wrong — they’re just answering different questions.

I think Jake earned a shirt.

Matt Williams said...

Fascinating stuff. I wrote a longer reply but it didn't seem to work... Anyhow:

"Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population"

I guess 'effect size' is a somewhat ambiguous term that can be defined in several different ways, but this is an odd definition that I haven't seen before.

Conceptually the definition does seem a bit problematic: I'm happy with the idea of acknowledging the idea that effects vary (remembering that most statistical analyses used in psychology assume constant effects within conditions). However, if we are going to acknowledge that effects vary, surely we should treat the average size of an effect and its variability as two very different concepts? Why should we regard an effect that is more variable as being smaller?

Furthermore, don't you feel like defining effect size in such a way that it is undefined* whenever the effect doesn't vary might be a bit problematic?

*Undefined, not infinite, surely?

Jeff Rouder said...

I am most glad that we can all agree effect size is a fleeting, difficult concept. And that was my goal in making the puzzler. I actually feel a bit put off by the emphasis on effect size and especially power. I know my expts have high resolution with say 20 people because I collect some 600 trials per person divided in only a few conditions. I look at only one DV, and I am convinced the usual power and effect size calculations are understating what we can learn from these experiments. I am most perturbed by the notion that there are procedures that yield correct answers. To me, there are only models.

The essence of the argument is should we measure the mean effect to delta^2, the between-subject variance in the population, or to sigma^2+delta^2 the composite within- and between-subject variation. I gather if you are going to consider mean effect then delta is the appropriate measure. If you are not happy with the concept of mean effect across all people, then of course effect size is not the measure for you. Sanjay, your example is not about variance, but about the interpretability of the mean mood. Perhaps mean mood does not capture the scientific phenomena at interest. But if it does, then I think delta^2 is the variance of interest.

I'll send Jake a Mizzou shirt; that seems fair. Email me your postal address.

Jeff Rouder said...

Matt, I think I am using it the way Cohen intended it. Effects might have a metric, but the coefficient of variation, the ration of the mean to the standard deviation, forms the basis for Cohen's d. Also, it is important to differentiate infinite from undefined. Infinite is a perfectly good limit and we can define say 3 divided by infinity as zero. In fact infinity divided any finite number is infinite; and finite number divided by infinity is zero. Undefined is when you divide infinity by infinity or zero by zero.

Jake Westfall said...

Thanks for the shirt, Jeff. If it's all the same to you, I'd prefer if you donated it to a local homeless shelter instead. It looks like there's one fairly close to the Mizzou campus:

I'm right there with you in being put off by emphasis on standardized effect size. But I'm not sure I follow your "and especially power" remark. Do you think it's a bad thing that there is increased emphasis these days on designing well-powered experiments? I could understand someone like you not being a fan of the classical technical definition of power, since it is all based around NHST. But I get the feeling that in most non-technical discussions of power, people (including me) really mean "power" in a looser sense of "ability of an experiment to yield precise and informative results," whether the analysis is based on NHST or Bayes or whatever. And surely this looser notion of power is a good thing from anyone's perspective.

You also say that you are convinced that the "usual power calculations" understate what can be learned in an experiment like the hypothetical one you described, but again, I'm not sure what you mean by this. In one sense, it's obviously true, since power analysis is not supposed to tell us about "everything that can be learned" from an experiment, but only about whether we are likely to reject H0 if a particular H1 is true. Whether such a result would be considered compelling or decisive evidence for or against different theoretical perspectives is a whole other layer of considerations. If this sort of thing is what you mean, then I agree. But if you mean something else, then I guess you'll need to elaborate. In particular, if you think there are alternative procedures -- "unusual power calculations" -- that would be be more informative, I'd like to hear what you think those are.

By the way, just to be clear, my preferred effect size does not use (square root of) sigma^2+delta^2 in the denominator, it uses sigma^2 + delta^2 + variance of the participant means. In other words, all the variance components go in the denominator. (Not that I think this would change the basic argument much, but just for the sake of being properly understood.)

Jeff Rouder said...


As a within-subject experimentalist who uses several hundred trials per person per condition, I am by-and-large unconcerned with the number of participants I run. We stop when we get a bit bored. That shocks people, but I think it is appropriate for the phenomena we study. I insist on many more trials per condition that just about anyone else I know. I use very few factors and often few levels. We often target a single, critical contrast per experiment. Now, I detest having to figure out power in these cases. I know even on a per-subject basis I have a well powered experiment. Simple computations do. If RT has a standard deviation of 300 ms (reasonable estimate), 400 trials per condition gives us a resolution of 15 ms per person per condition. Pool a bit, and we are done below 10 ms. So, 40 ms effects are at least 2-times my effective resolution per person. Why is not 10 people complete overkill to show effects. Yet, by the usual definitions my effect sizes are much smaller and my power much lower. But I know my experiments are overkill. See the problem; between-subject effect-size power and logic really doesn't work in a within-subject setup unless subject-by-treatment interactions are large. They hardly ever are in cognition and perception because dominance almost always holds, that is, everyone reads names congruent colors faster in Stroop than incongruent colors. Nobody is Stroop pathological, that is nobody names incongruent colors faster.

Jeff Rouder said...

Of course, for between-subject designs, more power is helpful in general. Of course, once the data are collected, power no longer becomes a concern. The data are what they are.

Jake Westfall said...

Jeff, that's all fine with me, but one final comment. You say "by the usual definitions my effect sizes are much smaller and my power much lower." But this is just wrong.

Depending on which definition of standardized effect size we're talking about, it might be true that the effect size works out to be not that big. But even if so, it's of course not (necessarily) true that power would be estimated as low. The "usual definition" of power depends not only on the effect size and number of subjects, but also on the design of the study, the number of replicates, and the variance components (and if the data are unbalanced, then other things too). Any reasonable power analysis of a study on the Stroop effect involving 10 people who complete 600 trials each is going to estimate that power is very high. Your example "puzzler" dataset is a case in point: it involved 25 people and Cohen's d = 0.25, but the post-hoc power (i.e., setting all parameters equal to the parameter estimates) is virtually 100% (the t-statistic was something like 6.5).

You seem to suggest that people might do the power analysis as if it were a between-subject study where everyone is measured once, but that would be just plain wrong on a technical level, and hopefully no one would actually do that.

Dean Eckles said...

"#1: Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population."
This is definitely inconsistent with all widely-used definitions and measures effect sizes.

This is a statement about the degree of treatment effect heterogeneity, and has basically no conceptual relationship to effect size as normally defined. In fact, the only way it seems to agree is when ATE=0 [and either (a) the variance of the effect is non-zero or (b) you define 0/0 = 0].

This is also what Jake Westfall is saying, but I wanted to add another voice here in case any readers think his response is idiosyncratic — it is rather the #1 in the post by Professor Rouder that is idiosyncratic.

Jeff Rouder said...

Thanks for commenting Dean. And please feel free to call me Jeff. I am not sure where from where you are writing and different places have different norms; at my institution, it is a sign of respect among professionals to be on a first name basis.

I haven't reread all the comments and it has been a while. Almost all of my colleagues faced with a similar design would tabulate mean RT across people and condition as the primary outcome. That is, they would enter two cell means per person into the analysis. In fact, they would take the difference, and report an effect size based on the one-sample or paired t-test. That, in fact, is nearly ubiquitous; done say in 99% of the cases. And it is not just cognitive psychologists who take the mean and use it as raw data in massively repeated experimental designs, it is the developmental and social and other experimental psychologists. All the time. Swear. And they do so for an obvious reason. If the design is balanced, you get the same p-value whether you treated the design as randomized block and consider all observations or if you just enter cell means and take a t-test. So why not do the easy thing, the thing that SPSS can handle!

So, for me, it is important when treating the data hierarchically, to have effect size measures that match up with my colleagues. When they take cell means, and lets assume they have a lot of data so they approximate each individuals' true cell mean well, the resulting effect size is just what I claimed. So, far from being idiosyncratic, my definition matches what is being estimated by the vast majority of experimental psychologist when they run massively repeated designs. Of course, they do not get any shrinkage because they entered cell means, but that is just an estimation issue, not an issue of what defines the true effect size.

We should have a concept of a true effect size or nature's effect size for a problem. For me, this is easy. I dont want it to depend on the # of trials per person or the number of people. So, the only natural answer is the one I provide.

It may well be field differences---I get the feeling many of you are not used to working with 500 trials per person per condition. You really get to beat within-person, within-trial variation.

Jake Westfall said...


I didn't expect to be returning to this thread, but here we are I guess. A couple things:

1. "Almost all of my colleagues [...] would take the difference, and report an effect size based on the one-sample or paired t-test. That, in fact, is nearly ubiquitous; done say in 99% of the cases. And it is not just cognitive psychologists who take the mean and use it as raw data in massively repeated experimental designs, it is the developmental and social and other experimental psychologists. All the time. Swear."

Right, I don't dispute that it's very common to compute "an effect size based on the one-sample t-test" -- specifically, to use d_z, where the denominator is the standard deviation of the difference scores.

Will you acknowledge that d_z -- the denominator of which is sqrt(delta^2 + sigma^2/m), where m is the number of trials per person -- is NOT equal to the effect size you defined in your post, which just contains delta in the denominator?

Will you further acknowledge that what people would do is compute d_z, NOT compute an effect size using just the estimate of delta in the denominator? If you disagree, can you point to even a single published paper where the researchers actually compute the effect size that you define?

Now your point seems to be that as the number of trials per person grows, the denominator of d_z approaches delta, and thus your effect size can be seen as an asymptotic limit of d_z. Yes, I see that. So this is an interesting motivation for why one might want to look at the effect size that you define. Great. But that is not what Dean and I are objecting to. We are objecting to your claim that "Rouder's within-subject totally-appropriate standardized effect size" -- featuring only delta in the denominator -- is something that psychologists in the real world have actually computed, used, or really even discussed, ever. Again, if you think these objections are wrong, please point to even a single published paper that features your effect size.

2. For me, this is easy. I dont want [the true effect size] to depend on the # of trials per person or the number of people. So, the only natural answer is the one I provide.

Well, no. Any d-like effect size that uses a function of only the estimated variance components in the denominator, and not any of the sample sizes, will have this property. Including... the classical Cohen's d!

Jeff Rouder said...

Hi Jake, Sure. Sort of. Certainly my denominator just has delta. Not sigma^2. That is the defining feature!

Let's find common ground. Tell me where I am wrong,

1. Most ppl do not even think about how or what to compute when they compute an effect size in these designe. They aggregate to participant-by-condition means, run the t, and move on. The data are too simple and this is too quick to be modeled. Remember, throughout most of the laboratory experimental lit, simple models, ANOVA in particular, is ubiquitous.

2. I don't know of anyone in the Stroop or related lit, ir in massively repeated designs that has computed effect size the way I am recommending, with shrinkage and with separating out delta from sigma.

3. I dont know of anyone who has done it your way in massively repeated designs in Stroop, Simon, even priming. Again, the key here is massively repeated, say 50 or 100 trials per condition, not 3. COuld be me, but hierarchical linear modeling in a single task like I highlight is pretty damn rare and not a single instance comes to mind. Help out if you wish.

4. None of this really matter to me. The question that really matters, is, "if you were going to talk about a true effect size as a thing to a cognitive psychologist, what would they want to know." I contend that the vast majority would want ot know my definition. That is what they are aspiring too. They want to know how large true effects are across people. They would want to know the average of these true effects and how much these true effects varied. It is a good thing to estimate, no?

5. I guess you have two choices, one is include a denominator of sqrt(\delta^2+ sigma^2) or delta. I just cant see the usefulness of the former because we run massively repeated designs to deflate the influence of sigma. So we are "Estimating" something that is sigma free. That is the goal, which is why delta makes so much more sense to me. I just cant see how this is not obvious. I think if you surveyed die-in-the-wool experimentalists what they think the true effect size is, they would point to this level without hesitation. They really simply dont use it that much and hardly think in terms of hierarchical models this way.

Jake Westfall said...


1. I agree. And honestly, that's fine with me. If standardized effect sizes are useful at all, in my opinion it is only in the planning phase of study. That is, they can be useful for doing a rough power analysis or as something to define a prior in terms of. Once the data are in, I think we're better off interpreting the effects on the scales in which the variables were actually measured.

3. I think there may be a misunderstanding about what "my way" is. My position is that, if you must compute a standardized effect size for a categorical predictor, classical Cohen's d is probably the best bet in most cases. Computing Cohen's d does not require a hierarchical linear model. Not sure where that idea comes from. As for examples of people who have used classical Cohen's d (as opposed to, say, d_z) in "massively repeated" designs, I went to Google Scholar and searched for "priming meta-analysis" and looked at the top 3 results. This one used classical Cohen's d (actually an adjusted version that corrects for the slight positive bias). This one used d_z. And this one apparently used a range of inconsistent effect sizes, including some incorrect conversion formulae (i.e., Eqs. 3 and 4), but they at least claim to have computed Cohen's d (Eq. 1).

Jake Westfall said...

4. I don't know. Maybe. Your effect size is interesting and has a sensible interpretation, I agree with that. I suspect that most cognitive psychologists don't really know what they want to estimate. If/when they do compute a standardized effect size, I think the primary goal is just to get a number that they can compare to other studies. So one of the primary motivations is likely just consistency with what others have done. (Unfortunately, this consistency can be specious, since you and I know that effect size definitions that are sensitive to design features like the sample sizes and number of repetitions are not generally comparable across different designs.)

5. To be clear, sqrt(\delta^2+ sigma^2) is not what I advocate -- this does not correspond to Cohen's d because it is missing the subject mean variance. The denominator I would advocate, to the extent that I would advocate at all, would be, in your notation, sqrt(var(alpha) + \delta^2+ sigma^2). In other words, all-sources-summed-and-mashed-together. As for the usefulness of this effect size compared to yours, there are a few nice things. First, it can be computed for any design with a categorical predictor. Yours requires that the predictor of interest varies within-subjects. Second, it yields effect sizes that are on a comparable scale for any design. This is particularly useful for study planning (power analysis or choosing a prior), because we don't have to adjust the expected effect size up or down based on incidental design features like the number of repetitions. Yours has this property too, at least for the designs where it can be computed. Third, it can be easily extended to designs with multiple random factors, including hierarchical 3-level designs and designs with crossed random factors like participants and stimuli. All you have to do is include the additional variance components in the denominator. It is not clear to me how your effect size would handle designs like this. You could of course just ignore the new variance components and just keep computing your effect size the same way in those cases, but then such an effect size would be a lot less relevant if we knew that the effect also varied randomly across stimuli, labs, and so on. Power analyses, for example, would need to incorporate that information.

Donald Williams said...
This comment has been removed by the author.
Donald Williams said...

It is common knowledge that in a multilevel framework one computes effects sizes by dividing the estimate by the variance components and residuals summed. Since variance is partitioned, this ensures we do not get inflated effect sizes, such as the one over 5 in your model. This is well documented and goes by delta subscribe t. However, some may suggest to divide by only the residuals. This is controversial since it leads to implausible effect size estimates.