I provide common sense justification for sample size (un)planning for within-subject designs in cognition and perception. You can cite my rule of thumb, below, and this blog in your grant proposals. You can even ding people who don't cite it. I'll PsyArxiv this post for your citational convenience soon, after you tell me where all the typos are.

**Rouder's Rule of Thumb**: If you run within-subject designs in cognition and perception, you can often get high powered experiments with 20 to 30 people so long as they run about 100 trials per condition.

**Setup.**You have \(N\) people observing \(M\) trials per condition. How big should \(N\) and \(M\) be?

**People.**Each person has a true effect, \(\mu_i\). This true effect would be known exactly if we had gazillions of trials per condition, that is, as \(M\rightarrow\infty\). We don't have gazillions, so we don't know it exactly. But let's play for the moment like we do. We will fix the situation subsequently.

**Crux Move.**The crux move here is a stipulation that \(\mu_i \geq 0\) for all people. What does this mean? It means that true scores all have the same sign. For example, in the Stroop effect, it implies that nobody truly responds more quickly to incongruent than congruent colors. In perception, it implies that in nobody responds quicker to faint tones than loud ones. Let's call this stipulation,

*The Stipulation*.

**Should You Make**That depends on your experiment and paradigm. It is quite reasonable in most cognition and perception paradigms. After all, it is implausible that some people truly reverse Stroop or truly identify faint tones more quickly than loud ones. In fact, in almost all cases I routinely deal with, in attention, memory, perception and cognition, this is a reasonable stipulation. You may not be able to make or justify this stipulation. For example, personal preferences may violate t

*The Stipulation*?*he stipulation*. So, whether you make or reject

*the stipulation*will affect your (un)planning. Overall, however,

*the stipulation*should be attractive for most cases in cognition and perception. If you make

*the stipulation*, read on. If not, the following (un)planning is not for you.

**Individual Variation is Limited with The Stipulation**: If you stipulate, then the gain is that the variability across people is limited. When all effects are positive, and the mean is a certain size, then the variability cannot be too big else the mean would be bigger. This is the key to sample size (un)planning. For the sake of argument, let's assume that true effects are distributed as below, either as the blue or green line. I used a Gamma distribution with a shape of 2, and not only is the distribution all positive, the shape value of 2 is reasonable for a distribution of true effects. And as bonus, the shape of 2 gives the right tail a normal-like fall off. Two curves are plotted, the blue one for a small effect; the green one for a large effect.

There limited-variation proposition is now on full display. The blue curve with the smaller effect also has smaller variance. The effect size, the ratio of mean to the standard deviation is the same for both curves! It is \(sqrt(2)\), about 1.4, or the root of the shape.

Am I claiming that if you had gazillion trials per condition, all experiments have an effect size of about 1.4? Well yes, more or less to first order. Once you make

*the stipulation*, it is natural to use a scale family distribution like the gamma. In this family the shape is the effect size, and reasonable shapes yield about effect sizes between 1 and 2.*The stipulation*is really consequential as it stabilizes the true effect sizes! This licenses unplanning.**Power and (Un)Planning:**Experiments capable of detecting effect sizes of say 1.4, the size in the figure, do not require many subjects. Ten is more than enough. For example, at a .05 level, \(N=10\) yields a power of .98. This ease of powering designs also holds for more realistic cases without a gazillion trials. [R code: 1-pt(qt(.975,9),9,ncp=sqrt(2)*sqrt(10))].

**Finite Trials:**nobody wants to run a gazillion trials. Let's slim \(M\) down. Let's take two cases, one for RT and another for accuracy:

For RT, we are searching for a 40 ms effect, and the residual variation is some 300 ms. This 300 ms value is a good estimate for tasks that take about 750 ms to complete, which is typical for many paradigms. The variability for $M$ trials is \(300/\sqrt{M}\), and if we take a contrast, we need an additional \(\sqrt(2)\) for the subtraction. If \(M=100\), then we expect variability of about 42 ms. Combining this with the variability across participants from the above Gamma distribution yields a total variability of about 51 ms, or an effect size of 40/51 = .78. Now, it doesn't take that many participants to power up this effect size value. N=20 correspond to power of .91. We can explore fewer trials per condition too. If \(M=50\), then the effective effect size is .60, and the power at N=25 is .82, which is quite acceptable.

For accuracy, the calculations as follows: Suppose we are trying to detect the difference between .725 and .775, or a .05 difference in the middle of a two-alternative force choice range. The standard deviation for observed proportions for \(M\) trials is \(\sqrt{p(1-p)/M}\). For 100 trials, it is .043, and if we throw in the factor of \(\sqrt{2}\) for the contrast, it is .061. Combining this with the variability across participants from the above Gamma distribution yields a total variability of .070, or an effective effect size of .71. N=25 corresponds to a power of .925. Even for M=50, the power remains quite high at N=30, and is .80.

So, for accuracy and RT, somewhere between 20 and 30 participants and 50 to 100 trials per condition is quite sufficient. And this holds so long as one is willing to make

*the stipulation*, which, again, seems quite reasonable in most cases to me.**Gamma of Shape 2?**Because so many of you are as argumentative as you smart, you are bound to complain about the Gamma distribution. Why shape of 2.0? Suppose the shape is lower? And how would we know? Let's go backwards. The way we are to know what is a good shape (or true effect size across people) is by running a good number of people for lots of trials each. We are pretty good at this in my lab, better than most. Our experiments are biased toward many trials with few conditions. But this is not enough. One needs an analytic method for decomposing trial-by-trial noise from population noise. We also use hierarchical models. The results are always a bit shocking. There is usually a large degree of regularization meaning that trial-by-trial noise dominates over people noise. People are truly not much different from each other. The following graph is typical of this finding. In the experiment, there are 50 people each observing 50 trials in 6 conditions. The remainder of the details are unimportant for this blog. The data are pretty, and the means tell a great story (Panel A). Panel B is the individual level differences or contrasts among the condition. Each line is for a different individual. These individual differences have 10s if not 100s milliseconds in variation. But when a reasonable hierarchical model is fit (Panel C), there is a great defree of regularization indicating that almost all the noise comes from the trial-by-trial variability. The size of the effect relative to the variability is stable and large! We find that this type of result repeated often and in many paradigms. From looking at many such plots, it is my expert opinion that the gamma of shape 2 is wildly conservative, and a more defensible shape might be 3 or 4. Hence, the power estimates here are if anything conservative too.

Of course, your milage may differ, but probably by not that much.

## 8 comments:

Jeff, with all due respect, your post grossly mischaracterizes the earlier discussion. You make it sound as if you meekly suggested that samples of 20 to 30 are perfectly adequate in many areas of cognitive psychology, only to be met with cries of shock and dismay. That's not what happened at all. What actually happened was that Daniel Lakens tweeted "If you submit a research grant, and ask for money to pay participants, but do not justify your sample sizes, that's *really* not good"--to which you replied "Have you ever heard of within-subject designs? It barely matters after 20 or 25." I hope you can see the difference between the weak claim you defend in this post--that "you can often get high powered experiments with 20 to 30 people so long as they run about 100 trials per condition"--and your earlier statement that people don't have to justify sample size if they're using within-subject designs. Several of us explicitly pointed out on Twitter that nobody was denying that there are domains where 30 subjects (or even 4!) is perfectly sufficient. The objection was to your unreasonable claim that no justification is required for sample size in grants involving within-subject designs in almost all areas of cognition and perception.

Note also that the context of the discussion was *grant proposals*. The reason it's important to justify sample size in grant proposals--more so than in papers, perhaps--is that research funding does not grow on trees, and researchers have a responsibility to use public funds in an efficient manner. Your post here conveniently ignores the fact that it's no good to say "if you have enough trials, you don't need that many subjects" when collecting additional trials *also* costs money. Knowing that you can do the job just fine by collecting 1,000 trials from each of 20 subjects provides no basis whatsoever for failing to justify your sample size in a grant proposal if a quick calculation would have shown that it would be much more efficient to collect 100 trials from each of 40 people. Again, *nobody* on twitter (or probably anywhere else) was claiming that it's impossible to plan a successful study without explicitly thinking about number of subjects. The objection--which you did not deign to reply to, as far as I can tell--was that there is absolutely no basis for saying that researchers should not justify their sample size in a grant proposal. This remains true whether or not you're using a within-subject design, and regardless of what area you work in.

The irony is that, in articulating the conditions under which a researcher wouldn't have to worry about sample size, you've actually done exactly what Daniel was arguing one needs to do in a grant proposal: you've justified your sample size. Notably, you've made it quite clear that your heuristic depends on "The Stipulation" that the entire population of individuals must have true scores in the same direction. Now, personally, I think this a crazy assumption that's untenable even most of cognition and perception (though it may well hold in *some* areas). But that's neither here nor there, and I don't care to argue the point. What I think should be quite clear is that your stipulation is not something that can be taken for granted in any grant proposal that uses a within-subject design, and must at the very least be made explicit--and preferably, actively justified. The suggestion that as long as a researcher is using a within-subject design, they don't need to explain why they're only sampling 20 subjects is absurd on its face. If you want to write a blog post pointing out that many cognitive psychology studies are perfectly adequately powered with n=20 subjects, that's great. I wholeheartedly agree with that (and I suspect that so would everyone else who criticized your comments earlier). But I think you can do that just fine without completely misrepresenting the context in which the discussion arose.

Tal, with all due respect, it is nice to see you have no critiques of my post. If you want to critique a twitter conversation, be my guest. It would be quite hard to get any of this in 140 characters.

Now, since you are not a cognitive type, let me tell you what is good common-sense folk lore in the field. Run 20 to 30 subjects if you are not doing psychophysics, run 5 if you are. We leave it to researchers to know how to trade off trials per condition for number of conditions based on all sorts of things. As a functioning cognitive person, I know these trade-offs are found out with a bit of pilot work and and with a lot of experience, and it also has reflected technical concerns (scanner, in the old days how many slides your projector could hold). The golden rule is run as many trials per subject as you can get into your time frame. And these rules, run as many trials as you can in a session. run within whenever possible, and run 20 to 30 are just good rules that should require no further justification to a practicing cognitive psychologist. And, in a limited space of a grant, there is no way I would waste PIs time and grant reviewers time arguing over these things. And I hate to see young people waste their energy on it. It is a disservice. The old advice is just fine. It works. What should scientists worry about in grants? Innovation. Transformation. Impact. Significance.

It's not "wasting energy" at all. It's not enough to judge a grant by innovation, transformation, impact, etc. You NEED to verify that the money would be efficiently used. If your field or niche only needs 20-25 because you use 100-1000 trials, then great! In this post, you articulate that such designs are high-powered. Again, great! People in your particular subdiscipline who study similar things and expect similar effect sizes with similar designs can point to this and say it's enough.

However, when you're not in *your* particular subdiscipline, studying your particular effect with a similar design, we need to know how many participants are needed to have a decent shot at having something even remotely conclusive. That's why power analyses exist, and rules of thumb sucked for years. We used to simply say "yeah, 20 per cell is enough", then "yeah, 30 per cell is enough", then "50 per cell is enough"; then we finally said "oh, you know what? It really depends on the design, question, effect size..." and power analysis is then used to formalize the N requested.

If I were a grant reviewer, and I saw someone in social psych studying implicit or subtle effects, and they said "40 is good; that's what X, Y, and Z said from 1982", I would stamp a big fat "NOPE" on the application and demand that they 1) provide some adequate statistical/inferential goal 2) conduct an analysis to see what N is needed given the design and the inferential goal. Usually, this is a power analysis. Just because the rule of thumb you use for within subject designs is fine for your particular niche (and can be justified as you did above) does *not* mean that anyone with within-subject designs can just say "yup, 20-25 is fine"; again, I use w/in subject designs, and 25 would not be *nearly* enough.

Interesting discussion!

I want to add that increasing the number of trials per person does not improve power much of the design to test the AVERAGE main effect. For instance, if icc=.1, one person can only contribute 10 to the effective sample size. So, even if that person had 1,000,000 trials, it only counts as approx. 10 observations.

Increasing number of trials is generally grossly overrated as a way to increase power!

Marcel, Thanks. True. Yes. Good point about low ICC (I assume you mean true ICC as opposed to observed ICC which can be low for small numbers of trials per condition). We have a nice figure of this behavior in our new paper at https://psyarxiv.com/2mf96/. But if you have an ICC of .1, are you really interested in a mean effect? There is so much variability. In most WI designs of the type in cognition, it means that like 40% have a negative effect and 60% have a positive effect. That fact, not the mean average, becomes more a target of inquiry. Why are some ppl negative and others positive? My own view is that crossing zero is super theoretically important and interesting. Thanks again.

"If you run within-subject designs in cognition and perception, you can often get high powered experiments with 20 to 30 people so long as they run about 100 trials per condition." My impression is that many cognitive/perception experiments use a 2x2 repeated measures design in which the key question is whether or not IV B affects the size of a well-established effect of IV A (e.g., does standing up versus sitting down affect the size of the difference in RT for colour naming of congruent vs. incongruent trials). It is also my impression that if the size of that interaction effect is small, power with 100 trials/cell and 20 subjects will be modest. I guess that just how modest it will be would depend on how noisy the RT data were and how correlated is RT on congruent and incongruent items, but as I understand it under a range of reasonable assumptions power to detect that interaction would be modest with N = 20 and would be nontrivially greater with, say, N = 40. But I'm no stats maven.,,

Steve

Thx Steve, let's take Stroop, and let's say you have a manipulation that knocks Stroop in half, say from 40 ms to 20 ms. Do you think anyone has the opposite effect, that is the manipulation increases their Stroop effect? If not, the variability across people must be tiny because 20 ms deduction is a small mean reduction and nobody has a positive increase (bounded below at zero). So, interactions overall have small effects, but the variability of these effects across people is also decreased. And that is how power is maintained.

Jeff, thanks for the link to the paper! I was planning to write a paper along this lines as well, but in another context (meta-analysis).

Short reaction to your comment, with which I completely agree, but I want to add some some comments too.

J: Yes. Good point about low ICC (I assume you mean true ICC as opposed to observed ICC which can be low for small numbers of trials per condition).

--> Sure! I used the equations for effective sample size, and se = sqrt((tau2+sigma2/n)/K)

J: We have a nice figure of this behavior in our new paper at https://psyarxiv.com/2mf96/. But if you have an ICC of .1, are you really interested in a mean effect? There is so much variability. In most WI designs of the type in cognition, it means that like 40% have a negative effect and 60% have a positive effect.

--> This of course depends on the main effect... But still, you are right that heterogeneity is like interactions, and in case of interactions interpreting main effects is tricky at best.

J: That fact, not the mean average, becomes more a target of inquiry. Why are some ppl negative and others positive? My own view is that crossing zero is super theoretically important and interesting. Thanks again.

--> I agree

Some additional comments:

- Uncertainty in assessing heterogeneity (tau2, or icc) is huge. So, don't expect to reliably assess it. Consequently, in most practical situations one can only make assumptions about icc = 0, and not "prove" it.

- A consequence of the last point is that it remains dangerous to have a design with a large number of trials per person and a small number of persons... I would not bet on homogeneity and risk lack of generalizability and low precision estimating the average effect by testing just a few subjects

- Longitudinal designs may have an icc of .5!

Post a Comment