Tuesday, April 21, 2015

How many participants do I need? Insights from a Dominance Principle

The Twitter Fight About Numbers of Participants

About a month back, there was an amusing Twitter fight about how many participants one needs in a within-subject design.  Cognitive and perceptual types tend to use just a few participants but run many replicates per subject.  For example, we tend to run about 30 people with 100 trials per condition.  Social psychologists tend to run between-subject designs with one or a handful of trials per condition, but they tend to run experiments with more people than 30.  The Twitter discussion was a critique of the small sample sizes used in cognitive experiments.

In this post, I ask how wise are the cognitive psychologists by examining the ramifications of these small numbers of participants.   This examination informed by a dominance principle, a reasonable conjecture on how people differ from one another.  I show why the cognitive psychologist right---why you need small numbers of people even to detect small effect.


Consider a small priming effect, say an average 30 ms difference between primed and unprimed conditions.  There are a few sources of variation in such an experiment:

The variability within a person and condition across trials.  What happens when the same person responds in the same condition?  In response times measures, this variation is usually large, say a standard deviation of 300 ms.  We can call this within-cell variability.

The variability across people.  Regardless of condition, some people just flat out respond faster than others.  Let's suppose we knew exactly how fast each person was, that is we had many repetitions in a condition.  In my experience, across-people variability is a tad less than within-person-condition variability.  Let's take the standard deviation at 200 ms.  We can call this across-people variability.

The variability in the effect across people.  Suppose we knew exactly how fast each person was in both conditions.  The difference is the true effect, and we should assume that this true effect varies.  Not everyone is going to have an exact 30 ms priming effect.  Some people are going to have a 20 ms effect, others are going to have a 40 ms effect.   How big is the variability of the effect across people?    Getting a handle on this variability is critical because it is the limiting factor in within-subject designs.   And this is where the dominance principle comes in.

A Dominance Principle

The dominance principle here is that nobody has a true reversal of the priming effect.  We might see a reversal in observed data, but this reversal is only because of sample noise.  Had we enough trials per person per condition, we would see that everyone has at least a positive priming effect.  The responses are unambiguously quicker in the primed condition---they dominate.

The Figure below shows the dominance principle in action.  Shown are two distributions of true effects across people---an exponential and a truncated normal.  The dominance principle stipulates that the true effect is in the same direction for everyone, that is, there is no mass below zero.  And if there is no mass below zero and the average is 30 ms, then the distributions cannot be too variable.  Indeed, the two shown distributions have a mean of 30 ms, and the standard deviations for these exponential and truncated normal distributions are 30 ms and 20 ms, respectively.  This variability is far less than the 300 ms of within-cell variability or the 200 ms of across-people variability.  The effect size across people, 30 ms divided by these standard deviations, is actually quite large.  It is 1 and 1.5 respectively for the shown distributions.

You may see the wisdom in the dominance principle or you may be skeptical.  If you are skeptical (and why not), then hang on.  I am first going to explore the ramifications of the principle, and then I am going to show it is probably ok.

Between and Within Subject Designs

The overall variability in a between subject design is the sum of the variabilities, and it is determined in large part by the much larger within-cell and across-people variabilities.  This is why it might be hard to see a 30 ms priming effect in a typical between subject design.  The effect size is somewhere south of .1.

The overall variability in a within subject design depends on the number of trials per participant.   In these designs, we calculate each person's mean effect.  This difference has two properties: first, we effectively subtract out across-participant variability; second, the within-cell variability decreases with the number of trials per participant.  If this number is great, then the overall variability is limited by the variability in the effect across people.  As we stated above, due to the dominance principle, this variability is small, say about the size of the effect under consideration.  Therefore, as we increase the numbers of observations per person, we can expect effects of 1 or even bigger.

Simulating Power for Within Subject Designs

Simulations seem to convince people of points perhaps even more than math.  So here are mine to show off the power of within-subject designs under the dominance principle.  I used the 300 ms within-cell and 200 ms across-people variabilities and sampled 100 observations per person per condition.  Each person had a true positive effect, and these effects we sampled from the truncated normal distribution with an overall mean of \( \mu \).  Here are the power results for several sample sizes (numbers of people) and  value of an average effect \( \mu \).

The news is quite good.  Although a 10 ms effect cannot be resolved with fewer than a hundred participants, the power for larger effects is reasonable.  For example the power to resolve a 30 ms effect with 30 participants is .93!  Indeed, cognitive psychologists know that even small effects can be successfully resolved with limited participants in massively-repeated within-subjects designs.  It's why we do it routinely.

The bottom line message is that if one assumes the dominance principle, then the power of within-subject designs is surprisingly high.  Of course, without dominance all bets are off.  Power remains a function of the variability of the effect across people, which must be specified.

Logic and Defense of the Dominance Principle

You may be skeptical of the dominance principle.  I suspect, however, that you will need to assert it.

1. The size of effects are difficult to interpret without the dominance principle.   Let's suppose that the dominance principle is massively violated.  In what sense is the mean effect useful or interpretable.  For example,suppose one has a 30 ms average effect with 60% of people having a true positive effect and 40% of people having a true a negative priming effect .  The value of the 30 ms seems unimportant.  What is critically important in this case is why the effect is different in direction across people.  A good start is exploring what person variables are associated with positive and negative priming?

2. The dominance principle is testable.  All you have to do is collect a few thousand trials per person to beat the 300 ms within-cell variability.  If you want say 10 ms resolution per person, just collect 1000 observations per person.  I have done it on several occasions, collecting as much as 8,000 trials per person on some (see Ratcliff and Rouder, 1998, Psych Sci).  I cannot recall a violation though I have no formal analysis....yet.   The key is making sure you do not confound within-participant variability, which is often large, with between participant variability.  You need a lot of trials per individual to deconfound these sources.  If you know of a dominance violation, then please pass the info along.

Odds are you are not going to collect enough data to test for dominance.  And odds are that you are going to want to interpret the average effect size across people as meaningful.  And to do so, in my view, you will therefore need to assume dominance!  And this strikes me as a good thing.  Dominance is reasonable in most contexts, strengthens the interpretation of effects, and leads to high-power even with small sample sizes in within-subject designs.


Jake Westfall said...

Very interesting post, Jeff.

About the relative sizes of the variance components. The assumption here -- which, for the record, I think is reasonable -- is that the variance of the participants' condition effects is likely to be much smaller than the participant mean variance or the error variance. I've built a similar assumption into my "PANGEA" power app ( http://jakewestfall.org/pangea/ ), appealing to the so-called "hierarchical ordering principle" that is discussed in the literature on design of experiments.

But I have to say that I don't totally understand how the dominance principle is a justification for why we might observe hierarchical ordering. You say that "if there is no mass below zero and the average is 30 ms, then the distributions cannot be too variable." I would like to hear you say a little more about what precisely you mean here. Taken at face value it doesn't seem to be true. As a counter-example, if the distribution of the response times is gamma (so that there is no mass below zero), then constraining the mean to be 30 places no constraint at all on the variance. It seems like in order to get that constraint we have to also make some sort of assumption about the shape of the distribution. But you don't mention anything about that here. So, while I'm fine to tentatively grant the dominance principle as an approximate rule to guide study design, I don't really see how this leads to the constraints on the variance components that you say it does.

Tentative assumptions aside, as a matter of empirical fact I suspect that violations of the dominance principle are common. (As a side comment, you say that this is difficult to test because you need a lot of observations per person, but I don't think that's generally true: it's probably true for response times, which are famously noisy, but the outcomes in a lot of other literatures are far less so, once other sources of variation such as participant mean variance are removed.) Consider judgments of attractiveness. We have a sample of straight female participants rate the attractiveness of a set of photographs of men. Most women prefer men with short hair, but a minority prefer men with long hair. Most women prefer women with relatively little facial hair, but a minority prefer men with heavy beards. My point is not simply that there exists some research situation where the dominance principle probably doesn't hold. My point is that it's really pretty easy to think of such counter-examples. I see why it's a convenient assumption, and maybe a useful one if it leads to sensible constraints on the variance components (although, like I said, I don't really get that), but it would be nice to see at least *some* plausible argument for why it might tend to be true empirically in reasonably diverse research domains.

Bruno said...

very interesting! I was wondering if you have some references to research checking whether the dominance principle holds or not for some specific cognitive phenomena.

Jeff Rouder said...

Hi Jake, Bruno. Thanks for the excellent comments and queries. Comments like these make blogging more rewarding. I will leave a few replies as I have time, but it is a busy day.

Let me address the relationship between dominance and variance. Jake, you are right. There needs to be some condition on skewness. I was thinking about a nondecreasing hazard (nothing more skewed than an exponential). The discussion clouded the post for readers who were not versed in mathematical statistics. I was hoping sophisticated readers would fill in the gaps. More to follow later today

Dr. R said...

Dear Jeff,

Interesting post. Within-subject designs are very appealing to study small effects, if it is possible to reduce between-subject variability by aggregating across repeated observations. I have two comments about power in a within-subject design.

The dominance principle

The dominance principle is not a principle or law. It is an assumption. The assumption can be stated as the true effect size for all participants has the same sign (e.g., the mean difference between condition A and condition B is always positive).
The blog post implies that small numbers of participants are sufficient to have high power to show an effect. This is true, but it directly follows from standard power analysis. The only factors that influence power are the effect size and sampling error, and the criterion to declare an effect to be present (statistically significant).
So, if we want to know how powerful a study is to demonstrate a reliable effect with 10ms, 20ms, or 30ms effect sizes, we need to know the sampling error. With 10ms effect size, sampling error has to be 5ms, to obtain a t-value of 2, which is significant with p < .05. If you use default Bayesian t-tests, you need less sampling error to achieve a Bayes-Factor of 10.
In a within-subject design, sampling error is a function of sample size, between-subject standard deviation and the correlation across conditions. A standard power analysis makes an even stronger assumption about individual differences than the dominance principle. It assumes that the true effect size is the same for all participants. It does not care about the sources of between-subject variance. This variance can be sampling error or true variation across participants.
To determine power of within-subject designs, we can use GPower. Open the determine window and enter means and standard deviations. I entered means 10ms, 0ms, and standard deviation as 200ms based on the suggestion of the blog. The crucial aspect for a within-subject power analysis is to provide an estimate of the correlation between means in condition A and condition B. I entered r = .90. A higher value would increase power.
GPower informs me that thse statistics imply a standardized mean difference of dz = .11, which is a small effect. A two-tailed t-test would with N = 100 would have 20% power to detect an effect. I would need N = 630 participants to achieve 80% power, and 1042 participants to achieve 95% power.
Cognitive psychologists can achieve higher power by further increasing the number of trails. As the number of trials increases, the correlation between condition A and condition B increases further. A pilot study could be used to determine this correlation to plan a study.
With r = .99, dz increases from d = .11 to d = .35. This does not mean the effect size changed. It is still 10ms, but the standardized effect size increased because aggregation across trials eliminates random error in response times. With r = .99, a sample of N = 106 is sufficient to have 95% power to detect an effect size of 10ms with 200ms between-subject standard deviation.
Remember, this analysis assumes that the effect size is the same for every individual (fixed effect size). Power will be attenuated by any form of person x situation interaction effects; that is the effect of condition A and condition B is moderated by an individual difference variable. The reason is that aggregation will not eliminate this systematic source of variation.

Simulating Power for Within Subject Designs
There is relevant information missing in the power analysis. Most important you do not mention the criterion that is being used to declare an observed effect as significant (p < .05, one-tailed, two-tailed?), BF > 10, if so, what is the alternative hypothesis?

Jeff Rouder said...

Bruno, I don't have a reference. I may try to pool some of my big-trial together, say from Ratcliff and Rouder, 1998 (Psych Sci) and from Rouder et al. (2010, Psych Rev.). I'll post if I do. Do you know of any data sets?

Jeff Rouder said...

Jake, Thanks for the violations of dominance---these are helpful examples. I think these cases help make my point about the need for dominance in interpreting average effects. Does it matter that on average women prefer beards with an effect size of say .3? Well, yes, if there is dominance. Well, no if there is not. If say 70% of women prefer beards and 30% do not, then the more topical question in this case is what explains the variation. Dominance licenses interpreting the average effect; violations of dominance beg the question of correlates of variation. That is my two cents at this point, though my views are evolving.

Bruno said...

Jeff, no, I work in psycholinguistics and we don't usually have many items. But sometimes a lot of participants. My impression from my research is that participants have sometimes different strategies and effects can flip (more or less consistently) for some participants (in for example my own research).

But there are other phenomena like predictions in language, where it makes sense that the dominance principle should hold. (I don't imagine that some people can consistently slowdown on predictable words, for example).

But a problem that may arise with a lot of items is that people figure out the experiment or develop strategies. There is also evidence for statistical learning in some experiments (especially from Florian Jaeger's lab). So I guess that the dominance principle also assumes that the length of the experiment doesn't affect the effect. Because of all that, I was wondering if someone actually examined some phenomena and check if this principle held. If not, this is something very interesting to try :)

tal said...

Hi Jeff,

Interesting post. I generally agree with Jake and Dr. R's comments above: I think your assumptions are reasonable insofar as they facilitate sensible general guidelines for research design (i.e., collect more subjects if you're doing between-subjects studies), but as Dr. R points out, the standard power framework doesn't require you to believe, or even explicitly think about, the dominance assumption (though you could argue it may be implicit in whatever considerations lead one to come up with a plausible effect size estimate).

That said, I agree with Jake that the assumption is almost certainly false in a very large proportion of cases. And I don't see why an average effect should be considered uninterpretable unless one is willing to assume dominance. I think that assumption would actually require us to throw out much (most?) of social and biomedical science. To continue Jake's example, I don't see why it's not meaningful to know that most women prefer clean-shaven men but some like men with beards. I mean, if I were single, and I believed that claim, I might very well shave my beard. It doesn't seem so hard at all to interpret that finding or act on it. Of course, I might also like to know why some women like clean-shaven or bearded men (because then maybe I could keep my beard and specifically seek out beard-loving women), but absent any other data, there is still clear, actionable information associated with a difference in means, even when there is known heterogeneity in the direction of effect.

Or a very medical example: there are many drugs that seem beneficial on average, but clearly make things worse for some proportion of people (either because they have intolerable side effects, or because they actually make the target condition worse). It would, of course, be very helpful to know why some people respond positive and others respond negatively (e.g., in some cases, such variability can be traced to individual gene variants), but surely in cases where don't have such knowledge (which is the vast majority of them), we shouldn't conclude that we've learned nothing! If giving a dopamine agonist to patients with Parkinson's produces major improvements for most people, but worsens things for some, we don't walk away saying the effect is uninterpretable. We've learned something about the effects of dopamine on Parkinson's--just not everything.

Jeff Rouder said...

Thanks Dr. R, Tal for the comments. Let me address a few outstanding issues:

1. The context for this blog post is the inane assertion in social-psychology circles as expressed on twitter that you need a certain number of subjects in all designs. I clearly prove this to be an inaccurate and unhelpful assertion for my work. We can reasonably assume dominance for strength variables and even context variables in most research in cognition, memory, and perception. Nobody detects a dim flash more slowly as the flash is increased in intensity. In almost all performance based cases, harder is at least as hard for everyone.

2. I agree that without dominance, it is pretty standard power argument. In my defense I stressed dominance from the title onward, so, it all revolves around dominance.

3. I considered the ordering argument---interactions have smaller variance than main effects---but I dont know how to justify it. It is a good assumption, less restrictive than dominance, but on what basis. At least with dominance, I am fairly certain there is an important set of phenomena that obey dominance. It is also important in IRT with the Rasch model and the concept of consistent item orderings.

4. I want to double down on my assertion that dominance is critically important for interpreting mean effects. I think it comes down to science and principle. Let's take the case of a cold remedy that truly shortens the duration of symptoms for 70% of people say by an average of 8 hrs and lengthens it for the remaining 30% by 6 hours. I gather the marginal mean, almost a 4 hr gain is important for the average cold sufferer (whatever that might mean), but it is not important for science. Understanding whether an experiment is powerful enough to detect this overall mean strikes me as an unimportant question. Really, who cares? The question however becomes immediately important with dominance. Before you ask about overall mean, think about dominance. If you are convinced dominance wont hold, then change your question to a more scientifically meaningful and interesting one.

6. Overall, I am challenging you to accept noise structures that are more varied, interesting, and complex than the tired old normal distribution. Zero is almost always theoretically important and the fact that it plays no role in common modeling is leaving money on the table.

Thanks again for the comments. --Jeff

Jake Westfall said...

I can think of three plausible arguments for why variance components from experiments often tend to be hierarchically ordered...(I just wrote a little about this in my dissertation so it's fresh on my mind)...but that discussion might be a little longer than a comment thread would warrant. Maybe it's time to finally set up my blog and post about this there. Anyway, dominance is definitely a very interesting idea. I agree with the general sentiment that it is useful and important to seriously consider these kinds of general principles (invariances!) in research.

Jeff Rouder said...

Jake, I look forward to reading about it. Post a comment here or on twitter when you do so. Best, Jeff

tal said...

Well, I'll double down on my argument that requiring dominance as a precondition for scientific understanding is basically writing off most of science, where such an assumption is simply not plausible. I don't really see any reason to view scientific understanding as an all-or-none phenomenon. I personally have no trouble saying "hey, the fact that dopamine agonists produce such-and-such behavior in most of the population, in line with our prediction, is pretty good evidence that we understand *something* about dopamine function. But of course, if we understood why it seems to reliably produce the opposite effect in a small segment of the population, that would be even better, and we should keep trying to go further."

Actually, that strikes me as exactly the way most scientists think about things. Now it may not be the way most cognitive psychologists think (to be honest, I actually doubt that, but am willing to take your word for it), but I think you're unlikely to be taken seriously by most people who work on problems dissimilar to yours if you are really claiming that people can't possibly be understanding anything about the mechanisms governing the relationship between different variables unless they can also specify under what conditions ann association might reverse.

It's also worth noting that requiring this assumption places you on kind of precarious ground, because let's suppose that I think that in extremely rare cases, you *could* actually get reversal of priming effects--something that would be incredibly difficult to demonstrate one way or the other. Would that mean that I therefore couldn't possible learn anything from any priming effects you report? In point of fact, I *am* actually quite prepared to believe that there *could* be certain (very unusual) conditions under which priming reverses. But I think it would be odd indeed to demand that unless I share your assumption about dominance, there's just nothing I could learn about the mechanisms underlying priming.

Jeff Rouder said...

Tal, I think you are confusing the concept of the interpretability of an overall mean (my words) with what you call "scientific understanding" and "learn anything" (your words). They are quite different. If you think there is massive indominance, then different questions are warranted than the one addressed in power analyses. .

Jake Westfall said...

Okay, I made my inaugural blog post about possible arguments for the hierarchical ordering principle: http://jakewestfall.org/blog/index.php/2015/05/11/the-hierarchical-ordering-principle/

Unknown said...

Utilizing numerous keep an eye on vendors shifting to be able to in-house built actions, will there ever be any headache that shift can certainly make it tougher with the watches for being serviced 20-50 years right from currently? Could the foreign exchange market switch help it become very complicated to find the observe serviced, equally in your area and / or by way of the see corporation by themselves? Compact brand names commonly are not assured in order to exist and even suppose anything should go bad http://www.hotreplicaonline.com, it is not at all apparent whether areas might be available. What is more, possibly even huge brands will in no way services older different watches. They often, nevertheless it is not certain to get. Literally whatever are generally set and also serviced, however a lesser amount of standard along with elder a watch is, the greater challenging it will probably be to search for a person this type of operate on the application, and therefore the less affordable it could acquire. Whatever the vogue remains to be duplicating, my best expect at the least plus there are actually beneficial indicators on this is actually this using truly serious and even motivated appreciate regarding looks after people might at some point observe exceptional design and style around both outdoors and additionally within just your locally crafted running watches. The particular system inside the Assets Whole world appears will not be. Whereas an evaluation of this move causes it to become clean which usually this is a formation, I stubled onto the majority of the special substances quite interesting. What I need on this observe can be the fact that additional complications can be a lot more rather than dermis profound. What you may look at should be only part of that which you acquire. Purchasing, I am going to express how to sometimes tell plenty of time.