Saturday, September 30, 2017

Your Input Needed: Are There Theories that Predict Variabilities of Individual Differences?

Hi Folks,

Input please about this individual-differences question.


Suppose I have a few  tasks, say Task A, Task B, Task C, etc..  These tasks can be any tasks, but for the sake of concreteness, let's assume each is a two-choice task that yields accuracy as a dependent measure with chance being .5 and ceiling being 1.  Suppose I choose task parameters so that each task yields a mean accuracy across participants of .75.


Here is an example: Task A might be a perception task where I flash letters and mask them, and the participant has to decide whether the letter has a curved element, like in Q but not in X.  Task B might be a recognition memory task where the participant decides if items were previously studied or new.  By playing with the duration of the flash in the first task and the number of memoranda in the second task, I can set up the experiments so that the mean performance across people is near .75.


If we calculate the variability across individuals, can you predict which task would be more variable.   The below figures show three cases.   Which would hold?  Why?  Obviously it depends on the tasks.  My question is that are there any tasks you could predict the order.

Example Revisited (and an answer)

Now, if we were running the above perception and memory tasks, people would be more variable in the perception task.  At 30 ms, some people will be at ceiling, others will be at floor, and the rest will be well distributed across the range.  At 100 items, most people in memory will be between 60% and 90% accurate.   I know of no theory however that addresses, predicts, or anticipates this degree of variability.

Variability In The Shadows

In psychophysics, we give each person unique parameters to keep accuracy controlled.  In cognition, we focus on mean levels rather than variability.  In individual differences, it is the correlation of people across tasks rather than the marginal variability in tasks that is of interest.

Questions Refined:

1. Do you think documenting and theorizing about this variability is helpful?  Foundational?  Arbitrary?

2. Do you know of any theory that addresses this question for any set of tasks?

3. My hunch is that the more complex or high-level a task is, the less variability.  Likewise, the more perceptual, simple, or low-level a task is, the more variability.  This seems a bit backwards in some sense, but it matches my observations as a cognitive person.  Does this hunch seem plausible?

Wednesday, September 27, 2017

The Justification of Sample Size (Un)Planning for Cognition and Perception

There apparently is keen interest in justifying sample sizes, especially in the reviewing of external grant applications. I am resistant, at least for cognitive and perceptual psychology.  I and other senior colleagues advise that 20 to 30 participants is usually sufficient for most studies cognition and perception, so long as you have tens or hundreds of replicates per condition.  That advice, however, rankled many of my tweeps.  It especially rankles my comrades in the Methodological Revolution, and I hope you don't paint me one with too much flair.  It has been a Twitter day, and even worse, you rankled methodological-comrades have motivated me to write this blog post.

I provide common sense justification for sample size (un)planning for within-subject designs in cognition and perception.  You can cite my rule of thumb, below, and this blog in your grant proposals.  You can even ding people who don't cite it.  I'll PsyArxiv this post for your citational convenience soon, after you tell me where all the typos are.

Rouder's Rule of Thumb:  If you run within-subject designs in cognition and perception, you can often get high powered experiments with 20 to 30 people so long as they run about 100 trials per condition.

Setup.  You have \(N\) people observing \(M\) trials per condition.  How big should \(N\) and \(M\) be?  

People.  Each person has a true effect, \(\mu_i\).   This true effect would be known exactly if we had gazillions of trials per condition, that is, as \(M\rightarrow\infty\).  We don't have gazillions, so we don't know it exactly.   But let's play for the moment like we do.  We will fix the situation subsequently.

Crux Move. The crux move here is a stipulation  that \(\mu_i \geq 0\)  for all people.  What does this mean?  It means that true scores all have the same sign.  For example, in the Stroop effect, it implies that nobody truly responds more quickly to incongruent than congruent colors.    In perception, it implies that in nobody responds quicker to faint tones than loud ones.  Let's call this stipulation, The Stipulation

Should You Make The Stipulation?  That depends on your experiment and paradigm.  It is quite reasonable in most cognition and perception paradigms.  After all, it is implausible that some people truly reverse Stroop or truly identify faint tones more quickly than loud ones.  In fact, in almost all cases I routinely deal with, in attention, memory, perception and cognition, this is a reasonable stipulation.  You may not be able to make or justify this stipulation.  For example, personal preferences may violate the stipulation.  So, whether you make or reject the stipulation will affect your (un)planning.  Overall, however, the stipulation should be attractive for most cases in cognition and perception.  If you make the stipulation, read on.  If not, the following (un)planning is not for you.

Individual Variation is Limited with The Stipulation: If you stipulate, then the gain is that the variability across people is limited.  When all effects are positive, and the mean is a certain size, then the variability cannot be too big else the mean would be bigger.  This is the key to sample size (un)planning.  For the sake of argument, let's assume that true effects are distributed as below, either as the blue or green line.  I used a Gamma distribution with a shape of 2, and not only is the distribution all positive, the shape value of 2 is reasonable for a distribution of true effects.  And as bonus, the shape of 2 gives the right tail a normal-like fall off.   Two curves are plotted, the blue one for a small effect; the green one for a large effect.

There limited-variation proposition is now on full display.  The blue curve with the smaller effect also has smaller variance.  The effect size, the ratio of mean to the standard deviation is the same for both curves!  It is \(sqrt(2)\), about 1.4, or the root of the shape. 

Am I claiming that if you had gazillion trials per condition, all experiments have an effect size of about 1.4?  Well yes, more or less to first order.  Once you make the stipulation, it is natural to use a scale family distribution like the gamma.  In this family the shape is the effect size, and reasonable shapes yield about effect sizes between 1 and 2.   The stipulation is really consequential as it stabilizes the true effect sizes!  This licenses unplanning.

Power and (Un)Planning: Experiments capable of  detecting effect sizes of say 1.4, the size in the figure, do not require many subjects.  Ten is more than enough.  For example, at a .05 level, \(N=10\) yields a power of .98.    This ease of powering designs also holds for more realistic cases without a gazillion trials.  [R code: 1-pt(qt(.975,9),9,ncp=sqrt(2)*sqrt(10))].

Finite Trials:  nobody wants to run a gazillion trials.  Let's slim \(M\) down.  Let's take two cases, one for RT and another for accuracy:

For RT, we are searching for a 40 ms effect, and the residual variation is some 300 ms.  This 300 ms value is a good estimate for tasks that take about 750 ms to complete, which is typical for many paradigms.  The variability for $M$ trials is \(300/\sqrt{M}\), and if we take a contrast, we need an additional \(\sqrt(2)\) for the subtraction.  If \(M=100\), then we expect variability of about  42 ms.  Combining this with the variability across participants from the above Gamma distribution yields a total variability of about 51 ms, or an effect size of 40/51 = .78.  Now, it doesn't take that many participants to power up this effect size value.  N=20 correspond to power of .91.  We can explore fewer trials per condition too.  If \(M=50\), then the effective effect size is .60, and the power  at N=25 is .82, which is quite acceptable.  

For accuracy, the calculations as follows:  Suppose we are trying to detect the difference between .725 and .775, or a .05 difference in the middle of a two-alternative force choice range.  The standard deviation for observed proportions for \(M\) trials is \(\sqrt{p(1-p)/M}\).  For 100 trials, it is .043, and if we throw in the factor of \(\sqrt{2}\) for the contrast, it is .061.  Combining this with the variability across participants from the above Gamma distribution yields a total variability of .070, or an effective effect size of .71.  N=25 corresponds to a power of .925.   Even for M=50, the power remains quite high at N=30, and is .80.  

So, for accuracy and RT, somewhere between 20 and 30 participants and 50 to 100 trials per condition is quite sufficient.  And this holds so long as one is willing to make the stipulation, which, again, seems quite reasonable in most cases to me.

Gamma of Shape 2?  Because so many of you are as argumentative as you smart, you are bound to complain about the Gamma distribution.  Why shape of 2.0?  Suppose the shape is lower?  And how would we know?  Let's go backwards.  The way we are to know what is a good shape (or true effect size across people) is by running a good number of people for lots of trials each.  We are pretty good at this in my lab, better than most. Our experiments are biased toward many trials with few conditions.  But this is not enough.  One needs an analytic method for decomposing trial-by-trial noise from population noise.  We also use hierarchical models.  The results are always a bit shocking.  There is usually a large degree of regularization meaning that trial-by-trial noise dominates over people noise.   People are truly not much different from each other.  The following graph is typical of this finding.  In the experiment, there are 50 people each observing 50 trials in 6 conditions.  The remainder of the details are unimportant for this blog.  The data are pretty, and the means tell a great story (Panel A).  Panel B is the individual level differences or contrasts among the condition.  Each line is for a different individual.  These individual differences have 10s if not 100s milliseconds in variation.  But when a reasonable hierarchical model is fit (Panel C), there is a great defree of regularization indicating that almost all the noise comes from the trial-by-trial variability.  The size of the effect relative to the variability is stable and large!  We find that this type of result repeated often and in many paradigms.  From looking at many such plots, it is my expert opinion that the gamma of shape 2 is wildly conservative, and a more defensible shape might be 3 or 4.  Hence, the power estimates here are if anything conservative too.  

Of course, your milage may differ, but probably by not that much.