## Monday, April 2, 2018

To those concerned about methodological practice in psychology,

You are our people. We are a tribe of kindred spirits wondering in the social science wilderness.

Two difficult issues in this wilderness are the lack of constraint from theories and the lack of constraint in data.  The typical theory predicts that "there is an effect in a certain direction," and the typical analysis is "yes, p<.05."   Even when hacked, we still haven't risked or learned much.

Many of you have focused on cleaning up the field by improving when we may claim there is an effect (or an invariance).  Your efforts in promoting preregistrations, awareness of QRPs, and more thoughtful statistical analysis are admirable and efficacious.

Nonetheless, the basic testing question---is there an effect---hasn't changed.  We are still playing low-theory, low-risk science.  So what to do?

We came up with what we consider the next question after asking "is there an effect?" It is, "does everybody?"

Take evaluative conditioning.  Participants judge the emotional valence of relatively neutral objects, say tables.  Some neutral objects are repeatedly paired with negative images (think bloody decapitated puppies), while others are repeatedly paired with positive images (think smiling children playing with adorable baby goats).   Not too surprisingly, tables are rated more positively when paired with smiling children than with decapitated puppies.  The next question is whether this evaluative conditioning effects is universal---does everybody plausibly show an evaluative conditioning effect in the same direction?

Some phenomena clearly hold universally.  If people can hear, then they respond faster to unexpected loud tones (startle) than unexpected soft tones.  Startle is a low-level, subcortical phenomenon, and nobody has a reverse startle where they respond faster to unexpected soft tones than loud ones.  Some phenomena clearly are not universal.  Handedness is a good example---most people can throw a ball further with their right hand; others throw further with their left hand.

Why does it matter?  If a phenomenon is universal, that it, it holds for everyone (or everyone in a subpopulation), we can seek a common explanation.  Further questions might even be metric---is there a metric relationship between the intensity of the sound and the intensity of the startle response. If a phenomenon is not universal, the next questions are: Why do people differ?  What are the correlates of, say, left-handedness?  Why are some people left-handed?  For evaluative conditioning, a universal answer begs the question of whether the mechanism is the same as ordinary associative learning; variation begs the question of why some people would view a table more positively when paired with decapitated puppies.

Of course, not every question lends itself to "does everyone?"  Questions about preference, for example, will not be universal.  Areas where the does-everyone idea is fruitful include perception, cognition, and social cognition.

The hard part of this question is statistical.  In commonly sized samples, we always observe some people who reverse the effect.  But the real question is whether these reversals are due to sample noise (trials are noisy) or variation in true values.  So, one needs to ask the more nuanced question, "does everyone plausibly?" and use latent variable models, with true and observed values, to answer the question.  The hard part is deciding how to evaluate the evidence because the analyst is assessing whether an ordering holds for each individual simultaneously.

We were very proud to present the "does everybody plausibly" question and solve the statistical problem.   Our first paper on it was published in Psychological Methods.

So, fellow methodological terrorists, research parasites, and allies.  Please consider the does-everyone question in your work.  If you need help with the statistics, we are here.

The Struggle Continues,
Jeff and Julia

## Saturday, March 3, 2018

### Hating on Sir Ronald? My Two Cents

This week, poor Sir Ronald A. Fisher took it on the chin in cyberspace.  Daniel Laken's, for example, writes on Twitter:

"For some, to me, incomprehensible, reason, most people seem not educated well on Neyman-Pearson statistics, and still use the ridiculous Fisherian interpretation of p-values"  (3/1)

And Uli Schimmack wrote on Facebook:

"the value of this article is to realize that Fisher was an egomaniac who hated it that Neyman improved on his work by adding type-II errors and power. So, he rejected this very sensible improvement with the consequence that many social scientists never learned about power and why it is important to conduct powerful studies.   Fisher -> Bem -> Train Wreck." (2/22)

So, I thought I would give poor Sir Ronald some love.  Rather than dig up quotes claiming poor Sir Ronald was misunderstood, let me see if I can provide a common sense example of why Fisher's view of the p-value remains intuitive and helpful, at least to the degree that a p-value can be intuitive and helpful.

Here is the setup.  Suppose two researchers, Researcher A and Researcher B, are running experiments each, and fortuitously each used the same alpha, had the same sample size, and did the same pre-experimental power calculation.  Both get p-values below their alpha's, which also happen to be the same, say .01.  Now, each rejects the null with the N-P safety net that had they each done their experiment over and over and over again, if the null were true, they would only make this rejection for 1% of the experiments.

Fine.  Except Researcher A's p-value was  .0099 and Researcher B's p-value was .00000001.  So, my question is whether you think Researcher B is entitled to make a stronger inferential statement than Researcher A?  If you read two papers with these p-values, could you form a judgment about which is more likely to have documented a true effect?  As I understand the state of things, if you think so, then you are using a Fisherian interpretation of the p-value.

In Neymann-Pearson testing, one starts with a specification of what the alternative would be if an effect were present.  This alternative is a point, say an effect of .4.  Then we design an experiment to have a sample large enough to detect this alternative effect with some power level  while maintaining a Type I error rate of some set value, usually .05.  And then, with our power informing our sample size, we collect data.  When finished, we compute a p-value and compare it to our Type I error rate.  If the p-value is below, we are justified in rejecting the null, otherwise, we are not.

As an upshot of N-P testing, the p-value is interpretable only as much as it falls on one side or the other of alpha.  That is it.  It is either less than alpha or greater than it.  The actual value is not informative for inference because it does not affect the long-term error rates the researcher is seeking to preserve.  Both Researcher A and B are entitled to the same inferential statements---both reject the null at .01---and that is it.  There is no sense that Researcher B's p-value is stronger or more likely to generalize.

So, do you think Researcher B has a better case?  If so, you are straying from N-P testing.

The beauty of Fisher is that, accordingly, the p-value is the strength of evidence against the null.  Smaller p-values always correspond to more evidence.  The ordinal relation between any two p-values, whether one is less than the other, can always be interpreted.

My sense is that this property makes intuitive sense to researchers.  Researcher B's rejection probably generalizes better than Researcher A's rejection.  And if you think so, I think you should be singing the praises of Sir Ronald.

The main difference between Fisher and N-P is whether you can interpret the numerical values of p-values as statements about a specific experiment.  For Fisher, you could.  For N-P, you cannot.  N-P viewed alpha as statements about the procedure you were using, more specifically, about its average performance across a large collection of studies.  This difference are most transparent for confidence intervals, where the only reasonable interpretation is Neymann's procedural one (see Morey et al., 2016, paper here).

There are difficulties in the Fisherian interpretation---if one states evidence against the null, what is one stating evidence for?  Fisher understood that p-values overstate the evidence against the null which is why he pursued fiducial probability (see here for an entree into fiducial probability).

From my humble POV, Bayes gives us everything we want.  It is far less assumptive than specifying points used for computing power.  And we can interpret the evidence in data without recourse to a sequence of infinitely many expeirments.  And we interpret it far more fully than the straight-jacket dichotomy of "in the rejection region" or "not in the rejection region."

## Saturday, September 30, 2017

### Your Input Needed: Are There Theories that Predict Variabilities of Individual Differences?

Hi Folks,

Setup

Suppose I have a few  tasks, say Task A, Task B, Task C, etc..  These tasks can be any tasks, but for the sake of concreteness, let's assume each is a two-choice task that yields accuracy as a dependent measure with chance being .5 and ceiling being 1.  Suppose I choose task parameters so that each task yields a mean accuracy across participants of .75.

Example

Here is an example: Task A might be a perception task where I flash letters and mask them, and the participant has to decide whether the letter has a curved element, like in Q but not in X.  Task B might be a recognition memory task where the participant decides if items were previously studied or new.  By playing with the duration of the flash in the first task and the number of memoranda in the second task, I can set up the experiments so that the mean performance across people is near .75.

Question

If we calculate the variability across individuals, can you predict which task would be more variable.   The below figures show three cases.   Which would hold?  Why?  Obviously it depends on the tasks.  My question is that are there any tasks you could predict the order.

Now, if we were running the above perception and memory tasks, people would be more variable in the perception task.  At 30 ms, some people will be at ceiling, others will be at floor, and the rest will be well distributed across the range.  At 100 items, most people in memory will be between 60% and 90% accurate.   I know of no theory however that addresses, predicts, or anticipates this degree of variability.

In psychophysics, we give each person unique parameters to keep accuracy controlled.  In cognition, we focus on mean levels rather than variability.  In individual differences, it is the correlation of people across tasks rather than the marginal variability in tasks that is of interest.

Questions Refined:

2. Do you know of any theory that addresses this question for any set of tasks?

3. My hunch is that the more complex or high-level a task is, the less variability.  Likewise, the more perceptual, simple, or low-level a task is, the more variability.  This seems a bit backwards in some sense, but it matches my observations as a cognitive person.  Does this hunch seem plausible?

## Wednesday, September 27, 2017

### The Justification of Sample Size (Un)Planning for Cognition and Perception

There apparently is keen interest in justifying sample sizes, especially in the reviewing of external grant applications. I am resistant, at least for cognitive and perceptual psychology.  I and other senior colleagues advise that 20 to 30 participants is usually sufficient for most studies cognition and perception, so long as you have tens or hundreds of replicates per condition.  That advice, however, rankled many of my tweeps.  It especially rankles my comrades in the Methodological Revolution, and I hope you don't paint me one with too much flair.  It has been a Twitter day, and even worse, you rankled methodological-comrades have motivated me to write this blog post.

I provide common sense justification for sample size (un)planning for within-subject designs in cognition and perception.  You can cite my rule of thumb, below, and this blog in your grant proposals.  You can even ding people who don't cite it.  I'll PsyArxiv this post for your citational convenience soon, after you tell me where all the typos are.

Rouder's Rule of Thumb:  If you run within-subject designs in cognition and perception, you can often get high powered experiments with 20 to 30 people so long as they run about 100 trials per condition.

Setup.  You have $$N$$ people observing $$M$$ trials per condition.  How big should $$N$$ and $$M$$ be?

People.  Each person has a true effect, $$\mu_i$$.   This true effect would be known exactly if we had gazillions of trials per condition, that is, as $$M\rightarrow\infty$$.  We don't have gazillions, so we don't know it exactly.   But let's play for the moment like we do.  We will fix the situation subsequently.

Crux Move. The crux move here is a stipulation  that $$\mu_i \geq 0$$  for all people.  What does this mean?  It means that true scores all have the same sign.  For example, in the Stroop effect, it implies that nobody truly responds more quickly to incongruent than congruent colors.    In perception, it implies that in nobody responds quicker to faint tones than loud ones.  Let's call this stipulation, The Stipulation

Should You Make The Stipulation?  That depends on your experiment and paradigm.  It is quite reasonable in most cognition and perception paradigms.  After all, it is implausible that some people truly reverse Stroop or truly identify faint tones more quickly than loud ones.  In fact, in almost all cases I routinely deal with, in attention, memory, perception and cognition, this is a reasonable stipulation.  You may not be able to make or justify this stipulation.  For example, personal preferences may violate the stipulation.  So, whether you make or reject the stipulation will affect your (un)planning.  Overall, however, the stipulation should be attractive for most cases in cognition and perception.  If you make the stipulation, read on.  If not, the following (un)planning is not for you.

Individual Variation is Limited with The Stipulation: If you stipulate, then the gain is that the variability across people is limited.  When all effects are positive, and the mean is a certain size, then the variability cannot be too big else the mean would be bigger.  This is the key to sample size (un)planning.  For the sake of argument, let's assume that true effects are distributed as below, either as the blue or green line.  I used a Gamma distribution with a shape of 2, and not only is the distribution all positive, the shape value of 2 is reasonable for a distribution of true effects.  And as bonus, the shape of 2 gives the right tail a normal-like fall off.   Two curves are plotted, the blue one for a small effect; the green one for a large effect.

There limited-variation proposition is now on full display.  The blue curve with the smaller effect also has smaller variance.  The effect size, the ratio of mean to the standard deviation is the same for both curves!  It is $$sqrt(2)$$, about 1.4, or the root of the shape.

Am I claiming that if you had gazillion trials per condition, all experiments have an effect size of about 1.4?  Well yes, more or less to first order.  Once you make the stipulation, it is natural to use a scale family distribution like the gamma.  In this family the shape is the effect size, and reasonable shapes yield about effect sizes between 1 and 2.   The stipulation is really consequential as it stabilizes the true effect sizes!  This licenses unplanning.

Power and (Un)Planning: Experiments capable of  detecting effect sizes of say 1.4, the size in the figure, do not require many subjects.  Ten is more than enough.  For example, at a .05 level, $$N=10$$ yields a power of .98.    This ease of powering designs also holds for more realistic cases without a gazillion trials.  [R code: 1-pt(qt(.975,9),9,ncp=sqrt(2)*sqrt(10))].

Finite Trials:  nobody wants to run a gazillion trials.  Let's slim $$M$$ down.  Let's take two cases, one for RT and another for accuracy:

For RT, we are searching for a 40 ms effect, and the residual variation is some 300 ms.  This 300 ms value is a good estimate for tasks that take about 750 ms to complete, which is typical for many paradigms.  The variability for $M$ trials is $$300/\sqrt{M}$$, and if we take a contrast, we need an additional $$\sqrt(2)$$ for the subtraction.  If $$M=100$$, then we expect variability of about  42 ms.  Combining this with the variability across participants from the above Gamma distribution yields a total variability of about 51 ms, or an effect size of 40/51 = .78.  Now, it doesn't take that many participants to power up this effect size value.  N=20 correspond to power of .91.  We can explore fewer trials per condition too.  If $$M=50$$, then the effective effect size is .60, and the power  at N=25 is .82, which is quite acceptable.

For accuracy, the calculations as follows:  Suppose we are trying to detect the difference between .725 and .775, or a .05 difference in the middle of a two-alternative force choice range.  The standard deviation for observed proportions for $$M$$ trials is $$\sqrt{p(1-p)/M}$$.  For 100 trials, it is .043, and if we throw in the factor of $$\sqrt{2}$$ for the contrast, it is .061.  Combining this with the variability across participants from the above Gamma distribution yields a total variability of .070, or an effective effect size of .71.  N=25 corresponds to a power of .925.   Even for M=50, the power remains quite high at N=30, and is .80.

So, for accuracy and RT, somewhere between 20 and 30 participants and 50 to 100 trials per condition is quite sufficient.  And this holds so long as one is willing to make the stipulation, which, again, seems quite reasonable in most cases to me.

Gamma of Shape 2?  Because so many of you are as argumentative as you smart, you are bound to complain about the Gamma distribution.  Why shape of 2.0?  Suppose the shape is lower?  And how would we know?  Let's go backwards.  The way we are to know what is a good shape (or true effect size across people) is by running a good number of people for lots of trials each.  We are pretty good at this in my lab, better than most. Our experiments are biased toward many trials with few conditions.  But this is not enough.  One needs an analytic method for decomposing trial-by-trial noise from population noise.  We also use hierarchical models.  The results are always a bit shocking.  There is usually a large degree of regularization meaning that trial-by-trial noise dominates over people noise.   People are truly not much different from each other.  The following graph is typical of this finding.  In the experiment, there are 50 people each observing 50 trials in 6 conditions.  The remainder of the details are unimportant for this blog.  The data are pretty, and the means tell a great story (Panel A).  Panel B is the individual level differences or contrasts among the condition.  Each line is for a different individual.  These individual differences have 10s if not 100s milliseconds in variation.  But when a reasonable hierarchical model is fit (Panel C), there is a great defree of regularization indicating that almost all the noise comes from the trial-by-trial variability.  The size of the effect relative to the variability is stable and large!  We find that this type of result repeated often and in many paradigms.  From looking at many such plots, it is my expert opinion that the gamma of shape 2 is wildly conservative, and a more defensible shape might be 3 or 4.  Hence, the power estimates here are if anything conservative too.

Of course, your milage may differ, but probably by not that much.

## Wednesday, March 8, 2017

Please  help.  This is a real-life case of likely false conviction where your input can help.  A man is spending life in jail without parole for a murder he likely did not commit.

### Background:

• In 1969, Jane Mixer, a law student, was murdered.  The case went cold.
• The case was reopened 33 years later when crime-scene evidence was submitted to DNA analysis.
• The DNA yielded two matches; both matches were from samples that were analyzed in the same lab and at the same time of the crime-scene DNA analysis.  All three samples were analyzed in late 2001 and early 2002.
• One match was to John Ruelas.  Mr. Ruelas was 4 in 1969 and was excluded as a suspect.
• The other match was to Gary Leiterman.  Mr. Leiterman was 26 at the time.  He was convicted in 2005 and is serving life without parole.  His appeal was denied in 2007.
• There is no doubt that Mr. Leiterman's DNA was deposited on the crime scene sample.  The match is 176-trillion-to-1.
• The question is whether the DNA was deposited at the crime scene in 1969 or if there was a cross-contamination event in the lab in 2002.

### A Very Easy and Helpful PowerPoint:

• This case comes from John Wixted, a psychologist at UCSD
• He has made a detailed and convincing presentation.  Click here for The Power Point from John's website.
• John has helped to persuade the Innocence Clinic at the University of Michigan to investigate the Leiterman case.
• John and I are convinced this is a an injustice.  We are working pro bono.

### Our Job:

• Our job is to make an educated assessment of Mr. Leiterman's guilt or innocence.  It would greatly help the Innocence Clinic to assess whether there is sufficient evidence to appeal.
• The jury heard that DNA is a trillion-to-1 accurate and there was only a very tine chance of cross contamination.   Yet, we know these are the wrong conditional probabilities to compute.
• Consider the two hypotheses above that Leiterman's DNA was deposited at the crime scene or, alternatively, that it was deposited in the lab through cross contamination.  Conditional on the match, compute posterior probabilities.

### My Analysis:

I have done my own analyses and typeset them.  But reasoning is tricky, and I would like some backup.  It is just too important to mess up.  Can you try your own analysis?  Then we can decide what is best.

You will need more information.  I used  the following specifications.  Write me if you want more:

• John and I assumed 2.5M people are possible suspects in 1969.  It is a good guess based on population estimate of Detroit metro area.
• The lab processes 12,000 samples a year.  The time period the DNA overlapped can be assumed to be 6 months, that is 6,000 other samples could be cross contaminated with Mixer or Leiterman.
• The known rate of DNA cross-contamination is 1-in-1500.  That is, each time they do a mouth swab from one person, they end up with two or more DNA profiles with probability of 1/1500. We assume this rate holds for unknowable cross-contamination such as that in processing a crime scene.
• The probability of getting usable DNA from a 33-year-old sample is 1/2.

Jeff's answer is at GitHub, https://github.com/rouderj/leiterman

Thank you,
Jeff Rouder
John Wixted

## Tuesday, January 3, 2017

### Why Is It So Hard To Organize My Lab?

It is clear I need to pay more attention to the organization of my lab.   Organization is a challenge to me, it causes much apprehension, and seems to be a chronic need in all aspects of my life.  Let's focus on the lab.

### Parameters:

1. Minimizing mistakes.  There is no upside in analyzing the wrong data set, using the wrong parameters, including the wrong figure, or reporting the wrong statistics.  These mistakes are in my view unacceptable in science.  Minimizing them is the highest priority

2.  Knowing what we did.  Some time in the future, way in the future, we or someone else will visit what we did.  Can we figure out what happened?  I'd like to plan on the time scale of decades rather than months or years.

3.  Planning for Human fallible.  Some people think science is for those who are meticulous.  Then count me out.  I am messy, careless, and chronically clueless.  A good organization anticipates human mistakes.

4. Easy to learn.  I collaborate with a lot of people.  The organization structure should be fairly intuitive self explanatory.

### What we do:

1. Data acquisition and curation.  I think we have this wired.  We use a born-open data model where data are collected, logged, versioned, and uploaded nightly to GitHub automatically.  We also automatically populate local mysql tables including information on subjects and sessions, and have additional tables for experiments, experimenters, computers, and IRB info.  We even have an adverse-events table to record and address any flaws in the organizational system.  The basic unit of organization is the dataset, and it works well.

2. Outputs.  We have the usual outputs: papers, talks, grant proposals, dissertations, etc.  Some are collaborative; some are individual; some are important; some go nowhere.  The basic unit here is pretty obvious---we know exactly where each paper, talk, dissertation, etc., begins and ends.

3. Value-added endeavors.  A value-added endeavor (VAE) is a small unit of intellectual contribution.  It could be a proof, a simulation, a specific analysis, or (on occasion) a verbal argument.  VAEs, as important as they are, are ill-defined in size and scope.  And it is sometimes unclear (perhaps arbitrary) where one ends and another begins.

### The Current System, The Good:

Perhaps the strongest elements of my lab's organization is that we use really good tools for open and high-integrity science.  Pretty much everything is script based, and scripts are in many ways self-documenting, especially when compared to menu-driven alternatives.  Our analyses are done in R, our papers in Latex and Markdown, and the two are integrated with RMarkdown and Knitr.  Moreover, we use a local git server and curate all development in repositories.

### The Current System, The Bad and Ugly:

We use projects as our basic organization unit.  Projects are basically repositories on our local git server.  They contain ad-hoc organizations of files.  But what a project encompasses and how it is organized is ad-hoc, disordered, unstandardized, and idiosyncratic.   Here are the issues:

1. There is no natural relation between the three things we do, acquire and curate data, produce outputs, and produce VAEs and projects.  One VAE might serve several different papers; likewise, one dataset might serve several different papers.  Papers and talks encompass several different experiments (usually) and VAEs.

2. Projects have no systematic relations to VAEs, outputs or datasets.  This is why I am unhappy.  Does a project mean one paper?  Does it mean one analysis?  One development?  A collection of related papers?  A paper and all talks and the supporting dissertation?  We have done all of the these.

### Help

What do you do?  Are there good standards?  What should be the basic organization unit?  Stay with project?  I am thinking about a strict output model where every output is a repository as the main organizing unit.  The problem is what-to-do about VAEs that span several outputs.  Say I have an analysis or graph that is common for a paper, a dissertation, and a talk.  I don't think I want this VAE repeated in three places.  I don't want symbolic links or hard codings because it makes it difficult to publicly archive.  That is why projects were so handy.   VAEs themselves are too small and too ill-defined to be organizing units.  Ideas?

## Friday, October 28, 2016

### A Probability Riddle

Some flu strains can jump from people to birds, and, perhaps vice-versa.

Suppose $$A$$ is the event that there is a flu outbreak in a certain community say in the next month, and let $$P(A)$$ denote the probability of this event occurring.    Suppose $$B$$ is the even that there is flu outbreak among chickens in the same community in the same time frame, with $$P(B)$$ being the probability of this event as well.

Now let's focus in on the relative flu risk to humans from chickens.  Let's define this risk as
$R_h=\frac{P(A|B)}{P(A)},$
If the flu strain jumps from chickens to people, then the conditional probability, $$P(A|B)$$ may well be higher than baserate, $$P(A)$$, and the risk to people will be greater than 1.0.

Now, if you are one of those animal-lover types, you might worry about the relative flu risk to chickens from people.  It is:
$R_c=\frac{P(B|A)}{P(B)}$

At this point, you might have the intuition that there is no good reason to think $$R_h$$ would be the same value as $$R_c$$.  You might think that the relative risk is a function of say the virology and biology of chickens, people, and viruses.

And you would be wrong.  While it may be that chickens and people have different base rates and different conditions, it must be that $$R_h=R_c$$.  It is a matter of math rather than biology or virology.

To see the math, let's start with the Law of Conditional Probability:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}.$

We can move $$P(A)$$ from one side to the other, arriving at
$\frac{P(A|B)}{P(A)} = \frac{P(B|A)}{P(B)} .$

Now, note that the left-hand side is the risk to people and the right hand side is the risk to chickens.

I find the fact that these risk ratios are preserved to be a bit counterintuitive.  It is part of what makes conditional probability hard.