Monday, November 19, 2018

Preregistration: Try it (Or not)

So, as the Statistical War and Tone War are in a lull, the Preregistration conflict has flared up yet again.   A few thoughts on the airplane back home from Psychonomics.


A. To be honest, it has taken me quite a long time to sort out my thoughts on preregistration.   I am not telling you to preregister or not.  Moreover, how I read your work is not dependent on whether you preregistered or not.  Perhaps you might find my thoughts helpful in your decision; perhaps not.

B. I don't believe in the usefulness of the exploratory/confirmatory distinction.  All of my research is motivated by some theoretical issue (so it is not exploratory) and I am always open to alternatives that I have not considered (so it is not confirmatory).  Arguments that rely on the exploratory/confirmatory distinction are not persuasive to me, and I will not be using them here (or elsewhere).

Why I Preregistered, the story:

I used preregistration because my students forced me to.  I found the experience rewarding and will preregister again.  Perhaps the strongest argument for preregistration is that it may clarify the researcher's thinking before seeing the data.   I think most of us can agree that writing is hard, and one of the reasons it is hard is that it forces you to clarify your thinking on things.   Preregistration in some sense provides the opportunity for that type of clarification before the data are collected.    As we wrote the preregistration, my team realized we hadn't though enough about what type of models could instantiate one of the theoretical alternatives.  So, we made a set of additional model specifications before seeing the data.  That was quite helpful.

Why I don't Take the Preregistration Too Seriously:

I feel no hesitation to break my preregistration.  In fact, I do not know if we did or did not break our preregistration because I never went back and read it!  I don't care if we did or not, to be honest.   I actually think this is not such a bad idea.  Here is why:

As Shiffrin notes, science requires good judgment.  In fact, being open-minded, flexible, and judicious are probably more important characteristics than being smart, industrious, or meticulous.  Now, what I like about preregistration is that it summons me to provide my best judgment at a particular point in time.  But, as new information come in, including data, I need to exercise good judgment yet again.  Hopefully, the previous efforts will make the current efforts easier and more successful.  But that is where it stops.   I will have no contract with my preregistration; instead I will use good judgment.  Preregistration is used to improve the pre-data steps, and hopefully that will improve post-data steps too.

So if you preregister, consider the following:

1. Try not to substitute your preregistration for your best judgment.  You can add value judiciously.  Don't trade in what you know now for what you knew then.

2. Don't forget to have a conversation with your data.  Nature only whispers, you need to communicate with her softly and subtly.  You gently ask one thing, it whispers something else.  And you go back and forth.  Please do not downgrade this conversation because it might be the most important thing you do with your data.

If You Want Others To Preregister:

Tell your story.   Maybe in detail.  What might you have done differently?

Saturday, September 1, 2018

Making Mistakes

Oh the irony.

I made a potentially dreadful mistake last month.  I zipped up and submitted the wrong version of a manuscript last month.  It was the final version for typesetting.  It could have been a fucking disaster---imaging if the original submission rather than revision was published.  I didn't catch the mistake, the amazing Michele Nathan at Psychological Science did.

And the manuscript that I made this mistake on was itself about minimizing mistakes (Click Here For The Paper).  That's right, we made a dreadful mistake in processing a paper for minimizing such mistakes.   Boy, do I feel like a fraud.

One solution is to berate myself, lose sleep, hide in shame, and promise I will be more careful.  This solution, of course, is useless because (a) there are much better things to feel shame about and (b) no matter my intentions, I will not be more careful.

The other solution is to do what we say in the paper.  We use the principles of High Reliability Organizations to deal with the mistake.

Step 1: Mistakes Happen, Record Them

Yeah, it happened.  We know mistakes happen and we are prepared for them.  So, the first thing I did was  open up a little form in our data base called "Adverse Events."  All labs should have a procedure for documenting mistakes.  In my lab, we do all adverse events together so everyone is in on the problem and the resolution.

Step 2: What Was the Mistake

I submitted directory sub2 instead of directory rev1. Directory sub2 was the second version and was originally submitted.  Directory rev1 was the third, and it was in response to reviews.  Obviously, with one second of thought, most of us would know that rev1 supersedes sub2.  We could even check the dates and the log, which were accurate.  But, I happen to be one of least detailed oriented people on the planet.  So, I must have seen the 2 in sub2 and zipped and submitted that directory.

Step 3: The Cause

It is obvious the problem here is a deficient set of naming conventions for files and directories.  We do a pretty good job of naming files in the lab.  We have git repositories, and all repos have the same few directories: dev, share, papers, presentations, grants.  Also, we have good mysql logs of when we submit papers.  So tracking down the mistake and why it occurred was easy.

What we did not have was a firm naming convention for versions within the directory "papers".  Clearly, we need to standardize this convention.

Step 4:  Resolution

In our system there are two workable solutions.  The first is to use successive version numbers on a single name. e.g., minMistakes.1 might have been the first version, minMistakes.2 might have been the second and so on.  Here, instead of calling directories sub2 and rev1 we number successively and use a more informative root name.  We lose where we are in the process though.   Fortunately, we keep pretty good records of things that we do in our database, so there we could record that minMistakes.2 was the first version submitted, etc.  An even better solution is to use git to sort the major versions.  We probably should not be changing directories and just using git tags for major versions and process notes.  That is what I am going to do from now on.

Here is our adverse event report now in our database.

Anyways, with the new conventions, I most likely wont make that mistake again.

Monday, August 27, 2018

Are there human universals in task performance? How might we know?

Science has traditionally proceeded by understanding constraint among variables. In a series of new papers, Julia Haaf and I ask if there are human universals in perception, cognition, and performance.   Here is an outline of our development.

What Do We Mean By "Human Universal"?

Let's take the Stroop task as an example. Are there any human universals in the Stroop task?  In this task, people name the color of compatible color words (e.g., RED) faster than incompatible ones (e.g., GREEN).   We will call this a positive Stroop effect, and, from many, many experiments, we know that on average, people have positive Stroop effects.  But what about each individual?   A good candidate for a human universal is that each individual shows a true positive effect.  And, conversely, no individual has a true negative Stroop effect where incongruent colors are named faster than congruent ones.  We call this the "Does Everyone" question.  Does everyone have a true nonnegative Stroop effect?  We propose a universal order-constraint on true performance.

The above figure shows observed Stroop effects across many individuals (data courtesy of Claudia von Bastian).  As can be seen, 8 of 121 people have negative Stroop effects.  But that doesn't mean the "Does Everyone" condition is violated.  The negative-going observations might be due to sample noise.  To help calibrate, we added individual 95% CIs.  Just by looking at the figure, it seems plausible from these CIs that indeed, everybody Stroops.

"True effect" in this context means in the limit of many trials, or as the CIs become vanishingly small.  The question is "if you had a really, really large number of trials for each person in the congruent and incongruent conditions, then would each and every individual have a positive effect.

Is The "Does Everyone" Question Interesting?

We would like to hear your opinion here.  In exchange, we offer our own.  

We think the "does everyone" question is fascinating.   We can think of some domains in which it seemingly holds, say Stroop or perception (nobody identifies quite dim words more quickly than modestly brighter ones).  Another domain is priming---it is hard to imagine that there are people who respond faster DOCTOR following PLATE than DOCTOR following NURSE.     And then there are other domains where it is assuredly violated including handedness (some people truly do throw a ball farther with their left hand) and preference (as strange as it may seem, some people truly prefer sweetened tea to unsweetened tea).    Where are the bounds?  Why?

The "does everyone" question seemingly has theoretical ramifications too.  An affirmative answer means that the processes underlying the task may be common across all people---that is, we may all access semantic meaning from text the same way.  A negative answer means that there may be variability in the processes and strategies people bring to bear.

Answering The Does Everyone Question.

The "Does Everyone" question is surprisingly hard to answer.  It is more than just classifying each individual as positive, zero, or negative.  It is a statement about the global configuration of effect.   We have seen cases where we can say with high confidence that at least one person has a negative true effect without being able to say with high confidence who these people are.  This happens when there are too many people with slightly negative effects.  We have been working hard over the last few years to develop the methodology to answer the question.  For the plot above, yes, there is evidence from our developments that everybody Stroops positive.

Our Requests:

  • Comment please.  Honestly, we wrote the below papers, but as far as we can tell, they have yet to be noticed.  
  • Is the "Does Everyone" question interesting to you?  Perhaps not.  Perhaps you know the answer for your domain.  
  • Can you think of a cool domain where whether everyone does somethings the same way is really theoretically important?  We are looking for collaborators!  (you can email or answer publicly).

Some of Our Papers on the Topic:

  1. Haaf & Rouder (2017), Developing Constraint in Bayesian Mixed Models.  Psychological Methods, 22, p 779-.  In this poorly entitled paper, we introduce the Does Everyone question and provide a Bayes factor approach for answering it.  We apply the approach to several data sets with simple inhibition tasks (Stroop, Simon, Flanker).   The bottom line is that when we can see overall effects, we often see evidence that everyone has the effect in the same direction.
  2. Haaf & Rouder (in press), Some do and some don’t? Accounting for variability of individual difference structures.  Psychonomic Bulletin & Review.  We also include a mixture model where some people have identically no effect while others have an order constrained effect.
  3. Thiele, Haaf, and Rouder (2017) Is there variation across individuals in processing? Bayesian analysis for systems factorial technology.  Journal of Mathematical Psychology,  81, p40-54.  A neat application to systems factorial technology (SFT).  SFT is a technique for telling whether people are processing different stimulus dimensions in serial, parallel, or co-actively by looking at the direction of a specific interaction contrast.  We ask whether all people have the same direction of the interaction contrast.

Thursday, May 24, 2018

Do you study individual difference? A Challenge

Can you solve the following problem that I think is hard, fun, and important.  I cannot.

The problem is that of characterizing individual differences for individuals performing cognitive tasks.  Each task has a baseline and an experimental condition, and the difference, the effect, is the target of interest.   Each person performs a great number of trials in each condition in each tasks, and the outcomes on each trial is quite variable (necessitating the multiple trials).  There are a number of tasks, and the goal is to estimate the correlation matrix among the task effects.
That is, if a person has a large effect in Task 1, are they more likely to have a large effect in Task 2.

Let's try an experiment with 200 people, 6 tasks, and 150 replicates per task per condition with simulated data.  When you factor in the two conditions, there are 360,000 observations in total.  Our goal is to estimate the 15 unique correlation coefficients in the 6-by-6 correlation matrix.  Note we have what seems to be a lot of data, 360K observations, for just 15 critical parameters.   Seems easy.

Unfortunately, the problem is seemingly more difficult, at least for the settings which I think are realistic for priming and context tasks, than one might think.

Here is code to make what I consider realistic data:


I=200 #ppl
J=6 #tasks
K=2 #conditions
L=150 # reps

myCor[lower.tri(myCor)]  <- t(myCor)[lower.tri(myCor)]


When you create the data, you are trying to estimate the correlation matrix of t.theta, the true effects per person per tasks.


You will notice that Tasks 1 and 2 are highly correlated, Tasks 3 and 4 are highly correlated, and Tasks 5 and 6 are highly correlated.  And there is moderate correlation across Tasks 1 and 3, 1 and 4, 2 and 3, and 2 and 4.  The rest are in the weeds.  Can you estimate that pattern?

If you just take means as estimators, you are swamped by measurement error.  The tight correlation among the pairs of tasks is greatly attenuated.  Here is the code:


I guess I am wondering if there is any way to recover the correlations with acceptable precisions.    Perhaps they are forever lost to measurement noise.  I certainly cannot with my home-baked, roll-your-own Bayesian models.   If I tune the priors, I can get high correlations where I am suppose to, but the other ones are too variable to  be useful.  So either the problem is not so tractable or my models/methods are inferior.  I can share what I did if you wish.

So, can you recover the 15 correlations with acceptable precision?  I appreciate your help and insight.


Monday, April 2, 2018

Advancing The Research Question

To those concerned about methodological practice in psychology,

You are our people. We are a tribe of kindred spirits wondering in the social science wilderness.  

Two difficult issues in this wilderness are the lack of constraint from theories and the lack of constraint in data.  The typical theory predicts that "there is an effect in a certain direction," and the typical analysis is "yes, p<.05."   Even when hacked, we still haven't risked or learned much.   

Many of you have focused on cleaning up the field by improving when we may claim there is an effect (or an invariance).  Your efforts in promoting preregistrations, awareness of QRPs, and more thoughtful statistical analysis are admirable and efficacious.

Nonetheless, the basic testing question---is there an effect---hasn't changed.  We are still playing low-theory, low-risk science.  So what to do?

We came up with what we consider the next question after asking "is there an effect?" It is, "does everybody?"  

Take evaluative conditioning.  Participants judge the emotional valence of relatively neutral objects, say tables.  Some neutral objects are repeatedly paired with negative images (think bloody decapitated puppies), while others are repeatedly paired with positive images (think smiling children playing with adorable baby goats).   Not too surprisingly, tables are rated more positively when paired with smiling children than with decapitated puppies.  The next question is whether this evaluative conditioning effects is universal---does everybody plausibly show an evaluative conditioning effect in the same direction?

Some phenomena clearly hold universally.  If people can hear, then they respond faster to unexpected loud tones (startle) than unexpected soft tones.  Startle is a low-level, subcortical phenomenon, and nobody has a reverse startle where they respond faster to unexpected soft tones than loud ones.  Some phenomena clearly are not universal.  Handedness is a good example---most people can throw a ball further with their right hand; others throw further with their left hand.

Why does it matter?  If a phenomenon is universal, that it, it holds for everyone (or everyone in a subpopulation), we can seek a common explanation.  Further questions might even be metric---is there a metric relationship between the intensity of the sound and the intensity of the startle response. If a phenomenon is not universal, the next questions are: Why do people differ?  What are the correlates of, say, left-handedness?  Why are some people left-handed?  For evaluative conditioning, a universal answer begs the question of whether the mechanism is the same as ordinary associative learning; variation begs the question of why some people would view a table more positively when paired with decapitated puppies.

Of course, not every question lends itself to "does everyone?"  Questions about preference, for example, will not be universal.  Areas where the does-everyone idea is fruitful include perception, cognition, and social cognition.

The hard part of this question is statistical.  In commonly sized samples, we always observe some people who reverse the effect.  But the real question is whether these reversals are due to sample noise (trials are noisy) or variation in true values.  So, one needs to ask the more nuanced question, "does everyone plausibly?" and use latent variable models, with true and observed values, to answer the question.  The hard part is deciding how to evaluate the evidence because the analyst is assessing whether an ordering holds for each individual simultaneously.  

We were very proud to present the "does everybody plausibly" question and solve the statistical problem.   Our first paper on it was published in Psychological Methods.  

So, fellow methodological terrorists, research parasites, and allies.  Please consider the does-everyone question in your work.  If you need help with the statistics, we are here.

The Struggle Continues,
Jeff and Julia

Saturday, March 3, 2018

Hating on Sir Ronald? My Two Cents

This week, poor Sir Ronald A. Fisher took it on the chin in cyberspace.  Daniel Laken's, for example, writes on Twitter:

"For some, to me, incomprehensible, reason, most people seem not educated well on Neyman-Pearson statistics, and still use the ridiculous Fisherian interpretation of p-values"  (3/1)

And Uli Schimmack wrote on Facebook:

"the value of this article is to realize that Fisher was an egomaniac who hated it that Neyman improved on his work by adding type-II errors and power. So, he rejected this very sensible improvement with the consequence that many social scientists never learned about power and why it is important to conduct powerful studies.   Fisher -> Bem -> Train Wreck." (2/22)

So, I thought I would give poor Sir Ronald some love.  Rather than dig up quotes claiming poor Sir Ronald was misunderstood, let me see if I can provide a common sense example of why Fisher's view of the p-value remains intuitive and helpful, at least to the degree that a p-value can be intuitive and helpful.

Here is the setup.  Suppose two researchers, Researcher A and Researcher B, are running experiments each, and fortuitously each used the same alpha, had the same sample size, and did the same pre-experimental power calculation.  Both get p-values below their alpha's, which also happen to be the same, say .01.  Now, each rejects the null with the N-P safety net that had they each done their experiment over and over and over again, if the null were true, they would only make this rejection for 1% of the experiments.

Fine.  Except Researcher A's p-value was  .0099 and Researcher B's p-value was .00000001.  So, my question is whether you think Researcher B is entitled to make a stronger inferential statement than Researcher A?  If you read two papers with these p-values, could you form a judgment about which is more likely to have documented a true effect?  As I understand the state of things, if you think so, then you are using a Fisherian interpretation of the p-value.

In Neymann-Pearson testing, one starts with a specification of what the alternative would be if an effect were present.  This alternative is a point, say an effect of .4.  Then we design an experiment to have a sample large enough to detect this alternative effect with some power level  while maintaining a Type I error rate of some set value, usually .05.  And then, with our power informing our sample size, we collect data.  When finished, we compute a p-value and compare it to our Type I error rate.  If the p-value is below, we are justified in rejecting the null, otherwise, we are not.

As an upshot of N-P testing, the p-value is interpretable only as much as it falls on one side or the other of alpha.  That is it.  It is either less than alpha or greater than it.  The actual value is not informative for inference because it does not affect the long-term error rates the researcher is seeking to preserve.  Both Researcher A and B are entitled to the same inferential statements---both reject the null at .01---and that is it.  There is no sense that Researcher B's p-value is stronger or more likely to generalize.

So, do you think Researcher B has a better case?  If so, you are straying from N-P testing.

The beauty of Fisher is that, accordingly, the p-value is the strength of evidence against the null.  Smaller p-values always correspond to more evidence.  The ordinal relation between any two p-values, whether one is less than the other, can always be interpreted.

My sense is that this property makes intuitive sense to researchers.  Researcher B's rejection probably generalizes better than Researcher A's rejection.  And if you think so, I think you should be singing the praises of Sir Ronald.

The main difference between Fisher and N-P is whether you can interpret the numerical values of p-values as statements about a specific experiment.  For Fisher, you could.  For N-P, you cannot.  N-P viewed alpha as statements about the procedure you were using, more specifically, about its average performance across a large collection of studies.  This difference are most transparent for confidence intervals, where the only reasonable interpretation is Neymann's procedural one (see Morey et al., 2016, paper here).

There are difficulties in the Fisherian interpretation---if one states evidence against the null, what is one stating evidence for?  Fisher understood that p-values overstate the evidence against the null which is why he pursued fiducial probability (see here for an entree into fiducial probability).

From my humble POV, Bayes gives us everything we want.  It is far less assumptive than specifying points used for computing power.  And we can interpret the evidence in data without recourse to a sequence of infinitely many expeirments.  And we interpret it far more fully than the straight-jacket dichotomy of "in the rejection region" or "not in the rejection region."