Thursday, April 30, 2015

Brainstorming: What Is A Good Data-Sharing Policy?

This blog post is coauthored with Richard Morey.

Many talented and generous people are working hard to change the culture of psychological science for the better.  Key in this change is the call for transparency and openness in the production, transmission, and consumption of knowledge in our field.  And change is happening---the field is becoming more transparent.   It is delightful to see.

To keep up with these changes, I have recently reconfigured how data are curated in my lab.  In that process I examined APA's statement on the ethical obligation of researchers to share data:

8.14 Sharing Research Data for Verification
(a) After research results are published, psychologists do not withhold the data on which their conclusions are based from other competent professionals who seek to verify the substantive claims through reanalysis and who intend to use such data only for that purpose, provided that the confidentiality of the participants can be protected and unless legal rights concerning proprietary data preclude their release. This does not preclude psychologists from requiring that such individuals or groups be responsible for costs associated with the provision of such information. 
(b) Psychologists who request data from other psychologists to verify the substantive claims through reanalysis may use shared data only for the declared purpose. Requesting psychologists obtain prior written agreement for all other uses of the data.

More commentary is provided in APA's 6th Edition Publication Manual (p 12-13):

To avoid misunderstanding, it is important for the researcher requesting the data and the researcher providing the data to come to a written agreement about the conditions under which the data are to be shared.  Such an agreement must specify how the shared data may be used (e.g., for verification of already published results, for inclusion in meta-analytic studies, for secondary analysis).  The written agreement should also include a formal statement about limits on the distribution of shared data....  Furthermore, the agreement should specify limits on the dissemination .... of the results of analyses performed on the data and authorship expectations.

What's Good About This Policy?

The above policy shines a spotlight on verification.  Verification is needed for a healthy psychological science.  Researchers are prone to mistakes, and allowing others to see the reasoning from data to conclusion provides an added layer of integrity to the process.  It also provides at least some expectation that data may be shared, albeit perhaps with stringent limits.

Understanding The Difference Between "Data" and "Interpretation"?

To act ethically, we need to differentiate data from the interpretation of data.  

Data serve as facts, and like facts, are interpreted in the context of theories and explanations.  The data themselves are incontrovertible in as much as the actual values stand on their own.  The interpretation of these facts, whether and how they support certain theories, explanations, and accounts, is not incontrovertible.  These interpretations are just that.  They are creative, insightful, varied, judged, negotiated, personal, etc.   

Who is qualified to interpret these data?  We all feel like we are uniquely qualified to interpret our own data, but we are not.  On the contrary, we should respect the abilities of others to interpret the data, and we should respect that their interpretations may differ from ours.  Our interpretations should derive no special authority or consideration from the fact that we collected the data.  When we collect data, we gain the right to interpret them first, but not last.  

What's Wrong With The Policy

Jelte Wicherts and Marjan Baaker have recently pointed out the flaws in APAs policies.  Their unpublished paper and  commentary in Nature are important reads in understanding the possible ramifications.  

The APA policy does not insure others' rights to independently interpret the data.  The PI explicitly retains sanctioned control of subsequent interpretations.  They may exercise this control by refusing to share without being granted authorship, limiting the scope of use of data, or limiting where the results may appear.  The APA policy serves to privilege the data collectors' interpretation where no such privilege should ethically exist.  

What Does A Good Policy Look Like?

Let's brainstorm this as a community.  As a starting point, we propose the following statement:  Ethical psychologists endeavor to insure that others may independently interpret their data without constraint.  How that should play out?  We welcome your ideas.

Tuesday, April 21, 2015

How many participants do I need? Insights from a Dominance Principle

The Twitter Fight About Numbers of Participants

About a month back, there was an amusing Twitter fight about how many participants one needs in a within-subject design.  Cognitive and perceptual types tend to use just a few participants but run many replicates per subject.  For example, we tend to run about 30 people with 100 trials per condition.  Social psychologists tend to run between-subject designs with one or a handful of trials per condition, but they tend to run experiments with more people than 30.  The Twitter discussion was a critique of the small sample sizes used in cognitive experiments.

In this post, I ask how wise are the cognitive psychologists by examining the ramifications of these small numbers of participants.   This examination informed by a dominance principle, a reasonable conjecture on how people differ from one another.  I show why the cognitive psychologist right---why you need small numbers of people even to detect small effect.


Consider a small priming effect, say an average 30 ms difference between primed and unprimed conditions.  There are a few sources of variation in such an experiment:

The variability within a person and condition across trials.  What happens when the same person responds in the same condition?  In response times measures, this variation is usually large, say a standard deviation of 300 ms.  We can call this within-cell variability.

The variability across people.  Regardless of condition, some people just flat out respond faster than others.  Let's suppose we knew exactly how fast each person was, that is we had many repetitions in a condition.  In my experience, across-people variability is a tad less than within-person-condition variability.  Let's take the standard deviation at 200 ms.  We can call this across-people variability.

The variability in the effect across people.  Suppose we knew exactly how fast each person was in both conditions.  The difference is the true effect, and we should assume that this true effect varies.  Not everyone is going to have an exact 30 ms priming effect.  Some people are going to have a 20 ms effect, others are going to have a 40 ms effect.   How big is the variability of the effect across people?    Getting a handle on this variability is critical because it is the limiting factor in within-subject designs.   And this is where the dominance principle comes in.

A Dominance Principle

The dominance principle here is that nobody has a true reversal of the priming effect.  We might see a reversal in observed data, but this reversal is only because of sample noise.  Had we enough trials per person per condition, we would see that everyone has at least a positive priming effect.  The responses are unambiguously quicker in the primed condition---they dominate.

The Figure below shows the dominance principle in action.  Shown are two distributions of true effects across people---an exponential and a truncated normal.  The dominance principle stipulates that the true effect is in the same direction for everyone, that is, there is no mass below zero.  And if there is no mass below zero and the average is 30 ms, then the distributions cannot be too variable.  Indeed, the two shown distributions have a mean of 30 ms, and the standard deviations for these exponential and truncated normal distributions are 30 ms and 20 ms, respectively.  This variability is far less than the 300 ms of within-cell variability or the 200 ms of across-people variability.  The effect size across people, 30 ms divided by these standard deviations, is actually quite large.  It is 1 and 1.5 respectively for the shown distributions.

You may see the wisdom in the dominance principle or you may be skeptical.  If you are skeptical (and why not), then hang on.  I am first going to explore the ramifications of the principle, and then I am going to show it is probably ok.

Between and Within Subject Designs

The overall variability in a between subject design is the sum of the variabilities, and it is determined in large part by the much larger within-cell and across-people variabilities.  This is why it might be hard to see a 30 ms priming effect in a typical between subject design.  The effect size is somewhere south of .1.

The overall variability in a within subject design depends on the number of trials per participant.   In these designs, we calculate each person's mean effect.  This difference has two properties: first, we effectively subtract out across-participant variability; second, the within-cell variability decreases with the number of trials per participant.  If this number is great, then the overall variability is limited by the variability in the effect across people.  As we stated above, due to the dominance principle, this variability is small, say about the size of the effect under consideration.  Therefore, as we increase the numbers of observations per person, we can expect effects of 1 or even bigger.

Simulating Power for Within Subject Designs

Simulations seem to convince people of points perhaps even more than math.  So here are mine to show off the power of within-subject designs under the dominance principle.  I used the 300 ms within-cell and 200 ms across-people variabilities and sampled 100 observations per person per condition.  Each person had a true positive effect, and these effects we sampled from the truncated normal distribution with an overall mean of \( \mu \).  Here are the power results for several sample sizes (numbers of people) and  value of an average effect \( \mu \).

The news is quite good.  Although a 10 ms effect cannot be resolved with fewer than a hundred participants, the power for larger effects is reasonable.  For example the power to resolve a 30 ms effect with 30 participants is .93!  Indeed, cognitive psychologists know that even small effects can be successfully resolved with limited participants in massively-repeated within-subjects designs.  It's why we do it routinely.

The bottom line message is that if one assumes the dominance principle, then the power of within-subject designs is surprisingly high.  Of course, without dominance all bets are off.  Power remains a function of the variability of the effect across people, which must be specified.

Logic and Defense of the Dominance Principle

You may be skeptical of the dominance principle.  I suspect, however, that you will need to assert it.

1. The size of effects are difficult to interpret without the dominance principle.   Let's suppose that the dominance principle is massively violated.  In what sense is the mean effect useful or interpretable.  For example,suppose one has a 30 ms average effect with 60% of people having a true positive effect and 40% of people having a true a negative priming effect .  The value of the 30 ms seems unimportant.  What is critically important in this case is why the effect is different in direction across people.  A good start is exploring what person variables are associated with positive and negative priming?

2. The dominance principle is testable.  All you have to do is collect a few thousand trials per person to beat the 300 ms within-cell variability.  If you want say 10 ms resolution per person, just collect 1000 observations per person.  I have done it on several occasions, collecting as much as 8,000 trials per person on some (see Ratcliff and Rouder, 1998, Psych Sci).  I cannot recall a violation though I have no formal analysis....yet.   The key is making sure you do not confound within-participant variability, which is often large, with between participant variability.  You need a lot of trials per individual to deconfound these sources.  If you know of a dominance violation, then please pass the info along.

Odds are you are not going to collect enough data to test for dominance.  And odds are that you are going to want to interpret the average effect size across people as meaningful.  And to do so, in my view, you will therefore need to assume dominance!  And this strikes me as a good thing.  Dominance is reasonable in most contexts, strengthens the interpretation of effects, and leads to high-power even with small sample sizes in within-subject designs.

Thursday, April 9, 2015

Reply to Uri Simonsohn's Critique of Default Bayesian Tests

How we draw inferences about theoretical positions from data, how we know things, is central to any science.  My colleagues and I advocate Bayes factors, and Uri Simonsohn provides a critique.  I welcome Uri’s comments  because they push the conversation forward.  I am pleased to be invited to reply, and am helped by Joe Hilgard, an excellent graduate student here at Mizzou.

The key question is what to conclude from small effects when we observe them.  It is a critical question because psychological science is littered with these pesky small effects---effects that are modest in size and evaluate to be barely significant. Such observed effects could reflect many possibilities including null effects, small effects, or even large effects.

According to Uri, Bayes factors are prejudiced against small effects because small effects when true may be interpreted as evidence for the null. By contrast, Ward Edwards (1965, Psy Bull, p. 400) described classical methods as violently biased against the null because if the null is true, these methods never provide support for it.

Uri's argument assumes that observed small effects reflect true small effects, as shown in his Demo 1. The Bayes factor is designed to answer a different question: What is the best model conditional on data, rather than how do statistics behave in the long run conditional on unknown truths?  I discuss the difference between these questions; why the Bayes factor question is more intellectually satisfying; and how to simulate data to assess veracity of Bayes factor in my 2014 Psy Bull Rev paper.

The figure below shows the Bayes-factor evidence for alternative model relative to the null model conditional on data, which in this case is the observed effect size. We also show the same for p-values, and while the figures are broadly similar they also have dissimilarities.  I am not sure we can discern which is more reasonable without deeper analysis.  Even though we think Uri's Demo 1 is not that helpful, it is still noteworthy that the Bayes factors favor the null to greater extent as the true effect size decreases.

The question remains how to interpret observed small effects.  We use Bayes factors to assess relative evidence for the alternative vs. the null.  Importantly, we give the null a fighting chance.  Nulls are statement of invariance, sameness, conservation, lawfulness, constraint, and regularity.  We suspect that if nulls are taken seriously, that is, that if one can state evidence for or against them as dictated by data, then they will prove useful in building theory.    Indeed, statements of invariances, constraint, lawfulnesses and the like have undergirded scientific advancement for over four centuries.  It is time we joined this bandwagon.

Bayes factors are formed by comparing predictions of competing models.  The logic is intuitive and straightforward.  Consider two competing explanations.  Before we observe the data, we use the explanations to predict where the observed effect size might be.  We do this by making probability statements.  For example, we may calculate the probability that the observed effect size is between .2 and .3.  Making predictions under the null is easy, and these probabilities follow gracefully from the t-distribution.

Making predictions under the alternative is not much harder for simple cases.  We specify a reasonable distribution over true effect sizes, and based on these, we can calculate probabilities on effect size intervals as with the null.

Once the predictions are stated, the rest is a matter of comparing how well the observed effect size matches the predictions.  That's it.  Did the data conform to one prediction better than another?  And that is exactly what the Bayes factor tells us.  Prediction is evidence.

My sense is that researchers need some time to get used to making predictions because it is new in psychological science.   I realize that some folks would rather not make predictions---it requires care, thought, and judiciousness.  Even worse, the predictions are bold and testable.  My colleagues and I are here to help, whether it is help in specifying models that make predictions or deriving the predictions themselves.

I would recommend that researchers who are unwilling to specify models of the alternative hypothesis avoid inference altogether.  There is no principled way of pitting a null model that makes predictions against a vague alternative that does not.  Inference is not always appropriate, and for some perfectly good research endeavors, description is powerful and persuasive.  The difficult case, though, is continuing in what we are doing---rejecting nulls with small effects with scant evidence while denying the beauty and usefulness of invariances in psychological science.