Thursday, May 9, 2019

Two Writing Tips

The subject of writing tips has come up on twitter, and I have more than 280 characters to say.  My two thought:

Liberate yourself from your work.  Many of us are scared to write.  Here is why:  The worst thing for an academic to feel is stupid.  Yet, we rightfully feel stupid a good proportion of the time.  The literature is vast, everyone knows more than we do on every topic, and our ideas feel pedestrian.  If we don't write, then our stupidity is not exposed.  If we do write, then we risk being exposed for the frauds that we are.  I feel stupid even writing about this.  

Here is the thing.  Your writing is not about you.  It is its own thing, in fact, its own story.  You have the privilege of telling this story, but it is not your story.  So, liberate yourself from it.  Focus on the story itself and not what it says about you.  Because people care about the story.  They don't care that much about you or what the story says about you.  That people don't care about you is freedom.  You still need to be brave and vulnerable to write.  But, you are free to focus exclusively on the story---to give it justice---without excessively worrying about how people will see you, because, honestly, they are not looking at you.

Motivation.  People just don't read because you write. In fact, the easiest thing for a reader to do is stop reading.  I stop reading all the time.  I do so as soon as I no longer know why I am reading something.  And I think this is universal.  People stop reading when they do not know why they are reading..  This need for knowing why holds at all levels of the paper.  Is your material motivated?   One way of motivating material is to focus on a problem/solution format.     Readers can relate to paragraphs that define problems or provide solutions.  If a paragraph is not serving one of these roles, you might want to reconsider what it is doing.  

Often paragraphs that are unmotivated are the ones about the writer and not the story.  They are attempts to show that the writer mastered the literature or some skill.  They might even be those fear paragraphs where we fear if we dont genuflect properly, some generic "they" won't like our paper.  If you cant say why you are writing something, then dont write it.

Saturday, March 23, 2019

Teaching Undergrad Stats without p, F, or t.

I taught a 10 week intro-level stats course without p, F, or t.

The course proceeded in the usual way:
  • one-group designs
  • two-group designs
  • many-group designs
  • factorial designs
  • OLS regression
  • multiple regression
  • mixed factors and continuous covariates (ANCOVA)
The key to the course was the concept of a model.  I spent quite a bit of time on a distribution and an observation from it.  I also talked about the random-variable notation like \(Y_i\) for each of \(i\) observations in a data set.    Then, all inference in the class was model comparison.  No hypotheses, only models that instantiate theoretically interesting positions.  The models always have to be written down before analysis.

So, for the one-group design, we carried two models.  Students had to master the following notation:

$$ \mbox{Null Model}: \quad Y_i \sim \mbox{Normal}(0,\sigma^2) $$
$$ \mbox{Effect Model}: \quad Y_i \sim \mbox{Normal}(\mu,\sigma^2) $$

I then asked them how the model accounted for each observation.  For example, if the data were the observations  (-2,-1,0,1), we would make the following tables:

Null-Model Table

Data Account Error Squared Error
-2 0 -2 4
-1 0 -1 1
0 0 0 0
1 0 1 1


Effects-Model Table

DataAccountError Squared Error
-2-.5-1.52.25
-1-.5-.5.25
0-.5.5.25
1-.51.52.25


The next step is to calculate \(SS_E\), \(R^2\), and BIC, with \(R^2\) serving as a measure of effect size and BIC as a relative model comparison statistic.  Of course, I taught the formulas and meanings of these three quantities.  BIC, for example, was taught in terms of error and penalty, e.g., \( \mbox{BIC}=n\log(SS_E/n)+k\log(n)\).   Continuing, the following model-comparison table was produced:

Model Parameters SSE R^2 BIC
Null
0
6
0
1.62
Effect
1
5
.167
2.27

Here we the effect was actually pretty big in \(R^2\), but we do not have the resolution to prefer the effects model over the null model given the small sample size.  To help understand what \(R^2\) means, I provide a list of accounted-variances for various phenomena, say how much variance foes smoking account for in lung-cancer rates. 

Students were taught how to calculate all the values by calculator in small data sets and by spreadsheet in large data sets.  


Interpretation


For interpretation, students were taught that:
a. their inference was only as good as their models
b. no model was true or false, all were just helpful abstractions
c. their primary goal was model comparison rather than testing
d. they need not make decisions, just assess evidence judiciously
e. they should consider both model comparison (BIC) and effect size \(R^2\) in assessment

Extension

The above model-comparison-through-error approach extends gracefully to all linear model applications.  In this sense, once the mechanics and interpretations are mastered in the one-sample case, the extensions to more complex models, including multiple regression and multi-factor ANOVA are straightforward.  By getting the mechanics out early, we can focus on the models and how they account for phenomena.  Contrast this to the usual case where you may teach one set of mechanics for the t-test, another for F-tests, and a third for regression.  

Results

The quarter went well.  The students mastered the material effectively.  I had fun.  They had fun.  And I never lost any sleep at night wondering why I was teaching what I was teaching.



Thursday, February 28, 2019

Reimagining Meta-Analysis





Fruit and Meta-Analysis

The fruit in my house weigh on average 93 grams.  I know this because I weighed them.  The process of doing so is a good analogy for meta-analysis, though a lot less painful.

I bet you find the value of 93 grams rather uninformative.  It reflects what my family likes to eat, say bananas more than kiwis and strawberries more than blackberries.  In fact, even though I went through the effort of gathering the fruit, setting fruit criteria (for the record, I excluded a cucumber because cucumbers, while bearing seeds, just don't taste like fruit), and weighing them, dare I say this  mean doesn't mean anything.  And this is the critique Julia Haaf (@JuliaHaaf), Joe Hilgard (@JoeHilgard), Clint Davis-Stober (@ClintinS) and I  provide for meta-analytic means in our just-accepted Psychological Methods paper.   Just because you can compute  a sample mean doesn't mean that it is automatically helpful.

Means are most meaningful to me when they measure the central tendency of some naturally interesting random process.  For example, if I was studying environmental impacts on children's growth in various communities, the mean (and quantiles) of height and weight for children of given ages is certainly meaningful.  Even though the sample of kids in a community is diverse in wealth, race, etc., the mean is helpful in understanding say environmental factors such as a local pesticide factory.

In meta-analysis, the mean is over something haphazard...what type of paradigms happen to be trendy for certain questions.   The collection of studies is more like a collection of fruit in my house.  And just as the fruit mean reflects my family's preferences about fruit as much as any biological variation among seeded plant things, the meta-analytic mean reflects the sociology of researchers (how they decide what data to collect) as much as the phenomenon under study.

Do All Studies Truly?

In our recent paper, we dispense with the meta-analytic mean.  It simply is not a target for us for scientific inference.   Instead, we ask a different question, "Do All Studies Truly...."  To set the stage, we note that most findings have a canonical direction.  For example, we might think that playing violent video games increases rather than decreases subsequent aggressive behavior.  Increases here is the canonical direction, and we can call it the positive effect.  If we gather a collection of studies on the effects of video game violence, do all truly have an effect in this positive direction, that is do all truly increase aggression or do some increase and others truly decrease aggression.  Next, let's focus on  truly.  Truly for a study refers to the what would happen in the large-sample limit of many people.  In any finite sample for any study, we might observe a negative-direction effect from sampling noise, but the main question is about the true values.  Restated, how plausible is it that all studies have a true positive effect even though some might have negative sample effects?  Using Julia's and my previous work, we show how to compute this plausibility across a collection of studies.

So What?

Let's say, "yes," it is plausible that all studies of a collection truly have effects in a common direction, say violent video games do indeed increase aggression.  What is implied is much more constrained than some statement about the meta-analytic mean.  It is about robustness.  Whatever the causes of variation in the data set, the main finding is robust to these causes.  It is not that just the average shows the effect, but all studies plausibly do.  What a strong statement to make when it holds!

Now, let's take the opposite possibility, "no."  It is not plausible that all studies truly have effects in a common direction.  With high probability some have true effects in the opposite direction.  The upshot is a rich puzzle.  Which studies go one way and which go the other way?  Why?  What are the mediators?

In our view, then, the very first meta-analytic question is "do all studies truly."  The answer will surely shape what we do next.

Can You Do It Too?

Maybe, maybe not.  The actual steps are not that difficult.  One needs to perform a Bayesian analysis and gather the posterior samples.  The models are pretty straightforward and are easy to implement in the Bayes Factor package, stan or JAGS.  Then, to compute  the plausibility of the "do all studies truly" question, one needs to count how many posterior samples fall in certain ranges. So, if you can gather MCMC posterior samples for a model and count, you are in good shape.

We realize that some people may be drawn to the question and may be repelled by the lack of an automatic solution.   Julia and I have unrealized dreams of automating the process.  But, in the meantime, if you have a cool data set and an interest in the does-every-study-truly question, let us know.

Rouder, JN, Haaf, JM, Davis-Stover, C, Hilgard, J (in press) Beyond Overall Effects: A Bayesian Approach to Finding Constraints Across A Collection Of Studies In Meta-Analysis. Psychological Methods.





Saturday, January 5, 2019

P-values and Sample Sizes, the Survey

I ran a brief 24 hour survey in which many of you participated.  Thank you.

The main goal was to explore how people weigh off sample size and p-values.  I think with the adoption of power and sample-size planning, many people have mistakenly used pre-data intuitions for post-data analysis.  Certainly, if we had no data, we would correctly think all other things being equal that a larger study has greater potential to be more evidential than a smaller one.  But what about after the data are collected.

Here is the survey.  The darker blue bar is the most popular response.




The Answers

My own feeling is that the study with the smaller sample size is more evidential.   Let's take it from a few points-of-view:

Significance Testing:  If you are a strict adherence to significance testing, then you would use the p-values.  You might choose "same."  However, the example shows why significance testing is critiqued.  Let's consider comparisons across small and very large sample sizes, say N1=50 and N2=1,000,000.  The observed effect size for the first experiment is a healthy .32; that for the second is a meager .002.  So, as sample size increases and p-values do not, we are observing smaller and smaller effects.

Modern Testing I: Modern testing has been influenced by considerations of effect sizes.  If effect size is to matter inference at all, then the correct answer is the smaller sample size.  After all, the p-values are equal and the smaller sample size has the larger effect size.

ModernTesting II: Another way of thinking about modern testing is that the analyst chooses a level based on context.  An obvious factor is sample size, and many authors recommend lowering alpha with increasing sample size.  Hence, the same p-value is more likely to be significant wit the smaller sample size.  

Bayesian Testing:  For all reasonable priors, the Bayes factor favors the smaller sample size because larger effect sizes are more compatible, in general, with the effect than with the null.  Tom Faulkenberry notes that if you get to see the data first and fine tune the priors, then you can game a higher Bayes factor for N2 than N1. 

What We Learned

For me, the best answer is N1 because it captures the appropriate post-data intuition that everything else equal larger effect sizes are preferable to smaller effect sizes when establishing effects.  Unfortunately, it was the least popular choice at 18%.

One of the shocking thing to me is the popularity of N2 (24%).  I can't think of any inferential strategy that would give credence to an N2 response.  So, if you chose N2, you may wish to rethink about how you evaluate the significance of effects.  The same response (18%) make sense only if you are willing to ignore effect size.  This ignorance, however, strikes me as unwise in the current climate.  

The most popular response is "depends." (40%).  I am not sure what to make of depends responses.  I suspect for some of you, it was a cop out to see the results.  For others, it was an overly technical response to cover your bases.  In any case, it really doesn't depend that much.  Go with bigger effects when establishing effects.



Monday, November 19, 2018

Preregistration: Try it (Or not)

So, as the Statistical War and Tone War are in a lull, the Preregistration conflict has flared up yet again.   A few thoughts on the airplane back home from Psychonomics.

Prologue:

A. To be honest, it has taken me quite a long time to sort out my thoughts on preregistration.   I am not telling you to preregister or not.  Moreover, how I read your work is not dependent on whether you preregistered or not.  Perhaps you might find my thoughts helpful in your decision; perhaps not.

B. I don't believe in the usefulness of the exploratory/confirmatory distinction.  All of my research is motivated by some theoretical issue (so it is not exploratory) and I am always open to alternatives that I have not considered (so it is not confirmatory).  Arguments that rely on the exploratory/confirmatory distinction are not persuasive to me, and I will not be using them here (or elsewhere).

Why I Preregistered, the story:

I used preregistration because my students forced me to.  I found the experience rewarding and will preregister again.  Perhaps the strongest argument for preregistration is that it may clarify the researcher's thinking before seeing the data.   I think most of us can agree that writing is hard, and one of the reasons it is hard is that it forces you to clarify your thinking on things.   Preregistration in some sense provides the opportunity for that type of clarification before the data are collected.    As we wrote the preregistration, my team realized we hadn't though enough about what type of models could instantiate one of the theoretical alternatives.  So, we made a set of additional model specifications before seeing the data.  That was quite helpful.

Why I don't Take the Preregistration Too Seriously:

I feel no hesitation to break my preregistration.  In fact, I do not know if we did or did not break our preregistration because I never went back and read it!  I don't care if we did or not, to be honest.   I actually think this is not such a bad idea.  Here is why:

As Shiffrin notes, science requires good judgment.  In fact, being open-minded, flexible, and judicious are probably more important characteristics than being smart, industrious, or meticulous.  Now, what I like about preregistration is that it summons me to provide my best judgment at a particular point in time.  But, as new information come in, including data, I need to exercise good judgment yet again.  Hopefully, the previous efforts will make the current efforts easier and more successful.  But that is where it stops.   I will have no contract with my preregistration; instead I will use good judgment.  Preregistration is used to improve the pre-data steps, and hopefully that will improve post-data steps too.

So if you preregister, consider the following:

1. Try not to substitute your preregistration for your best judgment.  You can add value judiciously.  Don't trade in what you know now for what you knew then.

2. Don't forget to have a conversation with your data.  Nature only whispers, you need to communicate with her softly and subtly.  You gently ask one thing, it whispers something else.  And you go back and forth.  Please do not downgrade this conversation because it might be the most important thing you do with your data.

If You Want Others To Preregister:

Tell your story.   Maybe in detail.  What might you have done differently?

Saturday, September 1, 2018

Making Mistakes




Oh the irony.

I made a potentially dreadful mistake last month.  I zipped up and submitted the wrong version of a manuscript last month.  It was the final version for typesetting.  It could have been a fucking disaster---imaging if the original submission rather than revision was published.  I didn't catch the mistake, the amazing Michele Nathan at Psychological Science did.

And the manuscript that I made this mistake on was itself about minimizing mistakes (Click Here For The Paper).  That's right, we made a dreadful mistake in processing a paper for minimizing such mistakes.   Boy, do I feel like a fraud.

One solution is to berate myself, lose sleep, hide in shame, and promise I will be more careful.  This solution, of course, is useless because (a) there are much better things to feel shame about and (b) no matter my intentions, I will not be more careful.

The other solution is to do what we say in the paper.  We use the principles of High Reliability Organizations to deal with the mistake.

Step 1: Mistakes Happen, Record Them

Yeah, it happened.  We know mistakes happen and we are prepared for them.  So, the first thing I did was  open up a little form in our data base called "Adverse Events."  All labs should have a procedure for documenting mistakes.  In my lab, we do all adverse events together so everyone is in on the problem and the resolution.

Step 2: What Was the Mistake

I submitted directory sub2 instead of directory rev1. Directory sub2 was the second version and was originally submitted.  Directory rev1 was the third, and it was in response to reviews.  Obviously, with one second of thought, most of us would know that rev1 supersedes sub2.  We could even check the dates and the log, which were accurate.  But, I happen to be one of least detailed oriented people on the planet.  So, I must have seen the 2 in sub2 and zipped and submitted that directory.

Step 3: The Cause

It is obvious the problem here is a deficient set of naming conventions for files and directories.  We do a pretty good job of naming files in the lab.  We have git repositories, and all repos have the same few directories: dev, share, papers, presentations, grants.  Also, we have good mysql logs of when we submit papers.  So tracking down the mistake and why it occurred was easy.

What we did not have was a firm naming convention for versions within the directory "papers".  Clearly, we need to standardize this convention.

Step 4:  Resolution


In our system there are two workable solutions.  The first is to use successive version numbers on a single name. e.g., minMistakes.1 might have been the first version, minMistakes.2 might have been the second and so on.  Here, instead of calling directories sub2 and rev1 we number successively and use a more informative root name.  We lose where we are in the process though.   Fortunately, we keep pretty good records of things that we do in our database, so there we could record that minMistakes.2 was the first version submitted, etc.  An even better solution is to use git to sort the major versions.  We probably should not be changing directories and just using git tags for major versions and process notes.  That is what I am going to do from now on.

Here is our adverse event report now in our database.



Anyways, with the new conventions, I most likely wont make that mistake again.


Monday, August 27, 2018

Are there human universals in task performance? How might we know?

Science has traditionally proceeded by understanding constraint among variables. In a series of new papers, Julia Haaf and I ask if there are human universals in perception, cognition, and performance.   Here is an outline of our development.

What Do We Mean By "Human Universal"?


Let's take the Stroop task as an example. Are there any human universals in the Stroop task?  In this task, people name the color of compatible color words (e.g., RED) faster than incompatible ones (e.g., GREEN).   We will call this a positive Stroop effect, and, from many, many experiments, we know that on average, people have positive Stroop effects.  But what about each individual?   A good candidate for a human universal is that each individual shows a true positive effect.  And, conversely, no individual has a true negative Stroop effect where incongruent colors are named faster than congruent ones.  We call this the "Does Everyone" question.  Does everyone have a true nonnegative Stroop effect?  We propose a universal order-constraint on true performance.



The above figure shows observed Stroop effects across many individuals (data courtesy of Claudia von Bastian).  As can be seen, 8 of 121 people have negative Stroop effects.  But that doesn't mean the "Does Everyone" condition is violated.  The negative-going observations might be due to sample noise.  To help calibrate, we added individual 95% CIs.  Just by looking at the figure, it seems plausible from these CIs that indeed, everybody Stroops.

"True effect" in this context means in the limit of many trials, or as the CIs become vanishingly small.  The question is "if you had a really, really large number of trials for each person in the congruent and incongruent conditions, then would each and every individual have a positive effect.

Is The "Does Everyone" Question Interesting?

We would like to hear your opinion here.  In exchange, we offer our own.  

We think the "does everyone" question is fascinating.   We can think of some domains in which it seemingly holds, say Stroop or perception (nobody identifies quite dim words more quickly than modestly brighter ones).  Another domain is priming---it is hard to imagine that there are people who respond faster DOCTOR following PLATE than DOCTOR following NURSE.     And then there are other domains where it is assuredly violated including handedness (some people truly do throw a ball farther with their left hand) and preference (as strange as it may seem, some people truly prefer sweetened tea to unsweetened tea).    Where are the bounds?  Why?

The "does everyone" question seemingly has theoretical ramifications too.  An affirmative answer means that the processes underlying the task may be common across all people---that is, we may all access semantic meaning from text the same way.  A negative answer means that there may be variability in the processes and strategies people bring to bear.

Answering The Does Everyone Question.


The "Does Everyone" question is surprisingly hard to answer.  It is more than just classifying each individual as positive, zero, or negative.  It is a statement about the global configuration of effect.   We have seen cases where we can say with high confidence that at least one person has a negative true effect without being able to say with high confidence who these people are.  This happens when there are too many people with slightly negative effects.  We have been working hard over the last few years to develop the methodology to answer the question.  For the plot above, yes, there is evidence from our developments that everybody Stroops positive.
  

Our Requests:

  • Comment please.  Honestly, we wrote the below papers, but as far as we can tell, they have yet to be noticed.  
  • Is the "Does Everyone" question interesting to you?  Perhaps not.  Perhaps you know the answer for your domain.  
  • Can you think of a cool domain where whether everyone does somethings the same way is really theoretically important?  We are looking for collaborators!  (you can email jrouder@uci.edu or answer publicly).

Some of Our Papers on the Topic:

  1. Haaf & Rouder (2017), Developing Constraint in Bayesian Mixed Models.  Psychological Methods, 22, p 779-.  In this poorly entitled paper, we introduce the Does Everyone question and provide a Bayes factor approach for answering it.  We apply the approach to several data sets with simple inhibition tasks (Stroop, Simon, Flanker).   The bottom line is that when we can see overall effects, we often see evidence that everyone has the effect in the same direction.
  2. Haaf & Rouder (in press), Some do and some don’t? Accounting for variability of individual difference structures.  Psychonomic Bulletin & Review.  We also include a mixture model where some people have identically no effect while others have an order constrained effect.
  3. Thiele, Haaf, and Rouder (2017) Is there variation across individuals in processing? Bayesian analysis for systems factorial technology.  Journal of Mathematical Psychology,  81, p40-54.  A neat application to systems factorial technology (SFT).  SFT is a technique for telling whether people are processing different stimulus dimensions in serial, parallel, or co-actively by looking at the direction of a specific interaction contrast.  We ask whether all people have the same direction of the interaction contrast.