## Thursday, May 9, 2019

### Two Writing Tips

The subject of writing tips has come up on twitter, and I have more than 280 characters to say.  My two thought:

Liberate yourself from your work.  Many of us are scared to write.  Here is why:  The worst thing for an academic to feel is stupid.  Yet, we rightfully feel stupid a good proportion of the time.  The literature is vast, everyone knows more than we do on every topic, and our ideas feel pedestrian.  If we don't write, then our stupidity is not exposed.  If we do write, then we risk being exposed for the frauds that we are.  I feel stupid even writing about this.

Here is the thing.  Your writing is not about you.  It is its own thing, in fact, its own story.  You have the privilege of telling this story, but it is not your story.  So, liberate yourself from it.  Focus on the story itself and not what it says about you.  Because people care about the story.  They don't care that much about you or what the story says about you.  That people don't care about you is freedom.  You still need to be brave and vulnerable to write.  But, you are free to focus exclusively on the story---to give it justice---without excessively worrying about how people will see you, because, honestly, they are not looking at you.

Motivation.  People just don't read because you write. In fact, the easiest thing for a reader to do is stop reading.  I stop reading all the time.  I do so as soon as I no longer know why I am reading something.  And I think this is universal.  People stop reading when they do not know why they are reading..  This need for knowing why holds at all levels of the paper.  Is your material motivated?   One way of motivating material is to focus on a problem/solution format.     Readers can relate to paragraphs that define problems or provide solutions.  If a paragraph is not serving one of these roles, you might want to reconsider what it is doing.

Often paragraphs that are unmotivated are the ones about the writer and not the story.  They are attempts to show that the writer mastered the literature or some skill.  They might even be those fear paragraphs where we fear if we dont genuflect properly, some generic "they" won't like our paper.  If you cant say why you are writing something, then dont write it.

## Saturday, March 23, 2019

### Teaching Undergrad Stats without p, F, or t.

I taught a 10 week intro-level stats course without p, F, or t.

The course proceeded in the usual way:
• one-group designs
• two-group designs
• many-group designs
• factorial designs
• OLS regression
• multiple regression
• mixed factors and continuous covariates (ANCOVA)
The key to the course was the concept of a model.  I spent quite a bit of time on a distribution and an observation from it.  I also talked about the random-variable notation like $$Y_i$$ for each of $$i$$ observations in a data set.    Then, all inference in the class was model comparison.  No hypotheses, only models that instantiate theoretically interesting positions.  The models always have to be written down before analysis.

So, for the one-group design, we carried two models.  Students had to master the following notation:

$$\mbox{Null Model}: \quad Y_i \sim \mbox{Normal}(0,\sigma^2)$$
$$\mbox{Effect Model}: \quad Y_i \sim \mbox{Normal}(\mu,\sigma^2)$$

I then asked them how the model accounted for each observation.  For example, if the data were the observations  (-2,-1,0,1), we would make the following tables:

Null-Model Table

Data Account Error Squared Error
-2 0 -2 4
-1 0 -1 1
0 0 0 0
1 0 1 1

Effects-Model Table

DataAccountError Squared Error
-2-.5-1.52.25
-1-.5-.5.25
0-.5.5.25
1-.51.52.25

The next step is to calculate $$SS_E$$, $$R^2$$, and BIC, with $$R^2$$ serving as a measure of effect size and BIC as a relative model comparison statistic.  Of course, I taught the formulas and meanings of these three quantities.  BIC, for example, was taught in terms of error and penalty, e.g., $$\mbox{BIC}=n\log(SS_E/n)+k\log(n)$$.   Continuing, the following model-comparison table was produced:

Model Parameters SSE R^2 BIC
Null
0
6
0
1.62
Effect
1
5
.167
2.27

Here we the effect was actually pretty big in $$R^2$$, but we do not have the resolution to prefer the effects model over the null model given the small sample size.  To help understand what $$R^2$$ means, I provide a list of accounted-variances for various phenomena, say how much variance foes smoking account for in lung-cancer rates.

Students were taught how to calculate all the values by calculator in small data sets and by spreadsheet in large data sets.

Interpretation

For interpretation, students were taught that:
a. their inference was only as good as their models
b. no model was true or false, all were just helpful abstractions
c. their primary goal was model comparison rather than testing
d. they need not make decisions, just assess evidence judiciously
e. they should consider both model comparison (BIC) and effect size $$R^2$$ in assessment

Extension

The above model-comparison-through-error approach extends gracefully to all linear model applications.  In this sense, once the mechanics and interpretations are mastered in the one-sample case, the extensions to more complex models, including multiple regression and multi-factor ANOVA are straightforward.  By getting the mechanics out early, we can focus on the models and how they account for phenomena.  Contrast this to the usual case where you may teach one set of mechanics for the t-test, another for F-tests, and a third for regression.

Results

The quarter went well.  The students mastered the material effectively.  I had fun.  They had fun.  And I never lost any sleep at night wondering why I was teaching what I was teaching.

## Thursday, February 28, 2019

### Reimagining Meta-Analysis

Fruit and Meta-Analysis

The fruit in my house weigh on average 93 grams.  I know this because I weighed them.  The process of doing so is a good analogy for meta-analysis, though a lot less painful.

I bet you find the value of 93 grams rather uninformative.  It reflects what my family likes to eat, say bananas more than kiwis and strawberries more than blackberries.  In fact, even though I went through the effort of gathering the fruit, setting fruit criteria (for the record, I excluded a cucumber because cucumbers, while bearing seeds, just don't taste like fruit), and weighing them, dare I say this  mean doesn't mean anything.  And this is the critique Julia Haaf (@JuliaHaaf), Joe Hilgard (@JoeHilgard), Clint Davis-Stober (@ClintinS) and I  provide for meta-analytic means in our just-accepted Psychological Methods paper.   Just because you can compute  a sample mean doesn't mean that it is automatically helpful.

Means are most meaningful to me when they measure the central tendency of some naturally interesting random process.  For example, if I was studying environmental impacts on children's growth in various communities, the mean (and quantiles) of height and weight for children of given ages is certainly meaningful.  Even though the sample of kids in a community is diverse in wealth, race, etc., the mean is helpful in understanding say environmental factors such as a local pesticide factory.

In meta-analysis, the mean is over something haphazard...what type of paradigms happen to be trendy for certain questions.   The collection of studies is more like a collection of fruit in my house.  And just as the fruit mean reflects my family's preferences about fruit as much as any biological variation among seeded plant things, the meta-analytic mean reflects the sociology of researchers (how they decide what data to collect) as much as the phenomenon under study.

Do All Studies Truly?

In our recent paper, we dispense with the meta-analytic mean.  It simply is not a target for us for scientific inference.   Instead, we ask a different question, "Do All Studies Truly...."  To set the stage, we note that most findings have a canonical direction.  For example, we might think that playing violent video games increases rather than decreases subsequent aggressive behavior.  Increases here is the canonical direction, and we can call it the positive effect.  If we gather a collection of studies on the effects of video game violence, do all truly have an effect in this positive direction, that is do all truly increase aggression or do some increase and others truly decrease aggression.  Next, let's focus on  truly.  Truly for a study refers to the what would happen in the large-sample limit of many people.  In any finite sample for any study, we might observe a negative-direction effect from sampling noise, but the main question is about the true values.  Restated, how plausible is it that all studies have a true positive effect even though some might have negative sample effects?  Using Julia's and my previous work, we show how to compute this plausibility across a collection of studies.

So What?

Let's say, "yes," it is plausible that all studies of a collection truly have effects in a common direction, say violent video games do indeed increase aggression.  What is implied is much more constrained than some statement about the meta-analytic mean.  It is about robustness.  Whatever the causes of variation in the data set, the main finding is robust to these causes.  It is not that just the average shows the effect, but all studies plausibly do.  What a strong statement to make when it holds!

Now, let's take the opposite possibility, "no."  It is not plausible that all studies truly have effects in a common direction.  With high probability some have true effects in the opposite direction.  The upshot is a rich puzzle.  Which studies go one way and which go the other way?  Why?  What are the mediators?

In our view, then, the very first meta-analytic question is "do all studies truly."  The answer will surely shape what we do next.

Can You Do It Too?

Maybe, maybe not.  The actual steps are not that difficult.  One needs to perform a Bayesian analysis and gather the posterior samples.  The models are pretty straightforward and are easy to implement in the Bayes Factor package, stan or JAGS.  Then, to compute  the plausibility of the "do all studies truly" question, one needs to count how many posterior samples fall in certain ranges. So, if you can gather MCMC posterior samples for a model and count, you are in good shape.

We realize that some people may be drawn to the question and may be repelled by the lack of an automatic solution.   Julia and I have unrealized dreams of automating the process.  But, in the meantime, if you have a cool data set and an interest in the does-every-study-truly question, let us know.

Rouder, JN, Haaf, JM, Davis-Stover, C, Hilgard, J (in press) Beyond Overall Effects: A Bayesian Approach to Finding Constraints Across A Collection Of Studies In Meta-Analysis. Psychological Methods.

## Saturday, January 5, 2019

### P-values and Sample Sizes, the Survey

I ran a brief 24 hour survey in which many of you participated.  Thank you.

The main goal was to explore how people weigh off sample size and p-values.  I think with the adoption of power and sample-size planning, many people have mistakenly used pre-data intuitions for post-data analysis.  Certainly, if we had no data, we would correctly think all other things being equal that a larger study has greater potential to be more evidential than a smaller one.  But what about after the data are collected.

Here is the survey.  The darker blue bar is the most popular response.

My own feeling is that the study with the smaller sample size is more evidential.   Let's take it from a few points-of-view:

Significance Testing:  If you are a strict adherence to significance testing, then you would use the p-values.  You might choose "same."  However, the example shows why significance testing is critiqued.  Let's consider comparisons across small and very large sample sizes, say N1=50 and N2=1,000,000.  The observed effect size for the first experiment is a healthy .32; that for the second is a meager .002.  So, as sample size increases and p-values do not, we are observing smaller and smaller effects.

Modern Testing I: Modern testing has been influenced by considerations of effect sizes.  If effect size is to matter inference at all, then the correct answer is the smaller sample size.  After all, the p-values are equal and the smaller sample size has the larger effect size.

ModernTesting II: Another way of thinking about modern testing is that the analyst chooses a level based on context.  An obvious factor is sample size, and many authors recommend lowering alpha with increasing sample size.  Hence, the same p-value is more likely to be significant wit the smaller sample size.

Bayesian Testing:  For all reasonable priors, the Bayes factor favors the smaller sample size because larger effect sizes are more compatible, in general, with the effect than with the null.  Tom Faulkenberry notes that if you get to see the data first and fine tune the priors, then you can game a higher Bayes factor for N2 than N1.

What We Learned

For me, the best answer is N1 because it captures the appropriate post-data intuition that everything else equal larger effect sizes are preferable to smaller effect sizes when establishing effects.  Unfortunately, it was the least popular choice at 18%.

One of the shocking thing to me is the popularity of N2 (24%).  I can't think of any inferential strategy that would give credence to an N2 response.  So, if you chose N2, you may wish to rethink about how you evaluate the significance of effects.  The same response (18%) make sense only if you are willing to ignore effect size.  This ignorance, however, strikes me as unwise in the current climate.

The most popular response is "depends." (40%).  I am not sure what to make of depends responses.  I suspect for some of you, it was a cop out to see the results.  For others, it was an overly technical response to cover your bases.  In any case, it really doesn't depend that much.  Go with bigger effects when establishing effects.