Thursday, December 12, 2019

Are you strong in math stats or probability?  I have a conjecture.  It is about order statistics.  The result is a bit surprising and I haven't figured out why it is so beautiful.

First,  an easy problem.  Suppose
$X_i \stackrel{iid}{\sim }\mbox{G},\; i=1,\ldots,I$
is a sequence of independent and identically distributed continuous random variables with some distribution $$G$$.  For example, $$G$$ could be a standard normal distribution.  And suppose the same for $$Y_i$$:
$Y_i \stackrel{iid}{\sim }\mbox{G},\; i=1,\ldots,I.$
Further, define $$A=\{i: \mbox{ such that } X_i<Y_i\}$$.  Let $$n$$ be the size of $$A$$, that is, n is the the number of times $$X_i<Y_i$$.  It should come as no surprise that $$n$$ is distribution as binomial with a probability of .5 and a size of $$I$$, and this holds regardless of $$G$$.  So, for example, if $$I=5$$, then the probability distribution of $$n$$ is:

That is the easy problem.  Here is the hard problem.  Suppose we do the same, but instead  if comparing the samples, we sort  them first so  $$X_1<X_2<\dots<X_I$$ and $$Y_1<Y_2<\ldots<Y_I$$.  Then, my contention is that $$n$$ is distributed as a discrete uniform with mass $$1/(I+1)$$ on points $$0,\ldots,I$$.  The following is some R code:

I=5
M=100000
n=1:M
for (m in 1:M){
X=sort(rnorm(I))
Y=sort(rnorm(I))
n[m]=sum(X<Y)}
p=table(n)/M

Here is the graph of the probability mass function:

Cool result.  I just can't prove it.  Any help?  Be a math-stat hero.  Thanks!

Thursday, May 9, 2019

Two Writing Tips

The subject of writing tips has come up on twitter, and I have more than 280 characters to say.  My two thought:

Liberate yourself from your work.  Many of us are scared to write.  Here is why:  The worst thing for an academic to feel is stupid.  Yet, we rightfully feel stupid a good proportion of the time.  The literature is vast, everyone knows more than we do on every topic, and our ideas feel pedestrian.  If we don't write, then our stupidity is not exposed.  If we do write, then we risk being exposed for the frauds that we are.  I feel stupid even writing about this.

Here is the thing.  Your writing is not about you.  It is its own thing, in fact, its own story.  You have the privilege of telling this story, but it is not your story.  So, liberate yourself from it.  Focus on the story itself and not what it says about you.  Because people care about the story.  They don't care that much about you or what the story says about you.  That people don't care about you is freedom.  You still need to be brave and vulnerable to write.  But, you are free to focus exclusively on the story---to give it justice---without excessively worrying about how people will see you, because, honestly, they are not looking at you.

Motivation.  People just don't read because you write. In fact, the easiest thing for a reader to do is stop reading.  I stop reading all the time.  I do so as soon as I no longer know why I am reading something.  And I think this is universal.  People stop reading when they do not know why they are reading..  This need for knowing why holds at all levels of the paper.  Is your material motivated?   One way of motivating material is to focus on a problem/solution format.     Readers can relate to paragraphs that define problems or provide solutions.  If a paragraph is not serving one of these roles, you might want to reconsider what it is doing.

Often paragraphs that are unmotivated are the ones about the writer and not the story.  They are attempts to show that the writer mastered the literature or some skill.  They might even be those fear paragraphs where we fear if we dont genuflect properly, some generic "they" won't like our paper.  If you cant say why you are writing something, then dont write it.

Saturday, March 23, 2019

Teaching Undergrad Stats without p, F, or t.

I taught a 10 week intro-level stats course without p, F, or t.

The course proceeded in the usual way:
• one-group designs
• two-group designs
• many-group designs
• factorial designs
• OLS regression
• multiple regression
• mixed factors and continuous covariates (ANCOVA)
The key to the course was the concept of a model.  I spent quite a bit of time on a distribution and an observation from it.  I also talked about the random-variable notation like $$Y_i$$ for each of $$i$$ observations in a data set.    Then, all inference in the class was model comparison.  No hypotheses, only models that instantiate theoretically interesting positions.  The models always have to be written down before analysis.

So, for the one-group design, we carried two models.  Students had to master the following notation:

$$\mbox{Null Model}: \quad Y_i \sim \mbox{Normal}(0,\sigma^2)$$
$$\mbox{Effect Model}: \quad Y_i \sim \mbox{Normal}(\mu,\sigma^2)$$

I then asked them how the model accounted for each observation.  For example, if the data were the observations  (-2,-1,0,1), we would make the following tables:

Null-Model Table

Data Account Error Squared Error
-2 0 -2 4
-1 0 -1 1
0 0 0 0
1 0 1 1

Effects-Model Table

DataAccountError Squared Error
-2-.5-1.52.25
-1-.5-.5.25
0-.5.5.25
1-.51.52.25

The next step is to calculate $$SS_E$$, $$R^2$$, and BIC, with $$R^2$$ serving as a measure of effect size and BIC as a relative model comparison statistic.  Of course, I taught the formulas and meanings of these three quantities.  BIC, for example, was taught in terms of error and penalty, e.g., $$\mbox{BIC}=n\log(SS_E/n)+k\log(n)$$.   Continuing, the following model-comparison table was produced:

Model Parameters SSE R^2 BIC
Null
0
6
0
1.62
Effect
1
5
.167
2.27

Here we the effect was actually pretty big in $$R^2$$, but we do not have the resolution to prefer the effects model over the null model given the small sample size.  To help understand what $$R^2$$ means, I provide a list of accounted-variances for various phenomena, say how much variance foes smoking account for in lung-cancer rates.

Students were taught how to calculate all the values by calculator in small data sets and by spreadsheet in large data sets.

Interpretation

For interpretation, students were taught that:
a. their inference was only as good as their models
b. no model was true or false, all were just helpful abstractions
c. their primary goal was model comparison rather than testing
d. they need not make decisions, just assess evidence judiciously
e. they should consider both model comparison (BIC) and effect size $$R^2$$ in assessment

Extension

The above model-comparison-through-error approach extends gracefully to all linear model applications.  In this sense, once the mechanics and interpretations are mastered in the one-sample case, the extensions to more complex models, including multiple regression and multi-factor ANOVA are straightforward.  By getting the mechanics out early, we can focus on the models and how they account for phenomena.  Contrast this to the usual case where you may teach one set of mechanics for the t-test, another for F-tests, and a third for regression.

Results

The quarter went well.  The students mastered the material effectively.  I had fun.  They had fun.  And I never lost any sleep at night wondering why I was teaching what I was teaching.

Thursday, February 28, 2019

Reimagining Meta-Analysis

Fruit and Meta-Analysis

The fruit in my house weigh on average 93 grams.  I know this because I weighed them.  The process of doing so is a good analogy for meta-analysis, though a lot less painful.

I bet you find the value of 93 grams rather uninformative.  It reflects what my family likes to eat, say bananas more than kiwis and strawberries more than blackberries.  In fact, even though I went through the effort of gathering the fruit, setting fruit criteria (for the record, I excluded a cucumber because cucumbers, while bearing seeds, just don't taste like fruit), and weighing them, dare I say this  mean doesn't mean anything.  And this is the critique Julia Haaf (@JuliaHaaf), Joe Hilgard (@JoeHilgard), Clint Davis-Stober (@ClintinS) and I  provide for meta-analytic means in our just-accepted Psychological Methods paper.   Just because you can compute  a sample mean doesn't mean that it is automatically helpful.

Means are most meaningful to me when they measure the central tendency of some naturally interesting random process.  For example, if I was studying environmental impacts on children's growth in various communities, the mean (and quantiles) of height and weight for children of given ages is certainly meaningful.  Even though the sample of kids in a community is diverse in wealth, race, etc., the mean is helpful in understanding say environmental factors such as a local pesticide factory.

In meta-analysis, the mean is over something haphazard...what type of paradigms happen to be trendy for certain questions.   The collection of studies is more like a collection of fruit in my house.  And just as the fruit mean reflects my family's preferences about fruit as much as any biological variation among seeded plant things, the meta-analytic mean reflects the sociology of researchers (how they decide what data to collect) as much as the phenomenon under study.

Do All Studies Truly?

In our recent paper, we dispense with the meta-analytic mean.  It simply is not a target for us for scientific inference.   Instead, we ask a different question, "Do All Studies Truly...."  To set the stage, we note that most findings have a canonical direction.  For example, we might think that playing violent video games increases rather than decreases subsequent aggressive behavior.  Increases here is the canonical direction, and we can call it the positive effect.  If we gather a collection of studies on the effects of video game violence, do all truly have an effect in this positive direction, that is do all truly increase aggression or do some increase and others truly decrease aggression.  Next, let's focus on  truly.  Truly for a study refers to the what would happen in the large-sample limit of many people.  In any finite sample for any study, we might observe a negative-direction effect from sampling noise, but the main question is about the true values.  Restated, how plausible is it that all studies have a true positive effect even though some might have negative sample effects?  Using Julia's and my previous work, we show how to compute this plausibility across a collection of studies.

So What?

Let's say, "yes," it is plausible that all studies of a collection truly have effects in a common direction, say violent video games do indeed increase aggression.  What is implied is much more constrained than some statement about the meta-analytic mean.  It is about robustness.  Whatever the causes of variation in the data set, the main finding is robust to these causes.  It is not that just the average shows the effect, but all studies plausibly do.  What a strong statement to make when it holds!

Now, let's take the opposite possibility, "no."  It is not plausible that all studies truly have effects in a common direction.  With high probability some have true effects in the opposite direction.  The upshot is a rich puzzle.  Which studies go one way and which go the other way?  Why?  What are the mediators?

In our view, then, the very first meta-analytic question is "do all studies truly."  The answer will surely shape what we do next.

Can You Do It Too?

Maybe, maybe not.  The actual steps are not that difficult.  One needs to perform a Bayesian analysis and gather the posterior samples.  The models are pretty straightforward and are easy to implement in the Bayes Factor package, stan or JAGS.  Then, to compute  the plausibility of the "do all studies truly" question, one needs to count how many posterior samples fall in certain ranges. So, if you can gather MCMC posterior samples for a model and count, you are in good shape.

We realize that some people may be drawn to the question and may be repelled by the lack of an automatic solution.   Julia and I have unrealized dreams of automating the process.  But, in the meantime, if you have a cool data set and an interest in the does-every-study-truly question, let us know.

Rouder, JN, Haaf, JM, Davis-Stover, C, Hilgard, J (in press) Beyond Overall Effects: A Bayesian Approach to Finding Constraints Across A Collection Of Studies In Meta-Analysis. Psychological Methods.

Saturday, January 5, 2019

P-values and Sample Sizes, the Survey

I ran a brief 24 hour survey in which many of you participated.  Thank you.

The main goal was to explore how people weigh off sample size and p-values.  I think with the adoption of power and sample-size planning, many people have mistakenly used pre-data intuitions for post-data analysis.  Certainly, if we had no data, we would correctly think all other things being equal that a larger study has greater potential to be more evidential than a smaller one.  But what about after the data are collected.

Here is the survey.  The darker blue bar is the most popular response.

My own feeling is that the study with the smaller sample size is more evidential.   Let's take it from a few points-of-view:

Significance Testing:  If you are a strict adherence to significance testing, then you would use the p-values.  You might choose "same."  However, the example shows why significance testing is critiqued.  Let's consider comparisons across small and very large sample sizes, say N1=50 and N2=1,000,000.  The observed effect size for the first experiment is a healthy .32; that for the second is a meager .002.  So, as sample size increases and p-values do not, we are observing smaller and smaller effects.

Modern Testing I: Modern testing has been influenced by considerations of effect sizes.  If effect size is to matter inference at all, then the correct answer is the smaller sample size.  After all, the p-values are equal and the smaller sample size has the larger effect size.

ModernTesting II: Another way of thinking about modern testing is that the analyst chooses a level based on context.  An obvious factor is sample size, and many authors recommend lowering alpha with increasing sample size.  Hence, the same p-value is more likely to be significant wit the smaller sample size.

Bayesian Testing:  For all reasonable priors, the Bayes factor favors the smaller sample size because larger effect sizes are more compatible, in general, with the effect than with the null.  Tom Faulkenberry notes that if you get to see the data first and fine tune the priors, then you can game a higher Bayes factor for N2 than N1.

What We Learned

For me, the best answer is N1 because it captures the appropriate post-data intuition that everything else equal larger effect sizes are preferable to smaller effect sizes when establishing effects.  Unfortunately, it was the least popular choice at 18%.

One of the shocking thing to me is the popularity of N2 (24%).  I can't think of any inferential strategy that would give credence to an N2 response.  So, if you chose N2, you may wish to rethink about how you evaluate the significance of effects.  The same response (18%) make sense only if you are willing to ignore effect size.  This ignorance, however, strikes me as unwise in the current climate.

The most popular response is "depends." (40%).  I am not sure what to make of depends responses.  I suspect for some of you, it was a cop out to see the results.  For others, it was an overly technical response to cover your bases.  In any case, it really doesn't depend that much.  Go with bigger effects when establishing effects.

Monday, November 19, 2018

Preregistration: Try it (Or not)

So, as the Statistical War and Tone War are in a lull, the Preregistration conflict has flared up yet again.   A few thoughts on the airplane back home from Psychonomics.

Prologue:

A. To be honest, it has taken me quite a long time to sort out my thoughts on preregistration.   I am not telling you to preregister or not.  Moreover, how I read your work is not dependent on whether you preregistered or not.  Perhaps you might find my thoughts helpful in your decision; perhaps not.

B. I don't believe in the usefulness of the exploratory/confirmatory distinction.  All of my research is motivated by some theoretical issue (so it is not exploratory) and I am always open to alternatives that I have not considered (so it is not confirmatory).  Arguments that rely on the exploratory/confirmatory distinction are not persuasive to me, and I will not be using them here (or elsewhere).

Why I Preregistered, the story:

I used preregistration because my students forced me to.  I found the experience rewarding and will preregister again.  Perhaps the strongest argument for preregistration is that it may clarify the researcher's thinking before seeing the data.   I think most of us can agree that writing is hard, and one of the reasons it is hard is that it forces you to clarify your thinking on things.   Preregistration in some sense provides the opportunity for that type of clarification before the data are collected.    As we wrote the preregistration, my team realized we hadn't though enough about what type of models could instantiate one of the theoretical alternatives.  So, we made a set of additional model specifications before seeing the data.  That was quite helpful.

Why I don't Take the Preregistration Too Seriously:

I feel no hesitation to break my preregistration.  In fact, I do not know if we did or did not break our preregistration because I never went back and read it!  I don't care if we did or not, to be honest.   I actually think this is not such a bad idea.  Here is why:

As Shiffrin notes, science requires good judgment.  In fact, being open-minded, flexible, and judicious are probably more important characteristics than being smart, industrious, or meticulous.  Now, what I like about preregistration is that it summons me to provide my best judgment at a particular point in time.  But, as new information come in, including data, I need to exercise good judgment yet again.  Hopefully, the previous efforts will make the current efforts easier and more successful.  But that is where it stops.   I will have no contract with my preregistration; instead I will use good judgment.  Preregistration is used to improve the pre-data steps, and hopefully that will improve post-data steps too.

So if you preregister, consider the following:

1. Try not to substitute your preregistration for your best judgment.  You can add value judiciously.  Don't trade in what you know now for what you knew then.

2. Don't forget to have a conversation with your data.  Nature only whispers, you need to communicate with her softly and subtly.  You gently ask one thing, it whispers something else.  And you go back and forth.  Please do not downgrade this conversation because it might be the most important thing you do with your data.

If You Want Others To Preregister:

Tell your story.   Maybe in detail.  What might you have done differently?

Saturday, September 1, 2018

Making Mistakes

Oh the irony.

I made a potentially dreadful mistake last month.  I zipped up and submitted the wrong version of a manuscript last month.  It was the final version for typesetting.  It could have been a fucking disaster---imaging if the original submission rather than revision was published.  I didn't catch the mistake, the amazing Michele Nathan at Psychological Science did.

And the manuscript that I made this mistake on was itself about minimizing mistakes (Click Here For The Paper).  That's right, we made a dreadful mistake in processing a paper for minimizing such mistakes.   Boy, do I feel like a fraud.

One solution is to berate myself, lose sleep, hide in shame, and promise I will be more careful.  This solution, of course, is useless because (a) there are much better things to feel shame about and (b) no matter my intentions, I will not be more careful.

The other solution is to do what we say in the paper.  We use the principles of High Reliability Organizations to deal with the mistake.

Step 1: Mistakes Happen, Record Them

Yeah, it happened.  We know mistakes happen and we are prepared for them.  So, the first thing I did was  open up a little form in our data base called "Adverse Events."  All labs should have a procedure for documenting mistakes.  In my lab, we do all adverse events together so everyone is in on the problem and the resolution.

Step 2: What Was the Mistake

I submitted directory sub2 instead of directory rev1. Directory sub2 was the second version and was originally submitted.  Directory rev1 was the third, and it was in response to reviews.  Obviously, with one second of thought, most of us would know that rev1 supersedes sub2.  We could even check the dates and the log, which were accurate.  But, I happen to be one of least detailed oriented people on the planet.  So, I must have seen the 2 in sub2 and zipped and submitted that directory.

Step 3: The Cause

It is obvious the problem here is a deficient set of naming conventions for files and directories.  We do a pretty good job of naming files in the lab.  We have git repositories, and all repos have the same few directories: dev, share, papers, presentations, grants.  Also, we have good mysql logs of when we submit papers.  So tracking down the mistake and why it occurred was easy.

What we did not have was a firm naming convention for versions within the directory "papers".  Clearly, we need to standardize this convention.

Step 4:  Resolution

In our system there are two workable solutions.  The first is to use successive version numbers on a single name. e.g., minMistakes.1 might have been the first version, minMistakes.2 might have been the second and so on.  Here, instead of calling directories sub2 and rev1 we number successively and use a more informative root name.  We lose where we are in the process though.   Fortunately, we keep pretty good records of things that we do in our database, so there we could record that minMistakes.2 was the first version submitted, etc.  An even better solution is to use git to sort the major versions.  We probably should not be changing directories and just using git tags for major versions and process notes.  That is what I am going to do from now on.

Here is our adverse event report now in our database.

Anyways, with the new conventions, I most likely wont make that mistake again.