## Saturday, March 3, 2018

### Hating on Sir Ronald? My Two Cents

This week, poor Sir Ronald A. Fisher took it on the chin in cyberspace.  Daniel Laken's, for example, writes on Twitter:

"For some, to me, incomprehensible, reason, most people seem not educated well on Neyman-Pearson statistics, and still use the ridiculous Fisherian interpretation of p-values"  (3/1)

And Uli Schimmack wrote on Facebook:

"the value of this article is to realize that Fisher was an egomaniac who hated it that Neyman improved on his work by adding type-II errors and power. So, he rejected this very sensible improvement with the consequence that many social scientists never learned about power and why it is important to conduct powerful studies.   Fisher -> Bem -> Train Wreck." (2/22)

So, I thought I would give poor Sir Ronald some love.  Rather than dig up quotes claiming poor Sir Ronald was misunderstood, let me see if I can provide a common sense example of why Fisher's view of the p-value remains intuitive and helpful, at least to the degree that a p-value can be intuitive and helpful.

Here is the setup.  Suppose two researchers, Researcher A and Researcher B, are running experiments each, and fortuitously each used the same alpha, had the same sample size, and did the same pre-experimental power calculation.  Both get p-values below their alpha's, which also happen to be the same, say .01.  Now, each rejects the null with the N-P safety net that had they each done their experiment over and over and over again, if the null were true, they would only make this rejection for 1% of the experiments.

Fine.  Except Researcher A's p-value was  .0099 and Researcher B's p-value was .00000001.  So, my question is whether you think Researcher B is entitled to make a stronger inferential statement than Researcher A?  If you read two papers with these p-values, could you form a judgment about which is more likely to have documented a true effect?  As I understand the state of things, if you think so, then you are using a Fisherian interpretation of the p-value.

In Neymann-Pearson testing, one starts with a specification of what the alternative would be if an effect were present.  This alternative is a point, say an effect of .4.  Then we design an experiment to have a sample large enough to detect this alternative effect with some power level  while maintaining a Type I error rate of some set value, usually .05.  And then, with our power informing our sample size, we collect data.  When finished, we compute a p-value and compare it to our Type I error rate.  If the p-value is below, we are justified in rejecting the null, otherwise, we are not.

As an upshot of N-P testing, the p-value is interpretable only as much as it falls on one side or the other of alpha.  That is it.  It is either less than alpha or greater than it.  The actual value is not informative for inference because it does not affect the long-term error rates the researcher is seeking to preserve.  Both Researcher A and B are entitled to the same inferential statements---both reject the null at .01---and that is it.  There is no sense that Researcher B's p-value is stronger or more likely to generalize.

So, do you think Researcher B has a better case?  If so, you are straying from N-P testing.

The beauty of Fisher is that, accordingly, the p-value is the strength of evidence against the null.  Smaller p-values always correspond to more evidence.  The ordinal relation between any two p-values, whether one is less than the other, can always be interpreted.

My sense is that this property makes intuitive sense to researchers.  Researcher B's rejection probably generalizes better than Researcher A's rejection.  And if you think so, I think you should be singing the praises of Sir Ronald.

The main difference between Fisher and N-P is whether you can interpret the numerical values of p-values as statements about a specific experiment.  For Fisher, you could.  For N-P, you cannot.  N-P viewed alpha as statements about the procedure you were using, more specifically, about its average performance across a large collection of studies.  This difference are most transparent for confidence intervals, where the only reasonable interpretation is Neymann's procedural one (see Morey et al., 2016, paper here).

There are difficulties in the Fisherian interpretation---if one states evidence against the null, what is one stating evidence for?  Fisher understood that p-values overstate the evidence against the null which is why he pursued fiducial probability (see here for an entree into fiducial probability).

From my humble POV, Bayes gives us everything we want.  It is far less assumptive than specifying points used for computing power.  And we can interpret the evidence in data without recourse to a sequence of infinitely many expeirments.  And we interpret it far more fully than the straight-jacket dichotomy of "in the rejection region" or "not in the rejection region."