Cohen, J. (1995). The earth is round (p < .05): Rejoinder. American Psychologist, 50(12), 1103-1103. doi: 10.1037/0003-066x.50.12.1103
I am greatly pleased and thankful to the many readers who responded to my article on null hypothesis significance testing (NHST; Cohen, December 1994). The purpose of the article was to begin a crusade to replace meaningless NHST by placing confidence limits on effect sizes. The responses that came to me were generally positive, and those that raised questions were stimulating.
To those who rushed to the defense of NHST, I concede that there are circumstances in which the direction, not the size of an effect, is central to the purpose of research. An example is a strictly controlled experiment, such as a clinical trial (although even in a clinical trial, nothing is lost and much may be gained with confidence limits). But the ritual of nil hypothesis testing has so dominated our research practice that it has inhibited our interest in the magnitude of the phenomena we study and the units in which they are measured, the basic stuff of which quantitative sciences are made. Parker (1995, this issue) worries about the equality of the units of our measures. The problem with his examples is twofold.
One problem is his presumption that a demonstration of equality of units in some abstract sense is a necessary condition for effect size measurement. Such a demonstration cannot be necessary, as it is not possible. Instead, measurement proceeds “in intimate relation with the empirical-theoretical structure of a scientific field” (Cliff, 1993, p. 61). Such a structure has existed for IQ for many years with no great concern about the equality of IQ units. I wouldn’t claim that every IQ unit is equal to every other IQ unit, but I think that—averaged over subjects and IQ units—they are equal enough.
The other problem is that the use of timidity ratings, number of chili peppers
eaten, and number of correct identifications in Parker’s (1995) examples arise from no web of relations in an empirical-theoretical structure, but are ad hoc measures, adequate only to the task of performing a significance test. Parker is quite right: We gain little (and risk a bellyache) from two to four chili peppers. And that, most emphatically, is the problem. Only by developing measures that psychologists in a given area can agree upon and use in their research can we have meaningful measurement units with which to build a cumulative scientific structure.
McGraw (1995, this issue) asserts that in “purely exploratory research that is unguided by any rigorous theoretical conceptualization and that has no literature to draw upon… the prior probability [that H_0 is true] is large” (p. 1100). Because I hold that the nil hypothesis is never true, its prior probability is zero. Even if H_0 is taken to mean trivially small, there is much persuasive evidence that the crud factor described by Meehl (1990) and Lykken (whom he cited) is likely to ensure that its prior probability is not large.
I used a high prior probability in my schizophrenia example to demonstrate how greatly mistaken one can be when one takes the p value as bearing on the truth of the null hypothesis. Although one cannot, of course, know the size of the crud factor in any given domain, I don’t find McGraw’s (1995) graph at all reassuring.
Frick’s (1995, this issue) comment does not examine the meaning—in the context of confidence intervals—of “95% probable.” It means that if I were to repeatedly draw random samples from this population and set up for each sample a 95% confidence interval, my intervals would include the estimated population parameter 95% of the time. In fact, it means that over a lifetime of research—during which I computed many such intervals for different populations and different parameters—I would similarly succeed in including the parameters I was estimating. This procedure in no way posits any null hypothesis.
Incidentally, I do not question the validity of NHST, but rather its widespread misinterpretation. If I reject the null hypothesis at the 5% level, then I can correctly assert that if it were true, I would have obtained results like those in hand less than 5% of the time. I cannot correctly assert that the probability that the null hypothesis is true is less than 5%. And apart from this misinterpretation, there is little point in rejecting the nil hypothesis, which, I repeat, is always false.
Baril and Cannon (1995, this issue) concede that I am correct in asserting that the probability that H_0 is true is zero, but they then redefine H_0 to mean that the parameter being estimated is trivially small, rather than nil—literally and exactly zero. Furthermore, they concede that reversed conditional probabilities are not equal. They take me to task, however, for using “inappropriate” and “irrelevant” (p. 1099) examples. They complain that in the schizophrenia example, Ho (e.g., being normal), instead of being zero or trivially small, has a high probability, and that it is rarely the case that Ho is certain, as in the Congress example.
My examples were not intended to model NHST as used “in the real world” (Baril & Cannon, 1995, p. 1099), but rather to demonstrate how wrong one can be when the logic of NHST is violated. I must point out that H_0 is generally a hypothetical statement of fact that is to be assessed using the rules of logic and is neither small (as in the nil hypothesis, which they seem to be assuming) nor large (as they take my Congress example to be). R. A. Fisher (1951) dubbed it the null hypothesis because it was the hypothesis to be nullified.
I must say that I am surprised by how much resistance I encounter to using confidence limits. Confidence limits not only tell you the status of the null (or nil) hypothesis, but also give you an idea of just how big the effect is. Without them, we relegate our conclusions to the form, in Tukey’s (1969, p. 86) immortal phrase, “if you pull on it, it gets longer!”
Baril, G. L., & Cannon, J. T. (1995). What is the probability that null hypothesis testing is meaningless? American Psychologist, 50, 1098-1099.
Cliff, N. (1993). What is and what isn’t measurement. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences. Methodological Issues (pp. 59-93). Hillsdale, NJ: Erlbaum.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
Fisher, R. A. (1951). Statistical methods for research workers. Edinburgh, Scottland: Oliver & Boyd. (Original work published 1925)
Frick, R. W. (1995). A problem with confidence intervals. American Psychologist, 50.
McGraw, K. O. (1995) Determining false alarm rates in null hypothesis testing research. American Psychologist, 50, 1099-1100.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66 (Monograph supplement 1-V66), 195-244.
Parker, S. (1995). The “difference of means” may not be the “effect size.” American Psychologist, 50, 1101-1102.
Tukey, J. W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 83-91.