The logic of hypothesis testing can be really confusing. Why do we “reject ” or “fail to reject “, and what does that really mean? Did we prove the null hypothesis when we didn’t reject it? These are common questions for any student studying these ideas.
It turns out that we have a very nice example of the logic behind hypothesis testing right in your everyday NFL or college football game. The basic idea comes from the “challenge” that a coach can use to dispute a call on the field by a referee. For example, a referee may say that the other team has scored. But, it is a very close play and the other team thinks that the referee could be wrong. At this point, the coach of that team may elect (under certain conditions) to challenge the call. Once a call is challenged, the play call is reviewed. During the review, the referees are looking for “clear evidence” that the play on the field was incorrect. If they find it, they will overturn the call. If not, they will make the statement “the ruling on the field stands” and that’s that. The game continues.
The sentence that should catch the eye of any statistics student is the phrase “the ruling on the field stands”. The ref’s are very careful not to say “the ruling on the field was correct” – instead, they elect to say something that implies “we didn’t see enough evidence to overturn the call”. These are two very different things! It is sort of like a court of law. We don’t find people innocent; instead, we find them “not guilty”. In other words, we didn’t see any evidence to change our minds from the assumption that they were innocent.
The Null Hypothesis
In hypothesis testing, the null hypothesis () is assumed to be true. So, to test a claim about the population, we take a sample and then we then look at the “evidence” (a p-value or a test statistic) to determine if the sample we took is unique enough to make us reject our assumption. That is, we decide to reject the null hypothesis or fail to reject the null hypothesis. We only do this if there is significant evidence , in other words, a “small” p-value (p-value smaller than ).
In the football example, the null hypothesis would be “the call on the field is correct” and the evidence would be the video replays available to the referees. The null hypothesis will only be rejected if there is significant evidence to the contrary.
Now, here is the tricky part. We can never PROVE the null hypothesis is true because we base all of our calculations on assuming it is true. We can only state whether there is evidence against it. Of course, in our football example , the referees probably could actually prove that the original call was correct, but reviewed plays are often a bit borderline (and I’m sure no one wants to say “that other ref was wrong”) so they choose to simply state that there is evidence against the null hypothesis. That is, that there is enough video evidence to make them change their minds about the call. This is the same as saying “we reject the null hypothesis”.
The Alternative Hypothesis
The competing hypothesis is the alternative (). We can almost think of this as the hypothesis that “something interesting is happening”. In other words, whatever we want to prove – whether it be that sales are increased by talking to customers or that a certain medicine reduces the number of headaches – will be the alternative hypothesis. When we reject (), we are saying there is evidence towards the alternative hypothesis. We are saying that the sample is unique enough under the null hypothesis to make us question it altogether.
When the refs overturn a call, they are saying “we reject (), there is evidence towards the alternative hypothesis”. That is, “we have enough evidence to make us question the original call”. In this case, the video was enough to make them seriously question the null hypothesis (that the call was correct).
Failing to Reject the Null
Thinking about all of this, failing to reject doesn’t say much. In the case of our friends the NFL ref’s, it simply means that there wasn’t enough evidence to make them change their minds about the call being correct. They aren’t saying it was correct, just that they can’t , based on this evidence, overturn the call.
In an experiment where we are trying to prove that a headache medicine reduces the average number of headaches patients experience each week, failing to reject would mean that there is no evidence that the medicine does in fact reduce this number. Any changes in the number of headaches experienced by patients could be due to chance. It doesn’t mean that the medicine absolutely doesn’t work. It just means that our study doesn’t prove that it does.
Suppose we were really trying to prove some headache medicine worked and in the end, we failed to reject our null hypothesis. What then?
Of course, we didn’t prove , so it may be that our medicine works (but effects are LIKELY due to chance since we didn’t reject the null). But since we didn’t prove that it does, the FDA will never let us sell it to consumers. So even though we didn’t prove the null hypothesis, real life dictates that there are consequences to not rejecting it that are very similar to having proved it.
Similarly, suppose that I run a factory and you try to sell me a system that reduces errors (supposedly). I run a test and find that I can’t prove it does in fact reduce errors ( I fail to reject ). Do I say “well, you can’t PROVE , so I will buy the system”? No, I will likely make the decision as though is true and therefore not buy it because it wasn’t proven to work. This is just practicality. My decision is based on the fact that you didn’t prove the alternative, .
As you study hypothesis testing (z-tests, t-tests, and others), you may find some of the language and required statements a bit rigid and maybe almost lawyer-like. Now you can see that a lot of this comes from us trying to work about ideas of uncertainty and what you can and cannot prove. We have to be careful and make sure that we don’t say things like “ is true” or even “ is true” because in the end we are working with probabilities and assumptions. The only time we could really say these things is if we could work with the entire population (and then we wouldn’t even need hypothesis testing – think about that one!).