Many psychology students find themselves in a situation where their research did not yield any significant results. This can be immensely frustrating since they have put a lot of time and effort into designing the study, as well as in collecting and analyzing the data. In some cases, be it out of desperation or pressure to publish interesting findings, certain students will effectively “hunt” for results by conducting statistical tests on all possible variable combinations. For instance, after noticing that a hypothesized correlation between two variables proved to be non-significant, a student might create a correlation matrix of all continuous variables of her study and hope for at least one pair to be significantly related to each other. Other students might include one, two, or even more covariates in their analysis of variance (turning it into an ANCOVA), thereby hoping that the interaction they initially hypothesized between their key factors will become significant.
Unfortunately it is also too common for certain students (and, sadly, researchers) to include a wide array of measures in their questionnaire studies and to search for any significant relation between any of the variables once it turns out there is no evidence for their initial hypotheses.
Oftentimes, students see high p-values as a sign that their research is plainly “bad” and that they have failed. Although it is understandable that they would want to boost their research report by adding complementary results, suggesting that their research was not completely in vain, this practice is unacceptable for two reasons. First, it undermines the purpose of empirical research, and, second, it makes absolutely no sense statistically speaking. This post will briefly address both issues and conclude with a small word of advice.
The p-value and multiple hypothesis testing
The p-value is a fundamental concept in statistics and experimental psychology and the purpose here is not to give a full account of it (for an introductory textbook on statistics, see for example Howell, 2007). Rather, a common misconception of the p-value will be exposed and an explanation will be given as to why it is erroneous to play around with statistical analyses, conduct a large amount of them, and pick-and-choose results that happen to be significant.
One of the most important things to know about the p-value is that it tests hypotheses, namely null hypotheses. The p-value must be interpreted as “the probability that the data would be at least as extreme as those observed if the null hypothesis were true” (Vickers, 2010). In other words, if the null hypothesis was true (i.e. there is no difference to be expected), what is the probability of having observed as extreme or more extreme data? The p-value is definitely not the probability that the observed results are due to chance or that the null hypothesis is true or false!
When you conduct a statistical test, the results can be either significant (in which case you reject the null hypothesis), or they can be fail to be statistically significant (in which case you do not reject the null hypothesis). Depending on whether the null hypothesis is true or not, you can commit two types of errors (see Table 1):
Type I error: Also known as false positive, this is the error of accepting an alternative hypothesis when the results can be attributed to chance. In other terms, you believe you are observing an effect when there is actually none.
Type II error: Also known as false negative, it is the error of not rejecting the null hypothesis when the alternative hypothesis is true. In layman’s terms, it means failing to observe a difference when there is one.
Consequences of statistical tests based on their outcome and the truth of the null hypothesis.
The type II error is related to the lack of statistical power which, in part, can be controlled by adjusting the sample size according to the magnitude of the expected effect and the statistical significance criterion (α).
Regarding type I errors: Statistics derives its power from random sampling (Huang, 2013). The idea behind this is that if you create two different samples randomly, you average out the differences between them – they are theoretically equal. Therefore, after the experiment’s treatment or manipulation, any differences observed should be due to the treatment or manipulation effect only and not the characteristics of the samples. Nonetheless, it is always possible that from time to time, chance will lead to the creation of two unequal samples. Furthermore, even if the samples are completely equal, we can’t be sure whether the differences we observe are a one-time occurrence (due to chance) or a real consequence of the treatment or manipulation. Thus, we must take the necessary precaution to make sure that any observed differences are not due to chance. We do so by setting the statistical significance criterion (α) low enough in order to minimize the chance of observing an effect when actually there is none (type I error).
Now, when a student hunts for results by conducting a large amount of statistical tests, the likelihood that at least once of the tests will yield significant results increases exponentially (Huang, 2013). There is always a chance of committing a type I error and guess what – you will eventually commit one.
There are several solutions to multiple hypothesis testing, such as the Bonferroni correction or the False Discovery Rate (Shaffer, 1995). However, these methods should only be used for hypotheses that have been established before the data collection, not after, which brings us to the second issue of hunting for results.
The purpose of empirical research
Empirical research in psychology follows the scientific method and aims to test falsifiable hypotheses based on evidence of objective information in a systematic manner (Goodwin, 2009). When you design a study, you do so to verify whether a justified hypothesis can be backed up by evidence. Hence, rummaging through your data for significant complementary results defies the purpose of empirical research. And rewriting your paper to make it seem like you were expecting these results is purely dishonest.
In conclusion, hunting for data makes no sense in terms of statistical accuracy or scientific integrity. If your study is well designed and if you collected your data well, no professor should expect you to come up with complementary results just because you did not end up with significant results, even if he/she does so for pedagogical purposes (especially not for pedagogical purposes!). Furthermore, you should not feel bad for having non-significant results: first of all, non-significant results are actually important and must be taken into consideration; if they don’t get taken into consideration (i.e. by not getting published), they add to the file-drawer effect (for more information on the file-drawer effect, see the following Bulletin posts: Bias in psychology and Replication studies). Secondly, a p-value is only as good as your data and the non-significance may be due to sampling errors (Huang, 2013).
Don’t hunt for results.
Goodwin, C. J. (2009). Research in Psychology: Methods and design. Hoboken, NJ: Wiley.
Howell, C. D. (2007). Statistical methods for psychology (6th ed.). Belmont, CA: Thomson Wadsworth.
Huang, H. (2013). Multiple Hypothesis Testing and False Discovery Rate [PDF document]. Retrieved from: http://www.stat.berkeley.edu/users/hhuang/STAT141/Lecture-FDR.pdf
Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561-584.
Vickers, A. (2010). What is a p-value anyway? 34 stories to help you actually understand statistics (1st ed.). Boston, MA: Pearson Education.