Exploratory and Confirmatory Hypothesis Testing


The replication crisis has spread all across the scientific community. In the field of psychology, scientists were not able to replicate more than half of previous findings (Open Science Collaboration, 2015). For a long time this problem went unnoticed, but a critical moment occurred when Daryl Bem published his now infamous paper on humans’ ability to quite literally predict the future (Bem, 2011). Many readers doubted his findings as there was no logical basis for the ability to predict the future and years later Daniel Engber summarized it nicely when he wrote:

(…) the paper posed a very difficult dilemma. It was both methodologically sound and logically insane. (…). If you bought into those results, you’d be admitting that much of what you understood about the universe was wrong. If you rejected them, you’d be admitting something almost as momentous: that the standard methods of psychology cannot be trusted, and that much of what gets published in the field—and thus, much of what we think we understand about the mind—could be total bunk.“ (Engber, 2017)

Wagenmakers, Wetzels, Borsboom, and van der Maas (2011) were quick to point out that the biggest problem with Bem’s paper is that it does not properly disclose which parts of the analysis are confirmatory and which ones are purely exploratory. To examine why it is problematic to test purely exploratory, we must distinguish between confirmatory and exploratory analysis first. Confirmatory analysis refers to the kind of statistical analysis where hypotheses that were properly deducted from a theory and are tested with all statistical parameters defined beforehand. On the other hand, in exploratory analysis, statistical analysis is employed after data collection without any clear theory-driven hypothesis in mind and in the absence of predetermined statistical parameters.

Graphic: Confirmatory vs Exploratory research by Dirk-Jan Hoek for Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit (2012).


Lots of problems with lots of science

Finding the roots of today’s problems means we have to look why setting up statistical parameters before data collecting and statistical analysis is crucial to science. The Neyman-Pearson approach to hypothesis testing is based on two hypotheses that are tested against one another: the alternative hypothesis (H1), representing the researchers presumed effect, and the null-hypothesis (H0), representing a null-effect. For instance, in a t-test the H1 would assume a difference between two group means, while the H0 would assume no difference between group means. To conclude that effect has been found by an analysis the H0 has to be rejected in favor of the H1.

To adequately test these against one another, four parameters come into play: alpha-error, beta-error, effect size and sample size. The alpha-error refers to the probability that the H0 is rejected when there is no effect, while the beta-error refers to the probability that the H1 is rejected when there is a true effect. The sample size is very straight forward: the number of data points you collect in your study. The effect size in the t-test scenario would tell how far the means of both groups are apart, commonly reported with a measurement like Cohen’s d.

Luckily it is possible to calculate parameters based on other parameters. In order to calculate one of the parameters, the other three must be set up beforehand. Today this is mostly used to calculate the sample sizes because you can reasonably justify the other parameters, either by scientific conventions or theoretical explanation. The scientific convention is to set the alpha-error to 5% by default ever since it was originally suggested by Ronald Fischer. It was only during the 1980s when Jacob Cohen suggested a convention for setting a beta-error rate of 20%, arguing that alpha-errors would be four times more serious than beta-errors (Cohen, 1988; cited after Lakens, 2019a). Even though this reasoning might not hold up to every research question at least it gives researchers an idea how solid their findings are. At last expected effect size can be set based on previous findings or theoretical background. This is one of the more complicated steps during the scientific process because it can be hard to justify how large an effect size should be to be meaningful.

For the longest time setting up parameters beforehand was ignored and sample sizes were chosen after a rule of thumb as Daniel Lakens puts it in one of his blogposts:

“You can derive the age of a researcher based on the sample size they were told to use in a two independent group design. When I started my PhD, this number was 15, and when I ended, it was 20. This tells you I did my PhD between 2005 and 2010. If your number was 10, you have been in science much longer than I have, and if your number is 50, good luck with the final chapter of your PhD.” (Lakens, 2019b).

This leads to a big issue because even if an effect is found nobody can be reasonably confident in the findings since the beta-error is unknown. And even if the the beta-error is calculated based on the alpha-error, the collected sample size and the found effect size, it contains a lot of bias since the calculation is based on the found effect size, not the one that is theoretically reasonable. In essence a significant test in this scenario will only show that there is no null-effect. However, since the alpha-error is set to 5% that means that you can expect five significant results in 100 studies even if there is no effect. Of course, when you test 100 studies you don’t always find exactly five false significant results, much like flipping a coin does not result in 50 heads and 50 tails every time, but when the amount of studies approaches infinity the proportion of false studies approaches 5%.

In case of Bem’s original paper, Wagenmakers et al. (2011) pointed out that in Experiment 1 where Bem found that his participants could predict where pictures with erotic content would end up on a computer screen significantly above chance (in this case 53.6%): “[Bem] tested not just erotic pictures but also neutral pictures, negative pictures, positive pictures, and pictures that were romantic but nonerotic.”, strongly suggesting that he used the way error rates work to his advantage by using multiple tests until one of them showed a significant effect.

To illustrate this, let’s say you already know for sure that there is no effect in your dataset (for example if you tested whether humans can predict future events). Now you set up 15 separate t-tests by setting up 15 groups in your data. Each of these tests has a 5% chance to show significance as per your predetermined alpha-error, ergo a 95% to show non-significance. The chance that all of them correctly show a non-significant result is  95%^15 = 46.33%. In other words, you have a chance of 53.67% that at least one test will show a significant result.

Bem’s paper also illustrates how easy it is to reach 15 tests, after all he reported to have used at least five underlying picture sets: erotic, neutral, positive, negative and romantic pictures. When you take into account that he also tested for gender, the amount of tests goes up to 15, five tests for the full set, and five for each gender he tested. Of course, these are not fully independent tests as the full set consists of the subsets for gender, but it serves as a good example how one can increase the chance to find a significant result, even if the main test ends up non-significant.

In the previous example the beta-error is high because there is no effect in the data. This translates to a lot of studies that have no predetermined parameters because if the beta-error is not controlled it might be absurdly high and as a result the presumed finding can barely be replicated, if at all. That is why it is so important to have a solid statistical test with predetermined parameters, meaning that only a finding based on confirmatory testing can be taken as a solid finding (which is not to say that it will be free of error, but it is at least controlled and can be taken into consideration when interpreting the results). With exploratory testing on the other hand, nobody can distinguish between a real effect and an error due to the lack of error control.

All of this is not to say that exploratory testing doesn’t have a place in science. These tests are valuable to explore new and untested phenomena where psychological theory is underdeveloped. If a test shows significance, it can be used as an indication for further research question. The problem with exploratory testing only arises when the results of an exploratory test are interpreted with the same certainty as confirmatory tests.


Moving forward

The biggest problem distinguishing between a confirmatory and an exploratory approach is that the reader of a given paper cannot know whether the results of a given study were derived in a confirmatory or exploratory manner. There is no way to be sure that the authors of a paper didn’t test until they found something and hypothesized after the fact.

The Open Science movement gained a lot of momentum since the replication crisis started. The main goal of the movement is to promote transparency and accessibility in the scientific field. Concerning publishing articles, this includes that a researcher should share every part of their scientific work alongside the actual article, ideally including the full dataset, the code for the analysis and the full research plan. This helps readers of a paper understand and possibly reconstruct how the data was handled and allows them to re-analyze it or spot errors in the original analysis. This approach still doesn’t allow the reader to distinguish between exploratory and confirmatory research designs. After all, the information is only provided after the study has been concluded and can “only” be used to double check the reported results.

To get full insight of the research plan, scientists came up with the idea of Registered Reports. This is a process were a version of an article is published before the data is collected and analyzed. It features the theoretical discussion / literature review, hypotheses, the full methodological part including a power analysis to set your sample size and a discussion about the expected findings. Comparing the Registered Report with the final article, readers are able to confirm if the final results match up with the proposed plan. This allows to see how much of the analysis was set before data collection and statistical analysis and what got added during or even after the process.

Registered Reports hold benefits for two more problems in science: publication pressure and publication bias. Where publication pressure disturbs the research plan of any given paper, publication bias disturbs the interpretation of an effect. Publication pressure is a problem individual researchers face during their PhD or post-doc positions due to the fact that they’re forced to publish papers in order to keep their jobs. Since publications are mostly accepted based on the significance of the results, publication pressure influences the design and research questions of any given paper. Researchers are using valuable time and money to generate data, which they might not be able to publish if it ends up non-significant so publication pressure research designs and questions that are more likely to produce significant results are generally preferred. This arguably limits research designs and questions. However, when a Registered Report is accepted by a journal this journal guarantees the author to publish the final results as well. While this doesn’t take away the pressure to publish papers, it at least takes away the pressure to produce significant results in an already demanding publication system. This opens up a way to broader research and relieves stress from the researchers since the quality of their study is based on the quality of its design, and not on the results. The latter also helps with publication bias. The term publication bias refers to the bias in scientific literature that results from overly publishing significant results. If only significant results get published you might end up with the problem where lots of evidence against an effect is cast aside and only the research in favor gets published, resulting in pretty much the same issue that exploratory testing causes, only on the level of research papers and not individual tests. Due to Registered Reports being published no matter the results, the scientific literature will in turn include all significant and non-significant studies of a certain effect, which is crucial in evaluating is an effect is real or how large it is, if it is real.

There are steps you can take to evaluate the findings of a paper more carefully. First you can check if a Registered Report for the article is available, maybe the authors even mentioned it in their article. It helps you to evaluate how solid the results are based on how the parameters are predetermined, and which part is confirmatory or exploratory. Also try to find papers that might have replicated the findings to see if they hold up. You can do so by looking at papers which cited the findings of yours by using a citation database like Web of Knowledge, Scopus or Scite. Also be mindful when interpreting the results of the study, always look out for the findings besides the main question for example the additional tests for gender etc. These might be significant but just because they increase the amount of tests. Try and understand them more as “these gender effects had an influence on the findings of this given study” instead of a general effect. Following up on the interpretation, be mindful of how you cite. If you cite these exploratory findings as serious as confirmatory ones you help make these findings are taken more seriously than they actually should. That is because most readers trust authors they won’t check every citation made in a text. After all it is crucial for psychological science that readers are mindful of the tricks that can be employed to reach a significant outcome and hold researchers accountable for results they publish!


Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425. doi.org/10.1037/a0021524

Engber, D. (2017). Daryl Bem proved ESP is real – which means science is broken. Slate. Retrieved from https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html

Lakens, D. (2019a). Justify Your Alpha by Minimizing or Balancing Error Rates. Retrieved from http://daniellakens.blogspot.com/search?q=Justify+Your+Alpha+by+Minimizing+or+Balancing+Error+Rates

Lakens, D. (2019b). The New Heuristics. Retrieved from http://daniellakens.blogspot.com/search?q=The+New+Heuristics

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251). doi.org/10.1126/science.aac4716

Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426–432. doi.org/10.1037/a0022790

Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L. J., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638. doi.org/10.1177/1745691612463078

Patrick Smela

Patrick is currently finishing his Bachelor in Psychology at the University Vienna. Afterwards, he will to do his Masters in General Psychology and Methodology. He is passionate about research methods, especially in the field Human Computer Interaction. Besides work, he likes to travel, read, and does a lot of voluntary work in the psychological faculty in Vienna.

More Posts