A Psychologist’s Guide to Reading a Neuroimaging Paper

Psychological research is benefiting from advances in neuroimaging techniques. This has been achieved through the validation and falsification of established hypothesis in psychological science (Cacioppo, Berntson, & Nusbaum, 2008). It has also helped nurture links with neuroscience, leading to more comprehensive explanations of established theories. Positron Emission Tomography (PET), functional MRI (fMRI), structural MRI (sMRI), electroencephalography (EEG), diffusion tensor imaging (DTI) and numerous other lesser-known neuroimaging techniques can provide information complimentary to behavioural data (Wager, 2006). With these modalities of research becoming more prevalent, ranging from investigating the neural effects of mindfulness training to neuro-degeneration, it is worth taking a moment to highlight some points to help discern what may be good or poor research. Like any other methodology, neuroimaging is a great tool that can be used poorly. As with all areas of science, one must exercise a good degree of caution when reading neuroimaging papers.

Reading a Neuroimaging Paper

In addition to the more general issues of critically reading a scientific paper, there are some common methodology pit falls that arise especially in neuroimaging papers. While there are physiological limitations to the use of each of these modalities, several others arise due to poor experimental design and analysis. Here I will focus on the latter. For a comprehensive overview of technical and biological limitations in fMRI see Logothetis (2007).

The pre-processing involved and statistical analysis of neuroimaging data can be complex. A lack of understanding of the image processing pipeline and the limitations of the statistical approach used is obviously dangerous. Pressing buttons on a computer isn’t sufficient; a conceptual knowledge of what is being done is really required. Here, a few of the common pitfalls to look out for while reading neuroimaging papers are presented.

Multiple Comparisons

Bennett, Baird, Miller, and George (2009) conducted an fMRI in which a post-mortem salmon was used to determine emotions from images. So what would be the expected result of this study—surely not activity in the brain cavity? You can see for yourself from the image below that indeed, even a dead salmon shows some activation.

Screen Shot 2014-06-29 at 23.28.37

Taken from Bennett et al. (2009), uncorrected (p = 0.001)

This surprising finding is associated with the fact that in any fMRI study, there is going to be noise. Imagine that in the volume of a human brain we have 100,000 voxels (3D pixels)!. In effect, when comparing two conditions we are conducting 100,000 t-tests to determine if there is a change in relative blood flow at each of these voxels. As we know from statistics, there are going to be false positives and by chance, some of these may cluster together. With the significance level of α = .05, there would be 5000 false positives!

There are many simple solutions to this multiple comparison problem. While Bonferroni correction is the first thing that comes to mind, it is generally too conservative for functional data and violates many assumptions. For a Bonferroni correction, the data needs to be independent, however, adjacent voxels are related, especially after the smoothing process during pre-processing. Therefore, various other methods are often used such as Random Field theory, small volume correction, peak, and cluster thresholds (Poldrack, Mumford & Nichols, 2011) .

The standard threshold for corrections may vary in different analysis software, but the more recent programs such as SPM8 (Statistical Parametric Mapping), soon to be SPM12 tend to have more stringent analysis. For any of these, an uncorrected threshold of p = .05 is a red flag in neuroimaging papers (even p =.01 can be a bit suspicious). While less frequent now, this was not an uncommon practice in early imaging papers. There are cases where uncorrected or lower thresholds might be arguably justified. Take, for example a region of interest analysis of solely the amygdala: Due to the reduction in the dimensionality, a less conservative correction needs to be conducted.

The most common form of multiple comparison correction for a whole brain analysis is a family-wise error correction (pFWEcluster < .05) based on cluster extent using a cluster-forming threshold. This cluster-forming threshold tends to be .001 uncorrected, and potentially lower, thus setting an uncorrected threshold for peak activation. The cluster correction then performs a stringent multiple comparisons correction on clusters that reach this peak activation. What is the likelihood of a cluster of adjacent voxels being active by chance alone? Earlier this year, Woo, Krishnan, and Wager (2014) published a paper on the pitfall of reducing this threshold for cluster correction. However, where possible, the current prominent notion in the scientific community is to hold a conservative threshold, which results in confidence of any activations being meaningful. Low thresholds pose the risk of many false positives, thus the results may not be replicable – so the “publish or perish” maxim leads to a far too liberal handling of thresholds. However, a well-powered and controlled experiment should help deter this from happening. On the other hand, having a few subjects may lead to threshold-dropping, so look out for papers with subject numbers less than 15-20 in a group.

Another thing to be suspicious of are unusual threshold limits. Say for example, the study corrected for multiple comparisons at p = .003. While a significance level of p = .05 is an arbitrary value itself, it is not a normal practice for researchers to choose their own level of significance. In relation to this, another questionable reporting method is ‘defining’ significance, for example when instead of conducting a correction, a voxel extent threshold is set for uncorrected data. That is, if more than 10 adjacent voxels are active, it is considered a significant cluster. Defining arbitrary cluster sizes like this is not an appropriate method.


The scientific method stipulates that analysis follows a hypothesis. This is especially important for high dimensionality data, like that from neuroimaging. It is easy to accidently fish for results and a problem that arises from this is circularity. The basis of a region of interest (ROI) analysis should not come from the results. This is commonly referred to as double dipping. ROIs need to be selected a priori, independent of the conducted analysis. Let’s say that in conducting a whole brain analysis, you find a cluster of activation around the amygdala. Interesting, you might think, and you explore this further and conduct a ROI analysis based on the signal extracted from this region. Well, of course, the extracted data are going to be strongly correlated! Instead of having a representative sample, only the data that show activation above a selected threshold are being looked at. If the selected activation is representative of the experimental effect, there will be no problem. However, these datasets are inherently noisy due to the nature of the fMRI signal and steps taken during pre-processing, which may distort the results if a selected region is reanalysed without prior evidence from a separate dataset to show plausible recruitment (Kriegeskorte, Simmons, Bellgowan, & Baker, 2009). This can artificially inflate or distort a small or moderate effect size to being large. When correlations greater than r =.8 appear, there is something is fishy. This is considering, at best, personality measures and fMRI have more or less the same reliability (Ioannidis, 2005; Vul, Harris, Winkielman, & Pashler, 2009). This can happen covertly without being reported but it can be spotted in the methods if a priori regions are not specified. sometimes it is written straight out that the ROIs are based on their functional activations. One must be wary of such papers.

  ‘Imager’s Fallacy’

Another common mistake is the issue of Imager’s Fallacy. A difference in significance does NOT imply significant difference. It is difficult to wrap your head around but it is still one of the most common mistakes in analysis still being made (Henson, 2005).

To illustrate, imagine that the striatum is more active during condition 1 compared to baseline. The same region is active but less significantly in condition 2 compared to baseline. This does not indicate whether there is a significant difference between conditions in the striatal region— it only indicates that to varying degrees, there is activation in this region for both conditions compared to baseline.

Further Reading

As a start, I would highly recommend these two guides for writing imaging papers, Poldrack et al. (2008) and Ridgway et al (2008). These will give a good overview of what to expect when reading or writing these types of publication. In addition, for readers lacking an extensive understanding of the physics theory behind magnetic resonance, the image acquisition part of the methods section may be daunting and I recommend this mri-physics introduction. YouTube videos such as these can also be helpful.

And Finally….

The limitations discussed are often legitimate but the argument that they make the modality redundant for measuring behavioural processes is not (Henson, 2005; Logothetis, 2008). MR data acquisition should be viewed as an additional method to complement behavioural measures. It will not solve all theoretical issues in psychology, but it does help provide insights into several cognitive and emotional processes. As with all areas of scientific inquiry, so long as appropriate measures are taken, then reliable data will be produced.



Bennett, C. M., Baird, A. A., Miller, M. B., & George, L. W. (2009). Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument for Proper Multiple Comparisons Correction. Journal of Serendipitous and Unexpected Results, 1(1), 1–5.

Cacioppo, J. T., Berntson, G. G., & Nusbaum, H. C. (2008). Neuroimaging as a New Tool in the Toolbox of Psychological Science, 17(2), 62–67.

Farah, M. J., & Hook, C. J. (2013). The Seductive Allure of “Seductive Allure.” Perspectives on Psychological Science, 8(1), 88–90. doi:10.1177/1745691612469035

Henson, R. (2005). What can functional neuroimaging tell the experimental psychologist? The Quarterly Journal of Experimental Psychology, 58(2), 193–233. doi:10.1080/02724980443000502

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. doi:10.1371/journal.pmed.0020124

Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F., & Baker, C. I. (2009). Circular analysis in systems neuroscience: The dangers of double dipping. Nature Neuroscience, 12(5), 535–540. doi:10.1038/nn.2303

Logothetis, N. K. (2008). What we can do and what we cannot do with fMRI. Nature, 453(7197), 869–78. doi:10.1038/nature06976

McCabe, D. P., & Castel, A. D. (2008). Seeing is believing: The effect of brain images on judgments of scientific reasoning. Cognition, 107(1), 343–52. doi:10.1016/j.cognition.2007.07.017

Poldrack, R. a, Fletcher, P. C., Henson, R. N., Worsley, K. J., Brett, M., & Nichols, T. E. (2008). Guidelines for reporting an fMRI study. NeuroImage, 40(2), 409–14. doi:10.1016/j.neuroimage.2007.11.048

Poldrack, R. Mumford, J. & Nichols, T. (2011) Handbook of Functional MRI Data Analysis. Cambridge University Press. ISBN: 9780521517669

Ridgway, G. R., Henley, S. M. D., Rohrer, J. D., Scahill, R. I., Warren, J. D., & Fox, N. C. (2008). Ten simple rules for reporting voxel-based morphometry studies. NeuroImage, 40(4), 1429–35. doi:10.1016/j.neuroimage.2008.01.003

Schweitzer, N. J., Baker, D. A, & Risko, E. F. (2013). Fooled by the brain: Re-examining the influence of neuroimages. Cognition, 129(3), 501–11. doi:10.1016/j.cognition.2013.08.009

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition. Perspectives on Psychological Science, 4(3), 274–290. doi:10.1111/j.1745-6924.2009.01125.x

Wager, T. D (2006). Do We Need to Study the Brain to Understand the Mind. Observer, 19(9). Retrieved from https://www.psychologicalscience.org/index.php/publications/observer/2006/september-06/do-we-need-to-study-the-brain-to-understand-the-mind.html on 15 December 2014

Woo, C.W., Krishnan, A., & Wager, T. D. (2014). Cluster-extent based thresholding in fMRI analyses: Pitfalls and recommendations. NeuroImage, 91, 412–9. doi:10.1016/j.neuroimage.2013.12.058

About the author

Niall Bourke Niall Bourke is a psychology graduate from Ireland. Having gained experience working with individuals that have acquired brain injuries, he moved to London to complete a MSc. in Neuroimaging at the Institute of Psychiatry, Psychology and Neuroscience (IoPPN). He now works as a research worker at the IoPPN, and as a visiting researcher at the University of Southampton on a developmental neuropsychology project.