Bayesian Statistics: Why and How


Bayesian statistics is what all the cool kids are talking about these days. Upon closer inspection, this does not come as a surprise. In contrast to classical statistics, Bayesian inference is principled, coherent, unbiased, and addresses an important question in science: in which of my hypothesis should I believe in, and how strongly, given the collected data?

After briefly stating the fundamental difference between classical and Bayesian statistics, I will introduce three software packages – JAGS, BayesFactor, and JASP – to conduct Bayesian inference. I will analyse simulated data about the effect of wearing a hat on creativity, just as the previous blog post did. In the end I will sketch some benefits a “Bayesian turn” would have on scientific practice in psychology.

Note that this post is very long, introducing plenty of ideas that you might not be familiar with (regrettably, they are not found in the standard university curriculum!). This post should serve as a basic but comprehensive introduction to Bayesian inference for psychologists. Wherever possible, I have linked to additional literature and resources for more in-depth treatments.

After you have read this post, you will begin to understand what the fuss is all about, and you will be familiar with tools to apply Bayesian inference. Let’s dive in!

Classical Statistics

What is probability?

Classical statistics conceptualizes probability as relative frequency. For example, the probability of a coin coming up heads is the proportion of heads in an infinite set of coin tosses. This is why classical statistics is sometimes called frequentist. At first glance, this definition seems reasonable. However, to talkabout probability, we now have to think about an infinite repetition of an event (e.g., tossing a coin). Frequentists, therefore, can only assign probability to events that are repeatable. They cannot, for example, talk about the probability of the temperature rising 4 °C in the next 15 years; or Hillary Clinton winning the next U.S. election; or you acing your next exam. Importantly, frequentists cannot assign probabilities to theories and hypotheses – which arguably is what scientists want (see Gigerenzer, 1993).

p values

Because of the above definition of probability, inference in classical statistics is counterintuitive, for students and senior researchers alike (e.g., Haller & Kraus, 2002; Hoekstra, Morey, Rouder, & Wagenmakers, 2014; Oakes, 1986). One notoriously difficult concept is that of the p value: the probability of obtaining a result as extreme as the one observed, or more extreme, given that the null hypothesis is true. To compute p values, we collect data and then assume an infinite repetition of the experiment, yielding more (hypothetical) data. For each repetition we compute a test statistic, such as the mean. The distribution of these means is the sampling distribution of the mean, and if our observed mean is far off in the tails of this distribution, based on an arbitrary standard ($\alpha = .05$) we conclude that our result is statistically significant. For a graphical depiction, see the figure below – inspired by figure 1 in Wagenmakers, 2007.


It is very easy to gloss over these specific assumptions of classical statistics because properties of the sampling distribution are often known analytically. For example, in the case of the t-test, the variance of the sampling distribution is the variance of the actual collected data, divided by the number of data points; $\hat \sigma = \sigma / N$.

Inference based on p values is a remarkable procedure: We assume that we did the experiment an infinite amount of time (we didn’t), and we compute probabilities of data we haven’t observed, assuming that the null hypothesis is true. This way of drawing inferences has serious deficits. For an enlightening yet easy to read paper about these issues, see Wagenmakers (2007). If you think that you could need a more detailed refresher on p values – they are tricky! – see here.

Confidence Intervals

Recently, the (old) new statistics has been proposed, abandoning p values and instead focusing on parameter estimation and confidence intervals (Cumming, 2014). As we will see later, parameter estimation and hypothesis testing answer different questions, so abandoning one in favour of the other is misguided. Because confidence intervals are still based on classical theory, they are an equally flawed method of drawing inferences; see Lee (n.d.) and Morey et al. (submitted).

What are parameters?

Tied to the classical notion of probability, classical statistics considers the population parameter as fixed, while the data are allowed to vary (repeated sampling). For example, the difference between men and women in height is exactly 15 centimeters. We can intuit that confidence intervals, as well as statistical power, are not concerned with the actual data we have collected. Why?

Assume we collect height data from men and women, and compute a 95% confidence interval around our difference estimate – this does not mean that we are 95% confident that the true parameter lies within these bounds! For the actual experiment, the true parameter either is or is not within these bounds. 95% confidence intervals state that, if we were to repeat our experiment an infinite amount of time, in 95% of all cases the parameter will be within those bounds (Hoekstra, Morey, Rouder, & Wagenmakers, 2014). This must be so because we can only talk about probability as relative frequency: we have to assume repeating our experiment, even though we only conducted one!

It is important to note that all probability statements in classical statistics are of that nature: they average over an infinite repetition of experiments. Probability statements do not pertain to the specific data you gathered; they are about the testing procedure itself. Extend this to statistical power, which is the probability of finding an effect when there really is one. High-powered experiments yield informative data, on average. Low-powered experiments yield uninformative data, on average. However, for the specific experiment actually carried out – once the data are in – we can go beyond power as a measure of how informative our experiment was (Wagenmakers et al., 2014). In the last section I explain what this entails when using Bayesian inference for hypothesis testing.

Bayesian Statistics

What is probability?

In Bayesian inference, probability is a means to quantify uncertainty. Continuing with the height example above, Bayesians quantify uncertainty about the difference in height with a probability distribution. It might be reasonable, for example, to specify a normal distribution with mean $\mu = 15$ and variance $\sigma^2 = 4$

\text{difference} \sim \mathcal{N}(15, 4)

We most strongly believe that the height difference is 15 centimeters, although it could also be 10, or even 5 centimeters (but with lower probability). However, we might not be so sure about our estimates. To incorporate uncertainty, we could increase the variance of the distribution (decrease the “peakedness”), say, in this case, to $\sigma^2 = 16$. The plot below compares possible prior beliefs (click on the image to enlarge it).


In the Bayesian world, probability retains the intuitive, common-sense interpretation: it is simply a measure of uncertainty.

What are parameters?

While parameters still have one single true value in some ontological sense, men are on average exactly 15 centimeters taller than women, we quantify our uncertainty about this difference with a probability distribution. The beautiful part is that, if we collect data, we simply update our prior beliefs with the information that is in the data to yield posterior beliefs.

Bayesian Parameter Estimation

There’s no theorem like Bayes’ theorem
Like no theorem we know
Everything about it is appealing
Everything about it is a wow
Let out all that a priori feeling
You’ve been concealing right up to now!
– George Box

Because parameters themselves are assigned a distribution, statistical inference reduces to applying the rules of probability. We specify a joint distribution over data and parameters, $p(\textbf{y}, \theta)$. By the definition of conditional probability we can write $p(\textbf{y}, \theta) = p(\textbf{y}|\theta)p(\theta)$. The first term, $p(\textbf{y}|\theta)$, is the likelihood, and it embodies our statistical model. It also exists in the frequentist world, and it contains assumptions about how our data points are distributed; for example, whether they are normally distributed or Bernoulli distributed. The other term, $p(\theta)$, is called the prior distribution over the parameters, and quantifies our belief (before looking at data) in, say, height differences among the sexes.

Combining the data we have collected with our prior beliefs is done via Bayes’ theorem:

p(\theta|\textbf{y}) = \frac{p(\theta) \times p(\textbf{y}|\theta)}{p(\textbf{y})}
where $\textbf{y}$ is the probability of the data (which has no frequentist equivalent); in words:
\text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{marginal likelihood}}

Because $p(\textbf{y})$ is just a normalizing constant so that $p(\theta|\textbf{y})$ sums to 1, i.e. is a proper probability distribution, we can drop it and write:

p(\theta|\textbf{y}) \sim p(\theta) \times p(\textbf{y}|\theta)

With Bayes’ rule we are solving the “inverse probability problem”, that is we are going from the effect (the data) back to the cause (the parameter) (Bayes, 1763; Laplace 1774).

Classical statistics has a whole variety of estimators, such as maximum likelihood and generalized least squares, which are investigated along specific dimensions (e.g. bias, efficiency, consistency). In contrast, Bayesians always uses Bayes’ rule, which simply follows from probability. This is why we say that Bayesian statistics is principled, rational, and coherent. Let’s take a look at a simple example to see how the two estimation approaches differ.

A simple example

Suppose you flip a coin twice and observe heads both times. What is your estimate that the next coin flip comes up heads? We know that flipping a coin several times, with the flips being independent, can be described binomially with likelihood:

L(\theta| \textbf{y}) = \theta^k \times (1 – \theta)^{N – k}

where $N$ is the number of flips and $k$ is the number of heads. The likelihood $L(\theta| \textbf{y})$ is just the probability expressed in terms of the parameter $\theta$, instead of the data $\textbf{y}$ (and need not sum to 1). For example, assuming $\theta = .5$, the likelihood is:

L(\theta = .5 | \textbf{y}) &= .5^2 \times (1 – .5)^{2-2} \\
L(\theta = .5 | \textbf{y}) &= .25

Plotting the likelihood for the whole range of possible $\theta$ values given our data ($k = 2$ and $N = 2$) yields:


We see that $\theta = 1$ maximizes the likelihood; in other words: $\theta = 1$ is the maximum likelihood estimator (MLE) for these data. Maximum likelihood estimation is the workhorse of classical inference. In classical statistics, then, the prediction for the next coin coming up heads would be 1. This is somewhat counterintuitive; would that really be your prediction, after just two coin flips? What does the Bayesian in you think?

First, we need to quantify our prior belief about the parameter $\theta$. Invoking the principle of insufficient reason, we use a prior that assigns equal probability to all possible values of $\theta$: the uniform distribution. We can write this as a Beta distribution with parameters $a = 1, b = 1$:

Beta(a, b) = \textbf{K} \times \theta^{a – 1} (1 – \theta)^{b – 1}

where $\textbf{K}$ is the normalizing constant and not of importance for now. $a$ and $b$ can be interpreted as prior data. In our example, $a$ would be the number of heads and $b$ would be the number of tails. Remember that in Bayesian inference we just multiply the likelihood with the prior. The Beta distribution is a conjugate prior for the binomial likelihood. This means that when using this prior, the posterior will again be a Beta distribution, making for trivial computation. Plugging in Bayes’ rule yields:

p(\theta|N, k) &\sim \theta^k (1 – \theta)^{N – k} \times \theta^{a – 1} (1 – \theta)^{b – 1} \\
p(\theta|2, 2) &\sim \theta^2 (1 – \theta)^{2 – 2} \times \theta^{a – 1} (1 – \theta)^{b – 1} \\
p(\theta|2, 2) &\sim \theta^{a – 1 + 2} (1 – \theta)^{b – 1 + 2 – 2}

Rearranging yields:

p(\theta|2, 2) \sim \theta^{(a + 2) – 1} (1 – \theta)^{(b + 2 – 2) – 1}

We recognize that our posterior is again a Beta distribution: it’s Beta(3, 1). In general, our updating rule is $Beta(a + k, b + N – k)$. You can play around with the code below and try out different priors and data:


The point with the most probability density, the mode, is $(a – 1) / (a + b – 2)$ which in our case yields $1$. However, as we see when looking at the distribution, there is substantial uncertainty. To account for this, we take the mean of the posterior distribution, $a / (a + b)$, as our best guess for the future. Thus, the prediction for the next coin flip being heads is $.75$. Note that classical estimation, based on maximum likelihood ($\theta_{MLE} = k / N = 2 / 2$), would predict heads with 100% certainty. For more on likelihoods, check out this nice blog post. For more details on Bayesian updating, see this.

What can be seen in conjugate examples quite clearly is that the Bayesian posterior is a weighted combination of the prior and the likelihood. We have seen that the posterior mean is:

\hat p = \frac{k + a}{a + b + n}

which, with some clever rearranging, yields:

\hat p = \frac{n}{a + b + n}(\frac{k}{n}) + \frac{a + b}{a + b + n}(\frac{a}{a + b})

Both the maximum likelihood estimator $\theta_{MLE} = k / n$ and the mean of the prior $a / (a + b)$ are weighted by terms that depend on the sample size $n$ and the prior parameters $a$ and $b$. Two important things should be noted here. First, this implies that the posterior mean is shrunk toward the prior mean. In hierarchical modeling, this is an extremely powerful idea. Shrinkage yields a better estimator (Efron & Morris, 1977). For a nice tutorial on hierarchical Bayesian inference, see Sorensen & Vasishth (submitted).

The second thing to note is that when $n$ becomes very large, the Bayes point estimate and the classical maximum likelihood estimate converge to the same value (Bernstein-von Mises theorem). You can see this in the formula above. The first weight becomes 1 and the second weight becomes 0, leaving only $k / n$, which is the maximum likelihood estimate. Therefore, when different people initially disagree (different prior), once they have seen enough evidence, they will agree (same posterior).

Note that identical estimation results need not lead to the same model comparison results. I find the argument that because the estimation converges, we can just go on doing frequentist statistics rather disturbing. First, you never have an infinite amount of participants. Second, the prior allows you to incorporate theoretical meaningful information into your model, which can be extremely valuable (e.g. this great problem). Third, and most important, estimation and testing answer different questions. While parameter estimates might converge, Bayesian hypothesis testing offers a much more elegant and richer inferential foundation over that provided by classical testing (but more on this below).

Markov chain Monte Carlo

Using a simple example, we have seen how Bayesians update their prior beliefs upon seeing new data, yielding posterior beliefs. However, conjugacy is not always given, and the prior times the likelihood might be a unusually formed, complex distribution that we cannot track analytically. For most of the 20th century, Bayesian analysis was restricted to conjugate priors for computational simplicity.

But things have changed. Due to the advent of cheap computing power and Markov chain Monte Carlo techniques (MCMC) we now can estimate basically every posterior distribution we like.

Bayesian hypothesis testing

Parameter estimation is not the same as testing hypotheses (e.g. Morey, Rouder, Verhagen, & Wagenmakers, 2014; Wagenmakers, Lee, Rouder, & Morey, submitted). The argument is simple: while each observation tells you something about the parameter, not every observation is informative about which hypothesis you should believe in. To build an intuition for that, let’s throw a coin (again). Is it a fair coin ($\theta = .5$) or a biased ($\theta \neq .5$) coin? Assume the first observation is heads. You learn something about the parameter $p$; for example, it can’t be equal to 0 (always tails), and that a bias toward heads is more likely than a bias toward tails. Looking only at the parameter estimate thus carries some evidence against the null hypothesis. However, in fact you have learned nothing that would strengthen your belief in either of your hypotheses. Observing heads is equally consistent with the coin being unbiased and the coin being biased (the Bayes factor is equal to 1, but more below). Inference by parameter estimation is inadequate, unprincipled, and biased.

In order to test hypotheses, we have to compute the marginal likelihood, which we can skip when doing parameter estimation (because it is just a constant). As a reminder, here is the formula again:

\text{posterior} = \frac{\text{prior} \times \text{likelihood}}{\text{marginal likelihood}}

Let’s say we compare two models, $M_0$ and $M_1$, that describe a difference in the creativity of people who wear hats and others who don’t. The parameter of interest is $\delta$. $M_0$ restricts $\delta$ to 0, while $M_1$ lets $\delta$ vary. This corresponds to testing a null hypothesis $H_0: \delta = 0$ against the alternative $H_1: \delta \neq 0$. In order to test which hypothesis is more likely to be true, we pit the predictions of the models against each other. The prediction of the models is embodied by their respective marginal likelihoods, $p(\textbf{y}|M)$. For the discrete case, this yields:

P(\textbf{y}|M) = \sum_{i=1}^{k} P(\textbf{y}|\theta_i, M)P(\theta_i|M)

That is, we look at the likelihood of the data $\textbf{y}$ for every possible value of $\theta$ and weight these values with our prior belief. For continuous distributions, we have an integral:

p(\textbf{y}|M) = \int_{}^{} p(\textbf{y}|\theta, M)p(\theta|M)d\theta

The ratio of the marginal likelihoods is the Bayes factor (Kass & Raftery, 1995):

BF_{01} = \frac{p(\textbf{y}|M_0)}{p(\textbf{y}|M_1)}

which is the factor by which our prior beliefs about the hypotheses (not parameters!) get updated to yield the posterior beliefs about which hypothesis is more likely:

\frac{p(M_0|\textbf{y})}{p(M_1|\textbf{y})} = \frac{p(\textbf{y}|M_0)}{p(\textbf{y}|M_1)} \frac{p(M_0)}{p(M_1)}

The Bayes factor is a continuous measure of evidence: $BF_{01} = 3$ indicates that the data are three times more likely under the null model $M_0$ than under the alternative model $M_1$. Note that $BF_{10} = 1 / BF_{01}$, and that when the prior odds are 1, the posterior odds equal the Bayes factor.

Because complex models can capture many different observations, their prior on parameters $p(\theta)$ is spread out wider than those of simpler models. Thus there is little density at any specific point – because complex models can capture so many data points; taken individually, each data point is comparatively less likely. For the marginal likelihood, this means that the likelihood gets multiplied with these low density values of the prior, which decreases the overall marginal likelihood. Thus model comparison via Bayes factors incorporates an automatic Ockham’s razor, guarding us against overfitting.

While classical approaches like the AIC naively add a penalty term (2 times the number of parameters) to incorporate model complexity, Bayes factors offer a more natural and principled approach to this problem. For details, see Vandekerckhove, Matzke, & Wagenmakers (2014).

Two priors

Note that there are actually two priors – one over models, as specified by the prior odds $p(M_0) / p(M_1)$, and one over the parameters, $p(\theta)$, which is implicit in the marginal likelihoods. The prior odds are just a measure of how likely one model or hypothesis is over the other, prior to data collection. It is irrelevant for the experiment at hand, but might be important for drawing conclusions. For example, Daryl Bem found a high Bayes factor in support of precognition. Surely this does not by itself mean that we should believe in precognition! The prior odds for pre-cognition are extremely low – at least for me – so that even if multiplied by a Bayes factor of 1000, the posterior odds will be vanishingly small:

\frac{1}{100000} = \frac{1000}{1} \cdot \frac{1}{100000000}

We can agree on how much the data support precognition (as quantified by the Bayes factor). However, this does not mean we have to buy it. Extraordinary claims require extraordinary evidence.

Savage-Dickey trick

For the univariable case, computing the marginal likelihood is pretty straightforward; however, the integrals get harder with many variables, say in multiple regression. Note that while parameter estimation is basically solved with MCMC methods, computing the marginal likelihood is still a tough deal. However, when testing nested models, such as in the case of null hypotheses testing, we can use a neat mathematical trick that sidesteps the issue of computing the marginal likelihood – the Savage-Dickey density ratio:

BF_{01} = \frac{p(\delta = 0|M_1, \textbf{y})}{p(\delta = 0|M_1)}

That is, take the ratio of the posterior density at the point of interest, for example $\delta = 0$, and divide it by the prior density at the point of interest. Note that unlike to the p value, the Bayes factor is not limited to testing nested models, but can compare complex, functionally different models (as is common in cognitive science). We will see a nice graphical depiction of Savage-Dickey later.

Let us revisit our coin toss example. We flipped the coin twice and observed heads both times. Suppose we want to test the hypothesis that the coin is fair – that is $\theta = .5$. Using the Savage-Dickey density ratio, we divide the height of the posterior distribution at $\theta = .5$ by the height of the prior distribution at $\theta = .5$. Recall that:

p(\theta) \sim \text{Beta(1, 1)} \\
p(\theta|\textbf{y}) \sim \text{Beta(3, 1)} \\

Thus yielding:

BF_{01} &= \frac{p(\theta = .5|\textbf{y})}{p(\theta = .5)} \\
BF_{01} &= \frac{.75}{1} \\
BF_{10} &= 1\frac{1}{3} \\

The data are $1\frac{1}{3}$ times more likely under the model that assumes that the coin is biased toward either heads or tails. For a nice first lesson in Bayesian inference that prominently features coin tosses with interactive examples, see this. For an excellent introduction to the Savage-Dickey density ratio, see Wagenmakers et al. (2010).

Lindley’s paradox and the problem with uninformative priors

Uninformative priors like $\text{Unif(}-\infty\text{,}+\infty\text{)}$ can be used in parameter estimation because the data quickly overwhelm the prior. However, when testing hypotheses, we need to use priors that are at least somewhat informative. One can see the problem with uninformative priors in the Savage-Dickey equation. These priors spread out their probability mass such that at each point there is virtually zero density. Dividing something by a very, very small number yields a very, very high number. The resulting Bayes factor favours the null hypothesis without bounds (DeGroot, 1982; Lindley, 1957). Consequently, we need to use informative priors for hypothesis testing. Rouder, Morey, Wagenmakers and colleagues have extended the default priors initially proposed by Harold Jeffreys (1961). These default priors have certain important features, see Rouder & Morey (2012), but should be given some thought and possible adjustments before used.

Creativity example

We want to know if wearing hats does indeed have an effect on creativity. Instead of collecting real data, we just simulate data, assuming a real effect of Cohen’s $d = 7 / 15$.


Classical Inference

In classical statistics we would compute p values. Because the t-test is simply a general linear model with one categorical predictor, we can run:

If you are not familiar with R, this output might look daunting. Note that the t-test as a general linear model states that

y_i = \beta_0 + \beta_1 x_i + \epsilon_i

In the R output, (Intercept) is $\beta_0$, while x is $\beta_1$. $\beta_0$ is the mean of the group in which our categorical predictor is 0, i.e. the group which did not wear hats. The group that did wear hats, on the other hand, gets a creativity boost of 6.944. The p value for this difference is $p = .028$. We would conclude that the probability of observing these data or more extreme data given that there really is no effect is only 2.8%. Using the conventional cutoff, $\alpha = .05$, we would say our result is statistically significant. Although we have not computed $p(H_1|\textbf{y})$, we would conclude that the result supports the alternative hypothesis. Is this really so?

Bayesian Inference

Below I introduce three sofware packages for Bayesian inference. The first one, JAGS, is very low-level and does not readily provide Bayes factors. Its primary use is to estimate the posterior distribution. In our present example, we are interested in the difference between the two groups, specifically in the effect size. As mentioned earlier, we can use uninformative priors for parameter estimation. To get a feeling for slightly informative priors, however, I will specify default priors when using JAGS. Subsequently we will compute Bayes factors using the BayesFactor R package and the graphical software JASP.

Using JAGS

JAGS (“Just Another Gibbs Sampler”) implements MCMC methods (as discussed above) to sample from the posterior distribution. This is especially needed when the posterior distribution is not available in closed form – say when we don’t have conjugate priors.

To have a scale-free measure of our prior belief, we will specify our prior beliefs over the effect size $\delta = \mu / \sigma^2$. As suggested in the literature, we might want to use a default prior. For a detailed rationale of so-called default priors, see pages 6 and 7 of this, Rouder, Speckman, Sun, Morey, & Iverson (2009), and Rouder & Morey (2012). For a more thorough treatment of the default prior approach pioneered by Harold Jeffreys, see Ly et al. (in press).

Constructing a default prior

What is our prior on the effect size? We might want to use a normal distribution centered on 0:

\delta \sim \text{Normal(0, g)}

The more difficult question is how to specify the variance, $g$. If we specify large values, then we say that we also expect very high values for the effect size, like 2 or 3. This never happens in psychology! Perhaps we should choose a lower value, like 1? Or better yet, why not quantify our uncertainty about the variance with … a probability distribution!

This is what we will do, using the inverse Gamma distribution which has two parameters:
g \sim \text{Inverse Gamma(}\frac{1}{2}\text{,}\frac{r^2}{2}\text{)}

Now when we integrate out the parameter $g$, that is incorporate our uncertainty about the variance in our belief about the effect size, this elaborate prior specification simplifies to:

\delta \sim \text{Cauchy(r)}

The Cauchy distribution has only one parameter, $r$. It is similar to a normal distribution, but with fatter tails (indicating greater uncertainty). The scale parameter $r$ influences the width of the distribution; the higher the parameter value, the wider the distribution – meaning greater probability density at the tails.

Are we done? Not quite! We still need to specify a prior over the variance. Because we do not make inference about the variance, we can give it a noninformative, Jeffreys’ prior. We approximate the Jeffreys’ prior using an inverse Gamma distribution with very small shape and rate parameters (e.g. Morey, Rouder, Pratte, & Speckman, 2011):

\sigma^2 \sim \text{Inverse Gamma(.00001, .00001)}

We also have to specify a prior distribution over $\beta_0$, the mean of the group that did not wear hats. Again, because our inference does not depend on this parameter, we assign it an uninformative prior:

\beta_0 \sim \text{Normal(0, 10000)}

Note that the specification in JAGS does not use the variance $\sigma$, but precision, which is $1 / \sigma$. Therefore, to specify a high variance, we would use a low precision.

Although it is possible to compute Bayes factors directly in JAGS, for example using the product-space method (Lodewyckx et al., 2011) or conditional marginal density estimation (Morey et al., 2011), it is primarily used to estimate posterior distributions of parameters of interest. In the example that follows, we will compute Bayes factors with easier-to-use software and use JAGS to estimate posterior distributions. By the way, if you are lost with all those different distributions, check out Richard Morey’s nice interactive visualisations.

Note that the general linear model assumes that:

y_i = \beta_0 + \beta_1 x_i + \epsilon_i

\epsilon \sim \text{Normal(}0, \sigma)

that is the errors are normally distributed and centered at 0. From this it follows that:
y_i \, – (\beta_0 + \beta_1 x_i) &\sim \text{Normal(}0, \sigma) \\
y_i &\sim \text{Normal(}\beta_0 + \beta_1 x_i, \sigma)

which is the specification used in JAGS (with precision instead of variance).

The plot below shows the posterior distributions of b0 (the mean in the no-hats condition), b1 (the difference between the two groups), and d (effect size):


How wonderful! Instead of a point estimate, we get a whole distribution. It is important to check if the MCMC algorithm actually converged on the posterior distribution. Run the commands below to monitor convergence:

Upon inspection, everything seems fine!

Let’s compute credibility intervals – the Bayesian analogue to confidence intervals:

We can be 95% confident that the effect size is between .034 and .802. There is substantial variability, but that’s just how things are (also this). Note that, in this case, the frequentist confidence interval coincides with the Bayesian credibility interval.

What John Kruschke and other advocates of “Hypothesis testing via Estimation” would have us look at to draw inferences is only these posterior distributions. Simplifying a bit, if the 95% credibility interval of d excludes 0 (or some specified region of interest, say -.1 to .1), we would conclude that there really is an effect. As I have said in the section introducing the Bayes factor, this is a little fishy. It is unprincipled – how strongly should we believe in an effect? And it is biased against $H_0$ – we will see below that there is not much evidence for an effect.

Note that we might have outliers in our data, rendering the normal distribution inadequate. Thanks to JAGS’s flexibility, we can easily substitute the normal distribution with a t-distribution, which has fatter tails. This would make our inference more robust, see e.g. chapter 16 in Kruschke (2014).

Concluding the section on JAGS, I want to say that JAGS offers a lot of flexibility, but at the price of complexity. Below I introduce you to more user-friendly software, which also enables us to easily compute Bayes factors for common designs. However, if you are doing multivariate statistics like structural equation modeling, there is no way around software like JAGS. If you are interested in cognitive modeling of the Bayesian kind, you also need to use JAGS. For a great book on Bayesian cognitive modeling, see Lee & Wagenmakers (2013).

Using BayesFactor

The BayesFactor R package written by Richard Morey computes the Bayes factor for a variety of common statistical models. Similarly to the frequentist analysis above, we specify the t-test as a general linear model and use the function lmBF from the BayesFactor package:

The resulting Bayes factor is $BF_{10} = 1.84$, that is the model that assumes an effect of hat is 1.84 times more likely then the model that assumes no effect – given a so-called JZS prior specification. JZS stands for Jeffreys-Zellner-Siow. This prior is basically the same we have specified before. For a more detailed rationale, see Rouder et al. (2009).

The BayesFactor package also does ANOVA, multiple regression, proportions and more – do have a look! For a more detailed explanation of the Bayesian t-test, see this. For a tutorial on regression that uses the BayesFactor package, click here.

Using JASP

“Gosh!” You might say. “This looks all very peculiar. I am not familiar with R, let alone with JAGS. Bayes seems complicated enough. Oh, I better stick with p values and SPSS.”


JASP is a slick alternative to SPSS that also provides Bayesian inference (and it does APA tables!). First, let us write the data to a .csv file:

Open JASP, read in the .csv file, and run a Bayesian independent t-test. Checking Prior and Posterior and Additional Info gives us a beautiful plot summarizing the result:


Above you can see the Savage-Dickey density ratio, i.e. the height of the posterior divided by the height of the prior, at the point of interest (here $\delta = 0$).

Is this result consistent for different prior specifications? Checking Bayes factor robustness check yields the following result:


We see that across different widths (0-1.5) of the Cauchy prior, the conclusion roughly stays the same. If we were to specify a width of 200, what would happen? Lindley would turn around in his grave! As mentioned above in the section on Lindley’s paradox, uninformative priors should not be used for hypothesis testing because this leads the Bayes factor to strongly support $H_0$. You can see that in the graph below.


Setting aside unreasonable prior specifications, the Bayes factor is not very different from 1. This means that our data are not sufficiently informative to choose between the hypotheses. While the Bayes factor is a continuous measure, it might be reasonable to label certain categories depending on the magnitude of the Bayes factor. Several labels have been proposed. For an overview, see this. To get some intuition how a Bayes factor “feels”, see here.

Because the data are not compelling enough, we run an additional 20 participants per group:

A quick look tells us that the Bayes factor in favour of the alternative hypothesis is now roughly 25:

We can see how the Bayes factor develops over the data collection using JASP. First, let’s write the data to disk again:

Running the same analysis in JASP in order to get those stunning graphs yields:


Checking Sequential Analysis yields:


This plot shows how the Bayes factor develops over the course of participant collection.

JASP includes several interesting datasets. I encourage you to play with them! Peter Edelsbrunner and I recently gave a workshop where we had some exercises and more datasets. You can find the materials here.

Advantages of Bayesian inference

The right kind of probability

I don’t think that researchers are really interested in the probability of the data (or more extreme data), given that the null hypothesis is true and the data was collected according to a specific (unknown) sampling plan. Rather, I believe that scientists care about which hypothesis is more likely to be true after the experiment has been conducted. In the same vein, we don’t care about how often, in an infinite repetition of the experiment, the parameter estimate lies in a specific interval. We prefer a statement of confidence. How confident can I be that the parameter estimate lies in this specific interval?


“Today one wonders how it is possible that orthodox logic continues to be taught in some places year after year and praised as ‘objective’, while Bayesians are charged with ‘subjectivity’. Orthodoxians, preoccupied with fantasies about nonexistent data sets and, in principle, unobservable limiting frequencies – while ignoring relevant prior information – are in no position to charge anybody with ‘subjectivity’.”
– Jaynes (2003, p. 550), as cited in Lee & Wagenmakers (2014, p. 61)

Bayesian inference conditions only on the observed data, that is it does not violate the likelihood principle, and so is independent of the researcher’s intentions. This might sound surprising, but p values are inherently subjective – in a really nasty way. For theoretical arguments about that, see Berger & Berry (1988) and Wagenmakers (2007). Scientific objectivity is illusionary, and both frequentist and Bayesian methods have their subjective elements. Crucially, though, Bayesian subjectivity is open to inspection by everyone (just look at the prior!), whereas frequentist subjectivity is not.

Optional Stopping

“The rules governing when data collection stops are irrelevant to data interpretation. It is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience.”
– Edwards, Lindman, & Savage (1963, p. 239)

Bayesians do not have an optional stopping problem (Edwards et al., 1963; Rouder, 2014). Recall that in classical statistics, you cannot collect additional data once you have run your test, because this inflates the nominal $\alpha$ level; that is it will rain false positives (Simmons, Nelson, & Simonsohn, 2011). For example, let’s say you test 20 participants and get $p = .06$. You cannot test another batch of, say 5, and run your test again. In a sense, you are in limbo. You can neither conclude that $H_1$ is supported, nor that $H_0$ is supported, nor that the data are uninformative. On other occasions you might want to stop early because the data show a clear picture, and running more subjects might be expensive or unethical. However, within the framework of classical statistics, you are not permitted to do so. Using Bayesian inference, we can monitor the evidence as it comes in, that is test after every participant, and stop data collection once the data are informative enough – say the Bayes factor in favour of $H_0$ or $H_1$ is greater than 10. For a recent paper on sequential hypothesis testing, see Schönbrodt (submitted).

Continuous Measure of Evidence

The Bayes factor is a continuous measure of evidence and is directly interpretable. While inference in current statistical practices relies on an arbitrary cutoff of $\alpha = .05$ and results in a counterfactual (probability of at least as extreme data, given $H_0$), Bayes factors straightforwardly tell you which hypothesis is more likely given the data at hand.

Supporting the null

Bayes factors explicitly look at how likely the data are under $H_0$ and $H_1$. Recall that the p value looks at the probability of the data (or more extreme data), given that $H_0$ is true. The logic behind inference, called Fisher’s disjunction, is as follows: either a rare event has occured (with probability $p$), or $H_0$ is false. In the form of a syllogism, we have:

(Premise) If $H_0$, then y is very unlikely.

(Premise) y.

(Conclusion) $H_0$ is very unlikely.

This reasoning is flawed, as demonstrated by the following:

(Premise) If an individual is a man, he is unlikely to be the Pope.

(Premise) Francis is the Pope.

(Conclusion) Francis is probably not a man.

Nonsense! The problem is that in classical inference, we do not look at the probability of the data under $H_1$. The data at hand, Francis being the Pope, are infinitely more likely under the hypothesis that Francis is a man ($H_0$) than they are under the hypothesis that Francis is not a man ($H_1$).

P values, because they only look at the probability of the data under $H_0$, are violently biased against $H_0$ (we have already seen this in our creativity example). For a more detailed treatment, see Wagenmakers et al. (in press). A study looking at 855 t-tests to quantify the bias of p values empirically can be found in Wetzels et al. (2011). A tragic, real-life case of how a p value caused grave harm is the case of Sally Clark; the story is also told in Rouder et al. (submitted).

Bayesian inference conditions on both $H_0$ and $H_1$, thus it also allows us the quantify support for the null hypothesis. In science, invariances can be of great interest (Rouder et al., 2009). Being able to support the null hypothesis is also important in replication research (e.g. Verhagen & Wagenmakers, 2014).


I hope to have convinced you that Bayesian statistics is a sound, elegant, practical, and useful method of drawing inferences from data. Bayes factors continuously quantify statistical evidence – either for $H_0$ or $H_1$ – and provide you with a measure of how informative your data are. If data are not informative ($BF \sim 1$), simply collect more data. Credibility intervals retain the intuitive, common-sense notion of probability and tell you exactly what you want to know: how certain am I that the parameter estimate lies within a specific interval?

JAGS, BayesFactor, and especially JASP provide easy-to-use software so that you can actually get stuff done. In light of what I have told you so far, I want to end with a rather provocative quote by Dennis Lindley, a famous Bayesian:

“[…] the only good statistics is Bayesian statistics. Bayesian statistics is not just another technique to be added to our repertoire alongside, for example, multivariate analysis; it is the only method that can produce sound inferences and decisions in multivariate, or any other branch of, statistics. It is not just another chapter to add to that elementary text you are writing; it is that text. It follows that the unique direction for mathematical statistics must be along the Bayesian roads.”
– Lindley (1975, p. 106)

Suggested Readings

For easy introduction, I suggest playing with this and reading this, this, this, and this. Peter Edelsbrunner and I recently did a workshop on Bayesian inference; you can find all the materials (slides, code, exercises, reading list) on github. For a thorough treatment, I suggest Jackman (2009) and Gill (2015). For an introduction more geared toward psychologists, but without a proper account of hypothesis testing, see Kruschke (2014). For an excellent practical introduction to Bayesian cognitive modeling, see Lee & Wagenmakers (2013).

Important Note

Bayesian and frequentist statistics have a long history of bitter rivalry (see for example McGrayne, 2011). Because the core issues – e.g. what is probability? – are philosophical rather than empirical, most of the debates were heated and emotional. There was a time when Bayes was more theology than tool. Although Bayesian statistics – in contrast to our current statistical approach – is a coherent, principled, and intuitive way of drawing inferences from data, there are still open issues. A Bayesian is not a Bayesian (Berger, 2006; Gelman & Shalizi, 2013; Morey, Romeijn, & Rouder, 2013; Kruschke & Liddell, submitted).

It would be foolish to think that Bayesian statistics could single-handedly turn around psychology. The current “crisis in psychology” (Pashler & Wagenmakers, 2012) won’t be solved by reporting $BF_{10} > 3$ instead of $p < .05$. Bayes cannot be the antidote to questionable research practices, “publish or perish” incentives, or mindless statistics (e.g. Gigerenzer, 2004). However, because Bayes factors are not biased against $H_0$, allow us to state evidence for the absence of an effect, and condition only on the observed data, Bayesian statistics increases both the flexibility in data collection and the robustness of our inferences. With the above tools in the trunk, there is no reason not to use Bayesian statistics.


Mr. Bayes, & Price, M. (1763). An Essay towards solving a Problem in the Doctrine of Chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFRS. Philosophical Transactions (1683-1775), 370-418.

Berger, J. (2006). The case for objective Bayesian analysis. Bayesian analysis, 1(3), 385-402.

Berger, J. O., & Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 159–-165.

Cumming, G. (2014). The new statistics: why and how. Psychological Science, 25(1), 7–-29.

DeGroot, M. H. (1982). Lindley’s paradox: Comment. Journal of the American Statistical Association, 336–-339.

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3), 193.

Efron, B., & Morris, C. (1977). Stein’s Paradox in Statistics. Scientific American, 236, 119-127.

Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8-38.

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reason-
ing. A handbook for data analysis in the behavioral sciences: Method-
ological issues
, 311–-339.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.

Haller, H., & Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers. Methods of Psychological Research, 7(1), 1-20.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157–-1164.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524-532.

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773-–795.

Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial introduction with R. Academic Press.

Kruschke, J., & Lidell, Torrin (submitted). The Bayesian New Statistics: Two Historical Trends Converge. Manuscript available from here.

Lee, M. D., & Wagenmakers, E.-J. (2014). Bayesian cognitive modeling: A practical course. Cambridge University Press.

Lindley, D. (1975). The future of statistics: A Bayesian 21st century. Advances in Applied Probability, 106-–115.

Lindley, D. V. (1957). A statistical paradox. Biometrika, 187-–192.

Lodewyckx, T., Kim, W., Lee, M. D., Tuerlinckx, F., Kuppens, P., & Wagenmakers, E.-J. (2011). A tutorial on Bayes factor estimation with the product space method. Journal of Mathematical Psychology, 55(5), 331-–347.

Ly, A., Verhagen, A. J., & Wagenmakers, E.-J. (in press). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology. Available from here.

McGrayne, S. B. (2011). The theory that would not die: how Bayes’ rule cracked the enigma code, hunted down Russian submarines, & emerged triumphant from two centuries of controversy. Yale University Press.

Morey, R. D., Romeijn, J. W., & Rouder, J. N. (2013). The humble Bayesian: Model checking from a fully Bayesian perspective. British Journal of Mathematical and Statistical Psychology, 66(1), 68-75.

Morey, R. D., Rouder, J. N., Verhagen, J., & Wagenmakers, E. J. (2014). Why Hypothesis Tests Are Essential for Psychological Science: A Comment on Cumming (2014). Psychological Science, 25(6), 1289-1290.

Morey, R. D., Rouder, J. N., Pratte, M. S., & Speckman, P. L. (2011). Using MCMC chain outputs to efficiently estimate Bayes factors. Journal of Mathematical Psychology, 55(5), 368–-378.

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral
Chichester: Wiley.

Pashler, H., & Wagenmakers, E. J. (2012). Editors’ Introduction to the Special Section on Replicability in Psychological Science A Crisis of Confidence?. Perspectives on Psychological Science, 7(6), 528-530.

Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301-–308.

Rouder, J. N., & Morey, R. D. (2012). Default Bayes factors for model selection in regression. Multivariate Behavioral Research, 47(6), 877-–903.

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t-tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–-237.

Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., Wagenmakers,
E.-J., & Rouder, J. (submitted). Is there free lunch in inference? Available from here.

Sorensen, T., Vasishth, S. (submitted). Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists. Available from here.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366.

Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (2014). Model comparison and the principle of parsimony. Oxford Handbook of Computational and Mathematical Psychology. Oxford University Press, Oxford.

Verhagen, J., & Wagenmakers, E. J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology: General, 143(4), 14-57.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779-–804.

Wagenmakers, E.-J., Lee, M., Rouder, J. N., & Morey, R. D. (submitted). Another statistical paradox. Available from here.

Wagenmakers, E.-J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., . . . Morey, R. D. (in press). A power fallacy. Behavior Research Methods. Available from here.

Wagenmakers, E.-J., Lee, M., Lodewyckx, T., & Iverson, G. J. (2008). Bayesian
versus frequentist inference. In Bayesian evaluation of informative hypotheses (pp. 181–-207). Springer.

Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010).
Bayesian hypothesis testing for psychologists: A tutorial on the Savage–
Dickey method. Cognitive psychology, 60(3), 158-–189.

How not to worry about APA style

If you have gone through the trouble of picking up a copy of the Publication Manual of the American Psychological Association (APA, 2010), I’m sure your first reaction was similar to mine: “Ugh! 272 pages of boredom.” Do people actually read this monster? I don’t know. I don’t think so. I know I haven’t read every last bit of it. You may be relieved to hear that your reaction resonates with some of the critique that has been voiced by senior researchers in Psychology, such as Henry L. Roediger III (2004). But let’s face it: APA style is not going anywhere. It is one of the major style regimes in academia and is used in many fields other than Psychology, including medical and other public health journals. And to be fair, standardizing academic documents is not a bad idea. It helps readers to efficiently access the desired information. It helps authors by making the journal’s expectations regarding style explicit, and it helps reviewers to concentrate on the content of a manuscript. Most importantly, the guidelines set a standard that is accepted by a large number of outlets. Imagine a world in which you had to familiarize yourself with a different style every time you chose a new outlet for your scholarly work.

APA style is hard

The data presented in an earlier post on this blog indicate that Psychology students find it difficult to adhere to the APA guidelines. Among the 9 most common mistakes in submissions to the Journal of European Psychology Students are

  1. missing or incorrect running head (86.3%)
  2. errors with in-text citations (84.0%)
  3. missing or incorrectly formatted page numbers (75.0%)
  4. incorrect margins (52.2%)
  5. indentation of first line of each paragraph (43.1%)

From my experience as an editorial assistant at the journal Experimental Psychology I know that fully mastering APA style is hard even for more senior researchers — and that’s okay. In fact, I’m glad that most researchers use their limited time on research (or teaching) rather than memorizing the “Publication Manual”. Life is too short to learn the ins and outs of APA style.

How not to worry about APA style

If you want to publish psychological research, you will have to produce properly formatted APA style manuscripts. Fortunately, this is a problem many researchers face; in other words, there is no reason to start from scratch. You could use an APA template for common word processors such as Microsoft Word or Libre Office that takes care of the page setup, line spacing, etc. But to be up-front, I want to convince you that there is a better way to write your manuscripts that prevents all of the above mentioned errors and more. I want to introduce you to Markdown, an easy-to-read and -write annotation system that makes writing APA style a breeze.

Don’t mix content and style

A general principle in typesetting — be it on (digital) paper or the web — is to separate content and style. Separation is commonly achieved through the use of a markup language, which is a system of document annotations. These annotations declare portions of text as title, section headings, or list items but crucially, they are agnostic to what this means visually (e.g., <bold>text</bold> instead of text). There are several advantages to this approach but I’ll only briefly name three of them here:

  1. Focus on writing. It seems that a common form of procrastination for many writers is making a document pretty. Adding a newline here or a manual line break there, moving a table just two pixels to the left, etc. When writing a markup document in a plain text document it let’s you focus on the content rather than the style.
  2. Swiftly adjust the style. If your paper is rejected and the next target journal prefers a different flavor of APA style, there is no need to touch your writing. As a simple example, I recently submitted a paper to a journal that asked me to collect all figure captions at the end of the document on one page rather than printing them below the corresponding figures. Because my captions were declared as such, I left the text unchanged (captions below the figures) and simply changed the option controlling the captions’ position within the document.
  3. Write plain text files. Once you move to writing in plain text files, you open yourself up to a whole new world of very helpful tools to facilitate your writing and collaboration, such as dynamic documents or the version control system git, but that’s a topic for another blog post.

Learn Markdown

Am I suggesting you replace one evil with another? Not learning APA style requires learning a whole new language? No, Markdown is intended to be as easy-to-read and easy-to-write as possible. The following is an excerpt from the APA example manuscript written in Markdown.

Without knowing anything about Markdown, it should be easy to guess what the annotations mean. # declare hierarchical section headings, <!-- and --> envelope comments, and [^p] adds a reference to a footnote. As you can see, Markdown is easy to learn and will quickly save time in manuscript preparation. The only thing that may be scary at first are the equations enveloped by $. Equations are written in the powerful, yet, fairly simple equation syntax used in LaTeX. Although LaTeX is widely used to write entire manuscripts (not just equations), it is not very popular in the field of Psychology. I suspect that the neglect is largely due to its complexity and long learning curve, which I find rather deterring myself. Both seem to outweigh the advantages of the system when it comes to handling citations and cross-references or typesetting large documents, complex tables, and equations which are rare in the average Psychology paper. That is why I like the idea of using Markdown as a simple interface to harness the power of LaTeX without having to write or know much about LaTeX.

Use a reference manager

If you are not already using a reference manager such as Zotero, I strongly suggest you start doing so. Reference managers are like iTunes for your literature; they help you search, download, and organize papers. Most importantly, with a few clicks you can export a collection of references you need for a paper into a .bib-file. Once your references are in a .bib-file that resides in the same folder as your Markdown-file, you can easily add citations to your Markdown document. Each reference has a unique handle, e.g. lewandowsky_computational_2011, which you can use in Markdown. @lewandowsky_computational_2011 creates an in-text citation; [@lewandowsky_computational_2011] creates a citation in parentheses. Everything reference-related, such as in-text citation and the reference section, will be taken cared of automatically.

Let R take care of the rest

To turn your Markdown file into a polished APA manuscript, you need to set a few options and then create a .pdf-file. Both could be done manually but the way I do this is by using the text editor RStudio (a text editor for R, but you literally need no knowledge about R to do this) and papaja, the R package I’m developing with Marius Barth. In turning your Markdown into a .pdf-file there are intermediate steps and software involved that are really not important to know. RStudio lets you do all of this by the click of a button. As a side note, if you use R for your analyses, you can embed the analysis code into your document and insert statistics, figures, and tables on the fly while creating your manuscript. This is what is called a dynamic document (Xie, 2013) and the topic of a future blog post.

How to create your first manuscript


If you want to try writing a manuscript in Markdown, you need to install a couple of things:

Make sure you install the complete—not the basic—TeX version and if you are on Ubuntu 14.04 you need a couple of extra TeX packages. Finally, install the development version of papaja by opening RStudio and copying the following into the R console:

New documents

Once you installed papaja you can create an APA document through the menus in RStudio (File > New File > R Markdown). If you take the time to explore the menu a little bit you will find that Markdown can be used to create a range of different documents like slides or HTML-files. template_selection The new text file will contain a document header enveloped by --- followed by the body of the text. There will be some scary looking R stuff following the header; feel free to delete all of it. To preview your manuscript click the Knit-Button. knitting If you click on the question mark next to it, you can get help regarding Markdown in case you get stuck. Also, a look at the papaja-example document may be helpful. All you need to do now is fill in the meta-information, e.g. authors, title, and abstract in the header of the document, start writing, and stop worrying about APA style.


American Psychological Association. (2010). Publication Manual of the American Psychological Association (6th edition.). Washington, DC: American Psychological Association.

Roediger, H. L. (2004). What Should They Be Called? APS Observer, 17(4). Retrieved from

Xie, Y. (2013). Dynamic Documents with R and knitr. Boca Raton: Productivity.

Of Elephants and Effect Sizes – Interview with Geoff Cumming

We all know these crucial moments while analysing our hard-earned data – the moment of truth – is there a star above the small p? Maybe even two? Can you write a nice and simple paper or do you have to bend your back to explain why people do not, surprisingly, behave the way you thought they would? It all depends on those little stars, below or above .05, significant or not, black or white. Continue reading

Interview with Prof. Dermot Barnes-Holmes


Prof. Dermot Barnes-Holmes was a Foundation Professor at the Department of Psychology at National University of Ireland, Maynooth. He is known for his research in human language and cognition through the development of the Relational Frame Theory (RFT) with Steven C. Hayes, and its applications in various psychological settings. barnes_holmes_pic_edit

What I enjoy most about my job as a researcher … Supervising research students who are passionate about and genuinely interested in their research. Sharing what is often a voyage of intellectual discovery for both the student and me is still, after all these years, by far the most stimulating and enjoyable feature of what I do as an academic. Continue reading

Interview with Prof. Alice Mado Proverbio

Prof. Alice Mado Proverbio has a degree in Experimental Psychology from the University of Rome “La Sapienza” and a PhD in General Psychology from the University of Padua. She did her Post Doctoral training at the University of California at Davis and at the University of Padua. As a research scientist at the University of Trieste, she guided the Cognitive Electrophysiology Laboratory from 1996 to 2000. Since 2001, she is Associate Professor of Psychobiology and Physiological Psychology at University of Milano-Bicocca. She founded the “Cognitive Electrophysiology” Lab at the same University in 2003. In 2014, she received the Habilitation as full Professor.

What I enjoy most about my job as a researcher …  Without a doubt what I enjoy most about my job as a researcher is the possibility to create and devise new experiments, to test new exciting ideas, to challenge pre-existing models with new hypotheses that I gather from discussions with people, but especially from a lot of reading and listening to insightful talks. It’s not rare that I get, what seems to be, a brilliant idea from reading or listening to scientists working outside my specific research field (cognitive electrophysiology). This can be genetics, evolutionary psychology, cellular biology, primatology or even molecular neuroscience. It can be something on Twitter, or even something that I spotted online. That’s what I like most: the creative process that precedes the actual experimental testing.

I also like that magic moment when, with my young co-workers standing all around my computer, we run the final ANOVA on a particular set of data we judge to be crucial to test our hypothesis. And we are all there, laughing and crossing our fingers, hoping for a high statistical significance, and then it gets p<0.005 and we all scream! I also love when an idea, just an incorporeal dream or a rough sketch at the beginning, but after months of working with my students, and refining details, re-adjusting the methodology, and changing the paradigm and all, finally becomes a consolidated paradigm, a concrete thing, almost a “person”, with a given personality and specific attitudes. We love to coin names for our new studies and paradigms, and stimulus types. Even computers and supplies and ERP components have personalized names in my lab. There are unofficial names (“just for us”) and more official, scientific terms that will be used later in the paper or in the dissertation.

The biggest challenge in my career so far was … there have been several challenging moments in my career, especially when I changed role, by becoming first a PhD student, then a Post-Doc fellow, a Researcher, and finally a Professor. Every passage required great effort in adjusting to the new situation and the many new commitments (not to mention, the new town or country, the new home, the new life, etc..). When I got a PhD student position I had to learn how to speak in public, deliver talks and travel a lot (while I enjoyed running experiments and writing my own papers). When I became a Post-Doc, I had to learn how to manage international relations and cooperate with multiple subjects and research groups. As a researcher, I had to face a lot of new work, mostly coming from student supervision, teaching and writing (books, chapters, papers), not to mention being the only person responsible for the ERP  lab. I often I had to work overnight. Becoming a Professor was very challenging at first, because of the large amount of teaching and lessons that I had to prepare for the first time. I learned how to be a good referee, a wise editor and the best mentor as possible for my students. I learned how to be very efficient with bureaucratic, administrative, and faculty duties, in order to have time for my research and my lab.

One research project I will never forget is…  I will never forget the research project aimed at testing the existence of possible subcortical inter-hemispheric pathways transferring visuomotor information in the brain of callosotomy (split-brain) patients, that I carried out in Ron Mangun’s lab in cooperation with Michel Gazzaniga, at the Center for Neuroscience of University of California at Davis. I had the extraordinary opportunity to test and get to know personally a beautiful person, the famous patient JW. I recall being incredibly excited and proud of my work at that time.

What I look for in a student who wants to work under my supervision … I mainly look for dedication, enthusiasm, patience, competence, rigor and loyalty, not necessarily in that specific order.

Student research could be improved by … I think that student research deserves the right equilibrium between autonomy and supervision. Sometimes I meet bright young researchers presenting poor pieces of evidence or lousy talks because of their inexperience mixed with a lack of supervision from their mentor. Its’ a real pity. Other times, I assist students acting as mere executors of projects they do not fully comprehend and testing hypotheses that they do not even scientifically understand. I think that students should not only perform the practical hands-on work in laboratories, but also do a lot of studying and reading to build a strong specialized knowledge.

Academically, I most admire … woman researchers (especially if independent and not grown under the wings of a powerful male mentor) …  because …. sometimes, they have to work twice as hard as their male colleagues, to prove their qualities. Indeed gender discrimination and inequalities of various types (from the most subtle to the most evident and gross inequalities) are still present at any level along the academic trail.

I wish someone had told me at the beginning of my career … I do not how to answer to this. I think that no advice can teach you better than your own personal experience. But I recall what I was actually being told, which revealed to be very useful in the hard times, and that is: do what you feel is better for you.

The largest changes in psychological science in the next 10 years will be … I am unsure what to predict. But I am pretty sure that the future is linked to a multidisciplinary integration, and that Psychology will grow only in interaction with other scientific disciplines, such as Cognitive Neuroscience, Genetics, Evolutionary Psychology, Cellular Neuroscience, Molecular Biology, Neuroimaging and the new emerging techniques (such as diffusor tensor imaging), and others that are still developing these days such as Brain Computer Interface (BCI), robotics.

Interview with Prof. Csikzentmihalyi


Prof. Mihaly Csikszentmihalyi is the Distinguished Professor of Psychology and Management at Claremont Graduate University and was the former head of the department of psychology at the University of Chicago. He is noted for his research on happiness and creativity, on which he published over 120 scientific articles and book chapters. He is also well known for introducing the concept of flow in his seminal work “Flow: The Psychology of Optimal Experience“. Csikszentmihalyi_Mihaly_WEB

What I enjoy most about my job as a researcher …  two things: the early analysis of data, when you are looking for patterns — exploring the psychological landscape, so to speak. Then the last part, when you start writing and trying to find the best way to express what you have learned.

The biggest challenge in my career so far was … to break out of the two reigning paradigms of my student’s days; the Freudian and the Skinnerian approaches.

One research project I will never forget is… perhaps the few months in 1968 when we started collecting data on the flow experience with a group of students at the college I was teaching at at the  time, Lake Forest College.

What I look for in a student who wants to work under my supervision … besides the obvious ones (academic and intellectual abilities): intrinsic motivation, a sense of humor, lack of excessive egotism.

Student research could be improved by … learning that what matters is engagement in a worth-while project.

Academically, I most admire … my friend Howard Gardner …  because …. he is an unselfish, sophisticated intellectual.

I wish someone had told me at the beginning of my career … how to get financial support for conducting large-scale research — although I probably would have ignored the advice anyway . . .

The largest changes in psychological science in the next 10 years will be … I am not a prophet, alas, so I have no idea. I know that the best-case scenario would be for psychology to focus on human experience, and establish conceptual links with other social sciences like sociology, anthropology, history, economics, and political science . . . The worst-case scenario would be selling out to neurobiology, and becoming a sub-discipline of that field. But I have no clue as to which of these two scenarios will win out in the evolutionary process.

Interview with Dr. Deirdre Barrett

Dr. Deirdre Barret is a researcher and lecturer at Harvard Medical School. She is well known for her research on dreams, hypnosis, and imagery. More recently she has written about evolutionary psychology and technology. She has also written severa successful books for the general public. deirdre barrett outside ucl 3a

What I enjoy most about my job as a researcher …  Any questions I have—in my case about dreams—I can come up with a way to operationalize the question and get an answer. Continue reading

Make the Most of Your Summer: Summer Schools in Europe

11051177_10205216017873360_1194271846_mWhy should you attend Summer Schools?

To put it simply: there is no better way to learn about psychology (and related disciplines), to travel, and to meet new people, all at the same time! Summer schools offer the opportunity to explore areas of psychology that might not be taught at your university, or to really explore a subject, seeing as this scheme allows you to  focus your work on one topic in the company of students who are enthusiastic about the same subject. Last year, I attended a summer school on Law, Criminology and Psychology – coming from Germany, where Criminology is in the Law faculty, that was my opportunity to learn more about eye-witness accounts, lie detection, psychopathy, and how to interrogate children. Aside from classic lectures, summer schools often include seminars and group work. Continue reading

Answering Frequently Asked Questions about JEPS

Is there anything you ever wanted to know about JEPS and the people behind it? Here are answers to our ten most frequently asked questions.

  1.  Who are we?

We are students from all over Europe and, as Editorial Team of the Journal of European Psychology Students (check out our Website here), we run JEPS.  Together with a group of other people (Associate Editors, Reviewers, Copyeditors, and Proofreaders), we see students’ manuscripts through the publication process.

Continue reading

Most frequent APA mistakes at a glance

APA-guidelines, don’t we all love them? As an example, take one simple black line used to separate words – the hyphen: not only do you have to check whether a term needs a hyphen or a blank space will suffice, you also have to think about the different types of hyphens (Em-dash, En-dash, minus, and hyphen). Yes, it is not that much fun. And at JEPS we often get the question: why do we even have to adhere to those guidelines?


Common APA Errors; Infographic taken from the EndNote Blog

The answer is rather simple: The formatting constraints imposed by journals enable for the emphasis to be placed on the manuscript’s content during the review process. The fact that all manuscripts submitted share the same format allows for the Reviewers to concentrate on the content without being distracted by unfamiliar and irregular formatting and reporting styles.

The Publication Manual counts an impressive 286 pages and causes quite some confusion. In JEPS, we have counted the most frequent mistakes in manuscripts submitted to us – data that the EndNote-blog has translated into this nice little graphic.

Here you can find some suggestions on how to avoid these mistakes in the first place.


American Psychological Association. (2009). Publication Manual of the American Psychological Association (6th ed.). Washington, DC: American Psychological Association.

Vainre, M. (2011). Common mistakes made in APA style. JEPS Bulletin, retrieved from