R is a statistical programming language whose popularity is quickly overtaking SPSS and other “traditional” pointandclick software packages (Muenchen, 2015). But why would anyone use a programming language, instead of pointandclick applications, for data analysis? An important reason is that data analysis rarely consists of simply running a statistical test. Instead, many small steps, such as cleaning and visualizing data, are usually repeated many times, and computers are much faster at doing repetitive tasks than humans are. Using a pointandclick interface for these “data cleaning” operations is laborious and unnecessarily slow:
“[T]he process of tidying my data took me around 10 minutes per participant as I would do it all manually through Excel. Even for a moderate sample size, this starts to take up a large chunk of time that could be spent doing other things like writing or having a beer” (Bartlett, 2016).
A programmed analysis would seamlessly apply the tidying steps to every participant in the blink of an eye, and would itself constitute an exact script of what operations were applied to the data, making it easier to repeat the steps later.
Learning to use a programming language for data analysis reduces human labor and saves time that could be better spent doing more important (or fun) things. In this post, I introduce the R programming language, and motivate its use in Psychological science. The introduction is aimed toward students and researchers with no programming experience, but is suitable for anyone with an interest in learning the basics of R.
The R project for statistical computing
“R is a free software environment for statistical computing and graphics.” (R Core Team, 2016)
Great, but what does that mean? R is a programming language that is designed and used mainly in the statistics, data science, and scientific communities. R has “become the defacto standard for writing statistical software among statisticians and has made substantial inroads in the social and behavioural sciences” (Fox, 2010). This means that if we use R, we’ll be in good company (and that company will likely be even better and numerous in the future, see (Muenchen, 2015)).
To understand what R is, and is not, it may be helpful to begin by contrasting R to its most common alternative, SPSS. Many psychologists are familiar with SPSS, which has a graphical user interface (GUI), allowing the user to look at the twodimensional data table on screen, and click through various dropdown menus to conduct analyses on the data. In contrast, R is an object oriented programming language. Data is loaded into R as a “variable”, meaning that in order to view it, the user has to print it on the screen. The power of this approach is that the data is an object in a programming environment, and only your imagination limits what functions you can apply to the data. R also has no GUI to navigate with the mouse; instead, users interact with the data by typing commands.
SPSS is expensive to use: Universities have to pay real money to make it available to students and researchers. R and its supporting applications, on the other hand, are completely free—meaning that both users and developers have easier access to it. R is an open source software package, which means that many cutting edge statistical methods are more quickly implemented in R than SPSS. This is apparent, for example, in the recent uprising of Bayesian methods for data analysis (e.g. Buerkner, 2016).
Further, SPSS’s facilities for cleaning, organizing, formatting, and transforming data are limited—and not very user friendly, although this is a subjective judgment—so users often resort to a spreadsheet program (Microsoft Excel, say) for data manipulation. R has excellent capacities for all steps in the analysis pipeline, including data manipulation, and therefore the analysis never has to spread across multiple applications. You can imagine how the possibility for mistakes, and time needed, is reduced when the data file(s) doesn’t need to be juggled between applications. Switching between applications, and repeatedly clicking through dropdown menus means that, for any small change, the human using the computer must redo every step of the analysis. With R, you can simply reuse your analysis script and just import different data to it.
These considerations lead to contrasting the two different workflows in Figure 1. Workflow 1 uses a programming language, such as R. It is difficult to learn, but beginners generally get started with real analysis in an hour or so. The payoff for the initial difficulty is great: The workflow is reproducible (users can save scripts and show their friends exactly what they did to create those beautiful violin plots); the workflow is flexible (want to do everything just the way you did it, but instead do the plots for males instead of females? Easy!); and most importantly, repetitive, boring, but important work is delegated to a computer.
The final point requires some reflecting; after all, computer programs all work on computers, so it sounds like a tautology. But what I mean is that repetitive tasks can be wrapped in a simple function (these are usually already available—you don’t have to create your own functions) which then performs the tasks as many times as you would like to. Many tasks in the data cleaning stage, for example, are fairly boring and repetitive (calculating summary statistics, aggregating data, combining spreadsheets or columns across spreadsheets), but less so when one uses a programming language.
Workflow 2, on the other hand, is easy to learn because there are few welldefined and systematic parts to it—everything is improvised on a taskbytask basis and done manually by copypasting, pointingandclicking and dragging and dropping. “Clean and organize” the data in Excel. “Analyze” in SPSS. In the optimal case where the data is perfectly aligned to the format that SPSS expects, you can get a pvalue in less than a minute (excluding SPSS startup time, which is quickly approaching infinity) by clicking through the dropdown menus. That is truly great, if that is all you want. But that’s rarely all that we want, and data is rarely in SPSS’s required format.
Workflow 2 is not reproducible (that is, it may be very difficult if not impossible to exactly retrace your steps through an analysis), so although you may know roughly that you “did an ANOVA”, you may not remember which cases were included, what data was used, how it was transformed, etc. Workflow 2 is not flexible: You’ve just done a statistical test on data from Experiment 1? Great! Can you now do it for Experiment 2, but logtransform the RTs? Sure, but then you would have to restart from the Excel step, and redo all that pointing and clicking. This leads to Workflow 2 requiring the human to do too much work, and spend time on the analysis that could be better spent “doing other things like writing or having a beer” (Bartlett, 2016).
So, what is R? It is a programming language especially suited for data analysis. It allows you to program (more on this below!) your analyses instead of pointing and clicking through menus. The point here is not that you can’t do analysis with a pointandclick SPSS style software package. You can, and you can do a pretty damn good job with it. The point is that you can work less and be more productive if you’re willing to spend some initial time and effort learning Workflow 1 instead of the common Workflow 2. And that requires getting started with R.
Getting started with R: From 0 to R in 100 seconds
If you haven’t already, go ahead and download R, and start it up on your computer. Like most programming languages, R is best understood through its console—the interface that lets you interact with the language.
After opening R on your computer, you should see a similar window on your computer. The console allows us to type input, have R evaluate it, and return output. Just like a fancy calculator. Here, our first input was assigning (R uses the left arrow, <
, for assignment) all the integers from 0 to 100 to a variable called numbers
. Computer code can often be read from right to left; the first one here would say “integers 0 through to 100, assign to numbers
”. We then calculated the mean of those numbers by using R’s built in function, mean()
. Everything interesting in R is done by using functions: There are functions for drawing figures, transforming data, running statistical tests, and much, much more.
Here’s another example, this time we’ll create some heights data for kids and adults (in centimeters) and conduct a twosample ttest (every line that begins with a “#>” is R’s output):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 
kids < c(100, 98, 89, 111, 101) grownups < c(180, 177, 159, 191, 163) t.test(kids, grownups) #> #> Welch Two Sample ttest #> #> data: kids and grownups #> t = 10.9, df = 6.5656, pvalue = 1.908e05 #> alternative hypothesis: true difference in means is not equal to 0 #> 95 percent confidence interval: #> 90.51527 57.88473 #> sample estimates: #> mean of x mean of y #> 99.8 174.0 
That’s it, a ttest in R in a hundred seconds! Note, c()
stands for “combine”, so kids
is now a numeric vector (collection of numbers) with 5 elements. The ttest results are printed in R’s console, and are straightforward to interpret.
Save your analysis scripts
At its most basic, data analysis in R consists of importing data to R, and then running functions to visualize and model the data. R has powerful functions for covering the entire process going from Raw Data to Communicating Results (or Word Processor) in Figure 1. That is, users don’t need to switch between applications at various steps of the analysis workflow. Users simply type in code, let R evaluate it, and receive output. As you can imagine, a full analysis from raw data to a report (or table of summary statistics, or whatever your goal is) may involve lots of small steps—transforming variables in the data, plotting, calculating summaries, modeling and testing—which are often done iteratively. Recognizing that there may be many steps involved, we realize that we better save our work so that we can investigate and redo it later, if needed. Therefore for each analysis, we should create a text file containing all those steps, which could then be run repeatedly with minor tweaks, if required.
To create these text files, or “R scripts”, we need a text editor. All computers have a text editor preinstalled, but programming is often easier if you use an integrated development environment (IDE), which has a text editor and console all in one place (often with additional capacities.) The best IDE for R, by far, is RStudio. Go ahead and download RStudio, and then start it. At this point you can close the other R console on your computer, because RStudio has the console available for you.
Getting started with RStudio
Figure 3 shows the main view of RStudio. There are four rectangular panels, each with a different purpose. The bottom left panel is the R console. We can type input in the console (on the empty line that begins with a “>”) and hit return to execute the code and obtain output. But a more efficient approach is to type the code into a script file, using the text editor panel, known as the source panel, in the top left corner. Here, we have a ttestkidsgrownups.R script open, which consists of three lines of code. You can write this script on your own computer by going to File > New File > R Script in RStudio, and then typing in the code you see in Figure 3. You can execute each line by hitting Control + Return, on Windows computers, or Command + Return on OS X computers. Scripts like this constitute the exact documentation of what you did in your analysis, and as you can imagine, are pretty important.
The two other panels are for viewing things, not so much for interacting with the data. Top right is the Environment panel, showing the variables that you have saved in R. That is, when you assign something into a variable (kids < c(100, 98, 89, 111, 101)
), that variable (kids
) is visible in the Environment panel, along with its type (num
for numeric), size (1:5
, for 5), and contents (100, 98, 89, 111, 101
). Finally, bottom right is the Viewer panel, where we can view plots, browse files on the computer, and do various other things.
With this knowledge in mind, let’s begin with a couple easy things. Don’t worry, we’ll get to actual data soon enough, once we have the absolute basics covered. I’ll show some code and evaluate it in R to show its output too. You can, and should, type in the commands yourself to help you understand what they do (type each line in an R script and execute the line by pressing Cmd + Enter. Save your work every now and then.)
Here’s how to create variables in R (try to figure out what’s saved in each variable):
1 2 3 4 
kids < c(100, 98, 89, 111, 101) n_kids < length(kids) m_kids < mean(kids) se_kids < sd(kids) / sqrt(n_kids) 
And here’s how to print those variable’s contents on the screen. (I’ll provide a comment for each line, comments begin with a #
and are not evaluated by R. That is, comments are read by humans only.)
1 2 3 4 5 6 7 8 
kids # Kids' heights #> [1] 100 98 89 111 101 n_kids # Number of kids #> [1] 5 m_kids # Mean of kids' heights #> [1] 99.8 se_kids # Standard error #> [1] 3.512834 
Transforming data is easy: R automatically applies operations to vectors of (variables containing multiple) numbers, if needed. Let’s create zscores of kids heights.
1 2 3 
z_kids < (kids  m_kids) / sd(kids) z_kids #> [1] 0.0254617 0.2291553 1.3749319 1.4258553 0.1527702 
I hope you followed along. You should now have a bunch of variables in your R Environment. If you typed all those lines into an R script, you can now execute them again, or modify them and then rerun the script, linebyline. You can also execute the whole script at once by clicking “Run”, at the top of the screen. Congratulations, you’ve just programmed your first computer program!
User contributed packages
One of the best things about R is that it has a large user base, and lots of user contributed packages, which make using R easier. Packages are simply bundles of functions, and will enhance your R experience quite a bit. Whatever you want to do, there’s probably an R package for that. Here, we will install and load (make available in the current session) the tidyverse package (Wickham, 2016), which is designed for making tidying data easier.
1 2 
install.packages("tidyverse") # Installs the package to your computer library(tidyverse) # Loads the package to your current session 
It’s important that you use the tidyverse package if you want to follow along with this tutorial. All of the tasks covered here are possible without it, but the functions from tidyverse make the tasks easier, and certainly easier to learn.
Using R with data
Let’s import some data to R. We’ll use example data from Chapter 4 of the Intensive Longitudinal Methods book (Bolger & Laurenceau, 2013). The data set is freely available on the book’s website. If you would like to follow along, please donwload the data set, and place it in a folder (unpack the .zip file). Then, use RStudio’s Viewer panel, and its Files tab, to navigate to the directory on your computer that has the data set, and set it as the working directory by clicking “More”, then “Set As Working Directory”.
Setting the working directory properly is extremely important, because it’s the only way R knows where to look for files on your computer. If you try to load files that are not in the working directory, you need to use the full path to the file. But if your working directory is properly set, you can just use the filename. The file is called “time.csv”, and we load it into a variable called d
using the read_csv()
function. (csv stands for comma separated values, a common plain text format for storing data.) You’ll want to type all these functions to an R script, so create a new R script and make sure you are typing the commands in the Source panel, not the Console panel. If you set your working directory correctly, once you save the R script file, it will be saved in the directory right next to the “time.csv” file.
1 
d < read_csv("time.csv") 
d
is now a data frame (sometimes called a “tibble”, because why not), whose rows are observations, and columns the variables associated with those observations.
This data contains simulated daily intimacy reports of 50 individuals, who reported their intimacy every evening, for 16 days. Half of these simulated participants were in a treatment group, and the other half in a control group. To print the first few rows of the data frame to screen, simply type its name:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
d #> # A tibble: 800 × 5 #> id time time01 intimacy treatment #> <int> <int> <dbl> <dbl> <int> #> 1 1 0 0.0000000 2.96 0 #> 2 1 1 0.0666667 2.34 0 #> 3 1 2 0.1333333 4.88 0 #> 4 1 3 0.2000000 2.99 0 #> 5 1 4 0.2666667 3.13 0 #> 6 1 5 0.3333333 2.73 0 #> 7 1 6 0.4000000 1.96 0 #> 8 1 7 0.4666667 4.13 0 #> 9 1 8 0.5333334 3.17 0 #> 10 1 9 0.6000000 2.93 0 #> # ... with 790 more rows 
The first column, id
is a variable that specifies the id number who that observation belongs to. int means that the data in this column are integers. time
indicates the day of the observation, and the authors coded the first day at 0 (this will make intercepts in regression models easier to interpret.) time01
is just time
but recoded so that 1 is at the end of the study. dbl means that the values are floating point numbers. intimacy
is the reported intimacy, and treatment
indicates whether the person was in the control (0) or treatment (1) group. The first row of this output also tells us that there are 800 rows in total in this data set, and 5 variables (columns). Each row is also numbered in the output (leftmost “column”), but those values are not in the data.
Data types
It’s important to verify that your variables (columns) are imported into R in the appropriate format. For example, you would not like to import time recorded in days as a character vector, nor would you like to import a character vector (country names, for example) as a numeric variable. Almost always, R (more specifically, read_csv()
) automatically uses correct formats, which you can verify by looking at the row between the column names and the values.
There are five basic data types: int
for integers, num
(or dbl
) for floating point numbers (1.12345…), chr
for characters (also known as “strings”), factor
(sometimes abbreviated as fctr
) for categorical variables that have character labels (factor
s can be ordered if required), and logical
(abbreviated as logi
) for logical variables: TRUE
or FALSE
. Here’s a little data frame that illustrates the basic variable types in action:
1 2 3 4 5 
sample_data #> # A tibble: 1 × 6 #> height weight name name_fctr likes_R likes_matlab #> <int> <dbl> <chr> <fctr> <lgl> <lgl> #> 1 189 83.284 Matti Matti TRUE NA 
Here we are also introduced a very special value, NA
. NA
means that there is no value, and we should always pay special attention to data that has NA
s, because it may indicate that some important data is missing. This sample data explicitly tells us that we don’t know whether this person likes matlab or not, because the variable is NA
. OK, let’s get back to the daily intimacy reports data.
Quick overview of data
We can now use the variables in the data frame d
and compute summaries just as we did above with the kids’ and adults’ heights. A useful operation might be to ask for a quick summary of each variable (column) in the data set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
summary(d) #> id time time01 intimacy #> Min. : 1.0 Min. : 0.00 Min. :0.00 Min. :0.000 #> 1st Qu.:13.0 1st Qu.: 3.75 1st Qu.:0.25 1st Qu.:2.368 #> Median :25.5 Median : 7.50 Median :0.50 Median :3.330 #> Mean :25.5 Mean : 7.50 Mean :0.50 Mean :3.469 #> 3rd Qu.:38.0 3rd Qu.:11.25 3rd Qu.:0.75 3rd Qu.:4.540 #> Max. :50.0 Max. :15.00 Max. :1.00 Max. :9.410 #> treatment #> Min. :0.0 #> 1st Qu.:0.0 #> Median :0.5 #> Mean :0.5 #> 3rd Qu.:1.0 #> Max. :1.0 
To get a single variable (column) from the data frame, we call it with the $
operator (“gimme”, for asking R to give you variables from within a data frame). To get all the intimacy values, we could just call d$intimacy
. But we better not, because that would print out all 800 intimacy values into the console. We can pass those values to functions instead:
1 2 3 4 5 6 
mean(d$intimacy) #> [1] 3.468713 range(d$intimacy) #> [1] 0.00 9.41 max(d$intimacy) #> [1] 9.41 
If you would like to see the first six values of a variable, you can use the head()
function:
1 2 
head(d$time) #> [1] 0 1 2 3 4 5 
head()
works on data frames as well, and you can use an optional number argument to specify how many first values you’d like to see returned:
1 2 3 4 5 6 
head(d, 2) #> # A tibble: 2 × 5 #> id time time01 intimacy treatment #> <int> <int> <dbl> <dbl> <int> #> 1 1 0 0.0000000 2.96 0 #> 2 1 1 0.0666667 2.34 0 
A look at R’s functions
Generally, this is how R functions work, you name the function, and specify arguments to the function inside the parentheses. Some of these arguments may be data or other input (d
, above), and some of them change what the argument does and how (2
, above). To know what arguments you can give to a function, you can just type the function’s name in the console with a question mark prepended to it:
1 
?head 
Importantly, calling the help page reveals that functions’ arguments are named. That is, arguments are of the form X = Y, where X is the name of the argument, and Y is the value you would like to set it to. If you look at the help page of head()
(?head
), you’ll see that it takes two arguments, x
which should be an object (like our data frame d
, (if you don’t know what “object” means in this context, don’t worry—nobody does)), and n
, which is the number of elements you’d like to see returned. You don’t always have to type in the X = Y part for every argument, because R can match the arguments based on their position (whether they are the first, second, etc. argument in the parentheses). We can confirm this by typing out the full form of the previous call head(d, 2)
, but this time, naming the arguments:
1 2 3 4 5 6 
head(x = d, n = 2) #> # A tibble: 2 × 5 #> id time time01 intimacy treatment #> <int> <int> <dbl> <dbl> <int> #> 1 1 0 0.0000000 2.96 0 #> 2 1 1 0.0666667 2.34 0 
Now that you know how R’s functions work, you can find out how to do almost anything by typing into a search engine: “How to do almost anything in R”. The internet (and books, of course) is full of helpful tutorials (see Resources section, below) but you will need to know these basics about functions in order to follow those tutorials.
Creating new variables
Creating new variables is also easy. Let’s create a new variable that is the square root of the reported intimacy (because why not), by using the sqrt()
function and assigning the values to a new variable (column) within our data frame:
1 
d$sqrt_int < sqrt(d$intimacy) 
Recall that sqrt(d$intimacy)
will take the square root of every 800 values of the vector of intimacy values, and return a vector of 800 squared values. There’s no need to do this individually for each value.
We can also create variables using conditional logic, which is useful for creating verbal labels for numeric variables, for example. Let’s create a verbal label for each of the treatment groups:
1 
d$Group < ifelse(d$treatment == 0, "Control", "Treatment") 
We created a new variable, Group
in d
, that is “Control” if the treatment
variable on that row is 0, and “Treatment” otherwise.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
d #> # A tibble: 800 × 7 #> id time time01 intimacy treatment sqrt_int Group #> <int> <int> <dbl> <dbl> <int> <dbl> <chr> #> 1 1 0 0.0000000 2.96 0 1.720465 Control #> 2 1 1 0.0666667 2.34 0 1.529706 Control #> 3 1 2 0.1333333 4.88 0 2.209072 Control #> 4 1 3 0.2000000 2.99 0 1.729162 Control #> 5 1 4 0.2666667 3.13 0 1.769181 Control #> 6 1 5 0.3333333 2.73 0 1.652271 Control #> 7 1 6 0.4000000 1.96 0 1.400000 Control #> 8 1 7 0.4666667 4.13 0 2.032240 Control #> 9 1 8 0.5333334 3.17 0 1.780449 Control #> 10 1 9 0.6000000 2.93 0 1.711724 Control #> # ... with 790 more rows 
Remember our discussion of data types above? d
now contains integer, double, and character variables. Make sure you can identify these in the output, above.
Aggregating
Let’s focus on aggregating the data across individuals, and plotting the average time trends of intimacy, for the treatment and control groups.
In R, aggregating is easiest if you think of it as calculating summaries for “groups” in the data (and collapsing the data across other variables). “Groups” doesn’t refer to experimental groups (although it can), but instead any arbitrary groupings of your data based on variables in it, so the groups can be based on multiple things, like time points and individuals, or time points and experimental groups.
Here, our groups are the two treatment groups and 16 time points, and we would like to obtain the mean for each group at each time point by collapsing across individuals
1 2 3 4 
# 1. "Group" data by Group and time d_groups < group_by(d, Group, time) # 2. Compute the mean intimacy for each of these groupings d_groups < summarize(d_groups, intimacy = mean(intimacy)) 
The above code summarized our data frame d
by calculating the mean intimacy for the groups specified by group_by()
. We did this by first creating a data frame that is d
, but is grouped on Group
and time
, and then summarizing those groups by taking the mean intimacy for each of them. This is what we got:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 
d_groups #> Source: local data frame [32 x 3] #> Groups: Group [?] #> #> Group time intimacy #> <chr> <int> <dbl> #> 1 Control 0 2.8168 #> 2 Control 1 2.9476 #> 3 Control 2 2.9284 #> 4 Control 3 3.0596 #> 5 Control 4 3.2724 #> 6 Control 5 3.2884 #> 7 Control 6 3.2540 #> 8 Control 7 3.2504 #> 9 Control 8 2.9144 #> 10 Control 9 3.3236 #> # ... with 22 more rows 
A mean intimacy value for both groups, at each time point.
Plotting
We can now easily plot these data, for each individual, and each group. Let’s begin by plotting just the treatment and control groups’ mean intimacy ratings:
1 2 
ggplot(d_groups, aes(x=time, y=intimacy, color=Group)) + geom_line() 
For this plot, we used the ggplot()
function, which takes as input a data frame (we used d_groups
from above), and a set of aesthetic specifications (aes()
, we mapped time to the x axis, intimacy to the y axis, and color to the different treatment Groups in the data). We then added a geometric object to display these data (geom_line()
for a line.)
To illustrate how to add other geometric objects to display the data, let’s add some points to the graph:
1 2 3 
ggplot(d_groups, aes(x=time, y=intimacy, color=Group)) + geom_line() + geom_point() 
We can easily do the same plot for every individual (a panel plot, but let’s drop the points for now):
1 2 3 
ggplot(d, aes(x=time, y=intimacy)) + geom_line(aes(color = Group)) + facet_wrap("id") 
The code is exactly the same, but now we used the nonaggregated raw data d
, and added an extra function that wraps each id
’s data into their own little subplot (facet_wrap()
; remember, if you don’t know what a function does, look at the help page, i.e. ?facet_wrap
). ggplot()
is an extremely powerful function that allows you to do very complex and informative graphs with systematic, short and neat code. For example, we may add a linear trend (linear regression line) to each person’s panel. This time, let’s only look at the individuals in the experimental group, by using the filter()
command (see below):
1 2 3 4 
ggplot(filter(d, Group == "Treatment"), aes(x=time, y=intimacy)) + geom_point() + geom_smooth(method = "lm", se = F) + # Linear regression, no SE facet_wrap("id") 
Data manipulation
We already encountered an example of manipulating data, when we aggregated intimacy
over some groups (experimental groups and time points). Other common operations are, for example, trimming the data based on some criteria. All operations that drop observations are conceptualized as subsetting, and can be done using the filter()
command. Above, we filtered the data such that we plotted the data for the treatment group only. As another example, we can get the first week’s data (time
is less than 7, that is, days 06), for the control group only, by specifying these logical operations in the filter()
function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
filter(d, time < 7 & Group == "Control") #> # A tibble: 175 × 7 #> id time time01 intimacy treatment sqrt_int Group #> <int> <int> <dbl> <dbl> <int> <dbl> <chr> #> 1 1 0 0.0000000 2.96 0 1.720465 Control #> 2 1 1 0.0666667 2.34 0 1.529706 Control #> 3 1 2 0.1333333 4.88 0 2.209072 Control #> 4 1 3 0.2000000 2.99 0 1.729162 Control #> 5 1 4 0.2666667 3.13 0 1.769181 Control #> 6 1 5 0.3333333 2.73 0 1.652271 Control #> 7 1 6 0.4000000 1.96 0 1.400000 Control #> 8 2 0 0.0000000 0.64 0 0.800000 Control #> 9 2 1 0.0666667 3.10 0 1.760682 Control #> 10 2 2 0.1333333 2.31 0 1.519868 Control #> # ... with 165 more rows 
Try rerunning the above line with small changes to the logical operations. Note that the two logical operations are combined with the AND command (&
), you can also use OR (
). Try to imagine what replacing AND with OR would do in the above line of code. Then try and see what it does.
A quick detour to details
At this point it is useful to remind that computers do exactly what you ask them to do, nothing less, nothing more. So for instance, pay attention to capital letters, symbols, and parentheses. The following three lines are faulty, try to figure out why:
1 2 3 4 
filter(d, time < 7 & Group == "control") #> # A tibble: 0 × 7 #> # ... with 7 variables: id <int>, time <int>, time01 <dbl>, #> # intimacy <dbl>, treatment <int>, sqrt_int <dbl>, Group <chr> 
Why does this data frame have zero rows?
1 2 3 4 
filter(d, time < 7 & Group == "Control")) #> Error: <text>:1:41: unexpected ')' #> 1: filter(d, time < 7 & Group == "Control")) #> ^ 
Error? What’s the problem?
1 2 3 4 
filter(d, time < 7 & Group = "Control") #> Error: <text>:1:28: unexpected '=' #> 1: filter(d, time < 7 & Group = #> ^ 
Error? What’s the problem?
(Answers: 1. Group
is either “Control” or “Treatment”, not “control” or “treatment”. 2. Extra parenthesis at the end. 3. ==
is not the same as =
, the double ==
is a logical comparison operator, asking if two things are the same, the single =
is an assignment operator.)
Advanced data manipulation
Let’s move on. What if we’d like to detect extreme values? For example, let’s ask if there are people in the data who show extreme overall levels of intimacy (what if somebody feels too much intimacy!). How can we do that? Let’s start thinking like programmers and break every problem into the exact steps required to answer the problem:
 Calculate the mean intimacy for everybody
 Plot the mean intimacy values (because always, always visualize your data)
 Remove everybody whose mean intimacy is over 2 standard deviations above the overall mean intimacy (overintimate people?) (note that this is a terrible exclusion criteria here, and done for illustration purposes only)
As before, we’ll group the data by person, and calculate the mean (which we’ll call int
).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
d_grouped < group_by(d, id) d_grouped < summarize(d_grouped, int = mean(intimacy)) d_grouped #> # A tibble: 50 × 2 #> id int #> <int> <dbl> #> 1 1 3.141875 #> 2 2 2.780625 #> 3 3 2.974375 #> 4 4 3.792500 #> 5 5 3.946875 #> 6 6 4.608125 #> 7 7 3.641250 #> 8 8 3.798750 #> 9 9 2.544375 #> 10 10 3.560000 #> # ... with 40 more rows 
We now have everybody’s mean intimacy in a neat and tidy data frame. We could, for example, arrange the data such that we see the extreme values:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
# Person with lowest average intimacy on first row (ascending) arrange(d_grouped, int) #> # A tibble: 50 × 2 #> id int #> <int> <dbl> #> 1 28 1.966250 #> 2 14 1.971250 #> 3 23 2.025000 #> 4 12 2.106875 #> 5 27 2.260625 #> 6 29 2.395000 #> 7 9 2.544375 #> 8 39 2.573750 #> 9 20 2.620000 #> 10 24 2.629375 #> # ... with 40 more rows # Person with highest average intimacy on first row (descending) arrange(d_grouped, desc(int)) #> # A tibble: 50 × 2 #> id int #> <int> <dbl> #> 1 41 5.523750 #> 2 32 5.313125 #> 3 31 4.941875 #> 4 15 4.711875 #> 5 43 4.689375 #> 6 38 4.638125 #> 7 6 4.608125 #> 8 37 4.427500 #> 9 48 4.268750 #> 10 42 4.268125 #> # ... with 40 more rows 
Nothing makes as much sense as a histogram:
1 2 
ggplot(d_grouped, aes(x=int)) + geom_histogram(binwidth = .25, fill = "gray70", col="black") 
It doesn’t look like anyone’s mean intimacy value is “off the charts”. Finally, let’s apply our artificial exclusion criteria: Drop everybody whose mean intimacy is 2 standard deviations above the overall mean:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
avg_mean < mean(d_grouped$int) avg_sd < sd(d_grouped$int) avg_mean #> [1] 3.468713 avg_sd #> [1] 0.8804832 # Create "exclude", that is true if int is higher than mean + 2SD d_grouped$exclude < d_grouped$int > (avg_mean + 2*avg_sd) arrange(d_grouped, desc(int)) #> # A tibble: 50 × 3 #> id int exclude #> <int> <dbl> <lgl> #> 1 41 5.523750 TRUE #> 2 32 5.313125 TRUE #> 3 31 4.941875 FALSE #> 4 15 4.711875 FALSE #> 5 43 4.689375 FALSE #> 6 38 4.638125 FALSE #> 7 6 4.608125 FALSE #> 8 37 4.427500 FALSE #> 9 48 4.268750 FALSE #> 10 42 4.268125 FALSE #> # ... with 40 more rows 
Then we could proceed to exclude these participants (don’t do this with real data!), by first joining the d_grouped
data frame, which has the exclusion information, with the full data frame d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
d < left_join(d, d_grouped) arrange(d, desc(int)) #> # A tibble: 800 × 9 #> id time time01 intimacy treatment sqrt_int Group int #> <int> <int> <dbl> <dbl> <int> <dbl> <chr> <dbl> #> 1 41 0 0.0000000 6.06 1 2.461707 Treatment 5.52375 #> 2 41 1 0.0666667 5.54 1 2.353720 Treatment 5.52375 #> 3 41 2 0.1333333 4.66 1 2.158703 Treatment 5.52375 #> 4 41 3 0.2000000 4.63 1 2.151743 Treatment 5.52375 #> 5 41 4 0.2666667 6.00 1 2.449490 Treatment 5.52375 #> 6 41 5 0.3333333 5.68 1 2.383275 Treatment 5.52375 #> 7 41 6 0.4000000 5.08 1 2.253886 Treatment 5.52375 #> 8 41 7 0.4666667 4.92 1 2.218107 Treatment 5.52375 #> 9 41 8 0.5333334 5.92 1 2.433105 Treatment 5.52375 #> 10 41 9 0.6000000 4.53 1 2.128380 Treatment 5.52375 #> # ... with 790 more rows, and 1 more variables: exclude <lgl> length(unique(d$id)) # Number of unique individuals in data #> [1] 50 
and then removing all rows where exclude
is TRUE
. We use the filter()
command, and take only the rows where exclude
is FALSE
. So we want our logical operator for filtering rows to be “notexclude”. “not”, in R language, is !
:
1 2 3 
d2 < filter(d, !exclude) length(unique(d2$id)) # Two individuals were dropped #> [1] 48 
I saved the included people in a new data set called d2
, because I don’t actually want to remove those people, but just illustrated how to do this. We could also in some situations imagine applying the exclusion criteria to individual observations, instead of individual participants. This would be as easy as (think why):
1 2 3 4 5 
d$exclude < d$intimacy > (avg_mean + 2*avg_sd) table(d$exclude) # How many rows would now be excluded #> #> FALSE TRUE #> 681 119 
Selecting variables in data
After these artificial examples of removing extreme values (or people) from data, we have a couple of extra variables in our data frame d
that we would like to remove, because it’s good to work with clean data. Removing, and more generally selecting variables (columns) in data frames is most easily done with the select()
function. Let’s select()
all variables in d
except the squared intimacy (sqrt_int
), average intimacy (int
) and exclusion (exclude
) variables (that is, let’s drop those three columns from the data frame):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
d < select(d, sqrt_int, int, exclude) d #> # A tibble: 800 × 6 #> id time time01 intimacy treatment Group #> <int> <int> <dbl> <dbl> <int> <chr> #> 1 1 0 0.0000000 2.96 0 Control #> 2 1 1 0.0666667 2.34 0 Control #> 3 1 2 0.1333333 4.88 0 Control #> 4 1 3 0.2000000 2.99 0 Control #> 5 1 4 0.2666667 3.13 0 Control #> 6 1 5 0.3333333 2.73 0 Control #> 7 1 6 0.4000000 1.96 0 Control #> 8 1 7 0.4666667 4.13 0 Control #> 9 1 8 0.5333334 3.17 0 Control #> 10 1 9 0.6000000 2.93 0 Control #> # ... with 790 more rows 
Using select()
, we can keep variables by naming them, or drop them by using 
. If no variables are named for keeping, but some are dropped, all unnamed variables are kept, as in this example.
Regression
Let’s do an example linear regression by focusing on one participant’s data. The first step then is to create a subset containing only one person’s data. For instance, we may ask a subset of d that consists of all rows where id is 30, by typing
1 
d_sub < filter(d, id == 30) 
Linear regression is available using the lm()
function, and R’s own formula syntax:
1 
fit < lm(intimacy ~ time, data = d_sub) 
Generally, for regression in R, you’d specify the formula as outcome ~ predictors
. If you have multiple predictors, you combine them with addition (“+”): outcome ~ IV1 + IV2
. Interactions are specified with multiplication (“*“): outcome ~ IV1 * IV2
(which automatically includes the main effects of IV1
and IV2
; to get an interaction only, use”:" outcome ~ IV1:IV2
). We also specified that for the regression, we’d like to use data in the d_sub data frame, which contains only person 30’s data.
Summary of a fitted model is easily obtained:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
summary(fit) #> #> Call: #> lm(formula = intimacy ~ time, data = d_sub) #> #> Residuals: #> Min 1Q Median 3Q Max #> 2.1199 1.1229 0.1005 1.0057 2.3019 #> #> Coefficients: #> Estimate Std. Error t value Pr(>t) #> (Intercept) 2.55353 0.66094 3.863 0.00172 ** #> time 0.20728 0.07508 2.761 0.01532 * #>  #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 1.384 on 14 degrees of freedom #> Multiple Rsquared: 0.3525, Adjusted Rsquared: 0.3063 #> Fstatistic: 7.622 on 1 and 14 DF, pvalue: 0.01532 
Visualizing the model fit is also easy. We’ll use the same code as for the figures above, but also add points (geom_point()
), and a linear regression line with a 95% “Confidence” Ribbon (geom_smooth(method="lm")
).
1 2 3 4 
ggplot(d_sub, aes(x=time, y=intimacy)) + geom_point(shape=1) + geom_line() + geom_smooth(method="lm") 
Pretty cool, right? And there you have it. We’ve used R to do a sample of common data cleaning and visualization operations, and fitted a couple of regression models. Of course, we’ve only scratched the surface, and below I provide a short list of resources for learning more about R.
Conclusion
Programming your statistical analyses leads to a flexible, reproducible and timesaving workflow, in comparison to more traditional pointandclick focused applications. R is probably the best programming language around for applied statistics, because it has a large user base and many usercontributed packages that make your life easier. While it may take an hour or so to get acquainted with R, after initial difficulty it is easy to use, and provides a fast and reliable platform for data wrangling, visualization, modeling, and statistical testing.
Finally, learning to code is not about having a superhuman memory for function names, but instead it is about developing a programmer’s mindset: Think your problem through and decompose it to small chunks, then ask a computer to do those chunks for you. Do that a couple of times and you will magically have memorized, as a byproduct, the names of a few common functions. You learn to code not by reading and memorizing a tutorial, but by writing it out, examining the output, changing the input and figuring out what changed in the output. Even better, you’ll learn the most once you use code to examine your own data, data that you know and care about. Hopefully, you’ll be now able to begin doing just that.
Resources
The web is full of fantastic R resources, so here’s a sample of some materials I think would useful to beginning R users.
Introduction to R

Data Camp’s Introduction to R is a free online course on R.

Code School’s R Course is an interactive web tutorial for R beginners.

YaRrr! The Pirate’s Guide to R is a free ebook, with accompanying YouTube lectures and witty writing (“it turns out that pirates were programming in R well before the earliest known advent of computers.”) YaRrr! is also an R package that helps you get started with some pretty cool R stuff (Phillips, 2016). Recommended!

The Personality Project’s Guide to R (Revelle, 2016b) is a great collection of introductory (and more advanced) R materials especially for Psychologists. The site’s author also maintains a popular and very useful R package called psych (Revelle, 2016a). Check it out!

Google Developers’ YouTube Crash Course to R is a collection of short videos. The first 11 videos are an excellent introduction to working with RStudio and R’s data types, and programming in general.

QuickR is a helpful collection of R materials.
Data wrangling
These websites explain how to “wrangle” data with R.

R for Data Science (Wickham & Grolemund, 2016) is the definitive source on using R with real data for efficient data analysis. It starts off easy (and is suitable for beginners) but covers nearly everything in a dataanalysis workflow apart from modeling.

Introduction to dplyr explains how to use the dplyr package (Wickham & Francois, 2016) to wrangle data.

Data Processing Workflow is a good resource on how to use common packages for data manipulation (Wickham, 2016), but the example data may not be especially helpful.
Visualizing data

ggplot2 is the most popular R visualization package, and all plots in this tutorial were created with it. This is probably the most important R package you will want to know.

Introduction to R graphics with ggplot2 is a thorough ggplot2 tutorial from Harvard.

Cookbook for R’s ggplot2 section is a good source of information.
Statistical modeling and testing
R provides many excellent packages for modeling data, my absolute favorite is the brms package (Buerkner, 2016) for bayesian regression modeling.

UCLA has a very comprehensive list of statistical tests and how to do them in R.

What statistical analysis should I use, also from UCLA.
References
Bartlett, J. (2016, November 22). Tidying and analysing response time data using r. Statistics and substance use. Retrieved November 23, 2016, from https://statsandsubstances.wordpress.com/2016/11/22/tidyingandanalysingresponsetimedatausingr/
Bolger, N., & Laurenceau, J.P. (2013). Intensive longitudinal methods: An introduction to diary and experience sampling research. Guilford Press. Retrieved from http://www.intensivelongitudinal.com/
Buerkner, P.C. (2016). Brms: Bayesian regression models using stan. Retrieved from http://CRAN.Rproject.org/package=brms
Fox, J. (2010). Introduction to statistical computing in r. Retrieved November 23, 2016, from http://socserv.socsci.mcmaster.ca/jfox/Courses/Rcourse/index.html
Muenchen, R., A. (2015). The popularity of data analysis software. R4stats.com. Retrieved November 22, 2016, from http://r4stats.com/articles/popularity/
Phillips, N. (2016). Yarrr: A companion to the ebook “YaRrr!: The pirate’s guide to r”. Retrieved from https://CRAN.Rproject.org/package=yarrr
R Core Team. (2016). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.Rproject.org/
Revelle, W. (2016a). Psych: Procedures for psychological, psychometric, and personality research. Evanston, Illinois: Northwestern University. Retrieved from https://CRAN.Rproject.org/package=psych
Revelle, W. (2016b). The personality project’s guide to r. Retrieved November 22, 2016, from http://personalityproject.org/r/
Wickham, H. (2016). Tidyverse: Easily install and load ’tidyverse’ packages. Retrieved from https://CRAN.Rproject.org/package=tidyverse
Wickham, H., & Francois, R. (2016). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.Rproject.org/package=dplyr
Wickham, H., & Grolemund, G. (2016). R for data science. Retrieved from http://r4ds.had.co.nz/