Category Archives: Research Methodology

Open online education: Research findings and methodological challenges

With a reliable internet connection comes access to the enormous World Wide Web. Being so large, we rely on tools like Google to search and filter all this information. Additional filters can be found in sites like Wikipedia, offering a library style access to curated knowledge, but it too is enormous. In more recent years, open online courses has rapidly become a highly popular method of gaining easy access to curated, high quality, as well as pre-packaged knowledge. A particularly popular variety is the Massive Open Online Course, or MOOC, which are found on platforms like Coursera and edX. The promise – global and free access to high quality education – has often been applauded. Some have heralded the age of the MOOC as the death of campus based teaching. Others are more critical, often citing the high drop-out rates as a sign of failure, or argue that MOOCs do not or cannot foster ‘real’ learning (e.g., Zemsky, 2014; Pope, 2014).

For those who are not aware of the MOOC phenomenon I will first briefly introduce them. In the remainder of this post I will discuss how we can learn about open online courses, what the key challenges are, and how the field can move forward.

What’s all this buzz about?

John Daniel (2012) called MOOCs the official educational buzzword of 2012, and the New York Times called it the Year of the MOOC. However, the movement started before that, somewhere around 2001 when the Massachusetts Institute of Technology (MIT) launched its OpenCourseWare (OCW) to share all its courses online. Individual teachers have been sharing digital content before (e.g., ‘Open Educational Resources’ or OER; Lane & McAndrew, 2010), but the scale and quality of OCW was pioneering. Today, MOOCs can be found on various platforms, such as the ones described in Table 1 below.

Table 1. Overview of several major platforms offering MOOCs

Platform Free content Paid certifications  For profit
Coursera Partial Yes Yes
edX Everything Yes No
Udacity Everything Yes Yes
Udemy Partial Yes Yes
P2PU Yes No No

MOOCs, and open online courses in general, have the goal of making high quality education available to everyone, everywhere. MOOC participants indeed come from all over the world, although participants from Western countries are still overrepresented (Nesterko et al., 2013). Nevertheless, there are numerous inspiring stories from students all over the world, for whom taking one or more MOOCs has had dramatic effects on their lives. For example, Battushig Myanganbayar, a 15 year old boy from Mongolia, took the Circuits and Electronics MOOC, a sophomore-level course from MIT. He was one of the 340 students out of 150.000 who obtained a perfect score, which led to his admittance to MIT (New York Times, 2013).

Stories like these make it much clearer that MOOCs are not to replace contemporary forms of education, but are an amazing addition to it. Why? Because books, radios, and the computer also did not replace education, but enhanced it. In some cases, such as in the story of Battushig, MOOCs provide a variety and quality of education which would otherwise not be accessible at all, due to lacking higher educational institutes. Open online courses provide a new source of high quality education, which is not just accessible to a few students in a lecture hall but has the potential to reach almost everyone who is interested. Will MOOCs replace higher education institutes? Maybe, or maybe not; I think this question mis  ses the point of MOOCs.

In the remainder of this article I will focus on MOOCs from my perspective as a researcher. From this perspective, open online education is in some ways a new approach to education and should thus be investigated on its own. On the other hand, key learning mechanisms (e.g., information processing, knowledge integration, long-term memory consolidation) of human learners are independent of societal changes such as the use of new technologies (e.g., Merrill, Drake, Lacy, & Pratt, 1996). The science of educational instruction has a firm knowledge base and could be used to further our understanding of these generic learning mechanisms, which are inherent to humans.

What are MOOCs anyway?

The typical MOOC is a series of educational videos, often interconnected by other study materials such as texts, and regularly followed-up by quizzes. Usually these MOOCs are divided into approximately 5 to 8 weeks of content. In Figure 1 you see an example of Week 1 from the course ‘Improving your statistical inferences’ by Daniel Lakens.

Figure 1. Example content of a single week in a MOOC

What do students do in a MOOC? To be honest, most do next to nothing. That is, most students who register for a course do not even access it or do so very briefly. However, the thousands of students per course who are active describe a wide variety of learning paths and behaviors. See Figure 2 for an example of a single study in a single course. It shows how this particular student engages very regularly with the course, but the duration and intensity of each session differs substantially. Lectures (shown in green) are often watched in long sessions, while (s)he makes much more often, but shorter, visits to the forum. In the bottom you see a surprising spurt of quiz activity, which might reflect the student’s desire to see what type of questions will be asked later in the course.

Figure 2. Activities of a single user in a single course. Source: Jasper Ginn

Of all the activities which are common for most MOOCs, educational videos are most central to the student learning experience (Guo, Kim, & Rubin, 2014; Liu et al., 2013). The central position of educational videos is reflected by students’ behavior and their intentions: most students plan to watch all videos in a MOOC, and also spend the majority of their time watching these videos (Campbell, Gibbs, Najafi, & Severinski, 2014; Seaton, Bergner, Chuang, Mitros, & Pritchard, 2014). The focus on videos does come with various consequences. Video production is typically expensive and time intensive labor. In addition, they are not as easily translated to other languages, which is contradictory to the aim of making the content accessible to students all around the world. There are many non-native English speakers in MOOCs, while these are almost exclusively presented in English. This raises the question to what extent non-native English speakers can benefit from these courses, compared to native speakers. Open online education may be available to most, the content might not be as accessible for many, for example due to language barriers. It is important to design online education in such a way that it minimizes detrimental effects of potential language barriers to increase its accessibility for a wider audience. While subtitles are often provided, it is unclear whether they promote learning (Markham, Peter, & McCarthy, 2001), hamper learning (Kalyuga, Chandler, & Sweller, 1999), or have no impact at all (van der Zee et al., 2017).

How do we learn about (online) learning?

Research on online learning, and MOOCs in particular, is a highly interdisciplinary field where many perspectives are combined. While research on higher education is typically done primarily by educational scientists, MOOCs are also studied in fields such as computer science and machine learning. This has resulted in an interesting divide in the literature, as researchers from some disciplines are used to publish only in journals (e.g., Computers & Education, Distance Education, International Journal of Computer-Supported Collaborative Learning) while other disciplines focus primarily on conference proceedings (e.g., Learning @ Scale, eMOOCs, Learning Analytics and Knowledge).

Learning at scale opens up a new frontier to learn about learning. MOOCs and similar large-scale online learning platforms give an unprecedented view of learners’ behavior, and potentially, learning. In online learning research, the setting in which the data is measured is not just an approximation of, but equals the world under examination, or at least comes very close to it. That is, measures of students’ behavior do not need to rely on self-reports, but can often be directly derived from log data (e.g., automated measurements of all activities inside an online environment). While this type of research has its advantages, it also comes with various risks and challenges, which I will attempt to outline.

Big data, meaningless data

Research on MOOCs is blessed and cursed with a wide variety of data. For example, it is possible to track every user’s mouse clicks. We also have detailed information about page views, forum data (posts, likes, reads), clickstream data, and interactions with videos. This is all very interesting, except that nobody really knows what it means if a student has clicked two times instead of three times. Nevertheless, the amount of mouse clicks is a strong predictor of ‘study success’, because students who click more, more often finish the course and do so with higher grade. As can be seen Figure 3, the correlations between various mouse clicks metrics and grade ranges from 0.50 to 0.65. However, it would be absurd to recommend students to click more and believe that this will increase their grades. Mouse clicks, in isolation, are inherently ambiguous, if not outright meaningless.

Figure 3. Pairwise Spearman rank correlations between various metrics for all clickers (upper triangle, N = 108008) and certificate earners (lower triangle, N = 7157), from DeBoer, Ho, Stump and Breslow (2014)

When there is smoke, but no fire

With mouse clicks, it will be obvious that this is a problem and will be recognized by many. However, the same problem can secretly underlie many other measured variables which are not that easily recognized. For example, how can we interpret the finding that some students watch a video longer than other students? Findings like this are readily interpreted as being meaningful, for example as signifying that these students were more ‘engaged’, while you could just as well argue that they were got distracted, were bored, etc. There is a classical reasoning fallacy which often underlies these arguments. Because it is reasonable to state that increased engagement will lead to longer video dwelling times, observing the latter is (incorrectly!) assumed to signify the former. In other words: if A leads to B, observing B does not allow you to conclude A. As there are many plausible explanations of differences in video dwelling times, observing such differences cannot be directly interpreted without additional data. This is an inherent problem with many types of big data: you have an enormous amount of granular data which often cannot be directly interpreted. For example, Guo et al. (2014) states that shorter videos and certain video production styles are “much more engaging” than their alternatives. While enormous amounts of data was used, it was in essence a correlational study, such that the claims about which video types are better is based on observational data which do not allow causal inference. More students stop watching a longer video than they do when watching shorter videos, which is interpreted as meaning that the shorter videos are more engaging. While this might certainly be true, it is difficult to make these claims when confounding variables have not been accounted for. As an example, shorter and longer videos do not differ just in time but might also differ in complexity, and the complexity of online educational videos is also strongly correlated with video dwelling time (Van der Sluis, Ginn, & Van der Zee, 2016). More importantly, they showed that the relationship between a video’s complexity (insofar that can be measured) and dwelling time appears to be non-linear, as shown in Figure 4. Non-linear relationships between variables which are typically measured observationally should make us very cautious about making confident claims. For example, in Figure 4 a relative dwelling time of 4 can be found both for an information rate of ~0.2 (below average complexity) as ~1.7 (above average complexity). In other words, if all you know is the dwelling time this does not allow you to make any conclusions about the complexity due to the non-linear relationship

Figure 4. The non-linear relationship between dwelling time and information rate (as a measure of complexity per second). Adapted from Van der Sluis, Ginn, & Van der Zee (2016).

Ghost relationships

Big data and education is a powerful, but dangerous combination. No matter the size of your data set, or variety of variables, correlation data remains incredible treacherous to interpret, especially when the data is granular and lacks 1-to-1 mapping to relevant behavior or cognitive constructs. Given that education is inherently about causality (that is, it aims to change learner’s behavior and/or knowledge), research on online learning should employ a wide array of study methodologies as to properly gather the type of evidence which is required to make claims about causality. It does not require gigabytes of event log data to establish there is some relationship between students’ video watching behavior and quiz results. It does require proper experimental designs to establish causal relationships and effectiveness of interventions and course design. For example, Kovacs (2016) found that students watch videos with in-video questions more often, and are less likely to prematurely stop watching these videos. While this provides some evidence on the benefits of in-video questions, it was a correlational study comparing videos with and without in-video questions. There might have been more relevant differences between the videos, other than the presence of in-video questions. For example, it is reasonable to assume that teachers do not randomly select which videos will have in-video questions, but will choose to add questions to more difficult videos. Should this be the case, a correlational study comparing different videos with and without in-video questions might be confounded by other factors such as the complexity of the video content, to the extent that the relationship might be opposite of what will be found in correlational studies. These type of correlational relationships which can be ‘ghost relationships’ which appear real at first sight, but have no bearing on reality.

The way forward

The granularity of the data, and the various ways how they can be interpreted challenges the validity and generalizability of this type of research. With sufficiently large sample sizes, amount of variables, and researchers’ degrees of freedom, you will be guaranteed to find ‘potentially’ interesting relationships in these datasets. A key development in this area (and science in general) is pre-registering research methodology before a study is performed, in other to decrease ‘noise mining’ and increase the overall veracity of the literature. For more on the reasoning behind pre-registration, see also the JEPS Bulletin three part series on the topic, starting with Dablander (2016). The Learning @ Scale conference, which is already at the center of research on online learning, is becoming a key player in this movement, as they explicitly recommend the use of pre-registered protocols for submitted papers for the conference in 2017.

A/B Testing

Experimental designs (often called “A/B tests” in this literature) are increasingly common in the research on online learning, but they too are not without dangers, and need to be carefully crafted (Reich, 2015). Data in open online education are not only different due to their scale, they require reconceptualization. There are new measures, such as the highly granular measurements described above, as well as existing educational variables which require different interpretations (DeBoer, Ho, Stump and Breslow, 2014). For example, in traditional higher education it would be considered dramatic if over 90% of the students do not finish a course, but this normative interpretation of drop-out rates cannot be uncritically applied to the context of open online education. While registration barriers are substantial for higher education, they are practically nonexistent in MOOCs. In effect, there is no filter which pre-selects the highly motivated students, resulting in many students who just want to take a peek and then stop participating. Secondly, in traditional education dropping out is interpreted as a loss for both the student and the institute. Again, this interpretation does not transfer to the context of MOOCs, as the students who drop out after watching only some videos might have successfully completed their personal learning goals.

Rebooting MOOC research

The next generation of MOOC research needs to adopt a wider range of research designs with greater attention to causal factors promoting student learning (Reich, 2015). To advance in understanding it becomes essential to compliment granular (big) data with other sources of information in an attempt to triangulate its meaning. Triangulation can be done in a various way, from multiple proxy measurements of the same latent construct within a single study, to repeated measurements across separate studies. A good example of triangulation in research on online learning is combining granular log data (such as video dwelling time), student output (such as essays), and subjective measures (such as self-reported behavior) in order to triangulate students’ behavior. Secondly, these models themselves require validation through repeated applications across courses and populations. Convergence between these different (types of) measurements strengthens singular interpretations of (granular) data, and is often a necessary exercise. Inherent to triangulation is increasing the variety within and between datasets, such that they become richer in meaning and usable for generalizable statements.

Replications (both direct and conceptual) are fundamental for this effort. I would like to end with this quote from Justin Reich (2015), which reads: “These challenges cannot be addressed solely by individual researchers. Improving MOOC research will require collective action from universities, funding agencies, journal editors, conference organizers, and course developers. At many universities that produce MOOCs, there are more faculty eager to teach courses than there are resources to support course production. Universities should prioritize courses that will be designed from the outset to address fundamental questions about teaching and learning in a field. Journal editors and conference organizers should prioritize publication of work conducted jointly across institutions, examining learning outcomes rather than engagement outcomes, and favoring design research and experimental designs over post hoc analyses. Funding agencies should share these priorities, while supporting initiatives—such as new technologies and policies for data sharing—that have potential to transform open science in education and beyond.”

Further reading

Here are three recommended readings related to this topic:

  1. Reich, J. (2015). Rebooting MOOC research. Science, 347(6217), 34-35.
  2. DeBoer, J., Ho, A. D., Stump, G. S., & Breslow, L. (2014). Changing “course” reconceptualizing educational variables for massive open online courses. Educational Researcher.
  3. Daniel, J., (2012). Making Sense of MOOCs: Musings in a Maze of Myth, Paradox and Possibility. Journal of Interactive Media in Education. 2012(3), p.Art. 18.

References

Butler, A. C., & Roediger III, H. L. (2007). Testing improves long-term retention in a simulated classroom setting. European Journal of Cognitive Psychology19(4-5), 514-527.

Campbell, J., Gibbs, A. L., Najafi, H., & Severinski, C. (2014). A comparison of learner intent and behaviour in live and archived MOOCs. The International Review of Research in Open and Distributed Learning15(5).

Cepeda, N. J., Coburn, N., Rohrer, D., Wixted, J. T., Mozer, M. C., & Pashler, H. (2009). Optimizing distributed practice: Theoretical analysis and practical implications. Experimental psychology56(4), 236-246.

Daniel, J. (2012). Making sense of MOOCs: Musings in a maze of myth, paradox and possibility. Journal of interactive Media in education2012(3).

DeBoer, J., Ho, A. D., Stump, G. S., & Breslow, L. (2014). Changing “course” reconceptualizing educational variables for massive open online courses. Educational Researcher.

Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). Improving students’ learning with effective learning techniques promising directions from cognitive and educational psychology. Psychological Science in the Public Interest14(1), 4-58.

Guo, P. J., Kim, J., & Rubin, R. (2014, March). How video production affects student engagement: An empirical study of mooc videos. In Proceedings of the first ACM conference on Learning@ scale conference (pp. 41-50). ACM.

Guo, P. J., Kim, J., & Rubin, R. (2014, March). How video production affects student engagement: An empirical study of mooc videos. In Proceedings of the first ACM conference on Learning@ scale conference (pp. 41-50). ACM.

Johnson, C. I., & Mayer, R. E. (2009). A testing effect with multimedia learning. Journal of Educational Psychology101(3), 621.

Kalyuga, S., Chandler, P., & Sweller, J. (1999). Managing split-attention and redundancy in multimedia instruction. Applied cognitive psychology, 13(4), 351-371.

Karpicke, J. D., & Roediger, H. L. (2008). The critical importance of retrieval for learning. science319(5865), 966-968.

Konstan, J. A., Walker, J. D., Brooks, D. C., Brown, K., & Ekstrand, M. D. (2015). Teaching recommender systems at large scale: evaluation and lessons learned from a hybrid MOOC. ACM Transactions on Computer-Human Interaction (TOCHI)22(2), 10.

Lane, A., & McAndrew, P. (2010). Are open educational resources systematic or systemic change agents for teaching practice?. British Journal of Educational Technology41(6), 952-962.

Liu, Y., Liu, M., Kang, J., Cao, M., Lim, M., Ko, Y., … & Lin, J. (2013, October). Educational Paradigm Shift in the 21st Century E-Learning. In E-Learn: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education (Vol. 2013, No. 1, pp. 373-379).

Markham, P., Peter, L. A., & McCarthy, T. J. (2001). The effects of native language vs. target language captions on foreign language students’ DVD video comprehension. Foreign language annals, 34(5), 439-445.

Mayer, R. E. (2003). The promise of multimedia learning: using the same instructional design methods across different media. Learning and instruction13(2), 125-139.

Mayer, R. E., Mathias, A., & Wetzell, K. (2002). Fostering understanding of multimedia messages through pre-training: Evidence for a two-stage theory of mental model construction. Journal of Experimental Psychology: Applied8(3), 147.

Merrill, M. D., Drake, L., Lacy, M. J., Pratt, J., & ID2 Research Group. (1996). Reclaiming instructional design. Educational Technology36(5), 5-7.

Nesterko, S. O., Dotsenko, S., Han, Q., Seaton, D., Reich, J., Chuang, I., & Ho, A. D. (2013, December). Evaluating the geographic data in MOOCs. In Neural information processing systems.

Ozcelik, E., Arslan-Ari, I., & Cagiltay, K. (2010). Why does signaling enhance multimedia learning? Evidence from eye movements. Computers in human behavior26(1), 110-117.

Plant, E. A., Ericsson, K. A., Hill, L., & Asberg, K. (2005). Why study time does not predict grade point average across college students: Implications of deliberate practice for academic performance. Contemporary Educational Psychology30(1), 96-116.

Pope, J. (2015). What are MOOCs good for?. Technology Review118(1), 69-71.

Reich, J. (2015). Rebooting MOOC research. Science, 347(6217), 34-35.

Roediger, H. L., & Butler, A. C. (2011). The critical role of retrieval practice in long-term retention. Trends in cognitive sciences15(1), 20-27.

Seaton, D. T., Bergner, Y., Chuang, I., Mitros, P., & Pritchard, D. E. (2014). Who does what in a massive open online course?. Communications of the ACM57(4), 58-65.

Van der Sluis, F., Ginn, J., & Van der Zee, T. (2016, April). Explaining Student Behavior at Scale: The Influence of Video Complexity on Student Dwelling Time. In Proceedings of the Third (2016) ACM Conference on Learning@ Scale (pp. 51-60). ACM.

Van der Zee, T., Admiraal, W., Paas, F., Saab, N., & Giesbers, B. (2017). Effects of Subtitles, Complexity, and Language proficiency on Learning from Online Education Videos. Journal of Media Psychology, in print. Pre-print available at https://osf.io/n6zuf/.

Zemsky, R. (2014). With a MOOC MOOC here and a MOOC MOOC there, here a MOOC, there a MOOC, everywhere a MOOC MOOC. The Journal of General Education63(4), 237-243.

Tim van der Zee

Skeptical scientist. I study how people learn from educational videos in open online courses, and how we can help them learn better. PhD student at Leiden University (the Netherlands), but currently a visiting scholar at MIT and UMass Lowell. You can follow me on Twitter: @Research_Tim and read my blog at www.timvanderzee.com

More Posts - Website

Follow Me:
Twitter

Facebooktwitterrss

Introduction to Data Analysis using R

R Logo

R is a statistical programming language whose popularity is quickly overtaking SPSS and other “traditional” point-and-click software packages (Muenchen, 2015). But why would anyone use a programming language, instead of point-and-click applications, for data analysis? An important reason is that data analysis rarely consists of simply running a statistical test. Instead, many small steps, such as cleaning and visualizing data, are usually repeated many times, and computers are much faster at doing repetitive tasks than humans are. Using a point-and-click interface for these “data cleaning” operations is laborious and unnecessarily slow:

“[T]he process of tidying my data took me around 10 minutes per participant as I would do it all manually through Excel. Even for a moderate sample size, this starts to take up a large chunk of time that could be spent doing other things like writing or having a beer” (Bartlett, 2016).

A programmed analysis would seamlessly apply the tidying steps to every participant in the blink of an eye, and would itself constitute an exact script of what operations were applied to the data, making it easier to repeat the steps later.

Learning to use a programming language for data analysis reduces human labor and saves time that could be better spent doing more important (or fun) things. In this post, I introduce the R programming language, and motivate its use in Psychological science. The introduction is aimed toward students and researchers with no programming experience, but is suitable for anyone with an interest in learning the basics of R.

The R project for statistical computing

“R is a free software environment for statistical computing and graphics.” (R Core Team, 2016)

Great, but what does that mean? R is a programming language that is designed and used mainly in the statistics, data science, and scientific communities. R has “become the de-facto standard for writing statistical software among statisticians and has made substantial inroads in the social and behavioural sciences” (Fox, 2010). This means that if we use R, we’ll be in good company (and that company will likely be even better and numerous in the future, see (Muenchen, 2015)).

To understand what R is, and is not, it may be helpful to begin by contrasting R to its most common alternative, SPSS. Many psychologists are familiar with SPSS, which has a graphical user interface (GUI), allowing the user to look at the two-dimensional data table on screen, and click through various drop-down menus to conduct analyses on the data. In contrast, R is an object oriented programming language. Data is loaded into R as a “variable”, meaning that in order to view it, the user has to print it on the screen. The power of this approach is that the data is an object in a programming environment, and only your imagination limits what functions you can apply to the data. R also has no GUI to navigate with the mouse; instead, users interact with the data by typing commands.

SPSS is expensive to use: Universities have to pay real money to make it available to students and researchers. R and its supporting applications, on the other hand, are completely free—meaning that both users and developers have easier access to it. R is an open source software package, which means that many cutting edge statistical methods are more quickly implemented in R than SPSS. This is apparent, for example, in the recent uprising of Bayesian methods for data analysis (e.g. Buerkner, 2016).

Further, SPSS’s facilities for cleaning, organizing, formatting, and transforming data are limited—and not very user friendly, although this is a subjective judgment—so users often resort to a spreadsheet program (Microsoft Excel, say) for data manipulation. R has excellent capacities for all steps in the analysis pipeline, including data manipulation, and therefore the analysis never has to spread across multiple applications. You can imagine how the possibility for mistakes, and time needed, is reduced when the data file(s) doesn’t need to be juggled between applications. Switching between applications, and repeatedly clicking through drop-down menus means that, for any small change, the human using the computer must re-do every step of the analysis. With R, you can simply re-use your analysis script and just import different data to it.

workflow

Figure 1. Two workflows for statistical discovery in the empirical sciences. “Analysis” consists of multiple operations, and is spread over multiple applications in Workflow 2, but not in Workflow 1. Therefore, “analysis” is more easily documented and repeated in Workflow 1. This fact alone may work to reduce mistakes in data analysis. The dashed line from R to Word Processor indicates an optional step: You can even write manuscripts with RStudio, going directly from R to Communicating Results.

These considerations lead to contrasting the two different workflows in Figure 1. Workflow 1 uses a programming language, such as R. It is difficult to learn, but beginners generally get started with real analysis in an hour or so. The payoff for the initial difficulty is great: The workflow is reproducible (users can save scripts and show their friends exactly what they did to create those beautiful violin plots); the workflow is flexible (want to do everything just the way you did it, but instead do the plots for males instead of females? Easy!); and most importantly, repetitive, boring, but important work is delegated to a computer.

The final point requires some reflecting; after all, computer programs all work on computers, so it sounds like a tautology. But what I mean is that repetitive tasks can be wrapped in a simple function (these are usually already available—you don’t have to create your own functions) which then performs the tasks as many times as you would like to. Many tasks in the data cleaning stage, for example, are fairly boring and repetitive (calculating summary statistics, aggregating data, combining spreadsheets or columns across spreadsheets), but less so when one uses a programming language.

Workflow 2, on the other hand, is easy to learn because there are few well-defined and systematic parts to it—everything is improvised on a task-by-task basis and done manually by copy-pasting, pointing-and-clicking and dragging and dropping. “Clean and organize” the data in Excel. “Analyze” in SPSS. In the optimal case where the data is perfectly aligned to the format that SPSS expects, you can get a p-value in less than a minute (excluding SPSS start-up time, which is quickly approaching infinity) by clicking through the drop-down menus. That is truly great, if that is all you want. But that’s rarely all that we want, and data is rarely in SPSS’s required format.

Workflow 2 is not reproducible (that is, it may be very difficult if not impossible to exactly retrace your steps through an analysis), so although you may know roughly that you “did an ANOVA”, you may not remember which cases were included, what data was used, how it was transformed, etc. Workflow 2 is not flexible: You’ve just done a statistical test on data from Experiment 1? Great! Can you now do it for Experiment 2, but log-transform the RTs? Sure, but then you would have to restart from the Excel step, and redo all that pointing and clicking. This leads to Workflow 2 requiring the human to do too much work, and spend time on the analysis that could be better spent “doing other things like writing or having a beer” (Bartlett, 2016).

So, what is R? It is a programming language especially suited for data analysis. It allows you to program (more on this below!) your analyses instead of pointing and clicking through menus. The point here is not that you can’t do analysis with a point-and-click SPSS style software package. You can, and you can do a pretty damn good job with it. The point is that you can work less and be more productive if you’re willing to spend some initial time and effort learning Workflow 1 instead of the common Workflow 2. And that requires getting started with R.

Getting started with R: From 0 to R in 100 seconds

If you haven’t already, go ahead and download R, and start it up on your computer. Like most programming languages, R is best understood through its console—the interface that lets you interact with the language.

Figure 2. The R console.

After opening R on your computer, you should see a similar window on your computer. The console allows us to type input, have R evaluate it, and return output. Just like a fancy calculator. Here, our first input was assigning (R uses the left arrow, <-, for assignment) all the integers from 0 to 100 to a variable called numbers. Computer code can often be read from right to left; the first one here would say “integers 0 through to 100, assign to numbers”. We then calculated the mean of those numbers by using R’s built in function, mean(). Everything interesting in R is done by using functions: There are functions for drawing figures, transforming data, running statistical tests, and much, much more.

Here’s another example, this time we’ll create some heights data for kids and adults (in centimeters) and conduct a two-sample t-test (every line that begins with a “#>” is R’s output):

That’s it, a t-test in R in a hundred seconds! Note, c() stands for “combine”, so kids is now a numeric vector (collection of numbers) with 5 elements. The t-test results are printed in R’s console, and are straightforward to interpret.

Save your analysis scripts

At its most basic, data analysis in R consists of importing data to R, and then running functions to visualize and model the data. R has powerful functions for covering the entire process going from Raw Data to Communicating Results (or Word Processor) in Figure 1. That is, users don’t need to switch between applications at various steps of the analysis workflow. Users simply type in code, let R evaluate it, and receive output. As you can imagine, a full analysis from raw data to a report (or table of summary statistics, or whatever your goal is) may involve lots of small steps—transforming variables in the data, plotting, calculating summaries, modeling and testing—which are often done iteratively. Recognizing that there may be many steps involved, we realize that we better save our work so that we can investigate and redo it later, if needed. Therefore for each analysis, we should create a text file containing all those steps, which could then be run repeatedly with minor tweaks, if required.

To create these text files, or “R scripts”, we need a text editor. All computers have a text editor pre-installed, but programming is often easier if you use an integrated development environment (IDE), which has a text editor and console all in one place (often with additional capacities.) The best IDE for R, by far, is RStudio. Go ahead and download RStudio, and then start it. At this point you can close the other R console on your computer, because RStudio has the console available for you.

Getting started with RStudio

rstudio

Figure 3. The RStudio IDE to R.

Figure 3 shows the main view of RStudio. There are four rectangular panels, each with a different purpose. The bottom left panel is the R console. We can type input in the console (on the empty line that begins with a “>”) and hit return to execute the code and obtain output. But a more efficient approach is to type the code into a script file, using the text editor panel, known as the source panel, in the top left corner. Here, we have a t-test-kids-grownups.R script open, which consists of three lines of code. You can write this script on your own computer by going to File -> New File -> R Script in RStudio, and then typing in the code you see in Figure 3. You can execute each line by hitting Control + Return, on Windows computers, or Command + Return on OS X computers. Scripts like this constitute the exact documentation of what you did in your analysis, and as you can imagine, are pretty important.

The two other panels are for viewing things, not so much for interacting with the data. Top right is the Environment panel, showing the variables that you have saved in R. That is, when you assign something into a variable (kids <- c(100, 98, 89, 111, 101)), that variable (kids) is visible in the Environment panel, along with its type (num for numeric), size (1:5, for 5), and contents (100, 98, 89, 111, 101). Finally, bottom right is the Viewer panel, where we can view plots, browse files on the computer, and do various other things.

With this knowledge in mind, let’s begin with a couple easy things. Don’t worry, we’ll get to actual data soon enough, once we have the absolute basics covered. I’ll show some code and evaluate it in R to show its output too. You can, and should, type in the commands yourself to help you understand what they do (type each line in an R script and execute the line by pressing Cmd + Enter. Save your work every now and then.)

Here’s how to create variables in R (try to figure out what’s saved in each variable):

And here’s how to print those variable’s contents on the screen. (I’ll provide a comment for each line, comments begin with a # and are not evaluated by R. That is, comments are read by humans only.)

Transforming data is easy: R automatically applies operations to vectors of (variables containing multiple) numbers, if needed. Let’s create z-scores of kids heights.

I hope you followed along. You should now have a bunch of variables in your R Environment. If you typed all those lines into an R script, you can now execute them again, or modify them and then re-run the script, line-by-line. You can also execute the whole script at once by clicking “Run”, at the top of the screen. Congratulations, you’ve just programmed your first computer program!

User contributed packages

One of the best things about R is that it has a large user base, and lots of user contributed packages, which make using R easier. Packages are simply bundles of functions, and will enhance your R experience quite a bit. Whatever you want to do, there’s probably an R package for that. Here, we will install and load (make available in the current session) the tidyverse package (Wickham, 2016), which is designed for making tidying data easier.

It’s important that you use the tidyverse package if you want to follow along with this tutorial. All of the tasks covered here are possible without it, but the functions from tidyverse make the tasks easier, and certainly easier to learn.

Using R with data

Let’s import some data to R. We’ll use example data from Chapter 4 of the Intensive Longitudinal Methods book (Bolger & Laurenceau, 2013). The data set is freely available on the book’s website. If you would like to follow along, please donwload the data set, and place it in a folder (unpack the .zip file). Then, use RStudio’s Viewer panel, and its Files tab, to navigate to the directory on your computer that has the data set, and set it as the working directory by clicking “More”, then “Set As Working Directory”.

setwd

Figure 4. Setting the Working Directory

Setting the working directory properly is extremely important, because it’s the only way R knows where to look for files on your computer. If you try to load files that are not in the working directory, you need to use the full path to the file. But if your working directory is properly set, you can just use the filename. The file is called “time.csv”, and we load it into a variable called d using the read_csv() function. (csv stands for comma separated values, a common plain text format for storing data.) You’ll want to type all these functions to an R script, so create a new R script and make sure you are typing the commands in the Source panel, not the Console panel. If you set your working directory correctly, once you save the R script file, it will be saved in the directory right next to the “time.csv” file.

d is now a data frame (sometimes called a “tibble”, because why not), whose rows are observations, and columns the variables associated with those observations.

This data contains simulated daily intimacy reports of 50 individuals, who reported their intimacy every evening, for 16 days. Half of these simulated participants were in a treatment group, and the other half in a control group. To print the first few rows of the data frame to screen, simply type its name:

The first column, id is a variable that specifies the id number who that observation belongs to. int means that the data in this column are integers. time indicates the day of the observation, and the authors coded the first day at 0 (this will make intercepts in regression models easier to interpret.) time01 is just time but recoded so that 1 is at the end of the study. dbl means that the values are floating point numbers. intimacy is the reported intimacy, and treatment indicates whether the person was in the control (0) or treatment (1) group. The first row of this output also tells us that there are 800 rows in total in this data set, and 5 variables (columns). Each row is also numbered in the output (leftmost “column”), but those values are not in the data.

Data types

It’s important to verify that your variables (columns) are imported into R in the appropriate format. For example, you would not like to import time recorded in days as a character vector, nor would you like to import a character vector (country names, for example) as a numeric variable. Almost always, R (more specifically, read_csv()) automatically uses correct formats, which you can verify by looking at the row between the column names and the values.

There are five basic data types: int for integers, num (or dbl) for floating point numbers (1.12345…), chr for characters (also known as “strings”), factor (sometimes abbreviated as fctr) for categorical variables that have character labels (factors can be ordered if required), and logical (abbreviated as logi) for logical variables: TRUE or FALSE. Here’s a little data frame that illustrates the basic variable types in action:

Here we are also introduced a very special value, NA. NA means that there is no value, and we should always pay special attention to data that has NAs, because it may indicate that some important data is missing. This sample data explicitly tells us that we don’t know whether this person likes matlab or not, because the variable is NA. OK, let’s get back to the daily intimacy reports data.

Quick overview of data

We can now use the variables in the data frame d and compute summaries just as we did above with the kids’ and adults’ heights. A useful operation might be to ask for a quick summary of each variable (column) in the data set:

To get a single variable (column) from the data frame, we call it with the $ operator (“gimme”, for asking R to give you variables from within a data frame). To get all the intimacy values, we could just call d$intimacy. But we better not, because that would print out all 800 intimacy values into the console. We can pass those values to functions instead:

If you would like to see the first six values of a variable, you can use the head() function:

head() works on data frames as well, and you can use an optional number argument to specify how many first values you’d like to see returned:

A look at R’s functions

Generally, this is how R functions work, you name the function, and specify arguments to the function inside the parentheses. Some of these arguments may be data or other input (d, above), and some of them change what the argument does and how (2, above). To know what arguments you can give to a function, you can just type the function’s name in the console with a question mark prepended to it:

Importantly, calling the help page reveals that functions’ arguments are named. That is, arguments are of the form X = Y, where X is the name of the argument, and Y is the value you would like to set it to. If you look at the help page of head() (?head), you’ll see that it takes two arguments, x which should be an object (like our data frame d, (if you don’t know what “object” means in this context, don’t worry—nobody does)), and n, which is the number of elements you’d like to see returned. You don’t always have to type in the X = Y part for every argument, because R can match the arguments based on their position (whether they are the first, second, etc. argument in the parentheses). We can confirm this by typing out the full form of the previous call head(d, 2), but this time, naming the arguments:

Now that you know how R’s functions work, you can find out how to do almost anything by typing into a search engine: “How to do almost anything in R”. The internet (and books, of course) is full of helpful tutorials (see Resources section, below) but you will need to know these basics about functions in order to follow those tutorials.

Creating new variables

Creating new variables is also easy. Let’s create a new variable that is the square root of the reported intimacy (because why not), by using the sqrt() function and assigning the values to a new variable (column) within our data frame:

Recall that sqrt(d$intimacy) will take the square root of every 800 values of the vector of intimacy values, and return a vector of 800 squared values. There’s no need to do this individually for each value.

We can also create variables using conditional logic, which is useful for creating verbal labels for numeric variables, for example. Let’s create a verbal label for each of the treatment groups:

We created a new variable, Group in d, that is “Control” if the treatment variable on that row is 0, and “Treatment” otherwise.

Remember our discussion of data types above? d now contains integer, double, and character variables. Make sure you can identify these in the output, above.

Aggregating

Let’s focus on aggregating the data across individuals, and plotting the average time trends of intimacy, for the treatment and control groups.

In R, aggregating is easiest if you think of it as calculating summaries for “groups” in the data (and collapsing the data across other variables). “Groups” doesn’t refer to experimental groups (although it can), but instead any arbitrary groupings of your data based on variables in it, so the groups can be based on multiple things, like time points and individuals, or time points and experimental groups.

Here, our groups are the two treatment groups and 16 time points, and we would like to obtain the mean for each group at each time point by collapsing across individuals

The above code summarized our data frame d by calculating the mean intimacy for the groups specified by group_by(). We did this by first creating a data frame that is d, but is grouped on Group and time, and then summarizing those groups by taking the mean intimacy for each of them. This is what we got:

A mean intimacy value for both groups, at each time point.

Plotting

We can now easily plot these data, for each individual, and each group. Let’s begin by plotting just the treatment and control groups’ mean intimacy ratings:

unnamed-chunk-21-1

Figure 5. Example R plot, created with ggplot(), of two groups’ mean intimacy ratings across time.

For this plot, we used the ggplot() function, which takes as input a data frame (we used d_groups from above), and a set of aesthetic specifications (aes(), we mapped time to the x axis, intimacy to the y axis, and color to the different treatment Groups in the data). We then added a geometric object to display these data (geom_line() for a line.)

To illustrate how to add other geometric objects to display the data, let’s add some points to the graph:

unnamed-chunk-22-1

Figure 6. Two groups’ mean intimacy ratings across time, with points.

We can easily do the same plot for every individual (a panel plot, but let’s drop the points for now):

unnamed-chunk-23-1

Figure 7. Two groups’ mean intimacy ratings across time, plotted separately for each person.

The code is exactly the same, but now we used the non-aggregated raw data d, and added an extra function that wraps each id’s data into their own little subplot (facet_wrap(); remember, if you don’t know what a function does, look at the help page, i.e. ?facet_wrap). ggplot() is an extremely powerful function that allows you to do very complex and informative graphs with systematic, short and neat code. For example, we may add a linear trend (linear regression line) to each person’s panel. This time, let’s only look at the individuals in the experimental group, by using the filter() command (see below):

unnamed-chunk-24-1

Figure 8. Treatment group’s mean intimacy ratings across time, plotted separately for each person, with linear trend lines.

Data manipulation

We already encountered an example of manipulating data, when we aggregated intimacy over some groups (experimental groups and time points). Other common operations are, for example, trimming the data based on some criteria. All operations that drop observations are conceptualized as subsetting, and can be done using the filter() command. Above, we filtered the data such that we plotted the data for the treatment group only. As another example, we can get the first week’s data (time is less than 7, that is, days 0-6), for the control group only, by specifying these logical operations in the filter() function

Try re-running the above line with small changes to the logical operations. Note that the two logical operations are combined with the AND command (&), you can also use OR (|). Try to imagine what replacing AND with OR would do in the above line of code. Then try and see what it does.

A quick detour to details

At this point it is useful to remind that computers do exactly what you ask them to do, nothing less, nothing more. So for instance, pay attention to capital letters, symbols, and parentheses. The following three lines are faulty, try to figure out why:

Why does this data frame have zero rows?

Error? What’s the problem?

Error? What’s the problem?

(Answers: 1. Group is either “Control” or “Treatment”, not “control” or “treatment”. 2. Extra parenthesis at the end. 3. == is not the same as =, the double == is a logical comparison operator, asking if two things are the same, the single = is an assignment operator.)

Advanced data manipulation

Let’s move on. What if we’d like to detect extreme values? For example, let’s ask if there are people in the data who show extreme overall levels of intimacy (what if somebody feels too much intimacy!). How can we do that? Let’s start thinking like programmers and break every problem into the exact steps required to answer the problem:

  1. Calculate the mean intimacy for everybody
  2. Plot the mean intimacy values (because always, always visualize your data)
  3. Remove everybody whose mean intimacy is over 2 standard deviations above the overall mean intimacy (over-intimate people?) (note that this is a terrible exclusion criteria here, and done for illustration purposes only)

As before, we’ll group the data by person, and calculate the mean (which we’ll call int).

We now have everybody’s mean intimacy in a neat and tidy data frame. We could, for example, arrange the data such that we see the extreme values:

Nothing makes as much sense as a histogram:

unnamed-chunk-31-1

Figure 9. Histogram of everybody’s mean intimacy ratings.

It doesn’t look like anyone’s mean intimacy value is “off the charts”. Finally, let’s apply our artificial exclusion criteria: Drop everybody whose mean intimacy is 2 standard deviations above the overall mean:

Then we could proceed to exclude these participants (don’t do this with real data!), by first joining the d_grouped data frame, which has the exclusion information, with the full data frame d

and then removing all rows where exclude is TRUE. We use the filter() command, and take only the rows where exclude is FALSE. So we want our logical operator for filtering rows to be “not-exclude”. “not”, in R language, is !:

I saved the included people in a new data set called d2, because I don’t actually want to remove those people, but just illustrated how to do this. We could also in some situations imagine applying the exclusion criteria to individual observations, instead of individual participants. This would be as easy as (think why):

Selecting variables in data

After these artificial examples of removing extreme values (or people) from data, we have a couple of extra variables in our data frame d that we would like to remove, because it’s good to work with clean data. Removing, and more generally selecting variables (columns) in data frames is most easily done with the select() function. Let’s select() all variables in d except the squared intimacy (sqrt_int), average intimacy (int) and exclusion (exclude) variables (that is, let’s drop those three columns from the data frame):

Using select(), we can keep variables by naming them, or drop them by using -. If no variables are named for keeping, but some are dropped, all unnamed variables are kept, as in this example.

Regression

Let’s do an example linear regression by focusing on one participant’s data. The first step then is to create a subset containing only one person’s data. For instance, we may ask a subset of d that consists of all rows where id is 30, by typing

Linear regression is available using the lm() function, and R’s own formula syntax:

Generally, for regression in R, you’d specify the formula as outcome ~ predictors. If you have multiple predictors, you combine them with addition (“+”): outcome ~ IV1 + IV2. Interactions are specified with multiplication (“*“): outcome ~ IV1 * IV2 (which automatically includes the main effects of IV1 and IV2; to get an interaction only, use”:" outcome ~ IV1:IV2). We also specified that for the regression, we’d like to use data in the d_sub data frame, which contains only person 30’s data.

Summary of a fitted model is easily obtained:

Visualizing the model fit is also easy. We’ll use the same code as for the figures above, but also add points (geom_point()), and a linear regression line with a 95% “Confidence” Ribbon (geom_smooth(method="lm")).

unnamed-chunk-40-1

Figure 10. Person 30’s intimacy ratings over time (points and black line), with a linear regression model (blue line and gray Confidence Ribbon).

Pretty cool, right? And there you have it. We’ve used R to do a sample of common data cleaning and visualization operations, and fitted a couple of regression models. Of course, we’ve only scratched the surface, and below I provide a short list of resources for learning more about R.

Conclusion

Programming your statistical analyses leads to a flexible, reproducible and time-saving workflow, in comparison to more traditional point-and-click focused applications. R is probably the best programming language around for applied statistics, because it has a large user base and many user-contributed packages that make your life easier. While it may take an hour or so to get acquainted with R, after initial difficulty it is easy to use, and provides a fast and reliable platform for data wrangling, visualization, modeling, and statistical testing.

Finally, learning to code is not about having a superhuman memory for function names, but instead it is about developing a programmer’s mindset: Think your problem through and decompose it to small chunks, then ask a computer to do those chunks for you. Do that a couple of times and you will magically have memorized, as a byproduct, the names of a few common functions. You learn to code not by reading and memorizing a tutorial, but by writing it out, examining the output, changing the input and figuring out what changed in the output. Even better, you’ll learn the most once you use code to examine your own data, data that you know and care about. Hopefully, you’ll be now able to begin doing just that.

Resources

The web is full of fantastic R resources, so here’s a sample of some materials I think would useful to beginning R users.

Introduction to R

  • Data Camp’s Introduction to R is a free online course on R.

  • Code School’s R Course is an interactive web tutorial for R beginners.

  • YaRrr! The Pirate’s Guide to R is a free e-book, with accompanying YouTube lectures and witty writing (“it turns out that pirates were programming in R well before the earliest known advent of computers.”) YaRrr! is also an R package that helps you get started with some pretty cool R stuff (Phillips, 2016). Recommended!

  • The Personality Project’s Guide to R (Revelle, 2016b) is a great collection of introductory (and more advanced) R materials especially for Psychologists. The site’s author also maintains a popular and very useful R package called psych (Revelle, 2016a). Check it out!

  • Google Developers’ YouTube Crash Course to R is a collection of short videos. The first 11 videos are an excellent introduction to working with RStudio and R’s data types, and programming in general.

  • Quick-R is a helpful collection of R materials.

Data wrangling

These websites explain how to “wrangle” data with R.

  • R for Data Science (Wickham & Grolemund, 2016) is the definitive source on using R with real data for efficient data analysis. It starts off easy (and is suitable for beginners) but covers nearly everything in a data-analysis workflow apart from modeling.

  • Introduction to dplyr explains how to use the dplyr package (Wickham & Francois, 2016) to wrangle data.

  • Data Processing Workflow is a good resource on how to use common packages for data manipulation (Wickham, 2016), but the example data may not be especially helpful.

Visualizing data

Statistical modeling and testing

R provides many excellent packages for modeling data, my absolute favorite is the brms package (Buerkner, 2016) for bayesian regression modeling.

References

Bartlett, J. (2016, November 22). Tidying and analysing response time data using r. Statistics and substance use. Retrieved November 23, 2016, from https://statsandsubstances.wordpress.com/2016/11/22/tidying-and-analysing-response-time-data-using-r/

Bolger, N., & Laurenceau, J.-P. (2013). Intensive longitudinal methods: An introduction to diary and experience sampling research. Guilford Press. Retrieved from http://www.intensivelongitudinal.com/

Buerkner, P.-C. (2016). Brms: Bayesian regression models using stan. Retrieved from http://CRAN.R-project.org/package=brms

Fox, J. (2010). Introduction to statistical computing in r. Retrieved November 23, 2016, from http://socserv.socsci.mcmaster.ca/jfox/Courses/R-course/index.html

Muenchen, R., A. (2015). The popularity of data analysis software. R4stats.com. Retrieved November 22, 2016, from http://r4stats.com/articles/popularity/

Phillips, N. (2016). Yarrr: A companion to the e-book “YaRrr!: The pirate’s guide to r”. Retrieved from https://CRAN.R-project.org/package=yarrr

R Core Team. (2016). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/

Revelle, W. (2016a). Psych: Procedures for psychological, psychometric, and personality research. Evanston, Illinois: Northwestern University. Retrieved from https://CRAN.R-project.org/package=psych

Revelle, W. (2016b). The personality project’s guide to r. Retrieved November 22, 2016, from http://personality-project.org/r/

Wickham, H. (2016). Tidyverse: Easily install and load ’tidyverse’ packages. Retrieved from https://CRAN.R-project.org/package=tidyverse

Wickham, H., & Francois, R. (2016). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr

Wickham, H., & Grolemund, G. (2016). R for data science. Retrieved from http://r4ds.had.co.nz/

Matti Vuorre

Matti Vuorre

Matti Vuorre is a PhD Student at Columbia University in New York City. He studies cognitive psychology and neuroscience, and focuses on understanding the mechanisms underlying humans' metacognitive capacities.

More Posts - Website

Facebooktwitterrss

Not solely about that Bayes: Interview with Prof. Eric-Jan Wagenmakers

Last summer saw the publication of the most important work in psychology in decades: the Reproducibility Project (Open Science Collaboration, 2015; see here and here for context). It stirred up the community, resulting in many constructive discussions but also in verbally violent disagreement. What unites all parties, however, is the call for more transparency and openness in research.

Eric-Jan “EJ” Wagenmakers has argued for pre-registration of research (Wagenmakers et al., 2012; see also here) and direct replications (e.g., Boekel et al., 2015; Wagenmakers et al., 2015), for a clearer demarcation of exploratory and confirmatory research (de Groot, 1954/2013), and for a change in the way we analyze our data (Wagenmakers et al., 2011; Wagenmakers et al., in press).

Concerning the latter point, EJ is a staunch advocate of Bayesian statistics. With his many collaborators, he writes the clearest and wittiest exposures to the topic (e.g., Wagenmakers et al., 2016; Wagenmakers et al., 2010). Crucially, he is also a key player in opening Bayesian inference up to social and behavioral scientists more generally; in fact, the software JASP is EJ’s brainchild (see also our previous interview).

EJ

In sum, psychology is changing rapidly, both in how researchers communicate and do science, but increasingly also in how they analyze their data. This makes it nearly impossible for university curricula to keep up; courses in psychology are often years, if not decades, behind. Statistics classes in particular are usually boringly cookbook oriented and often fraught with misconceptions (Wagenmakers, 2014). At the University of Amsterdam, Wagenmakers succeeds in doing differently. He has previously taught a class called “Good Science, Bad Science”, discussing novel developments in methodology as well as supervising students in preparing and conducting direct replications of recent research findings (cf. Frank & Saxe, 2012).

Now, at the end of the day, testing undirected hypotheses using p values or Bayes factors only gets you so far – even if you preregister the heck out of it. To move the field forward, we need formal models that instantiate theories and make precise quantitative predictions. Together with Michael Lee, Eric-Jan Wagenmakers has written an amazing practical cognitive modeling book, harnessing the power of computational Bayesian methods to estimate arbitrarily complex models (for an overview, see Lee, submitted). More recently, he has co-edited a book on model-based cognitive neuroscience on how formal models can help bridge the gap between brain measurements and cognitive processes (Forstmann & Wagenmakers, 2015).

Long-term readers of the JEPS bulletin will note that topics ranging from openness of research, pre-registration and replication, and research methodology and Bayesian statistics are recurring themes. It has thus been only a matter of time for us to interview Eric-Jan Wagenmakers and ask him questions concerning all areas above. In addition, we ask: how does he stay so immensely productive? What tips does he have for students interested in an academic career; and what can instructors learn from “Good Science, Bad Science”? Enjoy the ride!


Bobby Fischer, the famous chess player, once said that he does not believe in psychology. You actually switched from playing chess to pursuing a career in psychology; tell us how this came about. Was it a good move?

It was an excellent move, but I have to be painfully honest: I simply did not have the talent and the predisposition to make a living out of playing chess. Several of my close friends did have that talent and went on to become international grandmasters; they play chess professionally. But I was actually lucky. For players outside of the world top-50, professional chess is a career trap. The pay is poor, the work insanely competitive, and the life is lonely. And society has little appreciation for professional chess players. In terms of creativity, hard work, and intellectual effort, an international chess grandmaster easily outdoes the average tenured professor. People who do not play chess themselves do not realize this.

Your list of publications gets updated so frequently, it should have its own RSS feed! How do you grow and cultivate such an impressive network of collaborators? Do you have specific tips for early career researchers?

At the start of my career I did not publish much. For instance, when I finished my four years of grad studies I think I had two papers. My current publication rate is higher, and part of that is due to an increase in expertise. It is just easier to write papers when you know (or think you know) what you’re talking about. But the current productivity is mainly due to the quality of my collaborators. First, at the psychology department of the University of Amsterdam we have a fantastic research master program. Many of my graduate students come from this program, having been tried and tested in the lab as RAs. When you have, say, four excellent graduate students, and each publishes one article a year, that obviously helps productivity. Second, the field of Mathematical Psychology has several exceptional researchers that I have somehow managed to collaborate with. In the early stages I was a graduate student with Jeroen Raaijmakers, and this made it easy to start work with Rich Shiffrin and Roger Ratcliff. So I was privileged and I took the opportunities that were given. But I also work hard, of course.

There is a lot of advice that I could give to early career researchers but I will have to keep it short. First, in order to excel in whatever area of life, commitment is key. What this usually means is that you have to enjoy what you are doing. Your drive and your enthusiasm will act as a magnet for collaborators. Second, you have to take initiative. So read broadly, follow the latest articles (I remain up to date through Twitter and Google Scholar), get involved with scientific organizations, coordinate a colloquium series, set up a reading group, offer your advisor to review papers with him/her, attend summer schools, etc. For example, when I started my career I had seen a new book on memory and asked the editor of Acta Psychologica whether I could review it for them. Another example is Erik-Jan van Kesteren, an undergraduate student from a different university who had attended one of my talks about JASP. He later approached me and asked whether he could help out with JASP. He is now a valuable member of the JASP team. Third, it helps if you are methodologically strong. When you are methodologically strong –in statistics, mathematics, or programming– you have something concrete to offer in a collaboration.

Considering all projects you are involved in, JASP is probably the one that will have most impact on psychology, or the social and behavioral sciences in general. How did it all start?

In 2005 I had a conversation with Mark Steyvers. I had just shown him a first draft of a paper that summarized the statistical drawbacks of p-values. Mark told me “it is not enough to critique p-values. You should also offer a concrete alternative”. I agreed and added a section about BIC (the Bayesian Information Criterion). However, the BIC is only a rough approximation to the Bayesian hypothesis test. Later I became convinced that social scientists will only use Bayesian tests when these are readily available in a user-friendly software package. About 5 years ago I submitted an ERC grant proposal “Bayes or Bust! Sensible hypothesis tests for social scientists” that contained the development of JASP (or “Bayesian SPSS” as I called it in the proposal) as a core activity. I received the grant and then we were on our way.

I should acknowledge that much of the Bayesian computations in JASP depend on the R BayesFactor package developed by Richard Morey and Jeff Rouder. I should also emphasize the contribution by JASPs first software engineer, Jonathon Love, who suggested that JASP ought to feature classical statistics as well. In the end we agreed that by including classical statistics, JASP could act as a Trojan horse and boost the adoption of Bayesian procedures. So the project started as “Bayesian SPSS”, but the scope was quickly broadened to include p-values.

JASP is already game-changing software, but it is under continuous development and improvement. More concretely, what do you plan to add in the near future? What do you hope to achieve in the long-term?

In terms of the software, we will shortly include several standard procedures that are still missing, such as logistic regression and chi-square tests. We also want to upgrade the popular Bayesian procedures we have already implemented, and we are going to create new modules. Before too long we hope to offer a variable views menu and a data-editing facility. When all this is done it would be great if we could make it easier for other researchers to add their own modules to JASP.

One of my tasks in the next years is to write a JASP manual and JASP books. In the long run, the goal is to have JASP be financially independent of government grants and university support. I am grateful for the support that the psychology department at the University of Amsterdam offers now, and for the support they will continue to offer in the future. However, the aim of JASP is to conquer the world, and this requires that we continue to develop the program “at break-neck speed”. We will soon be exploring alternative sources of funding. JASP will remain free and open-source, of course.

You are a leading advocate of Bayesian statistics. What do researchers gain by changing the way they analyze their data?

They gain intellectual hygiene, and a coherent answer to questions that makes scientific sense. A more elaborate answer is outlined in a paper that is currently submitted to a special issue for Psychonomic Bulletin & Review: https://osf.io/m6bi8/ (Part I).

The Reproducibility Project used different metrics to quantify the success of a replication – none of them really satisfactory. How can a Bayesian perspective help illuminate the “crisis of replication”?

As a theory of knowledge updating, Bayesian statistics is ideally suited to address questions of replication. However, the question “did the effect replicate?” is underspecified. Are the effect sizes comparable? Does the replication provide independent support for the presence of the effect? Does the replication provide support for the position of the proponents versus the skeptics? All these questions are slightly different, but each receives the appropriate answer within the Bayesian framework. Together with Josine Verhagen, I have explored a method –the replication Bayes factor– in which the prior distribution for the replication test is the posterior distribution obtained from the original experiment (e.g., Verhagen & Wagenmakers, 2014). We have applied this intuitive procedure to a series of recent experiments, including the multi-lab Registered Replication Report of Fritz Strack’s Facial Feedback hypothesis. In Strack’s original experiment, participants who held a pen with their teeth (causing a smile) judged cartoons to be funnier than participants who held a pen with their lips (causing a pout). I am not allowed to tell you the result of this massive replication effort, but the paper will be out soon.

You have recently co-edited a book on model-based cognitive neuroscience. What is the main idea here, and what developments in this area are most exciting to you?

The main idea is that much of experimental psychology, mathematical psychology, and the neurosciences pursue a common goal: to learn more about human cognition. So ultimately the interest is in latent constructs such as intelligence, confidence, memory strength, inhibition, and attention. The models that have been developed in mathematical psychology are able to link these latent constructs to specific model parameters. These parameters may in turn be estimated by behavioral data, by neural data, or by both data sets jointly. Brandon Turner is one of the early career mathematical psychologists who has made great progress in this area. So the mathematical models are a vehicle to achieve an integration of data from different sources. Moreover, insights from neuroscience can provide important constraints that help inform mathematical modeling. The relation is therefore mutually beneficial. This is summarized in the following paper: http://www.ejwagenmakers.com/2011/ForstmannEtAl2011TICS.pdf

One thing that distinguishes science from sophistry is replication; yet it is not standard practice. In “Good Science, Bad Science”, you had students prepare a registered replication plan. What was your experience teaching this class? What did you learn from the students?

This was a great class to teach. The students were highly motivated and oftentimes it felt more like lab-meeting than like a class. The idea was to develop four Registered Report submissions. Some time has passed, but the students and I still intend to submit the proposals for publication.

The most important lesson this class has taught me is that our research master students want to learn relevant skills and conduct real research. In the next semester I will teach a related course, “Good Research Practices”, and I hope to attain the same high levels of student involvement. For the new course, I plan to have students read a classic methods paper that identifies a fallacy; next the students will conduct a literature search to assess the current prevalence of the fallacy. I have done several similar projects, but never with master students (e.g., http://www.ejwagenmakers.com/2011/NieuwenhuisEtAl2011.pdf and http://link.springer.com/article/10.3758/s13423-015-0913-5).

What tips and tricks can you share with instructors planning to teach a similar class?

The first tip is to set your aims high. For a research master class, the goal should be publication. Of course this may not always be realized, but it should be the goal. It helps if you can involve colleagues or graduate students. If you set your aims high, the students know that you take them seriously, and that their work matters. The second tip is to arrange the teaching so that the students do most of the work. The students need to develop a sense of ownership about their projects, and they need to learn. This will not happen if you treat the students as passive receptacles. I am reminded of a course that I took as an undergraduate. In this course I had to read chapters, deliver presentations, and prepare questions. It was one of the most enjoyable and inspiring courses I had ever taken, and it took me decades to realize that the professor who taught the course actually did not have to do much at all.

Many scholarly discussions these days take place on social media and blogs. You’ve joined twitter yourself over a year ago. How do you navigate the social media jungle, and what resources can you recommend to our readers?

I am completely addicted to Twitter, but I also feel it makes me a better scientist. When you are new to Twitter, I recommend that you start by following a few people that have interesting things to say. Coming from a Bayesian perspective, I recommend Alexander Etz (@AlxEtz) and Richard Morey (@richarddmorey). And of course it is essential to follow JASP (@JASPStats). As is the case for all social media, the most valuable resource you have is the “mute” option. Prevent yourself from being swamped by holiday pictures and exercise it ruthlessly.

Fabian Dablander

Fabian Dablander is currently finishing his thesis in Cognitive Science at the University of Tübingen and Daimler Research & Development on validating driving simulations. He is interested in innovative ways of data collection, Bayesian statistics, open science, and effective altruism. You can find him on Twitter @fdabl.

More Posts - Website

Facebooktwitterrss

Replicability and Registered Reports

Last summer saw the publication of a monumental piece of work: the reproducibility project (Open Science Collaboration, 2015). In a huge community effort, over 250 researchers directly replicated 100 experiments initially conducted in 2008. Only 39% of the replications were significant at the 5% level. Average effect size estimates were halved. The study design itself—conducting direct replications on a large scale—as well as its outcome are game-changing to the way we view our discipline, but students might wonder: what game were we playing before, and how did we get here?

In this blog post, I provide a selective account of what has been dubbed the “reproducibility crisis”, discussing its potential causes and possible remedies. Concretely, I will argue that adopting Registered Reports, a new publishing format recently also implemented in JEPS (King et al., 2016; see also here), increases scientific rigor, transparency, and thus replicability of research. Wherever possible, I have linked to additional resources and further reading, which should help you contextualize current developments within psychological science and the social and behavioral sciences more general.

How did we get here?

In 2005, Ioannidis made an intriguing argument. Because the prior probability of any hypothesis being true is low, researchers continuously running low powered experiments, and as the current publishing system is biased toward significant results, most published research findings are false. Within this context, spectacular fraud cases like Diederik Stapel (see here) and the publication of a curious paper about people “feeling the future” (Bem, 2011) made 2011 a “year of horrors” (Wagenmakers, 2012), and toppled psychology into a “crisis of confidence” (Pashler & Wagenmakers, 2012). As argued below, Stapel and Bem are emblematic of two highly interconnected problems of scientific research in general.

Publication bias

Stapel, who faked results of more than 55 papers, is the reductio ad absurdum of the current “publish or perish” culture[1]. Still, the gold standard to merit publication, certainly in a high impact journal, is p < .05, which results in publication bias (Sterling, 1959) and file-drawers full of nonsignificant results (Rosenthal, 1979; see Lane et al., 2016, for a brave opening; and #BringOutYerNulls). This leads to a biased view of nature, distorting any conclusion we draw from the published literature. In combination with low-powered studies (Cohen, 1962; Button et al., 2013; Fraley & Vazire; 2014), effect size estimates are seriously inflated and can easily point in the wrong direction (Yarkoni, 2009; Gelman & Carlin, 2014). A curious consequence is what Lehrer has titled “the truth wears off” (Lehrer, 2010). Initially high estimates of effect size attenuate over time, until nothing is left of them. Just recently, Kaplan and Lirvin reported that the proportion of positive effects in large clinical trials shrank from 57% before 2000 to 8% after 2000 (Kaplan & Lirvin, 2015). Even a powerful tool like meta-analysis cannot clear the view of a landscape filled with inflated and biased results (van Elk et al., 2015). For example, while meta-analyses concluded that there is a strong effect of ego-depletion of Cohen’s d=.63, recent replications failed to find an effect (Lurquin et al., 2016; Sripada et al., in press)[2].

Garden of forking paths

In 2011, Daryl Bem reported nine experiments on people being able to “feel to future” in the Journal of Social and Personality Psychology, the flagship journal of its field (Bem, 2011). Eight of them yielded statistical significance, p < .05. We could dismissively say that extraordinary claims require extraordinary evidence, and try to sail away as quickly as possible from this research area, but Bem would be quick to steal our thunder.

A recent meta-analysis of 90 experiments on precognition yielded overwhelming evidence in favor of an effect (Bem et al., 2015). Alan Turing, discussing research on psi related phenomena, famously stated that

“These disturbing phenomena seem to deny all our usual scientific ideas. How we should like to discredit them! Unfortunately, the statistical evidence, at least of telepathy, is overwhelming.” (Turing, 1950, p. 453; cf. Wagenmakers et al., 2015)

How is this possible? It’s simple: Not all evidence is created equal. Research on psi provides us with a mirror of “questionable research practices” (John, Loewenstein, & Prelec, 2012) and researchers’ degrees of freedom (Simmons, Nelson, & Simonsohn, 2011), obscuring the evidential value of individual experiments as well as whole research areas[3]. However, it would be foolish to dismiss this as being a unique property of obscure research areas like psi. The problem is much more subtle.

The main issue is that there is a one-to-many mapping from scientific to statistical hypotheses[4]. When doing research, there are many parameters one must set; for example, should observations be excluded? Which control variables should be measured? How to code participants’ responses? What dependent variables should be analyzed? By varying only a small number of these, Simmons et al. (2011) found that the nominal false positive rate of 5% skyrocketed to over 60%. They conclude that the “increased flexibility allows researchers to present anything as significant.” These issues are elevated by providing insufficient methodological detail in research articles, by a low percentage of researchers sharing their data (Wicherts et al., 2006; Wicherts, Bakker, & Molenaar, 2011), and in fields that require complicated preprocessing steps like neuroimaging (Carp, 2012; Cohen, 2016; Luck and Gaspelin, in press).

An important amendment is that researchers need not be aware of this flexibility; a p value might be misleading even when there is no “p-hacking”, and the hypothesis was posited ahead of time (i.e. was not changed after the fact—HARKing; Kerr, 1992). When decisions are contingent on the data are made in an environment in which different data would lead to different decisions, even when these decisions “just make sense,” there is a hidden multiple comparison problem lurking (Gelman & Loken, 2014). Usually, when conducting N statistical tests, we control for the number of tests in order to keep the false positive rate at, say, 5%. However, in the aforementioned setting, it is not clear what N should be exactly. Thus, results of statistical tests lose their meaning and carry little evidential value in such exploratory settings; they only do so in confirmatory settings (de Groot, 1954/2014; Wagenmakers et al., 2012). This distinction is at the heart of the problem, and gets obscured because many results in the literature are reported as confirmatory, when in fact they may very well be exploratory—most frequently, because of the way scientific reporting is currently done, there is no way for us to tell the difference.

To get a feeling for the many choices possible in statistical analysis, consider a recent paper in which data analysis was crowdsourced from 29 teams (Silberzahn et al., submitted). The question posited to them was whether dark-skinned soccer players are red-carded more frequently. The estimated effect size across teams ranged from .83 to 2.93 (odds ratios). Nineteen different analysis strategies were used in total, with 21 unique combinations of covariates; 69% found a significant relationship, while 31% did not.

A reanalysis of Berkowitz et al. (2016) by Michael Frank (2016; blog here) is another, more subtle example. Berkowitz and colleagues report a randomized controlled trial, claiming that solving short numerical problems increase children’s math achievement across the school year. The intervention was well designed and well conducted, but still, Frank found that, as he put it, “the results differ by analytic strategy, suggesting the importance of preregistration.”

Frequently, the issue is with measurement. Malte Elson—whose twitter is highly germane to our topic—has created a daunting website that lists how researchers use the Competitive Reaction Time Task (CRTT), one of the most commonly used tools to measure aggressive behavior. It states that there are 120 publications using the CRTT, which in total analyze the data in 147 different ways!

This increased awareness of researchers’ degrees of freedom and the garden of forking paths is mostly a product of this century, although some authors have expressed this much earlier (e.g., de Groot, 1954/2014; Meehl, 1985; see also Gelman’s comments here). The next point considers an issue much older (e.g., Berkson, 1938), but which nonetheless bears repeating.

Statistical inference

In psychology and much of the social and behavioral sciences in general, researchers overly rely on null hypothesis significance testing and p values to draw inferences from data. However, the statistical community has long known that p values overestimate the evidence against H0 (Berger & Delampady, 1987; Wagenmakers, 2007; Nuzzo, 2014). Just recently, the American Statistical Association released a statement drawing attention to this fact (Wasserstein & Lazar, 2016); that is, in addition to it being easy to obtain p < .05 (Simmons, Nelson, & Simonsohn, 2011), it is also quite a weak standard of evidence overall.

The last point is quite pertinent because the statement that 39% of replications in the reproducibility project were “successful” is misleading. A recent Bayesian reanalysis concluded that the original studies themselves found weak evidence in support of an effect (Etz & Vandekerckhove, 2016), reinforcing all points I have made so far.

Notwithstanding the above, p < .05 is still the gold standard in psychology, and is so for intricate historical reasons (cf., Gigerenzer, 1993). At JEPS, we certainly do not want to echo calls nor actions to ban p values (Trafimow & Marks, 2015), but we urge students and their instructors to bring more nuance to their use (cf., Gigerenzer, 2004).

Procedures based on classical statistics provide different answers from what most researchers and students expect (Oakes, 1986; Haller & Krauss; 2002; Hoekstra et al., 2014). To be sure, p values have their place in model checking (e.g., Gelman, 2006—are the data consistent with the null hypothesis?), but they are poorly equipped to measure the relative evidence for H1 or H0 brought about by the data; for this, researchers need to use Bayesian inference (Wagenmakers et al., in press). Because university curricula often lag behind current developments, students reading this are encouraged to advance their methodological toolbox by browsing through Etz et al. (submitted) and playing with JASP[5].

Teaching the exciting history of statistics (cf. Gigerenzer et al., 1989; McGrayne, 2012), or at least contextualizing the developments of currently dominating statistical ideas, is a first step away from their cookbook oriented application.

Registered reports to the rescue

While we can only point to the latter, statistical issue, we can actually eradicate the issue of publication bias and the garden of forking paths by introducing a new publishing format called Registered Reports. This format was initially introduced to the journal Cortex by Chris Chambers (Chambers, 2013), and it is now offered by more than two dozen journals in the fields of psychology, neuroscience, psychiatry, and medicine (link). Recently, we have also introduced this publishing format at JEPS (see King et al., 2016).

Specifically, researchers submit a document including the introduction, theoretical motivation, experimental design, data preprocessing steps (e.g., outlier removal criteria), and the planned statistical analyses prior to data collection. Peer review only focuses on the merit of the proposed study and the adequacy of the statistical analyses[5]. If there is sufficient merit to the planned study, the authors are guaranteed in-principle acceptance (Nosek & Lakens, 2014). Upon receiving this acceptance, researchers subsequently carry out the experiment, and submit the final manuscript. Deviations from the first submissions must be discussed, and additional statistical analyses are labeled exploratory.

In sum, by publishing regardless of the outcome of the statistical analysis, registered reports eliminate publication bias; by specifying the hypotheses and analysis plan beforehand, they make apparent the distinction between exploratory and confirmatory studies (de Groot 1954/2014), avoid the garden of forking paths (Gelman & Loken, 2014), and guard against post-hoc theorizing (Kerr, 1998).

Even though registered reports are commonly associated with high power (80-95%), this is unfeasible for student research. However, note that a single study cannot be decisive in any case. Reporting sound, hypothesis-driven, not-cherry-picked research can be important fuel for future meta-analysis (for an example, see Scheibehenne, Jamil, & Wagenmakers, in press).

To avoid possible confusion, note that preregistration is different from Registered Reports: The former is the act of specifying the methodology before data collection, while the latter is a publishing format. You can preregister your study on several platforms such as the Open Science Framework or AsPredicted. Registered reports include preregistration but go further and have the additional benefits such as peer review prior to data collection and in-principle acceptance.

Conclusion

In sum, there are several issues impeding progress in psychological science, most pressingly the failure to distinguish between exploratory and confirmatory research, and publication bias. A new publishing format, Registered Reports, provides a powerful means to address them both, and, to borrow a phrase from Daniel Lakens, enable us to “sail away from the seas of chaos into a corridor of stability” (Lakens & Evers, 2014).

Suggested Readings

  • Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
  • Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632-638.
  • Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6), 460-465.
  • King, M., Dablander, F., Jakob, L., Agan, M., Huber, F., Haslbeck, J., & Brecht, K. (2016). Registered Reports for Student Research. Journal of European Psychology Students, 7(1), 20-23
  • Twitter (or you might miss out)

Footnotes

[1] Incidentally, Diederik Stapel published a book about his fraud. See here for more.

[2] Baumeister (2016) is a perfect example of how not to respond to such a result. Michael Inzlicht shows how to respond adequately here.

[3] For a discussion of these issues with respect to the precognition meta-analysis, see Lakens (2015) and Gelman (2014).

[4] Another related, crucial point is the lack of theory in psychology. However, as this depends on whether you read the Journal of Mathematical Psychology or, say, Psychological Science, it is not addressed further. For more on this point, see for example Meehl (1978), Gigerenzer (1998), and a class by Paul Meehl which has been kindly converted to mp3 by Uri Simonsohn.

[5] However, it would be premature to put too much blame on p. More pressingly, the misunderstandings and misuse of this little fellow point towards a catastrophic failure in undergraduate teaching of statistics and methods classes (for the latter, see Richard Morey’s recent blog post). Statistics classes in psychology are often boringly cookbook oriented, and so students just learn the cookbook. If you are an instructor, I urge you to have a look at “Statistical Rethinking” by Richard McElreath. In general, however, statistics is hard, and there are many issues transcending the frequentist versus Bayesian debate (for examples, see Judd, Westfall, and Kenny, 2012; Westfall & Yarkoni, 2016).

[6] Note that JEPS already publishes research regardless of whether p < .05. However, this does not discourage us from drawing attention to this benefit of Registered Reports, especially because most other journals have a different policy.

This post was edited by Altan Orhon.

Fabian Dablander

Fabian Dablander is currently finishing his thesis in Cognitive Science at the University of Tübingen and Daimler Research & Development on validating driving simulations. He is interested in innovative ways of data collection, Bayesian statistics, open science, and effective altruism. You can find him on Twitter @fdabl.

More Posts - Website

Facebooktwitterrss

Structural equation modeling: What is it, what does it have in common with hippie music, and why does it eat cake to get rid of measurement error?

Do you want a statistics tool that is powerful; easy to learn; allows you to model complex data structures; combines the test, analysis of variance,  and multiple regression; and puts even more on top? Here it is! Statistics courses in psychology today often cover structural equation modeling (SEM), a statistical tool that allows one to go beyond classical statistical models by combining them and adding more. Let’s explore what this means, what SEM really is, and SEM’s surprising parallels with the hippie culture! Continue reading

Peter Edelsbrunner

Peter Edelsbrunner

Peter is currently doctoral student at the section for learning and instruction research of ETH Zurich in Switzerland. He graduated from Psychology at the University of Graz in Austria. Peter is interested in conceptual knowledge development and the application of flexible mixture models to developmental research. Since 2011 he has been active in the EFPSA European Summer School and related activities.

More Posts

Facebooktwitterrss

Introducing JASP: A free and intuitive statistics software that might finally replace SPSS

Are you tired of SPSS’s confusing menus and of the ugly tables it generates? Are you annoyed by having statistical software only at university computers? Would you like to use advanced techniques such as Bayesian statistics, but you lack the time to learn a programming language (like R or Python) because you prefer to focus on your research?

While there was no real solution to this problem for a long time, there is now good news for you! A group of researchers at the University of Amsterdam are developing JASP, a free open-source statistics package that includes both standard and more advanced techniques and puts major emphasis on providing an intuitive user interface.

The current version already supports a large array of analyses, including the ones typically used by researchers in the field of psychology (e.g. ANOVA, t-tests, multiple regression).

In addition to being open source, freely available for all platforms, and providing a considerable number of analyses, JASP also comes with several neat, distinctive features, such as real-time computation and display of all results. For example, if you decide that you want not only the mean but also the median in the table, you can tick “Median” to have the medians appear immediately in the results table. For comparison, think how this works in SPSS: First, you must navigate a forest of menus (or edit the syntax), then, you execute the new syntax. A new window appears and you get a new (ugly) table.

JASP_screenshoot_2

In JASP, you get better-looking tables in no time. Click here to see a short demonstration of this feature. But it gets even better—the tables are already in APA format and you can copy and paste them into Word. Sounds too good to be true, doesn’t it? It does, but it works!

Interview with lead developer Jonathon Love

Where is this software project coming from? Who pays for all of this? And what plans are there for the future? There is nobody who could answer these questions better than the lead developer of JASP, Jonathon Love, who was so kind as to answer a few questions about JASP.
J_love

How did development on JASP start? How did you get involved in the project?

All through my undergraduate program, we used SPSS, and it struck me just how suboptimal it was. As a software designer, I find poorly designed software somewhat distressing to use, and so SPSS was something of a thorn in my mind for four years. I was always thinking things like, “Oh, what? I have to completely re-run the analysis, because I forgot X?,” “Why can’t I just click on the output to see what options were used?,” “Why do I have to read this awful syntax?,” or “Why have they done this like this? Surely they should do this like that!”

At the same time, I was working for Andrew Heathcote, writing software for analyzing response time data. We were using the R programming language and so I was exposed to this vast trove of statistical packages that R provides. On one hand, as a programmer, I was excited to gain access to all these statistical techniques. On the other hand, as someone who wants to empower as many people as possible, I was disappointed by the difficulty of using R and by the very limited options to provide a good user interface with it.

So I saw that there was a real need for both of these things—software providing an attractive, free, and open statistics package to replace SPSS, and a platform for methodologists to publish their analyses with rich, accessible user interfaces. However, the project was far too ambitious to consider without funding, and so I couldn’t see any way to do it.

Then I met E.J. Wagenmakers, who had just received a European Research Council grant to develop an SPSS-like software package to provide Bayesian methods, and he offered me the position to develop it. I didn’t know a lot about Bayesian methods at the time, but I did see that our goals had a lot of overlap.

So I said, “Of course, we would have to implement classical statistics as well,” and E.J.’s immediate response was, “Nooooooooooo!” But he quickly saw how significant this would be. If we can liberate the underlying platform that scientists use, then scientists (including ourselves) can provide whatever analyses we like.

And so that was how the JASP project was born, and how the three goals came together:

  • to provide a liberated (free and open) alternative to SPSS
  • to provide Bayesian analyses in an accessible way
  • to provide a universal platform for publishing analyses with accessible user interfaces

 

What are the biggest challenges for you as a lead developer of JASP?

Remaining focused. There are hundreds of goals, and hundreds of features that we want to implement, but we must prioritize ruthlessly. When will we implement factor analysis? When will we finish the SEM module? When will data entry, editing, and restructuring arrive? Outlier exclusion? Computing of variables? These are all such excellent, necessary features; it can be really hard to decide what should come next. Sometimes it can feel a bit overwhelming too. There’s so much to do! I have to keep reminding myself how much progress we’re making.

Maintaining a consistent user experience is a big deal too. The JASP team is really large, to give you an idea, in addition to myself there’s:

  • Ravi Selker, developing the frequentist analyses
  • Maarten Marsman, developing the Bayesian ANOVAs and Bayesian linear regression
  • Tahira Jamil, developing the classical and Bayesian contingency tables
  • Damian Dropmann, developing the file save, load functionality, and the annotation system
  • Alexander Ly, developing the Bayesian correlation
  • Quentin Gronau, developing the Bayesian plots and the classical linear regression
  • Dora Matzke, developing the help system
  • Patrick Knight, developing the SPSS importer
  • Eric-Jan Wagenmakers, coming up with new Bayesian techniques and visualizations

With such a large team, developing the software and all the analyses in a consistent and coherent way can be really challenging. It’s so easy for analyses to end up a mess of features, and for every subsequent analysis we add to look nothing like the last. Of course, providing as elegant and consistent a user-experience is one of our highest priorities, so we put a lot of effort into this.

 

How do you imagine JASP five years from now?

JASP will provide the same, silky, sexy user experience that it does now. However, by then it will have full data entering, editing, cleaning, and restructuring facilities. It will provide all the common analyses used through undergraduate and postgraduate psychology programs. It will provide comprehensive help documentation, an abundance of examples, and a number of online courses. There will be textbooks available. It will have a growing community of methodologists publishing the analyses they are developing as additional JASP modules, and applied researchers will have access to the latest cutting-edge analyses in a way that they can understand and master. More students will like statistics than ever before.

 

How can JASP stay up to date with state-of-the-art statistical methods? Even when borrowing implementations written in R and the like, these always have to be implemented by you in JASP. Is there a solution to this problem?

Well, if SPSS has taught us anything, you really don’t need to stay up to date to be a successful statistical product, ha-ha! The plan is to provide tools for methodologists to write add-on modules for JASP—tools for creating user interfaces and tools to connect these user interfaces to their underlying analyses. Once an add-on module is developed, it can appear in a directory, or a sort of “App Store,” and people will be able to rate the software for different things: stability, user-friendliness, attractiveness of output, and so forth. In this way, we hope to incentivize a good user experience as much as possible.

Some people think this will never work—that methodologists will never put in all that effort to create nice, useable software (because it does take substantial effort). But I think that once methodologists grasp the importance of making their work accessible to as wide an audience as possible, it will become a priority for them. For example, consider the following scenario: Alice provides a certain analysis with a nice user interface. Bob develops an analysis that is much better than Alice’s analysis, but everyone uses Alice’s, because hers is so easy and convenient to use. Bob is upset because everyone uses Alice’s instead of his. Bob then realizes that he has to provide a nice, accessible user experience for people to use his analysis.

I hope that we can create an arms race in which methodologists will strive to provide as good a user experience as possible. If you develop a new method and nobody can use it, have you really developed a new method? Of course, this sort of add-on facility isn’t ready yet, but I don’t think it will be too far away.

 

You mention on your website that many more methods will be included, such as structural equation modeling (SEM) or tools for data manipulation. How can you both offer a large amount of features without cluttering the user interface in the future?

Currently, JASP uses a ribbon arrangement; we have a “File” tab for file operations, and we have a “Common” tab that provides common analyses. As we add more analyses (and as other people begin providing additional modules), these will be provided as additional tabs. The user will be able to toggle on or off which tabs they are interested in. You can see this in the current version of JASP: we have a proof-of-concept SEM module that you can toggle on or off on the options page. JASP thus provides you only with what you actually need, and the user interface can be kept as simple as you like.

 

Students who are considering switching to JASP might want to know whether the future of JASP development is secured or dependent on getting new grants. What can you tell us about this?

JASP is currently funded by a European Research Council (ERC) grant, and we’ve also received some support from the Centre for Open Science. Additionally, the University of Amsterdam has committed to providing us a software developer on an ongoing basis, and we’ve just run our first annual Bayesian Statistics in JASP workshop. The money we charge for these workshops is plowed straight back into JASP’s development.

We’re also developing a number of additional strategies to increase the funding that the JASP project receives. Firstly, we’re planning to provide technical support to universities and businesses that make use of JASP, for a fee. Additionally, we’re thinking of simply asking universities to contribute the cost of a single SPSS license to the JASP project. It would represent an excellent investment; it would allow us to accelerate development, achieve feature parity with SPSS sooner, and allow universities to abandon SPSS and its costs sooner. So I don’t worry about securing JASP’s future, I’m thinking about how we can expand JASP’s future.

Of course, all of this depends on people actually using JASP, and that will come down to the extent that the scientific community decides to use and get behind the JASP project. Indeed, the easiest way that people can support the JASP project is by simply using and citing it. The more users and the more citations we have, the easier it is for us to obtain funding.

Having said all that, I’m less worried about JASP’s future development than I’m worried about SPSS’s! There’s almost no evidence that any development work is being done on it at all! Perhaps we should pass the hat around for IBM.

 

What is the best way to get started with JASP? Are there tutorials and reproducible examples?

For classical statistics, if you’ve used SPSS, or if you have a book on statistics in SPSS, I don’t think you’ll have any difficulty using JASP. It’s designed to be familiar to users of SPSS, and our experience is that most people have no difficulty moving from SPSS to JASP. We also have a video on our website that demonstrates some basic analyses, and we’re planning to create a whole series of these.

As for the Bayesian statistics, that’s a little more challenging. Most of our effort has been going in to getting the software ready, so we don’t have as many resources for learning Bayesian statistics ready as we would like. This is something we’ll be looking at addressing in the next six to twelve months. E.J. has at least one (maybe three) books planned.

That said, there are a number of resources available now, such as:

  • Alexander Etz’s blog
  • E.J.’s website provides a number of papers on Bayesian statistics (his website also serves as a reminder of what the internet looked like in the ’80s)
  • Zoltan Dienes book is a great for Bayesian statistics as well

However, the best way to learn Bayesian statistics is to come to one of our Bayesian Statistics with JASP workshops. We’ve run two so far and they’ve been very well received. Some people have been reluctant to attend—because JASP is so easy to use, they didn’t see the point of coming and learning it. Of course, that’s the whole point! JASP is so easy to use, you don’t need to learn the software, and you can completely concentrate on learning the Bayesian concepts. So keep an eye out on the JASP website for the next workshop. Bayes is only going to get more important in the future. Don’t be left behind!

 

Jonas Haslbeck

Jonas Haslbeck

Jonas is a Senior Editor at the Journal of European Psychology Students. He is currently a PhD student in psychological methods at the University of Amsterdam, The Netherlands. For further info see http://jmbh.github.io/.

More Posts

Facebooktwitterrss

Bayesian Statistics: Why and How

bayes_hot_scaled

Bayesian statistics is what all the cool kids are talking about these days. Upon closer inspection, this does not come as a surprise. In contrast to classical statistics, Bayesian inference is principled, coherent, unbiased, and addresses an important question in science: in which of my hypothesis should I believe in, and how strongly, given the collected data?  Continue reading

Fabian Dablander

Fabian Dablander is currently finishing his thesis in Cognitive Science at the University of Tübingen and Daimler Research & Development on validating driving simulations. He is interested in innovative ways of data collection, Bayesian statistics, open science, and effective altruism. You can find him on Twitter @fdabl.

More Posts - Website

Facebooktwitterrss

Of Elephants and Effect Sizes – Interview with Geoff Cumming

We all know these crucial moments while analysing our hard-earned data – the moment of truth – is there a star above the small p? Maybe even two? Can you write a nice and simple paper or do you have to bend your back to explain why people do not, surprisingly, behave the way you thought they would? It all depends on those little stars, below or above .05, significant or not, black or white. Continue reading

Katharina Brecht

Katharina Brecht

Aside from her role as Editor-in-Chief of the Journal of European Psychology Students, Katharina is currently pursuing her PhD at the University of Cambridge. Her research interests revolve around the evolution and development of social cognition.

More Posts

Facebooktwitterrss

A Psychologist’s Guide to Reading a Neuroimaging Paper

Psychological research is benefiting from advances in neuroimaging techniques. This has been achieved through the validation and falsification of established hypothesis in psychological science (Cacioppo, Berntson, & Nusbaum, 2008). It has also helped nurture links with neuroscience, leading to more comprehensive explanations of established theories. Positron Emission Tomography (PET), functional MRI (fMRI), structural MRI (sMRI), electroencephalography (EEG), diffusion tensor imaging (DTI) and numerous other lesser-known neuroimaging techniques can provide information complimentary to behavioural data (Wager, 2006). With these modalities of research becoming more prevalent, ranging from investigating the neural effects of mindfulness training to neuro-degeneration, it is worth taking a moment to highlight some points to help discern what may be good or poor research. Like any other methodology, neuroimaging is a great tool that can be used poorly. As with all areas of science, one must exercise a good degree of caution when reading neuroimaging papers. Continue reading

Niall Bourke

Niall Bourke is a psychology graduate from Ireland. Having gained experience working with individuals that have acquired brain injuries, he moved to London to complete a MSc. in Neuroimaging at the Institute of Psychiatry, Psychology and Neuroscience (IoPPN). He now works as a research worker at the IoPPN, and as a visiting researcher at the University of Southampton on a developmental neuropsychology project.

More Posts

Facebooktwitterrss

Bayesian Statistics: What is it and Why do we Need it?

prlipohellThere is a revolution in statistics happening: The Bayesian revolution. Psychology students who are interested in research methods (which I hope everyone is!) should know what this revolution is about. Gaining this knowledge now instead of later might spare you lots of misconceptions about statistics as it is usually instructed in psychology, and it might help you gain a deeper understanding of the foundations of statistics. To make sure that you can try out everything you learn immediately, I conducted analysis in the free statistics software R (www.r-project.org; click HERE for a tutorial how to get started with R, and install RStudio for an enhanced R-experience) and I provide the syntax for the analysis directly in the article so you can easily try them out. So let’s jump in: What is “Bayesian Statistics”, and why do we need it? Continue reading

Peter Edelsbrunner

Peter Edelsbrunner

Peter is currently doctoral student at the section for learning and instruction research of ETH Zurich in Switzerland. He graduated from Psychology at the University of Graz in Austria. Peter is interested in conceptual knowledge development and the application of flexible mixture models to developmental research. Since 2011 he has been active in the EFPSA European Summer School and related activities.

More Posts

Facebooktwitterrss