Imagine a task that is simple for a human and difficult for a computer. For example, recognizing if a photograph contains a cat or a dog is a straightforward task even for a few months old child (Quinn & Eimas, 1996), but extremely difficult, if not impossible, for a computer (Shotton et al., 2006) because the two are quite similar in terms of shape. In order to capitalize on human’s superiority over computers in some kind of tasks, Amazon.com, Inc. (NASDAQ: AMZN) came up with a platform called Amazon Mechanical Turk (https://www.mturk.com/mturk/welcome), where it is possible to ask human workers to complete HITs – Human Intelligence Tasks.
On Amazon Mechanical Turk (MTurk for short) people completing the tasks (i.e. HITs) are called Mechanical Turk Workers, or simply MTurk workers. The HIT can be anything, from photo categorization or moderation, data verification, tagging, transcription or translation. However, for most researchers the most useful HIT would be gathering answers to their surveys and online experiments. Because MTurk boasts its service as one that facilitates access to a “global, on demand, 24 x 7 workforce,” where “thousands of HITs [are] completed in minutes” and you pay only “when you’re satisfied with the results” the turnover rates for the data collections are slashed to a few days and the costs are trimmed. It means that a researcher is left with more time for thorough experimental design, data analysis and theory development. MTurk represents a prime example of so-called crowdsourcing internet marketplaces where a requester outsources a task to distributed groups of people “en masse.”
For example, recently I have conducted research tapping into recognition of non-prototypical facial expressions of emotion (Lewinski, 2012) where the entire data collection was conducted on the MTurk. In Study 1, I have asked 42 US-based MTurk workers: 16 Men, and 26 Women (average age = 51 years) to do an experiment that I previously created on a web-based tool for building surveys (in this case – Qualtrics). It took on average 7 minutes to finish a task (i.e. HIT). In the end, I calculated that the Study 1 (n=42) cost less than 11$, which is significantly less than the standard reimbursement for the US population of 5 to 10$ per person (http://www.dol.gov/whd/minwage/america.htm).
So MTurk not only allowed me to cut the costs (effective hourly rate ca. 2.5$/hrs.) but also to cut time – only about 15 hours were needed to gather a sample of data from 42 participants. As an undergraduate student, I have helped with quite a few stationary or “off-line” experiments, but since 2011, I have been exclusively using MTurk for my own research, even with such quality-sensitive data collection as a recording of facial expressions responses to persuasive stimuli – recently accepted for a publication (Lewinski, Fransen & Tan, in press). In the following sections, I will review the basic principles and characteristics of MTurk and similar crowdsourcing platforms. I will also outline pros and cons and ethics involved with using those platforms. In the end, I will talk how to create a good experiment to be posted on the crowdsourcing platforms.
Crowdsourcing might be defined as a form of outsourcing a task to distributed groups of people “en masse.” As such, the requester (e.g., Mason & Suri, 2011; Paolacci, Chandler & Ipeirotis, 2010) cannot possibly individually identify the participants. The requester posts a task in the form of an open call to eligible participants (e.g. those who meet location or demographic criteria) on an online labor market where the participants can choose among thousands of possible tasks from many requesters. Notably, there has been a rise of scientific studies using the internet crowdsourcing marketplaces (Birnbaum, 2004) and distributing the collection of data into so-called HITs – Human Intelligence Tasks.
The HITs were introduced on a large scale by the web service called Amazon Mechanical Turk (MTurk)® (https://www.mturk.com/mturk/welcome). MTurk is a prime example of a crowdsourcing marketplace on the World Wide Web. On a daily basis, there are more than 150,000 HITs offered. The participants can access only the tasks for which they meet the specific requirements. In 2010, according to Paolacci, Chandler and Ipeirotis (2010), the demographics of Mechanical Turk for U.S. based population were as follows: the average age is 36 years old, there are more females (65%) than males (35%); the income levels are slightly lower than in the general U.S. population. Notably, the participants on MTurk tend to represent the U.S. population better than traditional university subject pools. However, Ross et al. (2010, April) presents data showing that MTurk demographic changed over past few years, e.g. in 2008 over 80% of the MTurk workers came from the U.S. and 5% from India, in 2010, the proportion swapped to 39% and 46% respectively. As of 2011 there were over 500’000 registered workers from over 190 countries (Natala@AWS, 2011). On Aug 26th there were 223,647 HITs available.
Online vs. off-line samples
This new research technology has proven to be an easier (e.g., Chilton, Horton, Miller & Azenkot, 2009), faster (e.g., Mason & Watts, 2009; Marge, Banerjee & Rudnicky, 2010) and more cost-efficient way to recruit and pay participants for experiments. Paolacci, Chandler and Ipeirotis (2010) and Berinsky, Huber and Lenz (2012) compared the data gathered from the crowdsourcing marketplace MTurk with the data gathered through standard offline methods—that is to say, a convenience sample (university students) and concluded that the data collected online is of high quality. In addition, more literature (Buhremester, Kwang & Gosling, 2011; Gaggioli & Riva 2008; Oppenheimer, Meyvis & Davidenko, 2009) suggests that if there are any differences between crowdsourced and standard participants, the differences do not threaten the validity of this novel method.
In 2009, a strong competitor to MTurk came up – CrowdFlower (https://crowdflower.com/). CrowdFlower is a “world’s leading crowdsourcing service, with over one billion tasks completed by five million contributors. We specialize in microtasking: distributing small, discrete tasks to many online contributors in assembly line fashion.”(CrowdFlower, 2013). In this short description, we can find many important facts. However, there is one important piece missing, in reality CrowdFlower is an “aggregating” platform, giving access to more than different 100 Channel Partners. Thanks to this service, we can achieve two objectives. First, if one goes to the link: https://crowdflower.com/partners, it is possible to find one of the most comprehensive lists of crowdsourcing internet marketplaces, enumerating all independent providers that CrowdFlower works with. Therefore, one may google those Channel Partners and access their service directly. The second important task possible to be achieved with CrowdFlower is targeting all those independent providers at once. If one creates an account on CrowdFlower, it is possible to post the tasks accessible to hundreds of thousands of workers, including one from e.g. MTurk. I would also add, more on the technical point, that CrowdFlower allows requester from any part of the world to post the tasks. It is not a case with MTurk, where a requester must be US-based.
As outlined above crowdsourcing seems to be a very good data gathering method for a researcher. However, there are ethics involved with conducting such type of research and with online research in general. One of the most frequently raised issues is the matter of relatively low pay (Fort, Adda & Cohen, 2011), that on average is 1.53$ / hour (Ross et al., 2010, April; for MTurk), and below a minimum U.S. wage. On the other hand, one may argue that the tasks performed by the online workers on a request of a task-provider (e.g. researcher) are “mini” contracts to carry out concrete activities (usually for ~15 mins in a case of an experiment) and as such cannot be viewed as traditional “jobs.” For example Ross et al. (2010, April) reports that in 2010, only 3% of U.S. based MTurk workers agreed with a statement about reliance on MTurk Income – “MTurk money is always necessary to make basic ends meet” and actually the highest percentage – 49% of the MTurk workers agreed with “MTurk money is nice, but doesn’t materially change my circumstances.” It seems as most MTurk workers see the platform as a way to earn a little extra, entertain themselves and sometimes as a little peculiar pastime as argued by Mason and Suri (2010). In addition, the same researchers note that there are some good reasons to reimburse the online participants less than in the lab-based experiment. In the case of the online-experiment, the participants do not have to align their schedules to the experiment neither there are any travel costs. Therefore, in economic terms, there is a smaller opportunity costs for the participants of the crowdsourcing platforms.
In previous sections, I have described what crowdsourcing is. However, in order to run a good experiment a researcher needs more tools than MTurk on other platforms. The most popular and robust platform to create online experiments seems to be Qualtrics (http://www.qualtrics.com/).
Qualtrics is a web-based tool for a creation of surveys. Qualtrics boasts that “5,000 organizations and 95 of the top 100 business schools love (…)” their product. In addition, at least from my experience, most universities have bought access to it for its researchers. This tool is simple to use, has a powerful GUI (Graphic User Interface) and allows setting sophisticated experiment outlines. However, there are more tools like Qualitrcs, e.g. SurveyMonkey. In the end, the message is simple. In order to use crowdsourcing one must have another good tool when it is possible to prepare robust and efficient experiments, just like in the lab.
Crowdsourcing is another tool in the researchers’ repertoire. It is not possible to run some kind of research on the crowdsourcing platforms, i.e. experiments that are not only computer based. For example, it is hard to imagine to measure brainwaves or galvanic skin response in other way than through the equipment available in the lab-based experiments. However, it is possible to crowdsource physiological reactions. As already mentioned, I recently did a study where we recorded and analyzed facial expressions of emotions in reaction to the video stimuli through crowdsourcing platforms (Lewinski, Fransen & Tan, in press) and we were not the first (see e.g. McDuff, El Kaliouby & Picard, 2011 November; 2012). In addition, in the near future, it should be possible to remotely crowdsource heartbeat and heartbeat variability through web-cam by using remote PPG (photoplethysmography) (e.g. Scully et al., 2012). For example the Dutch government has awarded dr. Daniël Lakens (TU/e) and me with a STW Valorization Grant Phase 1 (valorization – commercializing idea developed at the university) to develop “Heart2Bit – Measuring Emotions with a Webcam through Remote Photoplethysmography.” Overall, for now, crowdsourcing is mostly used by researchers to collect self-reported answers but in the near future it will be possible to collect more and more of physiological data. The time of big data (Manyika, 2011), which might be defined as “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” (Big data, n.d.) seems to be knocking at the labs’ doors all over the world.
Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2012). Evaluating online labor markets for experimental research: Amazon. com’s mechanical turk. Political Analysis, 20(3), 351-368.
Big data. (n.d.). In Wikipedia. Retrieved August 28, 2013, from
Birnbaum, M. H. (2004). Human research and data collection via the internet. Annual Review of Psychology, 55, 803-832.
Buhrmester, M. D., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6, 3-5.
Chilton, L. B., Horton, J. J., Miller, R. C., & Azenkot, S. (2009). Task search in a human computation market. In Proceedings of the ACM SIGKDD workshop on human computation (1-9). In P. Bennett, R. Chandrasekar, M. Chickering, P. Ipeirotis, E. Law, A. Mityagin, F. Provost & L. von Ahn (Eds.), HCOMP ’09: Proceedings of the ACM SIGKDD Workshop on Human Computation (77–85). New York: ACM.
CrowdFlower (2013). Our Company. Retrieved from https://crowdflower.com/company
Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon mechanical turk: Gold mine or coal mine?. Computational Linguistics, 37(2), 413-420.
Gaggioli & Riva (2008). Working the Crowd. Science, 12, 1443.
Lewinski, P. (2012). The false recognition of the nonprototypical facial expressions of emotion (Unpublished master’s thesis). Faculty of Psychology, University of Warsaw, Poland.
Lewinski, P., Fransen, M. L., Tan, E.S.H. (in press). Predicting advertising effectiveness by facial expressions in response to amusing persuasive stimuli. Journal of Neuroscience, Psychology, and Economics
Manyika, J., McKinsey Global Institute, Chui, M., Brown, B., Bughin, J., Dobbs, R., … & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.
Marge, M.; Banerjee, S., & Rudnicky, A.I. (2010). Using the Amazon Mechanical Turk for transcription of spoken language. International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE (5270-5273). Washington: Institute of Electronics and Electrical Engineers.
Mason, W., & Suri, S. (2011). Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods, 5, 411–419. doi: 10.3758/s13428-011-0124-6
McDuff, D., El Kaliouby, R., & Picard, R. (2011, November). Crowdsourced data collection of facial responses. In Proceedings of the 13th international conference on multimodal interfaces (pp. 11-18). ACM.
McDuff, D., El Kaliouby, R., & Picard, R. W. (2012). Crowdsourcing Facial Responses to Online Videos. IEEE Transactions on Affective Computing, 456-468.
Natala@AWS (2011). Re: MTurk CENSUS: About how many workers were on Mechanical Turk in 2010?. Retrieved from https://forums.aws.amazon.com/thread.jspa?threadID=58891#.
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisfying to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872.
Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5.
Quinn, P. C., & Eimas, P. D. (1996). Perceptual cues that permit categorical differentiation of animal species by infants. Journal of Experimental Child Psychology, 63, 189-211.
Ross, J., Irani, L., Silberman, M., Zaldivar, A., & Tomlinson, B. (2010, April). Who are the crowdworkers?: shifting demographics in mechanical turk. In CHI’10 Extended Abstracts on Human Factors in Computing Systems (pp. 2863-2872). ACM.
Scully, C. G., Lee, J., Meyer, J., Gorbach, A. M., Granquist-Fraser, D., Mendelson, Y., & Chon, K. H. (2012). Physiological parameter monitoring from optical recordings with a mobile phone. Biomedical Engineering, IEEE Transactions on, 59(2), 303-306.
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Computer Vision–ECCV 2006 (pp. 1-15). Springer Berlin Heidelberg.
This contribution is prepared based partially on presentation –
Lewinski, P. (2012, August). Gathering data online: Cutting cost & time. Paper presented at the meeting of 2011-2012 Junior Researcher Programme. Selwyn College, Cambridge.
This contribution is intended as a written summary for a research meeting about data collection agencies / respondent pools in Persuasive Communication group within Amsterdam School of Communication Research, University of Amsterdam, 19th September, 2013.