scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Evaluating Amazon's Mechanical Turk as a Tool for Experimental Behavioral Research

13 Mar 2013-PLOS ONE (Public Library of Science)-Vol. 8, Iss: 3, pp 1-18
TL;DR: This paper replicates a diverse body of tasks from experimental psychology including the Stroop, Switching, Flanker, Simon, Posner Cuing, attentional blink, subliminal priming, and category learning tasks using participants recruited using AMT.
Abstract: Amazon Mechanical Turk (AMT) is an online crowdsourcing service where anonymous online workers complete web-based tasks for small sums of money. The service has attracted attention from experimental psychologists interested in gathering human subject data more efficiently. However, relative to traditional laboratory studies, many aspects of the testing environment are not under the experimenter's control. In this paper, we attempt to empirically evaluate the fidelity of the AMT system for use in cognitive behavioral experiments. These types of experiment differ from simple surveys in that they require multiple trials, sustained attention from participants, comprehension of complex instructions, and millisecond accuracy for response recording and stimulus presentation. We replicate a diverse body of tasks from experimental psychology including the Stroop, Switching, Flanker, Simon, Posner Cuing, attentional blink, subliminal priming, and category learning tasks using participants recruited using AMT. While most of replications were qualitatively successful and validated the approach of collecting data anonymously online using a web-browser, others revealed disparity between laboratory results and online results. A number of important lessons were encountered in the process of conducting these replications that should be of value to other researchers.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: A new dataset of human perceptual similarity judgments is introduced and it is found that deep features outperform all previous metrics by large margins on this dataset, and suggests that perceptual similarity is an emergent property shared across deep visual representations.
Abstract: While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on ImageNet classification has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new dataset of human perceptual similarity judgments. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by large margins on our dataset. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.

3,838 citations


Cites background from "Evaluating Amazon's Mechanical Turk..."

  • ...[9] show that AMT can be reliably used to replicate many psychophysics studies, despite the inability to control all environmental factors....

    [...]

Proceedings ArticleDOI
11 Jan 2018
TL;DR: In this paper, the authors introduce a new dataset of human perceptual similarity judgments, and systematically evaluate deep features across different architectures and tasks and compare them with classic metrics, finding that deep features outperform all previous metrics by large margins on their dataset.
Abstract: While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on ImageNet classification has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new dataset of human perceptual similarity judgments. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by large margins on our dataset. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.

3,322 citations

Journal ArticleDOI
11 Dec 2015-Science
TL;DR: A computational model is described that learns in a similar fashion and does so better than current deep learning algorithms and can generate new letters of the alphabet that look “right” as judged by Turing-like tests of the model's output in comparison to what real humans produce.
Abstract: People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms-for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world's alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several "visual Turing tests" probing the model's creative generalization abilities, which in many cases are indistinguishable from human behavior.

2,364 citations

Journal ArticleDOI
TL;DR: The characteristics of Mechanical Turk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become Mechanical Turk workers and research participants, and how data quality on Mechanical Turk compares to that from other pools and depends on controllable and uncontrollable factors as mentioned in this paper.
Abstract: Mechanical Turk (MTurk), an online labor market created by Amazon, has recently become popular among social scientists as a source of survey and experimental data. The workers who populate this market have been assessed on dimensions that are universally relevant to understanding whether, why, and when they should be recruited as research participants. We discuss the characteristics of MTurk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors.

1,926 citations


Cites methods from "Evaluating Amazon's Mechanical Turk..."

  • ...Classic cognitive tasks that rely on response-time measures have also been replicated, including Stroop, switching, flanker, attentional blink, and subliminal-priming tasks (Crump et  al., 2013)....

    [...]

Journal ArticleDOI
TL;DR: This article found that participants on both platforms were more naive and less dishonest compared to MTurk participants, and ProA and CrowdFlower participants produced data quality that was higher than CF's and comparable to M-Turk's.

1,537 citations

References
More filters
Journal ArticleDOI
TL;DR: Findings indicate that MTurk can be used to obtain high-quality data inexpensively and rapidly and the data obtained are at least as reliable as those obtained via traditional methods.
Abstract: Amazon's Mechanical Turk (MTurk) is a relatively new website that contains the major elements required to conduct research: an integrated participant compensation system; a large participant pool; and a streamlined process of study design, participant recruitment, and data collection. In this article, we describe and evaluate the potential contributions of MTurk to psychology and other social sciences. Findings indicate that (a) MTurk participants are slightly more demographically diverse than are standard Internet samples and are significantly more diverse than typical American college samples; (b) participation is affected by compensation rate and task length, but participants can still be recruited rapidly and inexpensively; (c) realistic compensation rates do not affect data quality; and (d) the data obtained are at least as reliable as those obtained via traditional methods. Overall, MTurk can be used to obtain high-quality data inexpensively and rapidly.

9,562 citations


"Evaluating Amazon's Mechanical Turk..." refers background or methods in this paper

  • ...In addition, the service has been validated as a tool for conducting survey research [3,4], one-shot decision-making research [5,6], collective behavior experiments [7,8], for norming stimuli, and conducting behavioral linguistics experiments [9,10]....

    [...]

  • ...There are a number of in-depth overviews of using AMT to conduct behavioral experiments [4,2,6]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a 1-sec tachistoscopic exposure, Ss responded with a right or left leverpress to a single target letter from the sets H and K or S and C. The target always appeared directly above the fixation cross.
Abstract: During a 1-sec tachistoscopic exposure, Ss responded with a right or left leverpress to a single target letter from the sets H and K or S and C. The target always appeared directly above the fixation cross. Experimentally varied were the types of noise letters (response compatible or incompatible) flanking the target and the spacing between the letters in the display. In all noise conditions, reaction time (RT) decreased as between-letter spacing increased. However, noise letters of the opposite response set were found to impair RT significantly more than same response set noise, while mixed noise letters belonging to neither set but having set-related features produced intermediate impairment. Differences between two target-alone control conditions, one presented intermixed with noise-condition trials and one presented separately in blocks, gave evidence of a preparatory set on the part of Ss to inhibit responses to the noise letters. It was concluded that S cannot prevent processing of noise letters occurring within about 1 deg of the target due to the nature of processing channel capacity and must inhibit his response until he is able to discriminate exactly which letter is in the target position. This discrimination is more difficult and time consuming at closer spacings, and inhibition is more difficult when noise letters indicate the opposite response from the targe

6,234 citations


"Evaluating Amazon's Mechanical Turk..." refers background in this paper

  • ...The Flanker task [26,27] measures participants’ spatial attention in a task requiring them to select relevant from irrelevant information....

    [...]

Journal ArticleDOI
TL;DR: It is concluded that recent theories placing the explanatory weight on parallel processing of the irrelevant and the relevant dimensions are likely to be more sucessful than are earlier theories attempting to locate a single bottleneck in attention.
Abstract: The literature on interference in the Stroop Color-Word Task, covering over 50 years and some 400 studies, is organized and reviewed. In so doing, a set of 18 reliable empirical finding is isolated that must be captured by any successful theory of the Stroop effect. Existing theoretical positions are summarized and evaluated in view of this critical evidence and the 2 major candidate theories ―relative speed of processing and automaticity of reading― are found to be wanting. It is concluded that recent theories placing the explanatory weight on parallel processing of the irrelevant and the relevant dimensions are likely to be more sucessful than are earlier theories attempting to locate a single bottleneck in attention

5,172 citations


Additional excerpts

  • ...The Stroop task is a classic multi-trial procedure involving inkcolor identification of congruent (the word blue in blue) or incongruent (blue in red) word-color pairs [19,20]....

    [...]

Journal ArticleDOI
TL;DR: It is shown that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings, flexibility in data collection, analysis, and reporting dramatically increases actual false- positive rates, and a simple, low-cost, and straightforwardly effective disclosure-based solution is suggested.
Abstract: In this article, we accomplish two things. First, we show that despite empirical psychologists' nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.

4,727 citations


"Evaluating Amazon's Mechanical Turk..." refers background in this paper

  • ...Psychologists are under increasing criticism for undisclosed flexibility in data collection and statistical analysis [48]....

    [...]

  • ...As with all empirical studies, restrictions should be decided before data collection and clearly reported in papers to avoid excess experimenter degrees of freedom [48]....

    [...]