Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

doi:10.3115/1613715.1613751

Home
/
Papers
/
Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

Proceedings Article•DOI•

Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

Rion Snow¹, Brendan O'Connor, Dan Jurafsky¹, Andrew Y. Ng¹•Institutions (1)

Stanford University¹

25 Oct 2008-pp 254-263

TL;DR: This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks.

read less

Abstract: Human linguistic annotation is crucial for many natural language processing tasks but can be expensive and time-consuming. We explore the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web. We investigate five tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation. For all five, we show high agreement between Mechanical Turk non-expert annotations and existing gold standard labels provided by expert labelers. For the task of affect recognition, we also show that using non-expert labels for training machine learning algorithms can be as effective as using gold standard annotations from experts. We propose a technique for bias correction that significantly improves annotation quality on two tasks. We conclude that many large labeling tasks can be effectively designed and carried out in this method at a fraction of the usual expense.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Active Learning Literature Survey

[...]

Burr Settles

01 Jan 2009

TL;DR: This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.

...read moreread less

Abstract: The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. An active learner may pose queries, usually in the form of unlabeled data instances to be labeled by an oracle (e.g., a human annotator). Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant or easily obtained, but labels are difficult, time-consuming, or expensive to obtain. This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. An analysis of the empirical and theoretical evidence for successful active learning, a summary of problem setting variants and practical issues, and a discussion of related topics in machine learning research are also presented.

...read moreread less

5,227 citations

Cites methods from "Cheap and Fast -- But is it Good? E..."

...Such approaches have been used to produce gold-standard quality training sets (Snow et al., 2008) and also to evaluate learning algorithms on data for which no gold-standard labelings exist (Mintz et al., 2009; Carlson et al., 2010)....
[...]
...Such approaches have been used to produce gold-standard quality training sets (Snow et al., 2008) and also to evaluate learning algorithms on data for which no gold-standard labelings exist (Mintz et al....
[...]

Journal Article•DOI•

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

[...]

Ranjay Krishna¹, Yuke Zhu¹, Oliver Groth², Justin Johnson¹, Kenji Hata¹, Joshua Kravitz¹, Stephanie Chen¹, Yannis Kalantidis³, Li-Jia Li, David A. Shamma⁴, Michael S. Bernstein¹, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, Dresden University of Technology², Yahoo!³, Centrum Wiskunde & Informatica⁴

01 May 2017-International Journal of Computer Vision

TL;DR: The Visual Genome dataset as mentioned in this paper contains over 108k images where each image has an average of $35$35 objects, $26$26 attributes, and $21$21 pairwise relationships between objects.

...read moreread less

Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

...read moreread less

3,842 citations

Journal Article•DOI•

Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk

[...]

Adam J. Berinsky¹, Gregory A. Huber², Gabriel S. Lenz³•Institutions (3)

Massachusetts Institute of Technology¹, Yale University², University of California, Berkeley³

02 Mar 2012-Political Analysis

TL;DR: It is shown that respondents recruited in this manner are often more representative of the U.S. population than in-person convenience samples but less representative than subjects in Internet-based panels or national probability samples.

...read moreread less

Abstract: We examine the trade-offs associated with using Amazon.com’s Mechanical Turk (MTurk) interface for subject recruitment. We first describe MTurk and its promise as a vehicle for performing low-cost and easy-to-field experiments. We then assess the internal and external validity of experiments performed using MTurk, employing a framework that can be used to evaluate other subject pools. We first investigate the characteristics of samples drawn from the MTurk population. We show that respondents recruited in this manner are often more representative of the U.S. population than in-person convenience samples—the modal sample in published experimental political science—but less representative than subjects in Internet-based panels or national probability samples. Finally, we replicate important published experimental work using MTurk samples.

...read moreread less

3,517 citations

Cites background from "Cheap and Fast -- But is it Good? E..."

...For example, Snow et al. (2008) assessed the quality of MTurkers’ responses to several classic human language problems, finding that the quality was no worse than the expert data that most researchers use....
[...]

Proceedings Article•

VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text

[...]

Clayton J. Hutto¹, Eric Gilbert¹•Institutions (1)

Georgia Institute of Technology¹

16 May 2014

TL;DR: Interestingly, using the authors' parsimonious rule-based model to assess the sentiment of tweets, it is found that VADER outperforms individual human raters, and generalizes more favorably across contexts than any of their benchmarks.

...read moreread less

Abstract: The inherent nature of social media content poses serious challenges to practical applications of sentiment analysis. We present VADER, a simple rule-based model for general sentiment analysis, and compare its effectiveness to eleven typical state-of-practice benchmarks including LIWC, ANEW, the General Inquirer, SentiWordNet, and machine learning oriented techniques relying on Naive Bayes, Maximum Entropy, and Support Vector Machine (SVM) algorithms. Using a combination of qualitative and quantitative methods, we first construct and empirically validate a gold-standard list of lexical features (along with their associated sentiment intensity measures) which are specifically attuned to sentiment in microblog-like contexts. We then combine these lexical features with consideration for five general rules that embody grammatical and syntactical conventions for expressing and emphasizing sentiment intensity. Interestingly, using our parsimonious rule-based model to assess the sentiment of tweets, we find that VADER outperforms individual human raters (F1 Classification Accuracy = 0.96 and 0.84, respectively), and generalizes more favorably across contexts than any of our benchmarks.

...read moreread less

3,299 citations

Proceedings Article•DOI•

Distant supervision for relation extraction without labeled data

[...]

Mike D. Mintz¹, Steven Bills¹, Rion Snow¹, Dan Jurafsky¹•Institutions (1)

Stanford University¹

02 Aug 2009

TL;DR: This work investigates an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size.

...read moreread less

Abstract: Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression.

...read moreread less

2,965 citations

Cites methods from "Cheap and Fast -- But is it Good? E..."

...Human evaluation was performed by evaluators on Amazon’s Mechanical Turk service, shown to be effective for natural language annotation in Snow et al. (2008)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Report•DOI•

Building a large annotated corpus of English: the penn treebank

[...]

Mitchell Marcus¹, Mary Ann Marcinkiewicz¹, Beatrice Santorini²•Institutions (2)

University of Pennsylvania¹, Northwestern University²

01 Jun 1993-Computational Linguistics

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.

...read moreread less

Abstract: : As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure This material now includes a fully hand-parsed version of the classic Brown corpus About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant

...read moreread less

8,377 citations

Proceedings Article•DOI•

The Berkeley FrameNet Project

[...]

Collin F. Baker¹, Charles J. Fillmore¹, John B. Lowe¹•Institutions (1)

International Computer Science Institute¹

10 Aug 1998

TL;DR: This report will present the project's goals and workflow, and information about the computational tools that have been adapted or created in-house for this work.

...read moreread less

Abstract: FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a) a commitment to corpus evidence for semantic and syntactic generalizations, and (b) the representation of the valences of its target words (mostly nouns, adjectives, and verbs) in which the semantic portion makes use of frame semantics. The resulting database will contain (a) descriptions of the semantic frames underlying the meanings of the words described, and (b) the valence representation (semantic and syntactic) of several thousand words and phrases, each accompanied by (c) a representative collection of annotated corpus attestations, which jointly exemplify the observed linkings between "frame elements" and their syntactic realizations (e.g. grammatical function, phrase type, and other syntactic traits). This report will present the project's goals and workflow, and information about the computational tools that have been adapted or created in-house for this work.

...read moreread less

2,900 citations

"Cheap and Fast -- But is it Good? E..." refers background or methods in this paper

...In this work we explore the use of Amazon Mechanical Turk1 (AMT) to determine whether nonexpert labelers can provide reliable natural language annotations....
[...]
...Another method is to use Amazon’s compensation mechanisms to give monetary bonuses to highlyperforming workers and deny payments to unreliable ones; this is useful, but beyond the scope of this paper....
[...]
...We employ the Amazon Mechanical Turk system in order to elicit annotations from non-expert labelers....
[...]
...In this section we describe Amazon Mechanical Turk and the general design of our experiments....
[...]
...We demonstrate the effectiveness of using Amazon Mechanical Turk for a variety of natural language annotation tasks....
[...]

Journal Article•DOI•

The Proposition Bank: An Annotated Corpus of Semantic Roles

[...]

Martha Palmer¹, Daniel Gildea², Paul R. Kingsbury¹•Institutions (2)

University of Pennsylvania¹, University of Rochester²

01 Mar 2005-Computational Linguistics

TL;DR: An automatic system for semantic role tagging trained on the corpus is described and the effect on its performance of various types of information is discussed, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty trace categories of the treebank.

...read moreread less

Abstract: The Proposition Bank project takes a practical approach to semantic representation, adding a layer of predicate-argument information, or semantic role labels, to the syntactic structures of the Penn Treebank. The resulting resource can be thought of as shallow, in that it does not represent coreference, quantification, and many other higher-order phenomena, but also broad, in that it covers every instance of every verb in the corpus and allows representative statistics to be calculated.We discuss the criteria used to define the sets of semantic roles used in the annotation process and to analyze the frequency of syntactic/semantic alternations in the corpus. We describe an automatic system for semantic role tagging trained on the corpus and discuss the effect on its performance of various types of information, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty ''trace'' categories of the treebank.

...read moreread less

2,416 citations

"Cheap and Fast -- But is it Good? E..." refers background in this paper

...Large scale annotation projects such as TreeBank (Marcus et al., 1993), PropBank (Palmer et al., 2005), TimeBank (Pustejovsky et al., 2003), FrameNet (Baker et al., 1998), SemCor (Miller et al., 1993), and others play an important role in natural language processing research, encouraging the…...
[...]
...Large scale annotation projects such as TreeBank (Marcus et al., 1993), PropBank (Palmer et al., 2005), TimeBank (Pustejovsky et al., 2003), FrameNet (Baker et al., 1998), SemCor (Miller et al., 1993), and others play an important role in natural language processing research, encouraging the development of novel ideas, tasks, and algorithms....
[...]

Proceedings Article•DOI•

Labeling images with a computer game

[...]

Luis von Ahn¹, Laura Dabbish¹•Institutions (1)

Carnegie Mellon University¹

25 Apr 2004

TL;DR: A new interactive system: a game that is fun and can be used to create valuable output that addresses the image-labeling problem and encourages people to do the work by taking advantage of their desire to be entertained.

...read moreread less

Abstract: We introduce a new interactive system: a game that is fun and can be used to create valuable output. When people play the game they help determine the contents of images by providing meaningful labels for them. If the game is played as much as popular online games, we estimate that most images on the Web can be labeled in a few months. Having proper labels associated with each image on the Web would allow for more accurate image search, improve the accessibility of sites (by providing descriptions of images to visually impaired individuals), and help users block inappropriate images. Our system makes a significant contribution because of its valuable output and because of the way it addresses the image-labeling problem. Rather than using computer vision techniques, which don't work well enough, we encourage people to do the work by taking advantage of their desire to be entertained.

...read moreread less

2,365 citations

"Cheap and Fast -- But is it Good? E..." refers methods in this paper

...Luis von Ahn pioneered the collection of data via online annotation tasks in the form of games, including the ESPGame for labeling images (von Ahn and Dabbish, 2004) and Verbosity for annotating word relations (von Ahn et al., 2006)....
[...]

Journal Article•DOI•

[...]

Philip Resnik¹•Institutions (1)

University of Maryland, College Park¹

01 Jul 1999-Journal of Artificial Intelligence Research

TL;DR: In this paper, a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content is presented, and experimental evaluation against a benchmark set of human similarity judgments demonstrates that the measure performs better than the traditional edge counting approach.

...read moreread less

Abstract: This article presents a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content. Experimental evaluation against a benchmark set of human similarity judgments demonstrates that the measure performs better than the traditional edge-counting approach. The article presents algorithms that take advantage of taxonomic similarity in resolving syntactic and semantic ambiguity, along with experimental results demonstrating their effectiveness.

...read moreread less

2,190 citations