scispace - formally typeset
Open AccessJournal ArticleDOI

Limitations of Majority Agreement in Crowdsourced Image Interpretation

TLDR
While volunteers agreed with expert classifications for most images, over 20% would have been mis-categorized if only the volunteers’ majority vote was used and a series of recommendations for managing the challenges posed by heterogeneous tasks in crowdsourcing campaigns are made.
Abstract
Crowdsourcing can efficiently complete tasks that are difficult to automate, but the quality of crowdsourced data is tricky to evaluate. Algorithms to grade volunteer work often assume that all tasks are similarly difficult, an assumption that is frequently false. We use a cropland identification game with over 2,600 participants and 165,000 unique tasks to investigate how best to evaluate the difficulty of crowdsourced tasks and to what extent this is possible based on volunteer responses alone. Inter-volunteer agreement exceeded 90% for about 80% of the images and was negatively correlated with volunteer-expressed uncertainty about image classification. A total of 343 relatively difficult images were independently classified as cropland, non-cropland or impossible by two experts. The experts disagreed weakly (one said impossible while the other rated as cropland or non-cropland) on 27% of the images, but disagreed strongly (cropland vs. non-cropland) on only 7%. Inter-volunteer disagreement increased significantly with inter-expert disagreement. While volunteers agreed with expert classifications for most images, over 20% would have been mis-categorized if only the volunteers’ majority vote was used. We end with a series of recommendations for managing the challenges posed by heterogeneous tasks in crowdsourcing campaigns.

read more

Content maybe subject to copyright    Report

Limitations of majority agreement in crowdsourced image interpretation
*Carl F Salk
1,2
, Tobias Sturn
1
, Linda See
1
, Steffen Fritz
1
,
Transactions in GIS, 2016
(1) Ecosystems Services and Management Program, International Institute for Applied Systems
Analysis, Schlossplatz 1, A-2361 Laxenburg, Austria
(2) Southern Swedish Forest Research Center, Swedish University of Agricultural Sciences, Box 52,
S-23053 Alnarp, Sweden
*Corresponding author: salk@iiasa.ac.at; +43(0) 2236 807 293
Running Head: Limits of the crowd in VGI
Keywords: Crowdsourcing, Volunteered geographic information, Expert validation, Image interpretation,
Task difficulty

Abstract
Crowdsourcing can efficiently complete tasks that are difficult to automate, but the quality of
crowdsourced data is tricky to evaluate. Algorithms to grade volunteer work often assume that all tasks
are similarly difficult, an assumption that is frequently false. We use a cropland identification game with
over 2600 participants and 165,000 unique tasks to investigate how best to evaluate the difficulty of
crowdsourced tasks and to what extent this is possible based on volunteer responses alone. Inter-
volunteer agreement exceeded 90% for about 80% of the images and was negatively correlated with
volunteer-expressed uncertainty about image classification. 343 relatively difficult images were
independently classified as cropland, non-cropland or impossible by two experts. The experts disagreed
weakly (one said impossible while the other rated as cropland or non-cropland) on 27% of the images,
but disagreed strongly (cropland vs. non-cropland) on only 7%. Inter-volunteer disagreement increased
significantly with inter-expert disagreement. While volunteers agreed with expert classifications for
most images, over 20% would have been mis-categorized if only the volunteers’ majority vote was used.
We end with a series of recommendations to manage the challenges posed by heterogeneous tasks in
crowdsourcing campaigns.

1 Introduction
Crowdsourcing is a powerful tool to perform tasks requiring human input that would be
prohibitively expensive if paid for in a conventional way. Crowdsourcing has emerged from a business-
oriented domain (Howe, 2006) in which different types of micro-tasks are outsourced to a willing labor
force or interested volunteers but in recent years has been adopted for a broader array of data
collection and processing tasks. Where the goal of data collection or processing is scientific, the
involvement of citizens in research has been termed ‘citizen science’ (Bonney et al., 2009). Many citizen
science projects maintain the traditional micro-task approach of crowdsourcing, ranging from relatively
simple to highly-skilled tasks such as bird identification (Silvertown, 2009). However, involvement at
higher levels, e.g. in hypothesis generation and research design, is a goal of many citizen science
projects (Haklay, 2013). When crowdsourced data has a spatial aspect, it is often referred to as
Volunteered Geographic Information (VGI; Goodchild, 2007). VGI can be solicited in a variety of ways,
particularly through mobile phones and social media. Regardless of the purpose of crowdsourcing, data
quality is a fundamental issue that arises for inputs generated by non-specialists, whether the eventual
goal is scientific analysis (Hunter et al., 2013) or conflation with authoritative data (Pourabdollah et al.,
2013). Data quality assessment is of particular importance as it has implications for how volunteers are
motivated, evaluated and rewarded (Oreg and Nov, 2008; Raddick et al., 2013).
Citizen science contributors may be accorded credit for their work in a variety of ways. On the
most basic level, contributors take part for some sense of personal reward. A study of the highly
successful Galaxy Zoo project identified at least 12 distinct personal motivations for taking part (Raddick
et al, 2013). Volunteers may be awarded points proportioned in some way to their work, for instance
for the total number of tasks completed, the number of tasks completed correctly, the accuracy of task
completion, or some combination of these metrics (Ipeirotis et al., 2010; Wang et al., 2013), among

others. Contributor credit may be confined entirely to the ‘game world’ such that points accrued cannot
be converted into cash or other goods. In-game recognition has proven to be a powerful motivator in
many scientific discovery games (Mekler et al., 2013). Points earned by volunteers may also be
converted to tangible rewards, for instance by paying a fixed amount for successful task completion, as
is done using the Amazon Mechanical Turk platform (Buhrmester et al., 2011). Intermediate solutions
are also possible, for instance entering top players into prize draws (See et al., 2014b) or awarding them
co-authorship for their contributions to resulting manuscripts (Fritz et al., 2013; See et al., 2014a). The
simplest reward structure for crowdsourcing campaigns is to award points uniformly for each task
completed, or for each task completed successfully. Rewarding quantity (rather than quality) is a
common, but slowly changing, feature of crowdsourcing efforts (Wang et al., 2013). Uniform rewards
may be appropriate where task difficulty does not vary greatly, but what happens when this is not true?
A more accurate evaluation of contributor quality, and a more nuanced awarding of credit, should take
the difficulty of individual tasks into account.
Some authors suggest that it is possible to evaluate the correct answer to a task and volunteer
quality using only volunteer-contributed data, or as one publication memorably put it, ‘to grade a test
without knowing the answers’ (Bachrach et al., 2012). Dawid and Skene (1979) proposed what may
have been the first algorithm attempting to do this, using an iterative maximum likelihood method to
simultaneously estimate the correct response and the error rate of each rater. More recent work has
built on this method, improving its efficiency, and perhaps extending its usefulness to somewhat noisier
data, typically using Bayesian estimation (Wang et al., 2013; Bachrach et al., 2012; Whitehill et al., 2009;
Welinder et al., 2010). Indeed, some successful crowdsourcing campaigns such as the ‘ESP Game’ for
image labeling (von Ahn and Dabbish, 2004) and Galaxy Zoo (Lintott et al., 2008) have made little or no
use of validated tasks. Because of the substantial literature on these expert-free methods, and their

appeal for crowdsourcing applications, we examine the consequences of assuming that the wisdom of
the crowd is correct.
We address these questions in the context of a simple land-cover classification task. There is an
extensive literature on the factors that make images difficult for humans to classify, particularly in the
field of aerial photography and satellite imagery. For instance, studies have quantified how complex
backgrounds slow down recognition of foreground objects (Lloyd and Hodgson, 2002) and the minimum
resolution needed to identify disaster-caused damage to buildings (Battersby et al., 2012). The
psychological underpinnings of these factors have also received substantial attention (e.g. Hoffman,
1990; Bianchetti, 2014). Our work takes the opposite approach. Rather than building a model of task
difficulty from a basis of cognition and image composition, we ask what can be learned about task
difficulty from the patterns of player responses themselves. This knowledge is particularly valuable in
the context of crowdsourcing. Even if an image classification activity has a well-developed theory of
what makes it difficult, applying this theory would require a separate evaluation of each image. For
some activities, computers may be able to rate the difficulty of tasks, but in this case the tasks
themselves are unlikely to require human interpretation, and expert pre-screening of images would
defeat the purpose of crowdsourcing.
As a first step toward relating payment or reputational credit for crowdsourcers to the difficulty
of the work they complete, we evaluate different metrics for assessing the difficulty of tasks using a
land-cover classification example from the Cropland Capture game. We compare disagreement among
volunteers and volunteer-reported uncertainty with expert-derived measures. The results show that
while the different difficulty measures show positive relationships with one another, evaluations based
on volunteers’ data alone tend to underestimate the difficulty of tasks. Further, for a non-trivial fraction
of tasks, the wisdom of the crowd was wrong, greatly complicating the assessment of other, non-

Figures
Citations
More filters
Journal ArticleDOI

The tasks of the crowd : a typology of tasks in geographic information crowdsourcing and a case study in humanitarian mapping

TL;DR: The results show that the crowdsourced classification of remotely sensed imagery is able to generate geographic information about human settlements with a high level of quality and makes clear the different sophistication levels of tasks that can be performed by volunteers and reveals some factors that may have an impact on their performance.
Journal ArticleDOI

A taxonomy of quality assessment methods for volunteered and crowdsourced geographic information.

TL;DR: A taxonomy of methods for assessing the quality of CGI when no reference data are available is proposed, which is likely to be the most common situation in practice and includes 11 quality assessment methods that were identified by means of a systematic literature review.
Journal ArticleDOI

Mapping Human Settlements with Higher Accuracy and Less Volunteer Efforts by Combining Crowdsourcing and Deep Learning

TL;DR: The study reveals that crowdsourcing and deep learning outperform existing EO-based approaches and products such as the Global Urban Footprint and suggests that for the efficient creation of human settlement maps, it should rely on human skills when needed and rely on automated approaches when possible.
Book ChapterDOI

Citizen Science for Observing and Understanding the Earth

TL;DR: In this paper, the authors provide an overview of the field of citizen science and its contribution to the observation of the Earth, often not through remote sensing but a much closer relationship with the local environment.
References
More filters
Journal ArticleDOI

Amazon's Mechanical Turk A New Source of Inexpensive, Yet High-Quality, Data?

TL;DR: Findings indicate that MTurk can be used to obtain high-quality data inexpensively and rapidly and the data obtained are at least as reliable as those obtained via traditional methods.
Journal ArticleDOI

Citizens as sensors: the world of volunteered geography

TL;DR: In recent months, there has been an explosion of interest in using the Web to create, assemble, and disseminate geographic information provided voluntarily by individuals as mentioned in this paper, and the role of the amateur in geographic observation has been discussed.
Proceedings ArticleDOI

Labeling images with a computer game

TL;DR: A new interactive system: a game that is fun and can be used to create valuable output that addresses the image-labeling problem and encourages people to do the work by taking advantage of their desire to be entertained.
Journal ArticleDOI

Citizen Science: A Developing Tool for Expanding Science Knowledge and Scientific Literacy

TL;DR: This article describes the model for building and operating citizen science projects that has evolved at the Cornell Lab of Ornithology over the past two decades and hopes that the model will inform the fields of biodiversity monitoring, biological research, and science education while providing a window into the culture of citizen science.
Journal ArticleDOI

A new dawn for citizen science.

TL;DR: A citizen scientist is a volunteer who collects and/or processes data as part of a scientific enquiry, particularly in ecology and the environmental sciences.
Related Papers (5)
Frequently Asked Questions (5)
Q1. What have the authors contributed in "Limitations of majority agreement in crowdsourced image interpretation" ?

The authors use a cropland identification game with over 2600 participants and 165,000 unique tasks to investigate how best to evaluate the difficulty of crowdsourced tasks and to what extent this is possible based on volunteer responses alone. 

The simplest reward structure for crowdsourcing campaigns is to award points uniformly for each task completed, or for each task completed successfully. 

After the consensus building exercise, the ratings of 97.4% of the images on which the experts agreed were left unchanged if ratings of ‘maybe’ are considered to indicate that an image is impossible. 

Normally a volunteer saw about 2% of images more than once, although a few volunteers contributed more ratings than there were images, so necessarily had a substantial number of repeat views. 

Dawid and Skene (1979) proposed what may have been the first algorithm attempting to do this, using an iterative maximum likelihood method to simultaneously estimate the correct response and the error rate of each rater.