How many images were left unchanged after the consensus building exercise?

After the consensus building exercise, the ratings of 97.4% of the images on which the experts agreed were left unchanged if ratings of ‘maybe’ are considered to indicate that an image is impossible.

How many images did the volunteers see more than once?

Normally a volunteer saw about 2% of images more than once, although a few volunteers contributed more ratings than there were images, so necessarily had a substantial number of repeat views.

(Open Access) Limitations of Majority Agreement in Crowdsourced Image Interpretation (2017) | Carl F. Salk

Q: What have the authors contributed in "Limitations of majority agreement in crowdsourced image interpretation" ?

The authors use a cropland identification game with over 2600 participants and 165,000 unique tasks to investigate how best to evaluate the difficulty of crowdsourced tasks and to what extent this is possible based on volunteer responses alone.

Limitations of majority agreement in crowdsourced image interpretation

*Carl F Salk

1,2

, Tobias Sturn

, Linda See

, Steffen Fritz

Transactions in GIS, 2016

(1) Ecosystems Services and Management Program, International Institute for Applied Systems

Analysis, Schlossplatz 1, A-2361 Laxenburg, Austria

(2) Southern Swedish Forest Research Center, Swedish University of Agricultural Sciences, Box 52,

S-23053 Alnarp, Sweden

*Corresponding author: salk@iiasa.ac.at; +43(0) 2236 807 293

Running Head: Limits of the crowd in VGI

Keywords: Crowdsourcing, Volunteered geographic information, Expert validation, Image interpretation,

Task difficulty

Abstract

Crowdsourcing can efficiently complete tasks that are difficult to automate, but the quality of

crowdsourced data is tricky to evaluate. Algorithms to grade volunteer work often assume that all tasks

are similarly difficult, an assumption that is frequently false. We use a cropland identification game with

over 2600 participants and 165,000 unique tasks to investigate how best to evaluate the difficulty of

crowdsourced tasks and to what extent this is possible based on volunteer responses alone. Inter-

volunteer agreement exceeded 90% for about 80% of the images and was negatively correlated with

volunteer-expressed uncertainty about image classification. 343 relatively difficult images were

independently classified as cropland, non-cropland or impossible by two experts. The experts disagreed

weakly (one said impossible while the other rated as cropland or non-cropland) on 27% of the images,

but disagreed strongly (cropland vs. non-cropland) on only 7%. Inter-volunteer disagreement increased

significantly with inter-expert disagreement. While volunteers agreed with expert classifications for

most images, over 20% would have been mis-categorized if only the volunteers’ majority vote was used.

We end with a series of recommendations to manage the challenges posed by heterogeneous tasks in

crowdsourcing campaigns.

1 Introduction

Crowdsourcing is a powerful tool to perform tasks requiring human input that would be

prohibitively expensive if paid for in a conventional way. Crowdsourcing has emerged from a business-

oriented domain (Howe, 2006) in which different types of micro-tasks are outsourced to a willing labor

force or interested volunteers but in recent years has been adopted for a broader array of data

collection and processing tasks. Where the goal of data collection or processing is scientific, the

involvement of citizens in research has been termed ‘citizen science’ (Bonney et al., 2009). Many citizen

science projects maintain the traditional micro-task approach of crowdsourcing, ranging from relatively

simple to highly-skilled tasks such as bird identification (Silvertown, 2009). However, involvement at

higher levels, e.g. in hypothesis generation and research design, is a goal of many citizen science

projects (Haklay, 2013). When crowdsourced data has a spatial aspect, it is often referred to as

‘Volunteered Geographic Information’ (VGI; Goodchild, 2007). VGI can be solicited in a variety of ways,

particularly through mobile phones and social media. Regardless of the purpose of crowdsourcing, data

quality is a fundamental issue that arises for inputs generated by non-specialists, whether the eventual

goal is scientific analysis (Hunter et al., 2013) or conflation with authoritative data (Pourabdollah et al.,

2013). Data quality assessment is of particular importance as it has implications for how volunteers are

motivated, evaluated and rewarded (Oreg and Nov, 2008; Raddick et al., 2013).

Citizen science contributors may be accorded credit for their work in a variety of ways. On the

most basic level, contributors take part for some sense of personal reward. A study of the highly

successful Galaxy Zoo project identified at least 12 distinct personal motivations for taking part (Raddick

et al, 2013). Volunteers may be awarded points proportioned in some way to their work, for instance

for the total number of tasks completed, the number of tasks completed correctly, the accuracy of task

completion, or some combination of these metrics (Ipeirotis et al., 2010; Wang et al., 2013), among

others. Contributor credit may be confined entirely to the ‘game world’ such that points accrued cannot

be converted into cash or other goods. In-game recognition has proven to be a powerful motivator in

many scientific discovery games (Mekler et al., 2013). Points earned by volunteers may also be

converted to tangible rewards, for instance by paying a fixed amount for successful task completion, as

is done using the Amazon Mechanical Turk platform (Buhrmester et al., 2011). Intermediate solutions

are also possible, for instance entering top players into prize draws (See et al., 2014b) or awarding them

co-authorship for their contributions to resulting manuscripts (Fritz et al., 2013; See et al., 2014a). The

simplest reward structure for crowdsourcing campaigns is to award points uniformly for each task

completed, or for each task completed successfully. Rewarding quantity (rather than quality) is a

common, but slowly changing, feature of crowdsourcing efforts (Wang et al., 2013). Uniform rewards

may be appropriate where task difficulty does not vary greatly, but what happens when this is not true?

A more accurate evaluation of contributor quality, and a more nuanced awarding of credit, should take

the difficulty of individual tasks into account.

Some authors suggest that it is possible to evaluate the correct answer to a task and volunteer

quality using only volunteer-contributed data, or as one publication memorably put it, ‘to grade a test

without knowing the answers’ (Bachrach et al., 2012). Dawid and Skene (1979) proposed what may

have been the first algorithm attempting to do this, using an iterative maximum likelihood method to

simultaneously estimate the correct response and the error rate of each rater. More recent work has

built on this method, improving its efficiency, and perhaps extending its usefulness to somewhat noisier

data, typically using Bayesian estimation (Wang et al., 2013; Bachrach et al., 2012; Whitehill et al., 2009;

Welinder et al., 2010). Indeed, some successful crowdsourcing campaigns such as the ‘ESP Game’ for

image labeling (von Ahn and Dabbish, 2004) and Galaxy Zoo (Lintott et al., 2008) have made little or no

use of validated tasks. Because of the substantial literature on these expert-free methods, and their

appeal for crowdsourcing applications, we examine the consequences of assuming that the wisdom of

the crowd is correct.

We address these questions in the context of a simple land-cover classification task. There is an

extensive literature on the factors that make images difficult for humans to classify, particularly in the

field of aerial photography and satellite imagery. For instance, studies have quantified how complex

backgrounds slow down recognition of foreground objects (Lloyd and Hodgson, 2002) and the minimum

resolution needed to identify disaster-caused damage to buildings (Battersby et al., 2012). The

psychological underpinnings of these factors have also received substantial attention (e.g. Hoffman,

1990; Bianchetti, 2014). Our work takes the opposite approach. Rather than building a model of task

difficulty from a basis of cognition and image composition, we ask what can be learned about task

difficulty from the patterns of player responses themselves. This knowledge is particularly valuable in

the context of crowdsourcing. Even if an image classification activity has a well-developed theory of

what makes it difficult, applying this theory would require a separate evaluation of each image. For

some activities, computers may be able to rate the difficulty of tasks, but in this case the tasks

themselves are unlikely to require human interpretation, and expert pre-screening of images would

defeat the purpose of crowdsourcing.

As a first step toward relating payment or reputational credit for crowdsourcers to the difficulty

of the work they complete, we evaluate different metrics for assessing the difficulty of tasks using a

land-cover classification example from the Cropland Capture game. We compare disagreement among

volunteers and volunteer-reported uncertainty with expert-derived measures. The results show that

while the different difficulty measures show positive relationships with one another, evaluations based

on volunteers’ data alone tend to underestimate the difficulty of tasks. Further, for a non-trivial fraction

of tasks, the wisdom of the crowd was wrong, greatly complicating the assessment of other, non-

Limitations of Majority Agreement in Crowdsourced Image Interpretation

Figures

Citations

The tasks of the crowd : a typology of tasks in geographic information crowdsourcing and a case study in humanitarian mapping

A taxonomy of quality assessment methods for volunteered and crowdsourced geographic information.

Mapping Human Settlements with Higher Accuracy and Less Volunteer Efforts by Combining Crowdsourcing and Deep Learning

Citizen Science for Observing and Understanding the Earth

Recent Advances in Forest Observation with Visual Interpretation of Very High-Resolution Imagery

References

Amazon's Mechanical Turk A New Source of Inexpensive, Yet High-Quality, Data?

Citizens as sensors: the world of volunteered geography

Labeling images with a computer game

Citizen Science: A Developing Tool for Expanding Science Knowledge and Scientific Literacy

A new dawn for citizen science.

Related Papers (5)

Comparing the quality of crowdsourced data contributed by expert and non-experts.

Citizens as sensors: the world of volunteered geography

Assessing Crowdsourcing Quality through Objective Tasks

Characterization of Experts in Crowdsourcing Platforms

Species identification by experts and non-experts: comparing images from field guides

Frequently Asked Questions (5)

Q1. What have the authors contributed in "Limitations of majority agreement in crowdsourced image interpretation" ?

Q2. What is the common reward structure for crowdsourcing campaigns?

Q3. How many images were left unchanged after the consensus building exercise?

Q4. How many images did the volunteers see more than once?

Q5. What is the first method to estimate the correct response and the error rate of each rater?