scispace - formally typeset
Open AccessProceedings Article

SQUARE: A Benchmark for Research on Computing Crowd Consensus

TLDR
SQUARE, an open source shared task framework including benchmark datasets, defined tasks, standard metrics, and reference implementations with empirical results for several popular methods, is presented.
Abstract
While many statistical consensus methods now exist, relatively little comparative benchmarking and integration of techniques has made it increasingly difficult to determine the current state-of-the-art, to evaluate the relative benefit of new methods, to understand where specific problems merit greater attention, and to measure field progress over time. To make such comparative evaluation easier for everyone, we present SQUARE, an open source shared task framework including benchmark datasets, defined tasks, standard metrics, and reference implementations with empirical results for several popular methods. In addition to measuring performance on a variety of public, real crowd datasets, the benchmark also varies supervision and noise by manipulating training size and labeling error. We envision SQUARE as dynamic and continually evolving, with new datasets and reference implementations being added according to community needs and interest. We invite community contributions and participation.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions

TL;DR: In this paper, a survey of quality in the context of crowdsourcing along several dimensions is presented to define and characterize it and to understand the current state-of-the-art.
Proceedings ArticleDOI

QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications

TL;DR: This paper investigates the online task assignment problem: Given a pool of n questions, which of the k questions should be assigned to a worker, and proposes a system called the Quality-Aware Task Assignment System for Crowdsourcing Applications (QASCA) on top of AMT.
Journal ArticleDOI

Learning from crowdsourced labeled data: a survey

TL;DR: This survey introduces the basic concepts of the qualities of labels and learning models, and introduces open accessible real-world data sets collected from crowdsourcing systems and open source libraries and tools.
Proceedings ArticleDOI

Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk

TL;DR: It is found that screening workers for requisite cognitive aptitudes and providing training in qualitative coding techniques is quite effective, significantly outperforming control and baseline conditions and can improve coder annotation accuracy above and beyond common benchmark strategies such as Bayesian Truth Serum (BTS).
Journal ArticleDOI

Active Learning With Imbalanced Multiple Noisy Labeling

TL;DR: A novel active learning framework with multiple imperfect annotators involved in crowdsourcing systems that solves the imbalanced multiple noisy labeling problem and three novel instance selection strategies are proposed to adapt PLAT for improving the learning performance.
References
More filters
Proceedings ArticleDOI

Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

TL;DR: This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks.
Journal ArticleDOI

Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm

TL;DR: The EM algorithm is shown to provide a slow but sure way of obtaining maximum likelihood estimates of the parameters of interest in compiling a patient record.
Journal ArticleDOI

Learning From Crowds

TL;DR: A probabilistic approach for supervised learning when the authors have multiple annotators providing (possibly noisy) labels but no absolute gold standard, and experimental results indicate that the proposed method is superior to the commonly used majority voting baseline.
Proceedings Article

Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

TL;DR: A probabilistic model is presented and it is demonstrated that the model outperforms the commonly used "Majority Vote" heuristic for inferring image labels, and is robust to both noisy and adversarial labelers.
Proceedings ArticleDOI

Quality management on Amazon Mechanical Turk

TL;DR: This work presents algorithms that improve the existing state-of-the-art techniques, enabling the separation of bias and error, and illustrates how to incorporate cost-sensitive classification errors in the overall framework and how to seamlessly integrate unsupervised and supervised techniques for inferring the quality of the workers.
Related Papers (5)