Journal Article•DOI•

Truth inference in crowdsourcing: is the problem solved?

Yudian Zheng¹, Guoliang Li², Yuanbing Li², Caihua Shan¹, Reynold Cheng¹ - Show less +1 more•Institutions (2)

University of Hong Kong¹, Tsinghua University²

01 Jan 2017-Vol. 10, Iss: 5, pp 541-552

TL;DR: It is believed that the truth inference problem is not fully solved, and the limitations of existing algorithms are identified and point out promising research directions.

read less

Abstract: Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions.

...read moreread less

Summary (3 min read)

Jump to: [1. INTRODUCTION] – [2. PROBLEM DEFINITION] – [3. SOLUTION FRAMEWORK] – [4.1 Task Modeling] – [4.2 Worker Modeling] – [5.2 Optimization] – [5.3 Probabilistic Graphical Model (PGM)] – [6. EXPERIMENTS] – [6.1 Experimental Setup] – [6.2 Crowdsourced Data Quality] and [7. CONCLUSION & FUTURE DIRECTIONS]

1. INTRODUCTION

Crowdsourcing solutions have been proposed to address tasks that are hard for machines, e.g., entity resolution [8] and sentiment analysis [32].
To address this problem, one can label the ground truth for a small portion of tasks (called golden tasks) and use them to estimate workers’ quality.
These algorithms are not compared under the same experimental framework and it is hard for practitioners to select appropriate algorithms.
To summarize, the authors make the following contributions: We survey 17 existing algorithms, summarize a framework (Section 3), and provide an in-depth analysis and summary on the 17 algorithms in different perspectives (Sections 4-5), which can help practitioners to easily grasp existing truth inference algorithms.

2. PROBLEM DEFINITION

Each task asks workers to answer the task.
A direct extension of single-choice task is multiple-choice task, where workers can select multiple choices (not only a single choice) out of a set of candidate choices.
In image tagging, given a set of candidate tags for an image, it asks workers to select the tags that the image contains.
Let vwi denote the worker w’s answer for task ti, and the set of answers V = {vwi } contains the collected workers’ answers for all tasks.

3. SOLUTION FRAMEWORK

A naive solution is Majority Voting (MV) [20, 39, 37], which regards the choice answered by majority workers as the truth.
The authors discuss how existing works model a task in Section 4.1.
The two iterations will run until convergence, also known as Convergence (lines 9-11).
Finally the inferred truth and workers’ qualities are returned.
For the 1st iteration, in step 1, it computes each task’s truth from workers’ answers by considering which choice receives the highest aggregated workers’ qualities.

4.1 Task Modeling

1.1 Task Difficulty Different from most existing works which assume that a worker has the same quality for answering different tasks, some recent works [53, 35] model the difficulty in each task.
They assume that each task has its difficulty level, and the more difficult a task is, the harder a worker can correctly answer the task.
The basic idea is to exploit the diverse topics in a task, where the topic number (i.e., K) is pre-defined.
Existing studies [19, 35] make use of the text description in each task and adopt topic model techniques [6, 56] to generate a vector of sizeK for the task; while Multi [51] learns aK-size vector without referring to external information (e.g., text descriptions).
Based on the task models, a worker is probable to answer a task correctly if the worker has high qualities on the task’s related topics.

4.2 Worker Modeling

2.1 Worker Probability Worker probability uses a single real number (between 0 and 1) to model a worker w’s quality qw ∈ [0, 1], which represents the ability that workerw correctly answers a task.
Some recent works [53, 31] extend the worker probability to model a worker’s quality in a wider range, e.g., qw ∈ (−∞,+∞), and a higher qw means the worker w’s higher quality in answering tasks.
2.3 Worker Bias and Worker Variance Worker bias and variance [51, 41] are proposed to handle nu- meric tasks, where worker bias captures the effect that a worker may underestimate (or overestimate) the truth of a task, and worker variance captures the variation of errors around the bias.
A worker may have various levels of expertise for different topics.
A sports fan that rarely pays attention to entertainment may answer tasks related to sports more correctly than tasks related to entertainment.

5.2 Optimization

The basic idea of optimization methods is to set a self-defined optimization function that captures the relations between workers’ qualities and tasks’ truth, and then derive an iterative method to compute these two sets of parameters collectively.
The differences among existing works [5, 31, 30, 61] are that they model workers’ qualities differently and apply different optimization functions to capture the relations between the two sets of parameters.
By capturing the intuitions, similar to Algorithm 1, PM [5, 31] develops an iterative approach, and in each iteration, it adopts the two steps as illustrated in Section 3.
Finally [61] devises an iterative approach to infer the two sets of parameters {v∗i } and {πw}.

5.3 Probabilistic Graphical Model (PGM)

A probabilistic graphical model (PGM) [28] is a graph which expresses the conditional dependency structure (represented by edges) between random variables (represented by nodes).
Figure 1 shows the general PGM adopted in existing works.
Thus ZC [16] applies the EM (Expectation-Maximization) framework [17] and iteratively updates qw and v∗i to approximate its optimal value.
The above method D&S [15], which models a worker as a confusion matrix, is also a widely used model.
[35] combines the process of topic model (i.e., TwitterLDA [56]) and truth inference together, and [59] leverages entity linking and knowledge base to exploit a worker’s diverse skills.

6. EXPERIMENTS

The authors evaluate 17 existing methods (Table 4) on real datasets.
The authors first introduce the experimental setup (Section 6.1), and then analyze the quality of collected crowdsourced data (Section 6.2).
Finally the authors compare with existing methods (Section 6.3).
The authors implement the experiments in Python on a server with CPU 2.40GHz and 60GB memory.

6.1 Experimental Setup

There are many public crowdsourcing datasets [13].
In Table 5, for each selected dataset, the authors list four statistics: the number of tasks, or #tasks (n), #collected answers (|V |), the average number of answers for each task (|V |/n), #truth (some large datasets only provide a subset as ground truth) and #workers (|W|).
Each task in the dataset contains two products (with descriptions) and two choices (T, F), and it asks workers to identify whether the claim “the two products are the same” is true (‘T’) or false (‘F’).
A higher score means a higher degree for the emotion.
The authors use different metrics for different task types.

6.2 Crowdsourced Data Quality

And then answer them.the authors.
In Figure 3, for each dataset, the authors show each worker’s quality, computed based on comparing worker’s answers with tasks’ truth.
(1) The Quality of Different Methods in Different Datasets.
Other methods with more complicated task models and worker models do not express their benefits in quality.
(1) In terms of quality, the methods with Optimization and PGM are more effective than the methods with Direct Computation, as they consider more parameters and study how to infer them iteratively.

7. CONCLUSION & FUTURE DIRECTIONS

The authors provide a detailed survey on truth inference in crowdsourcing and perform an in-depth analysis of 17 existing methods.
The authors also conduct sufficient experiments to compare these methods on 5 datasets with varying task types and sizes.
In order to collect high quality crowdsourced data in an efficient way, it is important to design tasks with friendly User Interface (UI) with a feasible price.
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing.

Did you find this useful? Give us your feedback

Figures (12)

Table 5: The Statistics of Each Dataset.

Table 2: Collected Workers’ Answers for All Tasks.

Figure 3: The Statistics of Worker Quality for Each Dataset (Section 6.2.3).

Figure 2: The Statistics of Worker Redundancy for Each Dataset (Section 6.2.2).

Table 7: The Quality with Qualification Test ( c̃ ) and Benefit ( ∆ = c̃− c ) of Different Methods in Each Dataset (Section 6.3.2).

Figure 7: Varying Hidden Test on Decision-Making Tasks (Section 6.3.3).

Table 6: The Quality and Running Time of Different Methods with Complete Data (Section 6.3.1).

Figure 4: Quality Comparisons on Decision-Making Tasks (Section 6.3.1).

Table 4: Comparisons of Different Methods that Address Truth Inference Problem in Crowdsourcing.

Figure 1: A General PGM (Probabilistic Graphical Model).