Truth inference in crowdsourcing: is the problem solved?
Summary (3 min read)
1. INTRODUCTION
- Crowdsourcing solutions have been proposed to address tasks that are hard for machines, e.g., entity resolution [8] and sentiment analysis [32].
- To address this problem, one can label the ground truth for a small portion of tasks (called golden tasks) and use them to estimate workers’ quality.
- These algorithms are not compared under the same experimental framework and it is hard for practitioners to select appropriate algorithms.
- To summarize, the authors make the following contributions: We survey 17 existing algorithms, summarize a framework (Section 3), and provide an in-depth analysis and summary on the 17 algorithms in different perspectives (Sections 4-5), which can help practitioners to easily grasp existing truth inference algorithms.
2. PROBLEM DEFINITION
- Each task asks workers to answer the task.
- A direct extension of single-choice task is multiple-choice task, where workers can select multiple choices (not only a single choice) out of a set of candidate choices.
- In image tagging, given a set of candidate tags for an image, it asks workers to select the tags that the image contains.
- Let vwi denote the worker w’s answer for task ti, and the set of answers V = {vwi } contains the collected workers’ answers for all tasks.
3. SOLUTION FRAMEWORK
- A naive solution is Majority Voting (MV) [20, 39, 37], which regards the choice answered by majority workers as the truth.
- The authors discuss how existing works model a task in Section 4.1.
- The two iterations will run until convergence, also known as Convergence (lines 9-11).
- Finally the inferred truth and workers’ qualities are returned.
- For the 1st iteration, in step 1, it computes each task’s truth from workers’ answers by considering which choice receives the highest aggregated workers’ qualities.
4.1 Task Modeling
- 1.1 Task Difficulty Different from most existing works which assume that a worker has the same quality for answering different tasks, some recent works [53, 35] model the difficulty in each task.
- They assume that each task has its difficulty level, and the more difficult a task is, the harder a worker can correctly answer the task.
- The basic idea is to exploit the diverse topics in a task, where the topic number (i.e., K) is pre-defined.
- Existing studies [19, 35] make use of the text description in each task and adopt topic model techniques [6, 56] to generate a vector of sizeK for the task; while Multi [51] learns aK-size vector without referring to external information (e.g., text descriptions).
- Based on the task models, a worker is probable to answer a task correctly if the worker has high qualities on the task’s related topics.
4.2 Worker Modeling
- 2.1 Worker Probability Worker probability uses a single real number (between 0 and 1) to model a worker w’s quality qw ∈ [0, 1], which represents the ability that workerw correctly answers a task.
- Some recent works [53, 31] extend the worker probability to model a worker’s quality in a wider range, e.g., qw ∈ (−∞,+∞), and a higher qw means the worker w’s higher quality in answering tasks.
- 2.3 Worker Bias and Worker Variance Worker bias and variance [51, 41] are proposed to handle nu- meric tasks, where worker bias captures the effect that a worker may underestimate (or overestimate) the truth of a task, and worker variance captures the variation of errors around the bias.
- A worker may have various levels of expertise for different topics.
- A sports fan that rarely pays attention to entertainment may answer tasks related to sports more correctly than tasks related to entertainment.
5.2 Optimization
- The basic idea of optimization methods is to set a self-defined optimization function that captures the relations between workers’ qualities and tasks’ truth, and then derive an iterative method to compute these two sets of parameters collectively.
- The differences among existing works [5, 31, 30, 61] are that they model workers’ qualities differently and apply different optimization functions to capture the relations between the two sets of parameters.
- By capturing the intuitions, similar to Algorithm 1, PM [5, 31] develops an iterative approach, and in each iteration, it adopts the two steps as illustrated in Section 3.
- Finally [61] devises an iterative approach to infer the two sets of parameters {v∗i } and {πw}.
5.3 Probabilistic Graphical Model (PGM)
- A probabilistic graphical model (PGM) [28] is a graph which expresses the conditional dependency structure (represented by edges) between random variables (represented by nodes).
- Figure 1 shows the general PGM adopted in existing works.
- Thus ZC [16] applies the EM (Expectation-Maximization) framework [17] and iteratively updates qw and v∗i to approximate its optimal value.
- The above method D&S [15], which models a worker as a confusion matrix, is also a widely used model.
- [35] combines the process of topic model (i.e., TwitterLDA [56]) and truth inference together, and [59] leverages entity linking and knowledge base to exploit a worker’s diverse skills.
6. EXPERIMENTS
- The authors evaluate 17 existing methods (Table 4) on real datasets.
- The authors first introduce the experimental setup (Section 6.1), and then analyze the quality of collected crowdsourced data (Section 6.2).
- Finally the authors compare with existing methods (Section 6.3).
- The authors implement the experiments in Python on a server with CPU 2.40GHz and 60GB memory.
6.1 Experimental Setup
- There are many public crowdsourcing datasets [13].
- In Table 5, for each selected dataset, the authors list four statistics: the number of tasks, or #tasks (n), #collected answers (|V |), the average number of answers for each task (|V |/n), #truth (some large datasets only provide a subset as ground truth) and #workers (|W|).
- Each task in the dataset contains two products (with descriptions) and two choices (T, F), and it asks workers to identify whether the claim “the two products are the same” is true (‘T’) or false (‘F’).
- A higher score means a higher degree for the emotion.
- The authors use different metrics for different task types.
6.2 Crowdsourced Data Quality
- And then answer them.the authors.
- In Figure 3, for each dataset, the authors show each worker’s quality, computed based on comparing worker’s answers with tasks’ truth.
- (1) The Quality of Different Methods in Different Datasets.
- Other methods with more complicated task models and worker models do not express their benefits in quality.
- (1) In terms of quality, the methods with Optimization and PGM are more effective than the methods with Direct Computation, as they consider more parameters and study how to infer them iteratively.
7. CONCLUSION & FUTURE DIRECTIONS
- The authors provide a detailed survey on truth inference in crowdsourcing and perform an in-depth analysis of 17 existing methods.
- The authors also conduct sufficient experiments to compare these methods on 5 datasets with varying task types and sizes.
- In order to collect high quality crowdsourced data in an efficient way, it is important to design tasks with friendly User Interface (UI) with a feasible price.
- M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing.
Did you find this useful? Give us your feedback
Citations
304 citations
257 citations
185 citations
130 citations
126 citations
Cites background from "Truth inference in crowdsourcing: i..."
...As mentioned above, entire surveys could be written on the topic of crowdsourced data generation alone, and indeed some have [217, 219]....
[...]
...Unlike other surveys, which go into greater depth on algorithms for aggregating crowdsourced labels [217, 219], we address the label aggregation problem only briefly, devoting relatively more attention and detail to applications that are less well known within the machine learning community, in the hope of inspiring new connections and directions of research....
[...]
..., 54, 79, 82, 107, 160, 198, 218, 220], still an active area of research [219]....
[...]
...[219] provide a thorough survey and empirical comparison of seventeen algorithms that are based on this general framework, characterizing them in terms of the way in which instances and workers are modeled as well as the specifics of how the calculations of quality parameters and label assignments are made (through what they call direct computation, using optimization methods, or using probabilistic graphical models)....
[...]
References
30,570 citations
"Truth inference in crowdsourcing: i..." refers methods in this paper
...For example, existing studies [19, 35] make use of the text description in each task and adopt topic model techniques [6, 56] to generate a vector of sizeK for the task; while Multi [51] learns aK-size vector without referring to external information (e....
[...]
25,546 citations
10,215 citations
Related Papers (5)
Frequently Asked Questions (13)
Q2. What have the authors stated for future works in "Truth inference in crowdsourcing: is the problem solved?" ?
The authors also point out the following future research directions. It is also interesting to study the relations between the design of UI, price, worker ’ s latency and quality. Not all methods can benefit from qualification test, and the quality of some methods even decrease. Although most methods can benefit from them, the improvements vary in different datasets and methods.
Q3. What are the three categories of methods used in Algorithm 1?
Based on the used techniques, they can be classified into the following three categories: direct computation [20, 39], optimization methods [61, 19, 30, 5] and probabilistic graphical model methods [34, 16, 15, 53, 51, 41, 26, 33, 35, 27, 46].
Q4. Why do the quality of methods for single-label tasks decrease?
(3) On S Rel, the quality of methods CATD and ZC decrease as r ≥ 4, probably because they are sensitive to low quality workers’ answers.
Q5. Why is the access to crowd much easier?
Due to the wide deployment of public crowdsourcing platforms, e.g., Amazon Mechanical Turk (AMT) [2], CrowdFlower [12], the access to crowd becomes much easier.
Q6. What are the main reasons why the methods with confusion matrix perform better than methods with worker probability?
In terms of worker models, in general, methods with confusion matrix (D&S, BCC, CBCC, LFC, VI-BP, VIMF) perform better than methods with worker probability (ZC, GLAD, CATD, PM, KOS), since confusion matrix is more expressive than worker probability.
Q7. What is the reason for the slow convergentness of the methods?
(5) In terms of task models, the methods that model task difficulty (GLAD) or latent topics (Multi) in tasks do not perform significantly better in quality; moreover, they often take more time to converge.
Q8. What are the main reasons why different methods with optimization differ in efficiency?
Different optimization functions often vary significantly in efficiency, e.g., Bayesian Estimator is less efficient than Point Estimation, and some techniques (e.g., Gibbs Sampling, Variational Inference) often take a long time to converge.
Q9. How many methods can initialize workers’ quality?
The authors find that there are only 8 methods (i.e., ZC, GLAD, D&S, LFC, CATD, PM, VI-MF and LFC N) that can initialize workers’ qualities using qualification test.
Q10. What is the average deviation of the worker’s answers?
As the answers obtained for each task has inherent orderings, in order to capture the consistency of workers’ answers, for a task ti, the authors first compute the median vi (a robust metric in statistics and it is not sensitive to outliers) over all its collected answers; then the consistency (C) is defined as the average deviationcompared with the median, i.e., C = 1 n · n∑ i=1√∑ w∈Wi (v w i −vi) 2|Wi| ,where
Q11. What is the quality of different methods in different datasets?
For dataset D Product, i.e., Figures 4(a), (b), the authors can observe that (1) as the data redundancy r is varied in [1, 3], the quality increases with r for different methods.
Q12. What is the basic method for calculating worker probability?
Suppose all tasks are decision-making tasks (v∗i ∈ {T, F}) and each worker’s quality is modeled as worker probability qw ∈ [0, 1].
Q13. What is the quality of methods for single-label tasks?
(4) The quality of methods for single-label tasks are lower than that for decision-making tasks, since workers are not good at answering tasks with multiple choices, and the methods for single-label tasks are sensitive to low quality workers.