scispace - formally typeset
Open AccessJournal ArticleDOI

A confidence-aware approach for truth discovery on long-tail data

Reads0
Chats0
TLDR
A confidence-aware truth discovery (CATD) method to automatically detect truths from conflicting data with long-tail phenomenon is proposed, which outperforms existing state-of-the-art truth discovery approaches by successful discounting the effect of small sources.
Abstract
In many real world applications, the same item may be described by multiple sources. As a consequence, conflicts among these sources are inevitable, which leads to an important task: how to identify which piece of information is trustworthy, i.e., the truth discovery task. Intuitively, if the piece of information is from a reliable source, then it is more trustworthy, and the source that provides trustworthy information is more reliable. Based on this principle, truth discovery approaches have been proposed to infer source reliability degrees and the most trustworthy information (i.e., the truth) simultaneously. However, existing approaches overlook the ubiquitous long-tail phenomenon in the tasks, i.e., most sources only provide a few claims and only a few sources make plenty of claims, which causes the source reliability estimation for small sources to be unreasonable. To tackle this challenge, we propose a confidence-aware truth discovery (CATD) method to automatically detect truths from conflicting data with long-tail phenomenon. The proposed method not only estimates source reliability, but also considers the confidence interval of the estimation, so that it can effectively reflect real source reliability for sources with various levels of participation. Experiments on four real world tasks as well as simulated multi-source long-tail datasets demonstrate that the proposed method outperforms existing state-of-the-art truth discovery approaches by successful discounting the effect of small sources.

read more

Content maybe subject to copyright    Report

A Confidence-Aware Approach for Truth Discovery on
Long-Tail Data
Qi Li
1
, Yaliang Li
1
, Jing Gao
1
, Lu Su
1
,
Bo Zhao
2
, Murat Demirbas
1
, Wei Fan
3
, and Jiawei Han
4
1
SUNY Buffalo, Buffalo, NY USA
2
Microsoft Research, Mountain View, CA USA
3
Huawei Noah’s Ark Lab, Hong Kong
4
University of Illinois, Urbana, IL USA
{qli22,yaliangl,jing,lusu}@buffalo.edu, bozha@microsoft.com,
demirbas@buffalo.edu, david.fanwei@huawei.com, hanj@illinois.edu
ABSTRACT
In many real world applications, the same item may be described by
multiple sources. As a consequence, conflicts among these sources
are inevitable, which leads to an important task: how to identify
which piece of information is trustworthy, i.e., the truth discov-
ery task. Intuitively, if the piece of information is from a reliable
source, then it is more trustworthy, and the source that provides
trustworthy information is more reliable. Based on this princi-
ple, truth discovery approaches have been proposed to infer source
reliability degrees and the most trustworthy information (i.e., the
truth) simultaneously. However, existing approaches overlook the
ubiquitous long-tail phenomenon in the tasks, i.e., most sources
only provide a few claims and only a few sources make plenty
of claims, which causes the source reliability estimation for small
sources to be unreasonable. To tackle this challenge, we propose a
confidence-aware truth discovery (CATD) method to automatically
detect truths from conflicting data with long-tail phenomenon. The
proposed method not only estimates source reliability, but also con-
siders the confidence interval of the estimation, so that it can effec-
tively reflect real source reliability for sources with various levels
of participation. Experiments on four real world tasks as well as
simulated multi-source long-tail datasets demonstrate that the pro-
posed method outperforms existing state-of-the-art truth discovery
approaches by successful discounting the effect of small sources.
1. INTRODUCTION
Big data leads to big challenges, not only in the volume of data
but also in its variety and veracity. In many real applications, mul-
tiple descriptions often exist about the same set of objects or events
from different sources. For example, customer information can
be found from multiple databases in a company, and a patient’s
medical records may be scattered across different hospitals. Un-
avoidably, data or information inconsistency arises from multiple
sources. Then, among conflicting pieces of data or information,
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-
mission prior to any use beyond those covered by the license. Contact
copyright holder by emailing info@vldb.org. Articles from this volume
were invited to present their results at the 41st International Conference on
Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast,
Hawaii.
Proceedings of the VLDB Endowment, Vol. 8, No. 4
Copyright 2014 VLDB Endowment 2150-8097/14/12.
which one is more trustworthy, or represents the true fact? Facing
the daunting scale of data, it is unrealistic to expect a human to
“label” or tell which data source is reliable or which piece of infor-
mation is accurate. Therefore, an important task is to automatically
infer data trustworthiness from multi-source data to resolve con-
flicts and find the most trustworthy piece of information.
Finding Trustworthy Information. One simple approach for this
task is to assume that “majority” represents the “truth”. In other
words, we take the value claimed by the majority sources or take the
average of the continuous values reported by sources, and regard it
as the most trustworthy fact. The drawback of this simple approach
is its inability to characterize the reliability levels of sources. It re-
gards all sources as equally reliable and does not distinguish them,
and thus may fail in scenarios when there exist sources sending
low quality information, such as faulty sensors that keep emanating
erroneous information, and spam users who propagate false infor-
mation on the Web. To overcome this limitation, techniques have
been proposed to simultaneously derive trustworthy facts and esti-
mate source reliability degrees [8, 9, 16, 18–20, 23–26, 31, 34–37].
A common principle to the techniques is as follows. The sources
which provide trustworthy information are more reliable, and the
information from reliable sources is more trustworthy. In these ap-
proaches, the most trustworthy fact, i.e., the truth, is computed as
a weighted voting or averaging among sources where more reli-
able ones have higher weights. Although different formulas have
been proposed to derive source weights (i.e., reliability degrees),
the same principle applies: The source weight should be propor-
tional to the probability of the source giving trustworthy informa-
tion. In practice, this probability is simulated as the percentage of
correct claims of the source. The more claims a source makes, the
more likely that this estimation of source reliability is closer to the
true reliability degree.
Long-tail Phenomenon. However, sources with very few claims
are common in applications. The number of claims made by sources
typically exhibits long-tail phenomenon, that is, most of the sources
only provide information about one or two items, and there are only
a few sources that make lots of claims. For example, although there
are numerous websites containing information about one or several
celebrities, there are few websites which, like Wikipedia, provide
extensive coverage for thousands of celebrities. Another example
concerns user participation in survey, review or other activities. On
average, participants only show interests to a few items whereas
very few participants cover most of the items. Long-tail phenom-
425

ena are ubiquitous in real world applications, which bring obstacles
to the task of information trustworthiness estimation.
Recall that identifying reliable sources is the key to find trust-
worthy information, and source reliability is typically estimated by
the empirical probability of making correct claims. The effective-
ness of this estimation is heavily affected by the total number of
claims made by each source. When a source makes a large num-
ber of claims, it is likely that we can obtain a relatively accurate
estimate of source reliability. However, sources with a few claims
occupy the majority when long-tail phenomenon exists. For such
“small” sources, there is no easy way to evaluate their reliability
degrees. Consider an extreme case when most sources only make
one claim to one single item. If the claim is correct, its accuracy
is one and the source is considered as highly reliable. If the claim
is wrong, its accuracy is zero and the source is regarded as highly
unreliable. In some sense such an estimate based on one single
claim is totally random. When weighted voting is conducted based
on the estimates of source reliability, the “unreliable” estimation of
source reliability for many “small” sources will inevitably impair
the ability of detecting trustworthy information. We illustrate this
phenomenon and its effect on the task of truth discovery with more
details in Sections 2 and 4.
Limitation of Traditional Approaches. One may argue that one
way to tackle the issue of insufficient data for accurate reliability es-
timation is to remove sources that provide only a few claims. How-
ever, this simple strategy suffers from the following challenges.
First, we need a threshold on the number of claims to classify
“small” or “large” for sources. Second, as the majority of sources
claims very few facts, the removal of these sources may result in
sparse data and limited coverage. An alternative strategy could be
drawn from Bayesian estimation in which a smoothing prior can be
added [36, 37]. We can add a fixed “pseudo” count in the compu-
tation of source accuracy so that the estimation can be smoothed
for sources with very few claims. When there are many sources,
typically a uniform prior is adopted, i.e., the same “pseudo” count
applies to all sources. How to select an appropriate pseudo count
is an open question. Moreover, a uniform prior may not fit all the
scenarios, but setting a non-uniform prior is difficult when there is
a large number of sources.
Summary of Proposed Approach. In this paper, we propose a
confidence-aware approach to detect trustworthy information from
conflicting claims, where the long-tail phenomenon is observed in
data. We propose that source reliability degree is reflected in the
variance of the difference between the true fact and the source in-
put. The basic principle is that an unreliable source will make errors
frequently and have a wide spectrum of errors in distribution. To
resolve conflicts and detect the most trustworthy piece of informa-
tion, we take a weighted combination of source input in which the
weight of each source corresponds to its variance. Since variance
is unknown, we derive an effective estimator based on the confi-
dence interval of the variance. The chi-squared distribution in the
estimator incorporates the effect of sample size. The overall goal is
to minimize the weighted sum of the variances to obtain a reason-
able estimate of the source reliability degrees. By optimizing the
source weights, we can assign high weights to reliable sources and
low weights to unreliable sources when the sources have sufficient
claims. When a source only provides very few claims, the weight
is mostly dominated by the chi-squared probability value so that
the source reliability degree is automatically smoothed and small
sources will not affect the trustworthiness estimation heavily. We
apply the proposed method and various baseline methods on four
real world application scenarios and simulated datasets. Existing
approaches, which regard small and big sources the same way, fail
to provide an accurate estimate of truths. In contrast, the proposed
method can successfully detect trustworthy information by effec-
tively estimating source reliability degrees.
In summary, we make the following contributions in this paper:
We identify the pitfalls and challenges in data with long-tail
phenomenon for the task of truth discovery, i.e., detecting the
most trustworthy facts from multiple sources of conflicting
information.
We propose to combine multi-source data in a weighted ag-
gregation framework and search for the best assignment of
source weights by solving an optimization problem.
An estimator based on the confidence interval of source re-
liability is derived. This estimator can successfully estimate
source reliability, and discount the effect of small sources
without the hassle of setting pseudo counts or priors.
We test the proposed algorithm on real world long-tail datasets,
and the results clearly demonstrate the advantages of the ap-
proach in finding the true facts and identifying reliable sources.
We also provide insights about the method by illustrating its
behavior under various conditions using simulations.
In the following section, we first describe some real world ap-
plications and the collected datasets to illustrate the challenge of
long-tail phenomenon in truth discovery tasks. Then, in Section 3,
we formulate the problem and derive the proposed method. In Sec-
tion 4, various experiments are conducted on both real world and
simulated datasets, and we validate the effectiveness and efficiency
of the proposed method. Related work is discussed in Section 5,
and finally, we conclude the paper in Section 6.
2. APPLICATIONS AND OBSERVATIONS
In this section, we present a broad spectrum of real world truth
discovery applications where the long-tail phenomenon can be ob-
served. Although the long-tail phenomenon is not rare in truth dis-
covery tasks, it does not receive enough attention yet.
Web Information Aggregation. When the Web becomes one of
the most important information origins for most people, it is crucial
to analyze the reliability of various data sources on the Web in order
to obtain trustworthy information. The long-tail phenomenon is
common on the Web. Only a few famous big data sources, such as
Wikipedia, may offer plenty of information, but most websites may
only provide limited information.
We introduce two specific truth discovery scenarios for web in-
formation aggregation: truth discovery on city population and on
biography information. For these tasks, we are interested in ag-
gregating the population information about some cities at differ-
ent years and people’s biography respectively. Two datasets
1
were
crawled by the authors in [23]. The information can be found in
the cities’ or persons’ Wikipedia infoboxes, and the edit histories
of these infoboxes are examined. As Wikipedia pages can be edited
by any user, for a specific entity, multiple users may contribute to
it. The information from these users is not consistent, and some
users may provide more reliable information than the others.
Social Sensing. Social sensing is a newly emerged sensing sce-
nario where the collection of sensory data are carried out by a
large group of users via sensor-rich mobile devices such as smart-
phones. In social sensing applications, human-carried sensors are
the sources of information. For the same object or event, differ-
ent sensors may report differently due to many factors, such as the
1
http://cogcomp.cs.illinois.edu/page/resource_view/16
426

10
0
10
1
10
2
10
3
10
4
10
5
10
0
10
1
10
2
10
3
10
4
Number of Claims
Number of Sources
City Population Dataset
Power Law Function Fit
(a) City Population Dataset
10
0
10
1
10
2
10
3
10
4
10
0
10
2
10
4
10
6
Number of Claims
Number of Sources
Biography Dataset
Power Law Function Fit
(b) Biography Dataset
10
0
10
1
10
2
10
3
10
0
10
1
10
2
10
3
Number of Claims
Number of Sources
Indoor Floorplan Dataset
Power Law Function Fit
(c) Indoor Floorplan Dataset
10
0
10
1
10
2
10
3
10
4
10
0
10
1
10
2
10
3
10
4
10
5
Number of Claims
Number of Sources
Game Dataset
Power Law Function Fit
(d) Game Dataset
Figure 1: Long-tail phenomenon is observed with real world datasets.
quality of the sensors and the way in which the sensor carrier per-
forms the sensing task. Truth discovery techniques can be useful
for social sensing to improve the quality of sensor data integration
by inferring the sources’ quality. In many social sensing applica-
tions, only a few sensors are incessantly active while most of others
are activated occasionally, which causes the long-tail phenomenon.
A representative example of social sensing is the construction of
indoor floorplans [1, 28]. This research topic has recently drawn
a growing interest since it potentially can support a wide range
of location-based applications. The goal is to develop an auto-
matic floorplan construction system that can infer the information
about the building floorplan from the movement traces of a group
of smartphone users. The movement traces of each user can be de-
rived from the readings of inertial sensors (e.g., accelerometer, gy-
roscope, and compass) built in the smartphone. Here we are inter-
ested in one specific task of floorplan construction, i.e., to estimate
the distance between two indoor points (e.g., a hallway segment).
We develop an Android App that can estimate the walking distances
of a smartphone user through multiplying his/her step size by step
count inferred using the in-phone accelerometer. When App users
are walking along the hallways, we record the distances they have
traveled. For the same hallway segment, the estimated distances
given by different users are inevitably different due to the varieties
in their walking patterns, the ways of carrying the phones, and the
quality of in-phone sensors.
Crowd Wisdom. The wisdom of the crowd can be achieved by in-
tegrating the crowd’s answers and opinions towards a set of ques-
tions. By carefully estimating each participants’ abilities, the ag-
gregation among the crowd’s inputs can often achieve better an-
swers compared with the answers given by a single expert. Cur-
rent technologies enable convenient crowd wisdom implementa-
tion, and truth discovery provides an effective way to aggregate
participants’ input and output accurate answers. The long-tail phe-
nomenon happens in the crowd wisdom applications because many
participants only show interests in a couple of questions, while a
few participants answer lots of the questions.
In this application, we design an Android App as a crowd wis-
dom platform based on a popular TV game show “Who Wants to
Be a Millionaire” [2]. When the game show is on live, the An-
droid App sends each question and four corresponding candidate
answers to users, and then collects their answers. For each ques-
tion, answers from different users are available, and usually these
answers have conflicts among them. We can then create a super-
player that outperforms all the participants by integrating answers
from all of them.
Due to page limit, we only introduce three applications, but there
are more than we can list. In these applications, we observe the dif-
ference in information quality of various sources which motivates
truth discovery research. Long-tail phenomenon is ubiquitous in
these truth discovery tasks. In the following, we demonstrate the
long-tail phenomenon using the four truth discovery datasets we
experiment on. The four datasets are introduced in the above dis-
cussions and more information can be found in Section 4. Their sta-
tistical information is summarized in Table 1. We count the number
of claims made by each source and Figure 1 shows the distribution
of this statistic. The figures witness a clear long-tail phenomenon:
Most sources provide few claims and only a small proportion of
sources provide a large number of claims. In order to demonstrate
the long-tail phenomenon clearer, we further fit the City Popula-
tion, Biography, Indoor Floorplan and Game datasets into power
law function, a typical long-tail distribution
2
. Figure 1 shows that
the fitting curves closely match the data, which is a strong evidence
of long-tail phenomenon.
Table 1: Statistics of real world long-tail datasets
City Population Biography Indoor Floorplan Game
# Sources 4107 607819 247 38196
# Entities 43071 9924 129 2169
# Claims 50561 1372066 740 221653
11000200030004000
60
70
80
90
100
Number of Sources
Percentage of Coverage
(a) Coverage
1000200030004000
2000
2200
2400
2600
2800
3000
3200
Number of Sources
MAE
(b) Performance
Figure 2: The percentage of coverage decreases and MAE increases
as more sources are removed.
As discussed in Section 1, removing the sources that provide few
claims might be a possible solution. The main shortcoming of this
solution is that a large proportion of the whole dataset is discarded.
Figure 2 demonstrates two consequences caused by this problem
using City Population dataset. All the sources are ordered based on
the number of claims they provide. At the very beginning, we con-
sider all sources, and then gradually remove sources starting from
the smallest ones. One consequence is the sacrifice of coverage
(Figure 2a). If we regard “small” sources as those whose claims are
less than 1% of the number of claims made by the biggest source
and remove them, the percentage of coverage is 88.07%. In ad-
dition to the low percentage of coverage, we lose 10491 claims
counting for 20.74% of all claims, which leads to another conse-
quence: performance degrading. Figure 2b shows that the mean
absolute error (MAE) increases as more sources are removed (de-
tail of the measure is introduced in Section 4.1). After removing
2
Note that we use power law distribution as an example of long-tail phenomenon, but
long-tail phenomenon is a general scenario and some other distributions, such as Burr
distribution and log-normal distribution, can be used to describe long-tail phenomenon
too.
427

the small sources, the number of claims for each entity will shrink
dramatically, which causes the problem that the information is not
sufficient to estimate trustworthy output.
Smoothing prior or “pseudo” count, as mentioned in Section 1,
is another possible solution. The difficulty of this solution lies in
setting the “pseudo” count. As Figure 1 illustrates, the numbers of
claims made by sources are significantly different. It is unfair to
use the same “pseudo” count for all sources. However, with thou-
sands or even hundreds of thousands sources, to assign individual
“pseudo” count to each source is unrealistic and impossible to tune.
3. METHODOLOGY
In this section, we describe the proposed method, which tack-
les the challenge that most of the sources only provide information
about few items. We model the truths as weighted combination of
the claims from multiple sources and formulate the weight compu-
tation as an optimization problem. Some practical issues are dis-
cussed at the end of this section.
3.1 Problem Formulation
We start by introducing terminologies and notations used in this
paper with an example. Then the problem is formally formulated.
DEFINITION 1. An entity is an item of interest. A claim is a
piece of information provided by a source about a given entity. A
truth is the most trustworthy piece of information for an entity.
DEFINITION 2. Let C = {c
1
, c
2
, . . . c
C
} be the set of claims
that can be taken as input. Each claim c has the format of (n, s, x
s
n
),
where n denotes the entity, s denotes the source, and x
s
n
denotes
the information of entity n provided by source s.
DEFINITION 3. The output X is a collection of (n, x
n
) pairs,
where x
n
denotes the truth for entity n.
Table 2: A sample census database
Entity Source ID Population (million)
NYC Source A 8.405
NYC Source B 8.837
NYC Source C 8.4
NYC Source D 13.175
DC Source A 0.646
DC Source B 0.6
LA Source A 3.904
LA Source B 15.904
... ... ...
Table 3: X and the ground truths for the sample census database
X Ground truths
Entity Population Entity Population
NYC 8.423 NYC 8.420
DC 0.645 DC 0.646
LA 4.291 LA 4
... ... ... ...
EXAMPLE 1. Table 2 shows a sample census database. In this
particular example, an entity is a city and a claim is a tuple in the
database. Source A states that New York City has a population
of 8.405 million, so its corresponding x
s
n
= 8.405. Note that in
this example, x
s
n
is a numerical value, but we do not limit x
s
n
to
be continuous data type only. Discussions on categorical values
can be found in Section 3.2.4. Table 3 shows the output X using
the proposed method and the ground truth for this sample census
database. Comparing with the ground truths, every source may
make some mistakes on their claims, but some sources make fewer
errors than the others. For example, source A’s claims are closer to
the ground truths than source B’s claims, which means the former is
more reliable than the latter, so source A deserves a higher weight
when inferring the truth. Source C seems to be reliable, but based
on one claim, it is hard to judge. The proposed method achieves
very close results comparing with the ground truths by accurately
estimating the source reliability degrees.
Given input C, our task is to resolve the conflicts and find the
most trustworthy piece of information from various sources for ev-
ery entity. In addition to the truth X , we also simultaneously infer
the reliability degree of each source w
s
based on input informa-
tion. A higher w
s
indicates that the s-th source is more reliable and
information from this source is more trustworthy.
Table 4 summarizes all the notations used in this paper. σ
2
s
and
u
2
s
will be introduced in the next subsection.
Table 4: Notations
Notation Definition
C set of claims (input)
N set of entities
n the n-th entity
S set of sources
s the s-th source
N
s
the set of entities provided by source s
S
n
the set of sources that provide a claim on entity n
x
s
n
information for entity n provided by source s
X set of truths (output)
x
n
the truth for entity n
w
s
weight for source s
σ
2
s
error variance of source s
u
2
s
upper bound of variance σ
2
s
3.2 CATD Method
In this section, we formally introduce the proposed method, called
Confidence-Aware Truth Discovery (CATD), for resolving the con-
flicts and finding the truths among various sources. The proposed
method can handle the challenge brought by the long-tail phenomenon
that we observe.
3.2.1 Truth Calculation
Here we only consider the single truth scenario, i.e., there is only
one truth for each entity although sources may provide different
claims on the same entity.
The basic idea is that reliable sources provide trustworthy in-
formation, so the truth should be close to the claims from reliable
sources. Many truth discovery methods [8,16, 19,20,23,23–26,31,
34–37] use weighted voting or averaging more or less to achieve
the truths, which overcome the issue of conventional voting or av-
eraging schema that assumes all the sources are equally reliable.
We propose to use the same weighted averaging strategy to ob-
tain the truths. Since sources are usually consistent in the quality
of its claims, we can use source weight, i.e., the source reliability
degree, w
s
as the weight for all the claims provided by s:
x
n
=
P
sS
n
w
s
· x
s
n
P
sS
n
w
s
. (1)
However, the source reliability degrees are usually unknown a
priori. Therefore, the key question we want to explore next is how
to find the “best” assignment of w
s
.
428

3.2.2 Source Weight Calculation
In this paper, we assume that all sources make their claims in-
dependently, i.e., they do not copy from each other. We leave the
case when source dependence happens for future work. We can re-
gard that each source’s information is independently sampled from
a hidden distribution. Errors, which are differences between the
claims and the truths, may occur for every source. The variance of
the error distribution reflects the reliability degree of this source: if
a source is unreliable, the errors it makes occur frequently and have
a wide spectrum in general, so the variance of the error distribution
is big. We believe that none of the sources make errors on purpose,
so the mean of the error distribution, which indicates its bias, is 0.
We propose to use Gaussian distribution to describe errors, which
is widely adopted in many fields. For each source, its error follows
a Gaussian distribution with mean 0 and variance σ
2
s
, i.e.,
s
N(0, σ
2
s
).
Since we have the source independence assumption, the errors
that sources make are independent too. We can then compute the
distribution for the error of the weighted combination in Eq.(1) as:
combine
N
0,
P
s∈S
w
2
s
σ
2
s
(
P
s∈S
w
s
)
2
!
, (2)
where
combine
=
P
s∈S
w
s
s
P
s∈S
w
s
. Without loss of generality, we con-
strain
P
s∈S
w
s
= 1.
For a Gaussian distribution, the variance determines the shape
of the distribution. If the variance is small, then the distribution
has a sharp and high central peak at the mean, which indicates a
high probability that errors are close to 0. Therefore, we want the
variance of the
combine
to be as small as possible. We formulate
this goal into the following optimization problem:
min
{w
s
}
X
s∈S
w
2
s
σ
2
s
s.t.
X
s∈S
w
s
= 1, w
s
> 0, s S. (3)
Usually the theoretical σ
2
s
is unknown for each source. Inspired
by sample variance, the following estimator can be used to estimate
the real variance σ
2
s
:
ˆ
σ
2
s
=
1
|N
s
|
X
nN
s
x
s
n
x
(0)
n
2
, (4)
where x
(0)
n
is initial truth for entity n (such as the mean, median or
mode of the claims on entity n), |N
s
| is the number of claims made
by source s. Another interpretation of Eq.(4) is that
ˆ
σ
2
s
represents
the mean of the squared loss of the errors that source s makes.
However, this estimator is not precise when |N
s
| is very small,
so it can not accurately reflect the real variance of the source. As
we observed the long-tail phenomenon in Section 2, most of the
sources have very few claims. Then estimator
ˆ
σ
2
s
may lead to an in-
appropriate weight assignment for most of the sources, and further
cause inaccurate truth computation. In order to solve this problem
brought by the long-tail phenomenon in the dataset, we should not
only consider a single value of the estimator
ˆ
σ
2
s
for each source,
but a range of values that can act as good estimates of σ
2
s
. There-
fore, we adopt the (1α) confidence interval for σ
2
s
, where α, also
known as significant level, is usually a small number such as 0.05.
As we illustrate above, the difference between x
s
n
and x
(0)
n
fol-
lows a Gaussian distribution N (0, σ
2
s
). Since the sum of squares
of standard Gaussian distribution has chi-squared distribution [17],
we have:
P
nN
s
x
s
n
x
(0)
n
2
σ
2
s
=
|N
s
|
ˆ
σ
2
s
σ
2
s
χ
2
(|N
s
|).
Thus we have:
P
χ
2
(1α/2,|N
s
|)
<
|N
s
|
ˆ
σ
2
s
σ
2
s
< χ
2
(α/2,|N
s
|)
!
= 1 α,
which gives the (1 α) confidence interval of σ
2
s
as:
P
nN
s
x
s
n
x
(0)
n
2
χ
2
(1α/2,|N
s
|)
,
P
nN
s
x
s
n
x
(0)
n
2
χ
2
(α/2,|N
s
|)
(5)
Comparing with Eq.(4), Eq.(5) is more informative. Although
two sources with different numbers of claims may have the same
ˆ
σ
2
s
, the confidence interval of σ
2
s
for these two sources can be sig-
nificantly different as shown in the following example.
Table 5: Example on calculating confidence interval
Source ID # Claims
ˆ
σ
2
s
Confidence Interval (95%)
Source A 200 0.1 (0.0830, 0.1229)
Source B 200 3 (2.4890, 3.6871)
Source C 2 0.1 (0.0271, 3.9498)
Source D 2 3 (0.8133, 118.49)
EXAMPLE 2. Suppose from Example 1 we obtain the statistics
and sample variance for source A, B, C, and D as shown in Table
5. Both source A and C have the same
ˆ
σ
2
s
= 0.1, but source C has
only 2 claims while source A makes 200 claims. The confidence
interval of source C shows that the
ˆ
σ
2
s
is rather random and the
real variance may be much bigger than the sample variance for
the small sources. In contrast, the confidence interval for source
A is tight, and the upper bound of its confidence interval in this
case is close to its
ˆ
σ
2
s
. Similarly, source B and D provide different
numbers of claims, but they have the same
ˆ
σ
2
s
= 3. These two
sources are not as reliable as source A and C because the sample
variances are bigger, which indicates that claims made by these two
sources are far from the truths. The confidence intervals for source
B and D show similar patterns as source A and C. It is clear from
this simple example that the confidence interval of σ
2
s
carries more
information than
ˆ
σ
2
s
, and thus this confidence interval is helpful to
estimate more accurate source weights.
We propose to use the upper bound of the (1 α) confidence in-
terval (denoted as u
2
s
) as an estimator for σ
2
s
instead of using
ˆ
σ
2
s
in
the optimization problem 3. The intuition behind this choice is that
we want to minimize the variance of
combine
by considering the
possibly worst scenario of σ
2
s
for a given source, i.e., minimize the
maximum possible loss. The upper bound u
2
s
is a biased estimator
on σ
2
s
, but the bias is big only on sources with few claims. As the
number of claims from a source increases, the bias drops.
We can substitute the unknown variance σ
2
s
in Eq.(3) by this up-
per bound u
2
s
and rewrite the optimization problem Eq.(3) as:
min
{w
s
}
X
s∈S
w
2
s
u
2
s
s.t.
X
s∈S
w
s
= 1, w
s
> 0, s S. (6)
429

Citations
More filters
Journal ArticleDOI

Truth inference in crowdsourcing: is the problem solved?

TL;DR: It is believed that the truth inference problem is not fully solved, and the limitations of existing algorithms are identified and point out promising research directions.
Journal ArticleDOI

A Survey on Truth Discovery

TL;DR: This survey focuses on providing a comprehensive overview of truth discovery methods, and summarizing them from different aspects, and offers some guidelines on how to apply these approaches in application domains.
Posted Content

A Survey on Truth Discovery

TL;DR: Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains as discussed by the authors. But for the same object, there usually exist conflicts among the collected multi-source information.
Proceedings ArticleDOI

Quality of Information Aware Incentive Mechanisms for Mobile Crowd Sensing Systems

TL;DR: This paper incorporates a crucial metric, called users' quality of information (QoI), into their incentive mechanisms for MCS systems, and design incentive mechanisms based on reverse combinatorial auctions that approximately maximizes the social welfare with a guaranteed approximation ratio.
Proceedings ArticleDOI

Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media

TL;DR: This paper automatically assessing the credibility of emerging claims, with sparse presence in web-sources, and generating suitable explanations from judiciously selected sources, shows that the methods work well for early detection of emergingClaims, as well as for claims with limited presence on the web and social media.
References
More filters
Book

Convex Optimization

TL;DR: In this article, the focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them, and a comprehensive introduction to the subject is given. But the focus of this book is not on the optimization problem itself, but on the problem of finding the appropriate technique to solve it.
Journal ArticleDOI

Power-Law Distributions in Empirical Data

TL;DR: This work proposes a principled statistical framework for discerning and quantifying power-law behavior in empirical data by combining maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov-Smirnov (KS) statistic and likelihood ratios.
Book

Introduction to Mathematical Statistics

TL;DR: In this article, the authors present a list of common distributions of probability and distribution of likelihood for Bayesian models. But they do not discuss the relation between distributions and normal models.
Journal ArticleDOI

Data fusion

TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.
Journal ArticleDOI

Introduction to Mathematical Statistics

TL;DR: Paul G. Hoel's “Introduction to Mathematical Statistics” seems to me to be an excellent work, and if only it can become generally available it may have a most favourable effect on the situation just described.
Related Papers (5)
Frequently Asked Questions (5)
Q1. What is the key to obtaining accurate truths?

Since all the truth discovery methods and the proposed CATD method use weighted voting or averaging to calculate truths, the estimated source reliability is the key to obtain accurate truths. 

By outperforming the baseline methods on all real world datasets, the proposed CATD method demonstrates its power on modeling source reliability accurately even when the sources make insufficient claims. 

The authors can add a fixed “pseudo” count in the computation of source accuracy so that the estimation can be smoothed for sources with very few claims. 

The long-tail phenomenon happens in the crowd wisdom applications because many participants only show interests in a couple of questions, while a few participants answer lots of the questions. 

In their experiment, the Pearsons correlation coefficient for running time and the number of claims is 0.9991, indicating that they are highly linearly correlated.