What is the key to obtaining accurate truths?

Since all the truth discovery methods and the proposed CATD method use weighted voting or averaging to calculate truths, the estimated source reliability is the key to obtain accurate truths.

What is the CATD method for predicting source reliability?

By outperforming the baseline methods on all real world datasets, the proposed CATD method demonstrates its power on modeling source reliability accurately even when the sources make insufficient claims.

How is the correlation coefficient for running time and the number of claims?

In their experiment, the Pearsons correlation coefficient for running time and the number of claims is 0.9991, indicating that they are highly linearly correlated.

(Open Access) A confidence-aware approach for truth discovery on long-tail data (2014) | Qi Li

Q: How can the authors add a fixed pseudo count in the computation of source accuracy?

The authors can add a fixed “pseudo” count in the computation of source accuracy so that the estimation can be smoothed for sources with very few claims.

Q: What is the effect of the long-tail phenomenon in crowd wisdom applications?

The long-tail phenomenon happens in the crowd wisdom applications because many participants only show interests in a couple of questions, while a few participants answer lots of the questions.

A Conﬁdence-Aware Approach for Truth Discovery on

Long-Tail Data

Qi Li

, Yaliang Li

, Jing Gao

, Lu Su

Bo Zhao

, Murat Demirbas

, Wei Fan

, and Jiawei Han

SUNY Buffalo, Buffalo, NY USA

Microsoft Research, Mountain View, CA USA

Huawei Noah’s Ark Lab, Hong Kong

University of Illinois, Urbana, IL USA

{qli22,yaliangl,jing,lusu}@buffalo.edu, bozha@microsoft.com,

demirbas@buffalo.edu, david.fanwei@huawei.com, hanj@illinois.edu

ABSTRACT

In many real world applications, the same item may be described by

multiple sources. As a consequence, conﬂicts among these sources

are inevitable, which leads to an important task: how to identify

which piece of information is trustworthy, i.e., the truth discov-

ery task. Intuitively, if the piece of information is from a reliable

source, then it is more trustworthy, and the source that provides

trustworthy information is more reliable. Based on this princi-

ple, truth discovery approaches have been proposed to infer source

reliability degrees and the most trustworthy information (i.e., the

truth) simultaneously. However, existing approaches overlook the

ubiquitous long-tail phenomenon in the tasks, i.e., most sources

only provide a few claims and only a few sources make plenty

of claims, which causes the source reliability estimation for small

sources to be unreasonable. To tackle this challenge, we propose a

conﬁdence-aware truth discovery (CATD) method to automatically

detect truths from conﬂicting data with long-tail phenomenon. The

proposed method not only estimates source reliability, but also con-

siders the conﬁdence interval of the estimation, so that it can effec-

tively reﬂect real source reliability for sources with various levels

of participation. Experiments on four real world tasks as well as

simulated multi-source long-tail datasets demonstrate that the pro-

posed method outperforms existing state-of-the-art truth discovery

approaches by successful discounting the effect of small sources.

1. INTRODUCTION

Big data leads to big challenges, not only in the volume of data

but also in its variety and veracity. In many real applications, mul-

tiple descriptions often exist about the same set of objects or events

from different sources. For example, customer information can

be found from multiple databases in a company, and a patient’s

medical records may be scattered across different hospitals. Un-

avoidably, data or information inconsistency arises from multiple

sources. Then, among conﬂicting pieces of data or information,

This work is licensed under the Creative Commons Attribution-

NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-

cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-

mission prior to any use beyond those covered by the license. Contact

were invited to present their results at the 41st International Conference on

Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast,

Hawaii.

Proceedings of the VLDB Endowment, Vol. 8, No. 4

which one is more trustworthy, or represents the true fact? Facing

the daunting scale of data, it is unrealistic to expect a human to

“label” or tell which data source is reliable or which piece of infor-

mation is accurate. Therefore, an important task is to automatically

infer data trustworthiness from multi-source data to resolve con-

ﬂicts and ﬁnd the most trustworthy piece of information.

Finding Trustworthy Information. One simple approach for this

task is to assume that “majority” represents the “truth”. In other

words, we take the value claimed by the majority sources or take the

average of the continuous values reported by sources, and regard it

as the most trustworthy fact. The drawback of this simple approach

is its inability to characterize the reliability levels of sources. It re-

gards all sources as equally reliable and does not distinguish them,

and thus may fail in scenarios when there exist sources sending

low quality information, such as faulty sensors that keep emanating

erroneous information, and spam users who propagate false infor-

mation on the Web. To overcome this limitation, techniques have

been proposed to simultaneously derive trustworthy facts and esti-

mate source reliability degrees [8, 9, 16, 18–20, 23–26, 31, 34–37].

A common principle to the techniques is as follows. The sources

which provide trustworthy information are more reliable, and the

information from reliable sources is more trustworthy. In these ap-

proaches, the most trustworthy fact, i.e., the truth, is computed as

a weighted voting or averaging among sources where more reli-

able ones have higher weights. Although different formulas have

been proposed to derive source weights (i.e., reliability degrees),

the same principle applies: The source weight should be propor-

tional to the probability of the source giving trustworthy informa-

tion. In practice, this probability is simulated as the percentage of

correct claims of the source. The more claims a source makes, the

more likely that this estimation of source reliability is closer to the

true reliability degree.

Long-tail Phenomenon. However, sources with very few claims

are common in applications. The number of claims made by sources

typically exhibits long-tail phenomenon, that is, most of the sources

only provide information about one or two items, and there are only

a few sources that make lots of claims. For example, although there

are numerous websites containing information about one or several

celebrities, there are few websites which, like Wikipedia, provide

extensive coverage for thousands of celebrities. Another example

concerns user participation in survey, review or other activities. On

average, participants only show interests to a few items whereas

very few participants cover most of the items. Long-tail phenom-

425

ena are ubiquitous in real world applications, which bring obstacles

to the task of information trustworthiness estimation.

Recall that identifying reliable sources is the key to ﬁnd trust-

worthy information, and source reliability is typically estimated by

the empirical probability of making correct claims. The effective-

ness of this estimation is heavily affected by the total number of

claims made by each source. When a source makes a large num-

ber of claims, it is likely that we can obtain a relatively accurate

estimate of source reliability. However, sources with a few claims

occupy the majority when long-tail phenomenon exists. For such

“small” sources, there is no easy way to evaluate their reliability

degrees. Consider an extreme case when most sources only make

one claim to one single item. If the claim is correct, its accuracy

is one and the source is considered as highly reliable. If the claim

is wrong, its accuracy is zero and the source is regarded as highly

unreliable. In some sense such an estimate based on one single

claim is totally random. When weighted voting is conducted based

on the estimates of source reliability, the “unreliable” estimation of

source reliability for many “small” sources will inevitably impair

the ability of detecting trustworthy information. We illustrate this

phenomenon and its effect on the task of truth discovery with more

details in Sections 2 and 4.

Limitation of Traditional Approaches. One may argue that one

way to tackle the issue of insufﬁcient data for accurate reliability es-

timation is to remove sources that provide only a few claims. How-

ever, this simple strategy suffers from the following challenges.

First, we need a threshold on the number of claims to classify

“small” or “large” for sources. Second, as the majority of sources

claims very few facts, the removal of these sources may result in

sparse data and limited coverage. An alternative strategy could be

drawn from Bayesian estimation in which a smoothing prior can be

added [36, 37]. We can add a ﬁxed “pseudo” count in the compu-

tation of source accuracy so that the estimation can be smoothed

for sources with very few claims. When there are many sources,

typically a uniform prior is adopted, i.e., the same “pseudo” count

applies to all sources. How to select an appropriate pseudo count

is an open question. Moreover, a uniform prior may not ﬁt all the

scenarios, but setting a non-uniform prior is difﬁcult when there is

a large number of sources.

Summary of Proposed Approach. In this paper, we propose a

conﬁdence-aware approach to detect trustworthy information from

conﬂicting claims, where the long-tail phenomenon is observed in

data. We propose that source reliability degree is reﬂected in the

variance of the difference between the true fact and the source in-

put. The basic principle is that an unreliable source will make errors

frequently and have a wide spectrum of errors in distribution. To

resolve conﬂicts and detect the most trustworthy piece of informa-

tion, we take a weighted combination of source input in which the

weight of each source corresponds to its variance. Since variance

is unknown, we derive an effective estimator based on the conﬁ-

dence interval of the variance. The chi-squared distribution in the

estimator incorporates the effect of sample size. The overall goal is

to minimize the weighted sum of the variances to obtain a reason-

able estimate of the source reliability degrees. By optimizing the

source weights, we can assign high weights to reliable sources and

low weights to unreliable sources when the sources have sufﬁcient

claims. When a source only provides very few claims, the weight

is mostly dominated by the chi-squared probability value so that

the source reliability degree is automatically smoothed and small

sources will not affect the trustworthiness estimation heavily. We

apply the proposed method and various baseline methods on four

real world application scenarios and simulated datasets. Existing

approaches, which regard small and big sources the same way, fail

to provide an accurate estimate of truths. In contrast, the proposed

method can successfully detect trustworthy information by effec-

tively estimating source reliability degrees.

In summary, we make the following contributions in this paper:

• We identify the pitfalls and challenges in data with long-tail

phenomenon for the task of truth discovery, i.e., detecting the

most trustworthy facts from multiple sources of conﬂicting

information.

• We propose to combine multi-source data in a weighted ag-

gregation framework and search for the best assignment of

source weights by solving an optimization problem.

• An estimator based on the conﬁdence interval of source re-

liability is derived. This estimator can successfully estimate

source reliability, and discount the effect of small sources

without the hassle of setting pseudo counts or priors.

• We test the proposed algorithm on real world long-tail datasets,

and the results clearly demonstrate the advantages of the ap-

proach in ﬁnding the true facts and identifying reliable sources.

We also provide insights about the method by illustrating its

behavior under various conditions using simulations.

In the following section, we ﬁrst describe some real world ap-

plications and the collected datasets to illustrate the challenge of

long-tail phenomenon in truth discovery tasks. Then, in Section 3,

we formulate the problem and derive the proposed method. In Sec-

tion 4, various experiments are conducted on both real world and

simulated datasets, and we validate the effectiveness and efﬁciency

of the proposed method. Related work is discussed in Section 5,

and ﬁnally, we conclude the paper in Section 6.

2. APPLICATIONS AND OBSERVATIONS

In this section, we present a broad spectrum of real world truth

discovery applications where the long-tail phenomenon can be ob-

served. Although the long-tail phenomenon is not rare in truth dis-

covery tasks, it does not receive enough attention yet.

Web Information Aggregation. When the Web becomes one of

the most important information origins for most people, it is crucial

to analyze the reliability of various data sources on the Web in order

to obtain trustworthy information. The long-tail phenomenon is

common on the Web. Only a few famous big data sources, such as

Wikipedia, may offer plenty of information, but most websites may

only provide limited information.

We introduce two speciﬁc truth discovery scenarios for web in-

formation aggregation: truth discovery on city population and on

biography information. For these tasks, we are interested in ag-

gregating the population information about some cities at differ-

ent years and people’s biography respectively. Two datasets

were

crawled by the authors in [23]. The information can be found in

the cities’ or persons’ Wikipedia infoboxes, and the edit histories

of these infoboxes are examined. As Wikipedia pages can be edited

by any user, for a speciﬁc entity, multiple users may contribute to

it. The information from these users is not consistent, and some

users may provide more reliable information than the others.

Social Sensing. Social sensing is a newly emerged sensing sce-

nario where the collection of sensory data are carried out by a

large group of users via sensor-rich mobile devices such as smart-

phones. In social sensing applications, human-carried sensors are

the sources of information. For the same object or event, differ-

ent sensors may report differently due to many factors, such as the

http://cogcomp.cs.illinois.edu/page/resource_view/16

426

Number of Claims

Number of Sources

City Population Dataset

Power Law Function Fit

(a) City Population Dataset

Number of Claims

Number of Sources

Biography Dataset

Power Law Function Fit

(b) Biography Dataset

Number of Claims

Number of Sources

Indoor Floorplan Dataset

Power Law Function Fit

Number of Claims

Number of Sources

Game Dataset

Power Law Function Fit

(d) Game Dataset

Figure 1: Long-tail phenomenon is observed with real world datasets.

quality of the sensors and the way in which the sensor carrier per-

forms the sensing task. Truth discovery techniques can be useful

for social sensing to improve the quality of sensor data integration

by inferring the sources’ quality. In many social sensing applica-

tions, only a few sensors are incessantly active while most of others

are activated occasionally, which causes the long-tail phenomenon.

A representative example of social sensing is the construction of

indoor ﬂoorplans [1, 28]. This research topic has recently drawn

a growing interest since it potentially can support a wide range

of location-based applications. The goal is to develop an auto-

matic ﬂoorplan construction system that can infer the information

about the building ﬂoorplan from the movement traces of a group

of smartphone users. The movement traces of each user can be de-

rived from the readings of inertial sensors (e.g., accelerometer, gy-

roscope, and compass) built in the smartphone. Here we are inter-

ested in one speciﬁc task of ﬂoorplan construction, i.e., to estimate

the distance between two indoor points (e.g., a hallway segment).

We develop an Android App that can estimate the walking distances

of a smartphone user through multiplying his/her step size by step

count inferred using the in-phone accelerometer. When App users

are walking along the hallways, we record the distances they have

traveled. For the same hallway segment, the estimated distances

given by different users are inevitably different due to the varieties

in their walking patterns, the ways of carrying the phones, and the

quality of in-phone sensors.

Crowd Wisdom. The wisdom of the crowd can be achieved by in-

tegrating the crowd’s answers and opinions towards a set of ques-

tions. By carefully estimating each participants’ abilities, the ag-

gregation among the crowd’s inputs can often achieve better an-

swers compared with the answers given by a single expert. Cur-

rent technologies enable convenient crowd wisdom implementa-

tion, and truth discovery provides an effective way to aggregate

participants’ input and output accurate answers. The long-tail phe-

nomenon happens in the crowd wisdom applications because many

participants only show interests in a couple of questions, while a

few participants answer lots of the questions.

In this application, we design an Android App as a crowd wis-

dom platform based on a popular TV game show “Who Wants to

Be a Millionaire” [2]. When the game show is on live, the An-

droid App sends each question and four corresponding candidate

answers to users, and then collects their answers. For each ques-

tion, answers from different users are available, and usually these

answers have conﬂicts among them. We can then create a super-

player that outperforms all the participants by integrating answers

from all of them.

Due to page limit, we only introduce three applications, but there

are more than we can list. In these applications, we observe the dif-

ference in information quality of various sources which motivates

truth discovery research. Long-tail phenomenon is ubiquitous in

these truth discovery tasks. In the following, we demonstrate the

long-tail phenomenon using the four truth discovery datasets we

experiment on. The four datasets are introduced in the above dis-

cussions and more information can be found in Section 4. Their sta-

tistical information is summarized in Table 1. We count the number

of claims made by each source and Figure 1 shows the distribution

of this statistic. The ﬁgures witness a clear long-tail phenomenon:

Most sources provide few claims and only a small proportion of

sources provide a large number of claims. In order to demonstrate

the long-tail phenomenon clearer, we further ﬁt the City Popula-

tion, Biography, Indoor Floorplan and Game datasets into power

law function, a typical long-tail distribution

. Figure 1 shows that

the ﬁtting curves closely match the data, which is a strong evidence

of long-tail phenomenon.

Table 1: Statistics of real world long-tail datasets

City Population Biography Indoor Floorplan Game

# Sources 4107 607819 247 38196

# Entities 43071 9924 129 2169

# Claims 50561 1372066 740 221653

11000200030004000

100

Number of Sources

Percentage of Coverage

(a) Coverage

1000200030004000

2000

2200

2400

2600

2800

3000

3200

Number of Sources

MAE

(b) Performance

Figure 2: The percentage of coverage decreases and MAE increases

as more sources are removed.

As discussed in Section 1, removing the sources that provide few

claims might be a possible solution. The main shortcoming of this

solution is that a large proportion of the whole dataset is discarded.

Figure 2 demonstrates two consequences caused by this problem

using City Population dataset. All the sources are ordered based on

the number of claims they provide. At the very beginning, we con-

sider all sources, and then gradually remove sources starting from

the smallest ones. One consequence is the sacriﬁce of coverage

(Figure 2a). If we regard “small” sources as those whose claims are

less than 1% of the number of claims made by the biggest source

and remove them, the percentage of coverage is 88.07%. In ad-

dition to the low percentage of coverage, we lose 10491 claims

counting for 20.74% of all claims, which leads to another conse-

quence: performance degrading. Figure 2b shows that the mean

absolute error (MAE) increases as more sources are removed (de-

tail of the measure is introduced in Section 4.1). After removing

Note that we use power law distribution as an example of long-tail phenomenon, but

long-tail phenomenon is a general scenario and some other distributions, such as Burr

distribution and log-normal distribution, can be used to describe long-tail phenomenon

too.

427

the small sources, the number of claims for each entity will shrink

dramatically, which causes the problem that the information is not

sufﬁcient to estimate trustworthy output.

Smoothing prior or “pseudo” count, as mentioned in Section 1,

is another possible solution. The difﬁculty of this solution lies in

setting the “pseudo” count. As Figure 1 illustrates, the numbers of

claims made by sources are signiﬁcantly different. It is unfair to

use the same “pseudo” count for all sources. However, with thou-

sands or even hundreds of thousands sources, to assign individual

“pseudo” count to each source is unrealistic and impossible to tune.

3. METHODOLOGY

In this section, we describe the proposed method, which tack-

les the challenge that most of the sources only provide information

about few items. We model the truths as weighted combination of

the claims from multiple sources and formulate the weight compu-

tation as an optimization problem. Some practical issues are dis-

cussed at the end of this section.

3.1 Problem Formulation

We start by introducing terminologies and notations used in this

paper with an example. Then the problem is formally formulated.

DEFINITION 1. An entity is an item of interest. A claim is a

piece of information provided by a source about a given entity. A

truth is the most trustworthy piece of information for an entity.

DEFINITION 2. Let C = {c

, c

, . . . c

} be the set of claims

that can be taken as input. Each claim c has the format of (n, s, x

where n denotes the entity, s denotes the source, and x

denotes

the information of entity n provided by source s.

DEFINITION 3. The output X is a collection of (n, x

∗

) pairs,

where x

∗

denotes the truth for entity n.

Table 2: A sample census database

Entity Source ID Population (million)

NYC Source A 8.405

NYC Source B 8.837

NYC Source C 8.4

NYC Source D 13.175

DC Source A 0.646

DC Source B 0.6

LA Source A 3.904

LA Source B 15.904

... ... ...

Table 3: X and the ground truths for the sample census database

X Ground truths

Entity Population Entity Population

NYC 8.423 NYC 8.420

DC 0.645 DC 0.646

LA 4.291 LA 4

... ... ... ...

EXAMPLE 1. Table 2 shows a sample census database. In this

particular example, an entity is a city and a claim is a tuple in the

database. Source A states that New York City has a population

of 8.405 million, so its corresponding x

= 8.405. Note that in

this example, x

is a numerical value, but we do not limit x

be continuous data type only. Discussions on categorical values

can be found in Section 3.2.4. Table 3 shows the output X using

the proposed method and the ground truth for this sample census

database. Comparing with the ground truths, every source may

make some mistakes on their claims, but some sources make fewer

errors than the others. For example, source A’s claims are closer to

the ground truths than source B’s claims, which means the former is

more reliable than the latter, so source A deserves a higher weight

when inferring the truth. Source C seems to be reliable, but based

on one claim, it is hard to judge. The proposed method achieves

very close results comparing with the ground truths by accurately

estimating the source reliability degrees.

Given input C, our task is to resolve the conﬂicts and ﬁnd the

most trustworthy piece of information from various sources for ev-

ery entity. In addition to the truth X , we also simultaneously infer

the reliability degree of each source w

based on input informa-

tion. A higher w

indicates that the s-th source is more reliable and

information from this source is more trustworthy.

Table 4 summarizes all the notations used in this paper. σ

and

will be introduced in the next subsection.

Table 4: Notations

Notation Deﬁnition

C set of claims (input)

N set of entities

n the n-th entity

S set of sources

s the s-th source

the set of entities provided by source s

the set of sources that provide a claim on entity n

information for entity n provided by source s

X set of truths (output)

∗

the truth for entity n

weight for source s

error variance of source s

upper bound of variance σ

3.2 CATD Method

In this section, we formally introduce the proposed method, called

Conﬁdence-Aware Truth Discovery (CATD), for resolving the con-

ﬂicts and ﬁnding the truths among various sources. The proposed

method can handle the challenge brought by the long-tail phenomenon

that we observe.

3.2.1 Truth Calculation

Here we only consider the single truth scenario, i.e., there is only

one truth for each entity although sources may provide different

claims on the same entity.

The basic idea is that reliable sources provide trustworthy in-

formation, so the truth should be close to the claims from reliable

sources. Many truth discovery methods [8,16, 19,20,23,23–26,31,

34–37] use weighted voting or averaging more or less to achieve

the truths, which overcome the issue of conventional voting or av-

eraging schema that assumes all the sources are equally reliable.

We propose to use the same weighted averaging strategy to ob-

tain the truths. Since sources are usually consistent in the quality

of its claims, we can use source weight, i.e., the source reliability

degree, w

as the weight for all the claims provided by s:

∗

s∈S

· x

s∈S

. (1)

However, the source reliability degrees are usually unknown a

priori. Therefore, the key question we want to explore next is how

to ﬁnd the “best” assignment of w

428

3.2.2 Source Weight Calculation

In this paper, we assume that all sources make their claims in-

dependently, i.e., they do not copy from each other. We leave the

case when source dependence happens for future work. We can re-

gard that each source’s information is independently sampled from

a hidden distribution. Errors, which are differences between the

claims and the truths, may occur for every source. The variance of

the error distribution reﬂects the reliability degree of this source: if

a source is unreliable, the errors it makes occur frequently and have

a wide spectrum in general, so the variance of the error distribution

is big. We believe that none of the sources make errors on purpose,

so the mean of the error distribution, which indicates its bias, is 0.

We propose to use Gaussian distribution to describe errors, which

is widely adopted in many ﬁelds. For each source, its error follows

a Gaussian distribution with mean 0 and variance σ

, i.e.,



∼ N(0, σ

Since we have the source independence assumption, the errors

that sources make are independent too. We can then compute the

distribution for the error of the weighted combination in Eq.(1) as:



combine

∼ N

s∈S

(

s∈S

)

, (2)

where 

combine

s∈S



s∈S

. Without loss of generality, we con-

strain

s∈S

= 1.

For a Gaussian distribution, the variance determines the shape

of the distribution. If the variance is small, then the distribution

has a sharp and high central peak at the mean, which indicates a

high probability that errors are close to 0. Therefore, we want the

variance of the 

combine

to be as small as possible. We formulate

this goal into the following optimization problem:

min

}

s∈S

s.t.

s∈S

= 1, w

> 0, ∀s ∈ S. (3)

Usually the theoretical σ

is unknown for each source. Inspired

by sample variance, the following estimator can be used to estimate

the real variance σ

n∈N



− x

∗(0)



, (4)

where x

∗(0)

is initial truth for entity n (such as the mean, median or

mode of the claims on entity n), |N

| is the number of claims made

by source s. Another interpretation of Eq.(4) is that

represents

the mean of the squared loss of the errors that source s makes.

However, this estimator is not precise when |N

| is very small,

so it can not accurately reﬂect the real variance of the source. As

we observed the long-tail phenomenon in Section 2, most of the

sources have very few claims. Then estimator

may lead to an in-

appropriate weight assignment for most of the sources, and further

cause inaccurate truth computation. In order to solve this problem

brought by the long-tail phenomenon in the dataset, we should not

only consider a single value of the estimator

for each source,

but a range of values that can act as good estimates of σ

. There-

fore, we adopt the (1−α) conﬁdence interval for σ

, where α, also

known as signiﬁcant level, is usually a small number such as 0.05.

As we illustrate above, the difference between x

and x

∗(0)

fol-

lows a Gaussian distribution N (0, σ

). Since the sum of squares

of standard Gaussian distribution has chi-squared distribution [17],

we have:

n∈N



− x

∗(0)



∼ χ

(|N

|).

Thus we have:

(1−α/2,|N

< χ

(α/2,|N

= 1 − α,

which gives the (1 − α) conﬁdence interval of σ

as:







n∈N



− x

∗(0)



(1−α/2,|N

n∈N



− x

∗(0)



(α/2,|N







(5)

Comparing with Eq.(4), Eq.(5) is more informative. Although

two sources with different numbers of claims may have the same

, the conﬁdence interval of σ

for these two sources can be sig-

niﬁcantly different as shown in the following example.

Table 5: Example on calculating conﬁdence interval

Source ID # Claims

Conﬁdence Interval (95%)

Source A 200 0.1 (0.0830, 0.1229)

Source B 200 3 (2.4890, 3.6871)

Source C 2 0.1 (0.0271, 3.9498)

Source D 2 3 (0.8133, 118.49)

EXAMPLE 2. Suppose from Example 1 we obtain the statistics

and sample variance for source A, B, C, and D as shown in Table

5. Both source A and C have the same

= 0.1, but source C has

only 2 claims while source A makes 200 claims. The conﬁdence

interval of source C shows that the

is rather random and the

real variance may be much bigger than the sample variance for

the small sources. In contrast, the conﬁdence interval for source

A is tight, and the upper bound of its conﬁdence interval in this

case is close to its

. Similarly, source B and D provide different

numbers of claims, but they have the same

= 3. These two

sources are not as reliable as source A and C because the sample

variances are bigger, which indicates that claims made by these two

sources are far from the truths. The conﬁdence intervals for source

B and D show similar patterns as source A and C. It is clear from

this simple example that the conﬁdence interval of σ

carries more

information than

, and thus this conﬁdence interval is helpful to

estimate more accurate source weights.

We propose to use the upper bound of the (1 − α) conﬁdence in-

terval (denoted as u

) as an estimator for σ

instead of using

the optimization problem 3. The intuition behind this choice is that

we want to minimize the variance of 

combine

by considering the

possibly worst scenario of σ

for a given source, i.e., minimize the

maximum possible loss. The upper bound u

is a biased estimator

on σ

, but the bias is big only on sources with few claims. As the

number of claims from a source increases, the bias drops.

We can substitute the unknown variance σ

in Eq.(3) by this up-

per bound u

and rewrite the optimization problem Eq.(3) as:

min

}

s∈S

s.t.

s∈S

= 1, w

> 0, ∀s ∈ S. (6)

429

A confidence-aware approach for truth discovery on long-tail data

Figures

Citations

Truth inference in crowdsourcing: is the problem solved?

A Survey on Truth Discovery

A Survey on Truth Discovery

Quality of Information Aware Incentive Mechanisms for Mobile Crowd Sensing Systems

Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media

References

Convex Optimization

Power-Law Distributions in Empirical Data

Introduction to Mathematical Statistics

Data fusion

Introduction to Mathematical Statistics

Related Papers (5)

Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation

Truth Discovery with Multiple Conflicting Information Providers on the Web

Integrating conflicting data: the role of source dependence

Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm

On truth discovery in social sensing: a maximum likelihood estimation approach

Frequently Asked Questions (5)

Q1. What is the key to obtaining accurate truths?

Q2. What is the CATD method for predicting source reliability?

Q3. How can the authors add a fixed pseudo count in the computation of source accuracy?

Q4. What is the effect of the long-tail phenomenon in crowd wisdom applications?

Q5. How is the correlation coefficient for running time and the number of claims?