Range searching on uncertain data

doi:10.1145/2344422.2344433

Journal Article•DOI•

Range searching on uncertain data

Pankaj K. Agarwal¹, Siu-Wing Cheng², Ke Yi²•Institutions (2)

Duke University¹, Hong Kong University of Science and Technology²

04 Oct 2012-ACM Transactions on Algorithms (ACM)-Vol. 8, Iss: 4, pp 43

TL;DR: This article is given a collection P of n uncertain points in ℝ, each represented by its one-dimensional probability density function (pdf), and aims to build a data structure on P such that, given a query interval I and a probability threshold τ, it can quickly report all points of P that lie in I with probability at least τ.

read less

Abstract: Querying uncertain data has emerged as an important problem in data management due to the imprecise nature of many measurement data. In this article, we study answering range queries over uncertain data. Specifically, we are given a collection P of n uncertain points in ℝ, each represented by its one-dimensional probability density function (pdf). The goal is to build a data structure on P such that, given a query interval I and a probability threshold τ, we can quickly report all points of P that lie in I with probability at least τ. We present various structures with linear or near-linear space and (poly)logarithmic query time. Our structures support pdf's that are either histograms or more complex ones such as Gaussian or piecewise algebraic.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [Previous results.] – [2 Fixed-Threshold Range Queries] – [2.1 A geometric reduction] – [2.2 Half-plane range reporting] – [2.3 Segment-tree based structure] – [2.4 Optimal structure] – [2.5 Dynamization] – [3 Handling More General Pdf's] – [Lemma 3.3 For two Gaussian distributions, their threshold functions intersect at most twice.] and [5 Conclusion]

1 Introduction

Range searching, namely preprocessing a set of points into a data structure so that all points within a given query range can be reported efficiently, is one of the most widely studied topics in computational geometry and database systems [2] , with a wide range of applications.
The generally agreed semantics for querying uncertain data is the thresholding approach [13, 16] , i.e., for a particular threshold τ , retrieve all the tuples that appear in the query range with probability at least τ .
Note that the independence assumption among the uncertain points is irrelevant as far as range queries are concerned.
This case can also be represented by the histogram model using infinitesimal pieces around these locations, so the histogram model also incorporates the discrete pdf case.
The authors refer to the former as the variable threshold version and the latter as the fixed threshold version of the problem.

Previous results.

The problem of range searching on uncertain data has received much attention in the database community over the last few years.
The earliest work [13] considered the above problem in a simpler form, namely, where each f i (x) is a uniform distribution -a special case of their definition in which the histogram consists of only one piece.
The structures presented there are again heuristic solutions.
The authors make a significant theoretical step towards understanding the complexity of range searching on uncertain data.
The authors present linear or near-linear size data structures for both the fixed and variable threshold versions of the problem, with logarithmic or polylogarithmic query times.

2 Fixed-Threshold Range Queries

The authors present an optimal structure for answering range queries on uncertain data where the probability threshold τ is fixed.
The authors structure uses linear space and answers a query in the optimal O(log n+k) time.
The authors first describe in Section 2.1 the reduction to the segments-below-point problem.
The authors then improve this structure to achieve linear size and O(log n + k) query time simultaneously (Section 2.4).
The authors conclude this section by describing how they make the structure dynamic.

2.1 A geometric reduction

As the authors increase x further, g(x) increases linearly, with the slope depending on the pieces of the histogram f that contain x and g(x).
The authors call this problem the segments-below-point problem.

2.2 Half-plane range reporting

This problem is dual to the well-known half-plane range reporting problem, for which there is an O(n)-size structure with O(log n + k)time [11] .
Note that the lines appear along the envelope in decreasing order of their slopes.
By using fractional cascading [10] on the x-coordinates of the envelopes of these layers, the total query time can be improved to O(k) plus the initial binary search in L 1 (S).
Fractional cascading augments these lists with copies of elements from other lists, but the size of the structure remains linear, and it can be constructed in O(n log n) time [10, 11] .
The following statement is slightly more general than what appeared in [11] .

2.3 Segment-tree based structure

The authors later (cf. Section 2.4) bootstrap this structure to improve the query time to O(log n + k) while keeping the size linear.
Next, the authors recursively partition the left and right pieces of s following the r-ary tree.
Note that each piece with spans a multi-slab at some node.
Since the first layer of each halfplane structure is a linear list, this is exactly the standard situation where fractional cascading [10] can be applied.

2.4 Optimal structure

The authors now describe an optimal structure for answering segments-below-point queries.
The authors start with the binary segment-tree structure from the previous subsection.
The following observation will help us in reducing the size.
The total time spent over all pairs in Λ is O(n log n).
Finally, as in the structure of Lemma 2.4, the authors also use fractional cascading on these half-plane range-reporting structures.

2.5 Dynamization

Finally the authors briefly discuss how to make their structure dynamic, i.e., supporting insertions and deletions of uncertain points in the uncertain data set.
If only insertions are to be supported, the authors can apply the logarithmic method [6] to Theorem 2.5.
The best known dynamic structure for halfplane range reporting uses O(n log n) space, supports insertions and deletions in O(polylog n) time amortized, and answers queries in O(log n + k) time [8, 9] .
Currently, it is unknown if one can obtain a linear-size dynamic structure with O(polylog n) update times.
Since super-linear space is unavoidable, the authors can simply plug this dynamic halfplane structure into the segment-tree based structure with fanout 2 (see the remark following Lemma 2.3) and obtain the following.

3 Handling More General Pdf's

In Section 2.1, the authors converted the uncertain range searching problem to the problem of storing a set of x-monotone polygonal chains in a data structure so that all the chains below a query point can be reported efficiently.
The authors first give a Monte Carlo algorithm with running time O(r/δ 2 ) that fails with probability O(δ 3 ); then they show how to convert it to a Las Vegas algorithm that never fails and runs in expected time O(r).
With other families of pdf's, the threshold functions will have different forms.
Note that Lemma 2.1 easily extends to other piecewise functions.
Interestingly, the complexity of the lower envelope of these threshold functions only depends on how many times two pieces from two different threshold functions could intersect.

Lemma 3.3 For two Gaussian distributions, their threshold functions intersect at most twice.

If ϕ (x) has one root, ϕ(x) is unimodal or inverse-unimodal; if ϕ (x) has two roots, by combining with the fact that ϕ(−∞) = ϕ(+∞) = 0, the authors can conclude that ϕ(x) must have exactly one root, and that ϕ(x) is unimodal before the root and inverse-unimodal after it, or vice-versa; see Figure 6 .
The same argument implies that there is at most one intersection point of g 1 and g 2 that lies after ξ, implying that they have at most two intersection points.
Invoking Theorem 3.2, their structure for Gaussian distributions has size O(λ 2 (n) log n) = O(n log n).
By construction, each point is reported only once.
This structure supports insertions and deletions of uncertain points in O(polylog n) time amortized.

5 Conclusion

In this paper the authors have studied the problem of range searching on uncertain data.
The authors data structures have linear or near-linear sizes and support range queries in logarithmic (or polylogarithmic) time.
For the other more complicated ones, some of the ideas (such as the geometric reductions) could be borrowed to devise more practical data structures.
A few heuristics based on R-trees have been proposed in [21] , but no provably good solutions are known.
Unlike range searching, the authors need to consider the interplay between the uncertain points when answering a nearest neighbor query, which seems to make the problem considerably more difficult.

Did you find this useful? Give us your feedback

Figures (7)

Figure 3: A segment tree node with fanout r = 5.

Figure 7: Pr[p ∈ [xl, xr]] is a bivariate piecewise linear function in xl and xr. It consists of s2 pieces and each piece covers a rectangular region in the xlxr-plane.

Figure 6: The function ϕ(x) admits at most 2 intersections of g1(x) and g2(x).

Figure 1: Reduction to the segments-below-point problem: (i) pdf, (ii) cdf, and (iii) threshold function.

Figure 5: The thick chains are in the random sample Ri. The dashed lines divide the lower envelope of Ri into trapezoids. For the trapezoid t, its conflict list Ct consists of chains 4 and 6.

Figure 4: Set Sae = {1, 2, 3, 4, 5} and the strip Σae (shaded); Sae is included in Φe (queried in slab σd) and Φc (queried in slab σf ); Heae = {1, 2, 3} and Hcae = {3, 4, 5}.

Figure 2: The data structure for a set of lines: thick polygonal chain is the lower envelope of S; L1(S) = {1, 2, 6}, L2(S) = {3, 7}, L3(S) = {4, 5}.

Content maybe subject to copyright Report

Range Searching on Uncertain Data

∗

Pankaj K. Agarwal

Duke University

Durham, NC, USA

pankaj@cs.duke.edu

Siu-Wing Cheng

HKUST

Hong Kong, China

scheng@cse.ust.hk

Yufei Tao

CUHK

Hong Kong, China

taoyf@cse.cuhk.edu.hk

Ke Yi

HKUST

Hong Kong, China

yike@cse.ust.hk

Abstract

Querying uncertain data has emerged as an important problem in data management due to

the imprecise nature of many measurement data. In this paper we study answering range queries

over uncertain data. Speciﬁcally, we are given a collection P of n uncertain points in R, each

represented by its one-dimensional probability density function (pdf). The goal is to build a data

structure on P such that given a query interval I and a probability threshold τ , we can quickly

report all points of P that lie in I with probability at least τ . We present various structures with

linear or near-linear space and (poly)logarithmic query time. Our structures support pdf’s that

are either histograms or more complex ones such as Gaussian or piecewise algebraic.

1 Introduction

Range searching, namely preprocessing a set of points into a data structure so that all points within

a given query range can be reported efﬁciently, is one of the most widely studied topics in computa-

tional geometry and database systems [2], with a wide range of applications. Most of the works to

date deal with certain data, that is, the points are given their precise locations in R

. Recent years,

however, have witnessed a dramatically increasing amount of attention devoted to managing un-

certain data because many real-world measurements are inherently accompanied with uncertainty.

Besides the recent efforts in the data management community (see the survey [15]), various issues

related with data uncertainty have also been studied in artiﬁcial intelligence [20], machine learning

[5], statistics [18], and many other areas.

A popular approach to model data uncertainty [13, 25] is to consider each uncertain point p

as a probability distribution over space. It is usually assumed that the points are independent but

∗

A preliminary version of this paper appeared as “Indexing uncertain data” in ACM Symposium on Principles of

Database Systems (PODS), 2009. P. K. Agawal is supported by NSF under grants CNS-05-40347, CCF-06 -35000,

IIS-07-13498, and CCF-09-40671, by ARO grants W911NF-07-1-0376 and W911NF-08-1-0452, by an NIH grant 1P50-

GM-08183-01, by a DOE grant OEG-P200A070505, and by a grant from the U.S.–Israel Binational Science Foundation.

S.-W. Cheng is supported by HKRGC under grant GRF 612107; Y. Tao is supported by HKRGC under grants GRF

1202/06, GRF 4161/07, and GRF 4173/08; and K. Yi is supported by Hong Kong Direct Allocation Grant (DAG07/08).

it is not necessary. The generally agreed semantics for querying uncertain data is the thresholding

approach [13, 16], i.e., for a particular threshold τ, retrieve all the tuples that appear in the query

range with probability at least τ. This problem turns out to be nontrivial even in one dimension. The

na¨ıve approach of examining each point one by one and computing its probability of being inside

the query is obviously very expensive. Note that the independence assumption among the uncertain

points is irrelevant as far as range queries are concerned.

Problem deﬁnition. We now deﬁne our problem more formally. Let P = {p

, . . . , p

} be a set of

n uncertain points in R, where each p

is speciﬁed by its probability density function (pdf) f

: R →

∪{0}. We assume that each f

is a piecewise-uniform function, i.e., a histogram, consisting of at

most s pieces for some integer s ≥ 1. In practice, such a histogram can be used to approximate any

pdf with arbitrary precision. In some applications each point p

has a discrete pdf, namely, it could

appear at one of a few locations, each with a certain probability. This case can also be represented

by the histogram model using inﬁnitesimal pieces around these locations, so the histogram model

also incorporates the discrete pdf case. We will adopt the histogram model by default throughout

the paper. For simplicity, we assume s to be a constant for most of the discussion. Some of our

structures also support more complicated pdf’s (such as Gaussian or piecewise algebraic), and we

will explicitly say so for these structures.

Given the set P and the associated pdf’s, the goal is to build a data structure on them so that for

a query interval I and a threshold τ , all points p such that Pr[p ∈ I] ≥ τ are reported efﬁciently.

We also consider the version where τ is ﬁxed in advance. We refer to the former as the variable

threshold version and the latter as the ﬁxed threshold version of the problem. The latter version is

useful since in many applications the threshold is always ﬁxed at, say, 0.5. Moreover, the user can

often tolerate some error ε in the probability. In this case we can build 1/ε ﬁxed-threshold structures

with τ = ε, 2ε, . . . , 1, so that a query with any threshold can be answered with error at most ε.

Applications. The problem of range searching over uncertain data was ﬁrst introduced by Cheng

et al. [13] and has numerous applications in practice. For example, a certain measurement, say

temperature, may be taken by multiple sensors in a sensor network. Due to various imprecision

factors, the readings of these sensors may not be identical, in which case the temperature of a

location can be conveniently modeled as a pdf. In this context, a query in our problem would

retrieve “all the locations whose temperatures are between 100 and 120 degrees with probability

at least 50%”. It is not hard to see that there are many similar scenarios involving uncertain data.

In fact, our problem is also important even in several traditional applications where no uncertainty

seems to exist. For instance, consider a movie rating system (such as the one at Amazon) where

each reviewer can give a rating from 1 to 10. A query of our problem would ﬁnd “all the movies

such that at least 90% of the ratings it receives are at least 8”.

Previous results. The problem of range searching on uncertain data has received much attention

in the database community over the last few years. The earliest work [13] considered the above

problem in a simpler form, namely, where each f

(x) is a uniform distribution — a special case of

our deﬁnition in which the histogram consists of only one piece. For the ﬁxed-threshold version

with threshold 0 < τ ≤ 1, they proposed a structure of O(nτ

−1

) size with O(τ

−1

log n + k) query

time, where k is the output size. These bounds depend on τ

−1

, which can be arbitrarily large.

This structure does not extend to histograms consisting of two or more pieces. They presented

heuristics for the variable threshold version without any performance guarantees. Tao et al. [25]

considered the problem in two and higher dimensions, and presented some data structures based on

space partitioning heuristics. They prune points whose probability of being inside the query range

is either too low or too high, but the query procedure visits all points of P in the worst case. Finally,

yet another heuristic is presented in [22], but it is still the same as a sequential scan in the worst

case.

Cheng et al. [13] also showed that the ﬁxed-threshold version of the problem is at least as

difﬁcult as 2D halfplane range-reporting (i.e., report all points lying in a query halfplane), and that

it can be reduced to 2D simplex queries (report all points lying in a query triangle). However

the complexities of these two problems differ signiﬁcantly: With linear space, a halfplane range-

reporting query can be answered in time Θ(log n + k) [11], while the latter takes Ω(

√

n) time [12].

So there is a signiﬁcant gap between the current upper and lower bounds for range searching over

uncertain data.

Also related is the work by Singh et al. [24], who considered the problem of querying uncertain

data that are categorical, namely, each random object takes a value from a discrete, unordered

domain. The structures presented there are again heuristic solutions.

Our results. In this paper, we make a signiﬁcant theoretical step towards understanding the com-

plexity of range searching on uncertain data. We present linear or near-linear size data structures for

both the ﬁxed and variable threshold versions of the problem, with logarithmic or polylogarithmic

query times. Speciﬁcally, we obtain the following results.

For the ﬁxed-threshold version, we present a linear-size structure that answers a query in O(log n+

k) time (Section 2). These bounds are clearly optimal (in the comparison model of computation).

We ﬁrst show that this problem can be reduced to a so-called segments-below-point problem: stor-

ing a set of segments in R

so that all segments lying below a query point can be reported quickly.

Then we present an optimal structure for the segments-below-point problem — a linear-size struc-

ture with O(log n + k) query time. This result shows that the ﬁxed-threshold version has exactly

the same complexity as the halfplane range-reporting problem, closing the large gap left in [13]. In

Section 3 we present a simpler structure of size O(nα(n) log n) and query time O(log n + k). This

structure extends to more general pdf’s, such as Gaussian distributions or other piecewise algebraic

pdf’s.

For the variable-threshold version, we use a different reduction and show that it can be solved

by carefully storing a number of points in R

in a structure for answering halfspace range queries.

Combining with the recent result of Afshani and Chan [1] for 3D halfspace range reporting, we

obtain a structure for the variable-threshold version of our problem with O(n log

n) space and

O(log

n + k) query time (Section 4). Although the bounds have extra log factors in this case, our

result shows that this problem is still signiﬁcantly easier than 2D simplex queries.

Finally, we show that our structures can be dynamized, supporting insertions and deletions of

(uncertain) points with a slight increase in the query time.

2 Fixed-Threshold Range Queries

We present an optimal structure for answering range queries on uncertain data where the probability

threshold τ is ﬁxed. Our structure uses linear space and answers a query in the optimal O(log n+k)

time. These bounds do not depend on the particular value of τ. We ﬁrst describe in Section 2.1

the reduction to the segments-below-point problem. Next we describe a segment-tree based data

x x x

f(x) F (x)

g(x)

a b c d e a b c d e a b c d e

(i)

(ii)

(iii)

, x

)

Figure 1: Reduction to the segments-below-point problem: (i) pdf, (ii) cdf, and (iii) threshold

function.

structure that uses linear space and answers a query in O(

√

n + k) time or uses O(n log n) space

and answers a query in O(log n + k) time (Section 2.3). We then improve this structure to achieve

linear size and O(log n + k) query time simultaneously (Section 2.4). We conclude this section by

describing how we make the structure dynamic.

2.1 A geometric reduction

Let p be an uncertain point in R, and let f : R → R be its pdf.

Suppose the histogram of f consists

of s pieces, and let

f(x) = y

, for x

i−1

≤ x < x

, i = 1, . . . , s.

We set x

= −∞, x

= ∞, and y

= y

= 0; see Figure 1 (i). The cumulative distribution

function (cdf) F (x) =

−∞

f(t)dt is a monotone piecewise-linear function consisting of s pieces;

see Figure 1 (ii). Let the query range be [x

, x

]. The probability of p falling inside [x

, x

] is

F (x

) −F (x

). We deﬁne a function g : R → R, which we refer to as the threshold function. For a

given a ∈ R, let g(a) be the minimum value b such that F (b) − F (a) ≥ τ. If no such b exists, g(a)

is set to ∞; see Figure 1 (iii).

Lemma 2.1 The function g(x) is non-decreasing and piecewise linear consisting of at most 2s

pieces.

Proof : Suppose we continuously vary x from −∞ to ∞. For x = −∞, g(x) = min{y | F (y) =

τ}; g(x) stays the same until x reaches x

. As we increase x further, g(x) increases linearly, with

the slope depending on the pieces of the histogram f that contain x and g(x). When either x or

g(x) passes through one of the x

’s, the slope changes. There are at most 2(s − 1) such changes;

see Figure 1. 

Given the description of the pdf f, the function g can be constructed easily. Once we have

the threshold function g, the condition Pr[p ∈ [x

, x

]] ≥ τ simply becomes checking whether

≥ g(x

). Geometrically, this is equivalent to testing whether the point (x

, x

) ∈ R

lies above

the polygonal line representing the graph of g (see Figure 1). We construct the threshold function g

for each point p in P . Let S be the set of at most 2ns segments in R

that form the pieces of these n

functions; S can be constructed in O(n) time. We label each segment of g

with p. The problem of

Through this paper we do not distinguish between a function and its graph.

Figure 2: The data structure for a set of lines: thick polygonal chain is the lower envelope of S;

(S) = {1, 2, 6}, L

(S) = {3, 7}, L

(S) = {4, 5}.

reporting the points of P that lie in the interval [x

, x

] with probability at least τ becomes reporting

the segments of S that lie below the point (x

, x

) ∈ R

: If the procedure returns a segment labeled

with p, we return the point p. Each polygonal line being x-monotone, no point is reported more

than once.

We thus have the following problem at hand: Let S be a set of n segments in R

. Build a

data structure on S so that for a query point q ∈ R

, the set of segments in S lying directly below

q, denoted by S[q], can be reported efﬁciently. For simplicity, we assume the coordinates of the

endpoints of S to be distinct; this assumption can be removed using standard techniques. We call

this problem the segments-below-point problem.

2.2 Half-plane range reporting

We begin by describing a structure for the special case when all segments in S are full lines and

we want to report the lines of S lying below a query point. This problem is dual to the well-known

half-plane range reporting problem, for which there is an O(n)-size structure with O(log n + k)-

time [11]. We brieﬂy describe a variant of this structure (in the dual setting), denoted by H(S),

which we will use as a building block.

If we view each line ` in S as a linear function ` : R → R, then the lower envelope of S is the

graph of the function E

(x) = min

`∈S

`(x), i.e., it is the boundary of the unbounded region in the

planar map induced by S that lies below all the lines of S (see Figure 2). We represent the lower

envelope as a sequence x

= −∞, `

, x

, `

, . . . , `

, x

= +∞, where the x

’s are the x-coordinates

of the vertices of the lower envelope, and `

is the line that appears on the lower envelope in the

interval [x

i−1

, x

]. Note that the lines appear along the envelope in decreasing order of their slopes.

We partition S into a sequence L

(S), L

(S), . . ., of subsets, called layers. L

(S) ⊆ S consists of

the lines that appear on the lower envelope of S. For i > 1, L

(S) is the set of lines that appear

on the lower envelope of S \

i−1

j=1

(S); see Figure 2. For each i, we store the aforementioned

representation of layer L

(S) in a list. To answer a query q = (q

, q

), we start from L

(S) and

locate the interval [x

i−1

, x

] that contains q

, using binary search. Next we walk along the envelope

of L

(S) in both directions, starting from `

, to report the lines lying below q, in time linear to the

output size. Then we query the rest of the layers L

(S), L

(S), . . . in order until no lines have been

reported at a certain layer. By using fractional cascading [10] on the x-coordinates of the envelopes

of these layers, the total query time can be improved to O(k) plus the initial binary search in L

(S).

Fractional cascading augments these lists with copies of elements from other lists, but the size of

the structure remains linear, and it can be constructed in O(n log n) time [10, 11]. The following

HTML Viewer

Range searching on uncertain data

Summary (3 min read)

1 Introduction

Previous results.

2 Fixed-Threshold Range Queries

2.1 A geometric reduction

2.2 Half-plane range reporting

2.3 Segment-tree based structure

2.4 Optimal structure

2.5 Dynamization

3 Handling More General Pdf's

Lemma 3.3 For two Gaussian distributions, their threshold functions intersect at most twice.

5 Conclusion

Figures (7)

Citations

Cites background from "Range searching on uncertain data"

References

"Range searching on uncertain data" refers background in this paper

"Range searching on uncertain data" refers background in this paper

"Range searching on uncertain data" refers background in this paper

Related Papers (5)