scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Range searching on uncertain data

TL;DR: This article is given a collection P of n uncertain points in ℝ, each represented by its one-dimensional probability density function (pdf), and aims to build a data structure on P such that, given a query interval I and a probability threshold τ, it can quickly report all points of P that lie in I with probability at least τ.
Abstract: Querying uncertain data has emerged as an important problem in data management due to the imprecise nature of many measurement data. In this article, we study answering range queries over uncertain data. Specifically, we are given a collection P of n uncertain points in ℝ, each represented by its one-dimensional probability density function (pdf). The goal is to build a data structure on P such that, given a query interval I and a probability threshold τ, we can quickly report all points of P that lie in I with probability at least τ. We present various structures with linear or near-linear space and (poly)logarithmic query time. Our structures support pdf's that are either histograms or more complex ones such as Gaussian or piecewise algebraic.

Summary (3 min read)

1 Introduction

  • Range searching, namely preprocessing a set of points into a data structure so that all points within a given query range can be reported efficiently, is one of the most widely studied topics in computational geometry and database systems [2] , with a wide range of applications.
  • The generally agreed semantics for querying uncertain data is the thresholding approach [13, 16] , i.e., for a particular threshold τ , retrieve all the tuples that appear in the query range with probability at least τ .
  • Note that the independence assumption among the uncertain points is irrelevant as far as range queries are concerned.
  • This case can also be represented by the histogram model using infinitesimal pieces around these locations, so the histogram model also incorporates the discrete pdf case.
  • The authors refer to the former as the variable threshold version and the latter as the fixed threshold version of the problem.

Previous results.

  • The problem of range searching on uncertain data has received much attention in the database community over the last few years.
  • The earliest work [13] considered the above problem in a simpler form, namely, where each f i (x) is a uniform distribution -a special case of their definition in which the histogram consists of only one piece.
  • The structures presented there are again heuristic solutions.
  • The authors make a significant theoretical step towards understanding the complexity of range searching on uncertain data.
  • The authors present linear or near-linear size data structures for both the fixed and variable threshold versions of the problem, with logarithmic or polylogarithmic query times.

2 Fixed-Threshold Range Queries

  • The authors present an optimal structure for answering range queries on uncertain data where the probability threshold τ is fixed.
  • The authors structure uses linear space and answers a query in the optimal O(log n+k) time.
  • The authors first describe in Section 2.1 the reduction to the segments-below-point problem.
  • The authors then improve this structure to achieve linear size and O(log n + k) query time simultaneously (Section 2.4).
  • The authors conclude this section by describing how they make the structure dynamic.

2.1 A geometric reduction

  • As the authors increase x further, g(x) increases linearly, with the slope depending on the pieces of the histogram f that contain x and g(x).
  • The authors call this problem the segments-below-point problem.

2.2 Half-plane range reporting

  • This problem is dual to the well-known half-plane range reporting problem, for which there is an O(n)-size structure with O(log n + k)time [11] .
  • Note that the lines appear along the envelope in decreasing order of their slopes.
  • By using fractional cascading [10] on the x-coordinates of the envelopes of these layers, the total query time can be improved to O(k) plus the initial binary search in L 1 (S).
  • Fractional cascading augments these lists with copies of elements from other lists, but the size of the structure remains linear, and it can be constructed in O(n log n) time [10, 11] .
  • The following statement is slightly more general than what appeared in [11] .

2.3 Segment-tree based structure

  • The authors later (cf. Section 2.4) bootstrap this structure to improve the query time to O(log n + k) while keeping the size linear.
  • Next, the authors recursively partition the left and right pieces of s following the r-ary tree.
  • Note that each piece with spans a multi-slab at some node.
  • Since the first layer of each halfplane structure is a linear list, this is exactly the standard situation where fractional cascading [10] can be applied.

2.4 Optimal structure

  • The authors now describe an optimal structure for answering segments-below-point queries.
  • The authors start with the binary segment-tree structure from the previous subsection.
  • The following observation will help us in reducing the size.
  • The total time spent over all pairs in Λ is O(n log n).
  • Finally, as in the structure of Lemma 2.4, the authors also use fractional cascading on these half-plane range-reporting structures.

2.5 Dynamization

  • Finally the authors briefly discuss how to make their structure dynamic, i.e., supporting insertions and deletions of uncertain points in the uncertain data set.
  • If only insertions are to be supported, the authors can apply the logarithmic method [6] to Theorem 2.5.
  • The best known dynamic structure for halfplane range reporting uses O(n log n) space, supports insertions and deletions in O(polylog n) time amortized, and answers queries in O(log n + k) time [8, 9] .
  • Currently, it is unknown if one can obtain a linear-size dynamic structure with O(polylog n) update times.
  • Since super-linear space is unavoidable, the authors can simply plug this dynamic halfplane structure into the segment-tree based structure with fanout 2 (see the remark following Lemma 2.3) and obtain the following.

3 Handling More General Pdf's

  • In Section 2.1, the authors converted the uncertain range searching problem to the problem of storing a set of x-monotone polygonal chains in a data structure so that all the chains below a query point can be reported efficiently.
  • The authors first give a Monte Carlo algorithm with running time O(r/δ 2 ) that fails with probability O(δ 3 ); then they show how to convert it to a Las Vegas algorithm that never fails and runs in expected time O(r).
  • With other families of pdf's, the threshold functions will have different forms.
  • Note that Lemma 2.1 easily extends to other piecewise functions.
  • Interestingly, the complexity of the lower envelope of these threshold functions only depends on how many times two pieces from two different threshold functions could intersect.

Lemma 3.3 For two Gaussian distributions, their threshold functions intersect at most twice.

  • If ϕ (x) has one root, ϕ(x) is unimodal or inverse-unimodal; if ϕ (x) has two roots, by combining with the fact that ϕ(−∞) = ϕ(+∞) = 0, the authors can conclude that ϕ(x) must have exactly one root, and that ϕ(x) is unimodal before the root and inverse-unimodal after it, or vice-versa; see Figure 6 .
  • The same argument implies that there is at most one intersection point of g 1 and g 2 that lies after ξ, implying that they have at most two intersection points.
  • Invoking Theorem 3.2, their structure for Gaussian distributions has size O(λ 2 (n) log n) = O(n log n).
  • By construction, each point is reported only once.
  • This structure supports insertions and deletions of uncertain points in O(polylog n) time amortized.

5 Conclusion

  • In this paper the authors have studied the problem of range searching on uncertain data.
  • The authors data structures have linear or near-linear sizes and support range queries in logarithmic (or polylogarithmic) time.
  • For the other more complicated ones, some of the ideas (such as the geometric reductions) could be borrowed to devise more practical data structures.
  • A few heuristics based on R-trees have been proposed in [21] , but no provably good solutions are known.
  • Unlike range searching, the authors need to consider the interplay between the uncertain points when answering a nearest neighbor query, which seems to make the problem considerably more difficult.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Range Searching on Uncertain Data
Pankaj K. Agarwal
Duke University
Durham, NC, USA
pankaj@cs.duke.edu
Siu-Wing Cheng
HKUST
Hong Kong, China
scheng@cse.ust.hk
Yufei Tao
CUHK
Hong Kong, China
taoyf@cse.cuhk.edu.hk
Ke Yi
HKUST
Hong Kong, China
yike@cse.ust.hk
Abstract
Querying uncertain data has emerged as an important problem in data management due to
the imprecise nature of many measurement data. In this paper we study answering range queries
over uncertain data. Specifically, we are given a collection P of n uncertain points in R, each
represented by its one-dimensional probability density function (pdf). The goal is to build a data
structure on P such that given a query interval I and a probability threshold τ , we can quickly
report all points of P that lie in I with probability at least τ . We present various structures with
linear or near-linear space and (poly)logarithmic query time. Our structures support pdfs that
are either histograms or more complex ones such as Gaussian or piecewise algebraic.
1 Introduction
Range searching, namely preprocessing a set of points into a data structure so that all points within
a given query range can be reported efficiently, is one of the most widely studied topics in computa-
tional geometry and database systems [2], with a wide range of applications. Most of the works to
date deal with certain data, that is, the points are given their precise locations in R
d
. Recent years,
however, have witnessed a dramatically increasing amount of attention devoted to managing un-
certain data because many real-world measurements are inherently accompanied with uncertainty.
Besides the recent efforts in the data management community (see the survey [15]), various issues
related with data uncertainty have also been studied in artificial intelligence [20], machine learning
[5], statistics [18], and many other areas.
A popular approach to model data uncertainty [13, 25] is to consider each uncertain point p
as a probability distribution over space. It is usually assumed that the points are independent but
A preliminary version of this paper appeared as “Indexing uncertain data” in ACM Symposium on Principles of
Database Systems (PODS), 2009. P. K. Agawal is supported by NSF under grants CNS-05-40347, CCF-06 -35000,
IIS-07-13498, and CCF-09-40671, by ARO grants W911NF-07-1-0376 and W911NF-08-1-0452, by an NIH grant 1P50-
GM-08183-01, by a DOE grant OEG-P200A070505, and by a grant from the U.S.–Israel Binational Science Foundation.
S.-W. Cheng is supported by HKRGC under grant GRF 612107; Y. Tao is supported by HKRGC under grants GRF
1202/06, GRF 4161/07, and GRF 4173/08; and K. Yi is supported by Hong Kong Direct Allocation Grant (DAG07/08).
1

it is not necessary. The generally agreed semantics for querying uncertain data is the thresholding
approach [13, 16], i.e., for a particular threshold τ, retrieve all the tuples that appear in the query
range with probability at least τ. This problem turns out to be nontrivial even in one dimension. The
na¨ıve approach of examining each point one by one and computing its probability of being inside
the query is obviously very expensive. Note that the independence assumption among the uncertain
points is irrelevant as far as range queries are concerned.
Problem definition. We now define our problem more formally. Let P = {p
1
, . . . , p
n
} be a set of
n uncertain points in R, where each p
i
is specified by its probability density function (pdf) f
i
: R
R
+
{0}. We assume that each f
i
is a piecewise-uniform function, i.e., a histogram, consisting of at
most s pieces for some integer s 1. In practice, such a histogram can be used to approximate any
pdf with arbitrary precision. In some applications each point p
i
has a discrete pdf, namely, it could
appear at one of a few locations, each with a certain probability. This case can also be represented
by the histogram model using infinitesimal pieces around these locations, so the histogram model
also incorporates the discrete pdf case. We will adopt the histogram model by default throughout
the paper. For simplicity, we assume s to be a constant for most of the discussion. Some of our
structures also support more complicated pdfs (such as Gaussian or piecewise algebraic), and we
will explicitly say so for these structures.
Given the set P and the associated pdfs, the goal is to build a data structure on them so that for
a query interval I and a threshold τ , all points p such that Pr[p I] τ are reported efficiently.
We also consider the version where τ is fixed in advance. We refer to the former as the variable
threshold version and the latter as the fixed threshold version of the problem. The latter version is
useful since in many applications the threshold is always fixed at, say, 0.5. Moreover, the user can
often tolerate some error ε in the probability. In this case we can build 1 fixed-threshold structures
with τ = ε, 2ε, . . . , 1, so that a query with any threshold can be answered with error at most ε.
Applications. The problem of range searching over uncertain data was first introduced by Cheng
et al. [13] and has numerous applications in practice. For example, a certain measurement, say
temperature, may be taken by multiple sensors in a sensor network. Due to various imprecision
factors, the readings of these sensors may not be identical, in which case the temperature of a
location can be conveniently modeled as a pdf. In this context, a query in our problem would
retrieve “all the locations whose temperatures are between 100 and 120 degrees with probability
at least 50%”. It is not hard to see that there are many similar scenarios involving uncertain data.
In fact, our problem is also important even in several traditional applications where no uncertainty
seems to exist. For instance, consider a movie rating system (such as the one at Amazon) where
each reviewer can give a rating from 1 to 10. A query of our problem would find “all the movies
such that at least 90% of the ratings it receives are at least 8”.
Previous results. The problem of range searching on uncertain data has received much attention
in the database community over the last few years. The earliest work [13] considered the above
problem in a simpler form, namely, where each f
i
(x) is a uniform distribution a special case of
our definition in which the histogram consists of only one piece. For the fixed-threshold version
with threshold 0 < τ 1, they proposed a structure of O(
1
) size with O(τ
1
log n + k) query
time, where k is the output size. These bounds depend on τ
1
, which can be arbitrarily large.
This structure does not extend to histograms consisting of two or more pieces. They presented
heuristics for the variable threshold version without any performance guarantees. Tao et al. [25]
2

considered the problem in two and higher dimensions, and presented some data structures based on
space partitioning heuristics. They prune points whose probability of being inside the query range
is either too low or too high, but the query procedure visits all points of P in the worst case. Finally,
yet another heuristic is presented in [22], but it is still the same as a sequential scan in the worst
case.
Cheng et al. [13] also showed that the fixed-threshold version of the problem is at least as
difficult as 2D halfplane range-reporting (i.e., report all points lying in a query halfplane), and that
it can be reduced to 2D simplex queries (report all points lying in a query triangle). However
the complexities of these two problems differ significantly: With linear space, a halfplane range-
reporting query can be answered in time Θ(log n + k) [11], while the latter takes Ω(
n) time [12].
So there is a significant gap between the current upper and lower bounds for range searching over
uncertain data.
Also related is the work by Singh et al. [24], who considered the problem of querying uncertain
data that are categorical, namely, each random object takes a value from a discrete, unordered
domain. The structures presented there are again heuristic solutions.
Our results. In this paper, we make a significant theoretical step towards understanding the com-
plexity of range searching on uncertain data. We present linear or near-linear size data structures for
both the fixed and variable threshold versions of the problem, with logarithmic or polylogarithmic
query times. Specifically, we obtain the following results.
For the fixed-threshold version, we present a linear-size structure that answers a query in O(log n+
k) time (Section 2). These bounds are clearly optimal (in the comparison model of computation).
We first show that this problem can be reduced to a so-called segments-below-point problem: stor-
ing a set of segments in R
2
so that all segments lying below a query point can be reported quickly.
Then we present an optimal structure for the segments-below-point problem a linear-size struc-
ture with O(log n + k) query time. This result shows that the fixed-threshold version has exactly
the same complexity as the halfplane range-reporting problem, closing the large gap left in [13]. In
Section 3 we present a simpler structure of size O((n) log n) and query time O(log n + k). This
structure extends to more general pdfs, such as Gaussian distributions or other piecewise algebraic
pdfs.
For the variable-threshold version, we use a different reduction and show that it can be solved
by carefully storing a number of points in R
3
in a structure for answering halfspace range queries.
Combining with the recent result of Afshani and Chan [1] for 3D halfspace range reporting, we
obtain a structure for the variable-threshold version of our problem with O(n log
2
n) space and
O(log
3
n + k) query time (Section 4). Although the bounds have extra log factors in this case, our
result shows that this problem is still significantly easier than 2D simplex queries.
Finally, we show that our structures can be dynamized, supporting insertions and deletions of
(uncertain) points with a slight increase in the query time.
2 Fixed-Threshold Range Queries
We present an optimal structure for answering range queries on uncertain data where the probability
threshold τ is fixed. Our structure uses linear space and answers a query in the optimal O(log n+k)
time. These bounds do not depend on the particular value of τ. We first describe in Section 2.1
the reduction to the segments-below-point problem. Next we describe a segment-tree based data
3

x x x
f(x) F (x)
g(x)
a b c d e a b c d e a b c d e
b
c
d
e
(i)
x
l
x
r
(ii)
(iii)
(x
l
, x
r
)
Figure 1: Reduction to the segments-below-point problem: (i) pdf, (ii) cdf, and (iii) threshold
function.
structure that uses linear space and answers a query in O(
n + k) time or uses O(n log n) space
and answers a query in O(log n + k) time (Section 2.3). We then improve this structure to achieve
linear size and O(log n + k) query time simultaneously (Section 2.4). We conclude this section by
describing how we make the structure dynamic.
2.1 A geometric reduction
Let p be an uncertain point in R, and let f : R R be its pdf.
1
Suppose the histogram of f consists
of s pieces, and let
f(x) = y
i
, for x
i1
x < x
i
, i = 1, . . . , s.
We set x
0
= −∞, x
s
= , and y
1
= y
s
= 0; see Figure 1 (i). The cumulative distribution
function (cdf) F (x) =
R
x
−∞
f(t)dt is a monotone piecewise-linear function consisting of s pieces;
see Figure 1 (ii). Let the query range be [x
l
, x
r
]. The probability of p falling inside [x
l
, x
r
] is
F (x
r
) F (x
l
). We define a function g : R R, which we refer to as the threshold function. For a
given a R, let g(a) be the minimum value b such that F (b) F (a) τ. If no such b exists, g(a)
is set to ; see Figure 1 (iii).
Lemma 2.1 The function g(x) is non-decreasing and piecewise linear consisting of at most 2s
pieces.
Proof : Suppose we continuously vary x from −∞ to . For x = −∞, g(x) = min{y | F (y) =
τ}; g(x) stays the same until x reaches x
1
. As we increase x further, g(x) increases linearly, with
the slope depending on the pieces of the histogram f that contain x and g(x). When either x or
g(x) passes through one of the x
i
s, the slope changes. There are at most 2(s 1) such changes;
see Figure 1.
Given the description of the pdf f, the function g can be constructed easily. Once we have
the threshold function g, the condition Pr[p [x
l
, x
r
]] τ simply becomes checking whether
x
r
g(x
l
). Geometrically, this is equivalent to testing whether the point (x
l
, x
r
) R
2
lies above
the polygonal line representing the graph of g (see Figure 1). We construct the threshold function g
p
for each point p in P . Let S be the set of at most 2ns segments in R
2
that form the pieces of these n
functions; S can be constructed in O(n) time. We label each segment of g
p
with p. The problem of
1
Through this paper we do not distinguish between a function and its graph.
4

1
2
3
4
5
6
7
Figure 2: The data structure for a set of lines: thick polygonal chain is the lower envelope of S;
L
1
(S) = {1, 2, 6}, L
2
(S) = {3, 7}, L
3
(S) = {4, 5}.
reporting the points of P that lie in the interval [x
l
, x
r
] with probability at least τ becomes reporting
the segments of S that lie below the point (x
l
, x
r
) R
2
: If the procedure returns a segment labeled
with p, we return the point p. Each polygonal line being x-monotone, no point is reported more
than once.
We thus have the following problem at hand: Let S be a set of n segments in R
2
. Build a
data structure on S so that for a query point q R
2
, the set of segments in S lying directly below
q, denoted by S[q], can be reported efficiently. For simplicity, we assume the coordinates of the
endpoints of S to be distinct; this assumption can be removed using standard techniques. We call
this problem the segments-below-point problem.
2.2 Half-plane range reporting
We begin by describing a structure for the special case when all segments in S are full lines and
we want to report the lines of S lying below a query point. This problem is dual to the well-known
half-plane range reporting problem, for which there is an O(n)-size structure with O(log n + k)-
time [11]. We briefly describe a variant of this structure (in the dual setting), denoted by H(S),
which we will use as a building block.
If we view each line ` in S as a linear function ` : R R, then the lower envelope of S is the
graph of the function E
S
(x) = min
`S
`(x), i.e., it is the boundary of the unbounded region in the
planar map induced by S that lies below all the lines of S (see Figure 2). We represent the lower
envelope as a sequence x
0
= −∞, `
1
, x
1
, `
2
, . . . , `
r
, x
r
= +, where the x
i
s are the x-coordinates
of the vertices of the lower envelope, and `
i
is the line that appears on the lower envelope in the
interval [x
i1
, x
i
]. Note that the lines appear along the envelope in decreasing order of their slopes.
We partition S into a sequence L
1
(S), L
2
(S), . . ., of subsets, called layers. L
1
(S) S consists of
the lines that appear on the lower envelope of S. For i > 1, L
i
(S) is the set of lines that appear
on the lower envelope of S \
S
i1
j=1
L
j
(S); see Figure 2. For each i, we store the aforementioned
representation of layer L
i
(S) in a list. To answer a query q = (q
x
, q
y
), we start from L
1
(S) and
locate the interval [x
i1
, x
i
] that contains q
x
, using binary search. Next we walk along the envelope
of L
1
(S) in both directions, starting from `
i
, to report the lines lying below q, in time linear to the
output size. Then we query the rest of the layers L
2
(S), L
3
(S), . . . in order until no lines have been
reported at a certain layer. By using fractional cascading [10] on the x-coordinates of the envelopes
of these layers, the total query time can be improved to O(k) plus the initial binary search in L
1
(S).
Fractional cascading augments these lists with copies of elements from other lists, but the size of
the structure remains linear, and it can be constructed in O(n log n) time [10, 11]. The following
5

Citations
More filters
Proceedings Article
01 Jan 1998
TL;DR: It is shown how to answer halfspace range reporting queries in O(log n+k) expected time for an output size k and the first optimal randomized algorithm for the construction of the $(\le k)$-level in an arrangement of n planes in three dimensions is obtained.
Abstract: Given n points in three dimensions, we show how to answer halfspace range reporting queries in O(log n+k) expected time for an output size k. Our data structure can be preprocessed in optimal O(n log n) expected time. We apply this result to obtain the first optimal randomized algorithm for the construction of the $(\le k)$-level in an arrangement of n planes in three dimensions. The algorithm runs in O(n log n+nk2) expected time. Our techniques are based on random sampling. Applications in two dimensions include an improved data structure for "k nearest neighbors" queries and an algorithm that constructs the order-k Voronoi diagram in O(n log n+nk log k) expected time.

84 citations

Book ChapterDOI
02 Sep 2013
TL;DR: The most likely hull under the point model can be computed in O(n 3) time for n points in d = 2 dimensions, but it is NP–hard for d ≥ 3 dimensions, and it is shown that the problem is NP-hard under the multipoint model even for d =2 dimensions.
Abstract: Consider a set of points in d dimensions where the existence or the location of each point is determined by a probability distribution. The convex hull of this set is a random variable distributed over exponentially many choices. We are interested in finding the most likely convex hull, namely, the one with the maximum probability of occurrence. We investigate this problem under two natural models of uncertainty: the point (also called the tuple) model where each point (site) has a fixed position s i but only exists with some probability π i , for 0 < π i ≤ 1, and the multipoint model where each point has multiple possible locations or it may not appear at all. We show that the most likely hull under the point model can be computed in O(n 3) time for n points in d = 2 dimensions, but it is NP–hard for d ≥ 3 dimensions. On the other hand, we show that the problem is NP–hard under the multipoint model even for d = 2 dimensions. We also present hardness results for approximating the probability of the most likely hull. While we focus on the most likely hull for concreteness, our results hold for other natural definitions of a probabilistic hull.

60 citations


Cites background from "Range searching on uncertain data"

  • ...There also has been extensive research in the database community on clustering and ranking of uncertain data [4,5,10] and on range searching and indexing [1,2,3]....

    [...]

Posted Content
TL;DR: These results include both exact and approximation algorithms for computing the probability of a query point lying inside the convex hull of the input, time–space tradeoffs for the membership queries, a connection between Tukey depth and membership queries as well as a new notion of $$\beta $$β-hull that may be a useful representation of uncertain hulls.
Abstract: We study the convex-hull problem in a probabilistic setting, motivated by the need to handle data uncertainty inherent in many applications, including sensor databases, location-based services and computer vision. In our framework, the uncertainty of each input site is described by a probability distribution over a finite number of possible locations including a \emph{null} location to account for non-existence of the point. Our results include both exact and approximation algorithms for computing the probability of a query point lying inside the convex hull of the input, time-space tradeoffs for the membership queries, a connection between Tukey depth and membership queries, as well as a new notion of $\some$-hull that may be a useful representation of uncertain hulls.

52 citations

Journal ArticleDOI
TL;DR: Based on local information: local density and local uncertainty level, a new outlier detection algorithm is designed in this paper to calculate uncertain local outlier factor (ULOF) for each point in an uncertain dataset.
Abstract: Based on local information: local density and local uncertainty level, a new outlier detection algorithm is designed in this paper to calculate uncertain local outlier factor (ULOF) for each point in an uncertain dataset. In this algorithm, all concepts, definitions and formulations for conventional local outlier detection approach (LOF) are generalized to include uncertainty information. The least squares algorithm on multi-times curve fitting is used to generate an approximate probability density function of distance between two points. An iteration algorithm is proposed to evaluate K–η–distance and a pruning strategy is adopted to reduce the size of candidate set of nearest-neighbors. The comparison between ULOF algorithm and the state-of-the-art approaches has been made. Results of several experiments on synthetic and real data sets demonstrate the effectiveness of the proposed approach.

40 citations

Book ChapterDOI
15 Dec 2014
TL;DR: An alternative approach to the most likely nearest neighbor (LNN) search using Pareto sets is presented, which gives a linear-space data structure and sub-linear query time in 1D for average and smoothed analysis models, as well as worst-case with a bounded number of distinct probabilities.
Abstract: We consider the problem of nearest-neighbor searching among a set of stochastic sites, where a stochastic site is a tuple \((s_i, \pi _i)\) consisting of a point \(s_i\) in a \(d\)-dimensional space and a probability \(\pi _i\) determining its existence. The problem is interesting and non-trivial even in \(1\)-dimension, where the Most Likely Voronoi Diagram (LVD) is shown to have worst-case complexity \(\Omega (n^2)\). We then show that under more natural and less adversarial conditions, the size of the \(1\)-dimensional LVD is significantly smaller: (1) \(\Theta (k n)\) if the input has only \(k\) distinct probability values, (2) \(O(n \log n)\) on average, and (3) \(O(n \sqrt{n})\) under smoothed analysis. We also present an alternative approach to the most likely nearest neighbor (LNN) search using Pareto sets, which gives a linear-space data structure and sub-linear query time in 1D for average and smoothed analysis models, as well as worst-case with a bounded number of distinct probabilities. Using the Pareto-set approach, we can also reduce the multi-dimensional LNN search to a sequence of nearest neighbor and spherical range queries.

26 citations

References
More filters
Book
01 Jan 1993
TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Abstract: This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.

37,183 citations

Book
01 Oct 2004
TL;DR: Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts, and discusses many methods from different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining.
Abstract: The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data. Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts. In order to present a unified treatment of machine learning problems and solutions, it discusses many methods from different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining. All learning algorithms are explained so that the student can easily move from the equations in the book to a computer program. The text covers such topics as supervised learning, Bayesian decision theory, parametric methods, multivariate methods, multilayer perceptrons, local models, hidden Markov models, assessing and comparing classification algorithms, and reinforcement learning. New to the second edition are chapters on kernel machines, graphical models, and Bayesian estimation; expanded coverage of statistical tests in a chapter on design and analysis of machine learning experiments; case studies available on the Web (with downloadable results for instructors); and many additional exercises. All chapters have been revised and updated. Introduction to Machine Learning can be used by advanced undergraduates and graduate students who have completed courses in computer programming, probability, calculus, and linear algebra. It will also be of interest to engineers in the field who are concerned with the application of machine learning methods. Adaptive Computation and Machine Learning series

3,950 citations


"Range searching on uncertain data" refers background in this paper

  • ...Besides the recent efforts in the data management community (see the survey [15]), various issues related with data uncertainty have also been studied in artificial intelligence [20], machine learning [5], statistics [18], and many other areas....

    [...]

  • ...…the recent efforts in the data management community (see the survey [Dalvi et al. 2009]), various issues related with data uncertainty have also been studied in arti.cial intelligence [Kanal and Lemmer 1986], machine learning [Alpaydin 2004], statistics [Halpern 2003], and many other areas....

    [...]

Book
01 Jan 1988
TL;DR: Qualitative Probabilistic Reasoning and Cognitive models, Dempster-Shafer Theory in Knowledge Representation, and Possibility Theory: Semantics and Applications.
Abstract: Qualitative Probabilistic Reasoning and Cognitive Models. Exploiting Functional Dependencies in Qualitative Probabilistic Reasoning (M.P. Wellman). Qualitative Propagation and Scenario-Based Scheme for Explaining Probabilistic Reasoning (M. Henrion, M.J. Druzdel). Propagating Uncertainty in Rule Based Cognitive Modeling (T.R. Shultz). Context-Dependent Similarity (Y. Cheng). Abductive Probabilistic Reasoning and KB Development. Similarity Networks for the Construction of Multiple-Faults Belief Networks (D. Heckerman). Separable and Transitive Graphoids (D. Geiger, D. Heckerman). Integrating Probabilistic, Taxonomic and Causal Knowledge in Abductive Diagnosis (D. Lin, R. Goebel). What is the Most Likely Diagnosis (D. Poole, G.M. Provan). Probabilistic Evaluation of Candidate Sets for Multidisorder Diagnosis (T.D. Wu). Kutato: An Entropy-Driven System for Construction of Probabilistic Expert Systems from Databases (E. Herskovits, G. Cooper). Problem Formulation and Control of Reasoning. Ideal Reformulation of Belief Networks (J.S. Breese, E.J. Horvitz). Computationally-Optimal Real-Resource Strategies for Independent, Uninterruptible Methods (D. Einav, M.R. Fehling). Problem Formulation as the Reduction of a Decision Model (D.E. Heckerman, E.J. Horvitz). Dynamic Construction of Belief Networks (R.P. Goldman, E. Charniak). A New Algorithm for Finding MAP Assignments to Belief Networks (S.E. Shimony, E. Charniak). Belief Network Decomposition. Directed Reduction Algorithms and Decomposable Graphs (R.D. Shachter, S.K. Andersen, K.L. Poh). Optimal Decomposition of Belief Networks (W.X. Wen). Pruning Bayesian Networks for Efficient Computation (M. Baker, T.E. Boult). On Heuristics for Finding Loop Cutsets in Multiply-Connected Belief Networks (J. Stillman). A Combination of Cutset Conditioning with Clique-Tree Propagation in the Pathfinder System (H.J. Suermondt, G.F. Cooper, D.E. Heckerman). Equivalence and Synthesis of Causal Models (T.S. Verma, J. Pearl). Possibility Theory: Semantics and Applications. Possibility as Similarity: The Semantics of Fuzzy Logic (E. Ruspini). Integrating Case-Based and Rule-Based Reasoning: the Possibilistic Connection (S. Dutta, P.P. Bonissone). Credibility Discounting in the Theory of Approximate Reasoning (R.R. Yager). Updating with Belief Functions, Ordinal Conditional Functions and Possibility Measures (D. Dubois, H. Prade). A Hierarchical Approach to Designing Approximate Reasoning-Based Controllers for Dynamic Physical Systems (H.R. Berenji, et al.). Dempster-Shafer: Graph Decomposition, FMT, and Interpretations. A New Approach to Updating Beliefs (R. Fagin, J.Y. Halpern). The Transferable Belief Model and Other Interpretations of Dempster-Shafer's Model (P. Smets). Valuation-Based Systems for Discrete Optimization (P.P. Shenoy). Computational Aspects of the Mobius Transformation (R. Kennes, P. Smets). Using Dempster-Shafer Theory in Knowledge Representation (A. Saffiotti).

1,407 citations


"Range searching on uncertain data" refers background in this paper

  • ...Besides the recent efforts in the data management community (see the survey [15]), various issues related with data uncertainty have also been studied in artificial intelligence [20], machine learning [5], statistics [18], and many other areas....

    [...]

Proceedings ArticleDOI
Kenneth L. Clarkson1
06 Jan 1988
TL;DR: Asymptotically tight bounds for a combinatorial quantity of interest in discrete and computational geometry, related to halfspace partitions of point sets, are given.
Abstract: Random sampling is used for several new geometric algorithms. The algorithms are “Las Vegas,” and their expected bounds are with respect to the random behavior of the algorithms. One algorithm reports all the intersecting pairs of a set of line segments in the plane, and requires O(A + n log n) expected time, where A is the size of the answer, the number of intersecting pairs reported. The algorithm requires O(n) space in the worst case. Another algorithm computes the convex hull of a point set in E3 in O(n log A) expected time, where n is the number of points and A is the number of points on the surface of the hull. A simple Las Vegas algorithm triangulates simple polygons in O(n log log n) expected time. Algorithms for half-space range reporting are also given. In addition, this paper gives asymptotically tight bounds for a combinatorial quantity of interest in discrete and computational geometry, related to halfspace partitions of point sets.

1,163 citations


"Range searching on uncertain data" refers background in this paper

  • ...Following the random-sampling framework of Clarkson and Shor [1989], the expected size of Ct is at most O(2i )....

    [...]

  • ...and Shor [14], the expected size of Ct is at most O(2i)....

    [...]