Range searching on uncertain data
Summary (3 min read)
1 Introduction
- Range searching, namely preprocessing a set of points into a data structure so that all points within a given query range can be reported efficiently, is one of the most widely studied topics in computational geometry and database systems [2] , with a wide range of applications.
- The generally agreed semantics for querying uncertain data is the thresholding approach [13, 16] , i.e., for a particular threshold τ , retrieve all the tuples that appear in the query range with probability at least τ .
- Note that the independence assumption among the uncertain points is irrelevant as far as range queries are concerned.
- This case can also be represented by the histogram model using infinitesimal pieces around these locations, so the histogram model also incorporates the discrete pdf case.
- The authors refer to the former as the variable threshold version and the latter as the fixed threshold version of the problem.
Previous results.
- The problem of range searching on uncertain data has received much attention in the database community over the last few years.
- The earliest work [13] considered the above problem in a simpler form, namely, where each f i (x) is a uniform distribution -a special case of their definition in which the histogram consists of only one piece.
- The structures presented there are again heuristic solutions.
- The authors make a significant theoretical step towards understanding the complexity of range searching on uncertain data.
- The authors present linear or near-linear size data structures for both the fixed and variable threshold versions of the problem, with logarithmic or polylogarithmic query times.
2 Fixed-Threshold Range Queries
- The authors present an optimal structure for answering range queries on uncertain data where the probability threshold τ is fixed.
- The authors structure uses linear space and answers a query in the optimal O(log n+k) time.
- The authors first describe in Section 2.1 the reduction to the segments-below-point problem.
- The authors then improve this structure to achieve linear size and O(log n + k) query time simultaneously (Section 2.4).
- The authors conclude this section by describing how they make the structure dynamic.
2.1 A geometric reduction
- As the authors increase x further, g(x) increases linearly, with the slope depending on the pieces of the histogram f that contain x and g(x).
- The authors call this problem the segments-below-point problem.
2.2 Half-plane range reporting
- This problem is dual to the well-known half-plane range reporting problem, for which there is an O(n)-size structure with O(log n + k)time [11] .
- Note that the lines appear along the envelope in decreasing order of their slopes.
- By using fractional cascading [10] on the x-coordinates of the envelopes of these layers, the total query time can be improved to O(k) plus the initial binary search in L 1 (S).
- Fractional cascading augments these lists with copies of elements from other lists, but the size of the structure remains linear, and it can be constructed in O(n log n) time [10, 11] .
- The following statement is slightly more general than what appeared in [11] .
2.3 Segment-tree based structure
- The authors later (cf. Section 2.4) bootstrap this structure to improve the query time to O(log n + k) while keeping the size linear.
- Next, the authors recursively partition the left and right pieces of s following the r-ary tree.
- Note that each piece with spans a multi-slab at some node.
- Since the first layer of each halfplane structure is a linear list, this is exactly the standard situation where fractional cascading [10] can be applied.
2.4 Optimal structure
- The authors now describe an optimal structure for answering segments-below-point queries.
- The authors start with the binary segment-tree structure from the previous subsection.
- The following observation will help us in reducing the size.
- The total time spent over all pairs in Λ is O(n log n).
- Finally, as in the structure of Lemma 2.4, the authors also use fractional cascading on these half-plane range-reporting structures.
2.5 Dynamization
- Finally the authors briefly discuss how to make their structure dynamic, i.e., supporting insertions and deletions of uncertain points in the uncertain data set.
- If only insertions are to be supported, the authors can apply the logarithmic method [6] to Theorem 2.5.
- The best known dynamic structure for halfplane range reporting uses O(n log n) space, supports insertions and deletions in O(polylog n) time amortized, and answers queries in O(log n + k) time [8, 9] .
- Currently, it is unknown if one can obtain a linear-size dynamic structure with O(polylog n) update times.
- Since super-linear space is unavoidable, the authors can simply plug this dynamic halfplane structure into the segment-tree based structure with fanout 2 (see the remark following Lemma 2.3) and obtain the following.
3 Handling More General Pdf's
- In Section 2.1, the authors converted the uncertain range searching problem to the problem of storing a set of x-monotone polygonal chains in a data structure so that all the chains below a query point can be reported efficiently.
- The authors first give a Monte Carlo algorithm with running time O(r/δ 2 ) that fails with probability O(δ 3 ); then they show how to convert it to a Las Vegas algorithm that never fails and runs in expected time O(r).
- With other families of pdf's, the threshold functions will have different forms.
- Note that Lemma 2.1 easily extends to other piecewise functions.
- Interestingly, the complexity of the lower envelope of these threshold functions only depends on how many times two pieces from two different threshold functions could intersect.
Lemma 3.3 For two Gaussian distributions, their threshold functions intersect at most twice.
- If ϕ (x) has one root, ϕ(x) is unimodal or inverse-unimodal; if ϕ (x) has two roots, by combining with the fact that ϕ(−∞) = ϕ(+∞) = 0, the authors can conclude that ϕ(x) must have exactly one root, and that ϕ(x) is unimodal before the root and inverse-unimodal after it, or vice-versa; see Figure 6 .
- The same argument implies that there is at most one intersection point of g 1 and g 2 that lies after ξ, implying that they have at most two intersection points.
- Invoking Theorem 3.2, their structure for Gaussian distributions has size O(λ 2 (n) log n) = O(n log n).
- By construction, each point is reported only once.
- This structure supports insertions and deletions of uncertain points in O(polylog n) time amortized.
5 Conclusion
- In this paper the authors have studied the problem of range searching on uncertain data.
- The authors data structures have linear or near-linear sizes and support range queries in logarithmic (or polylogarithmic) time.
- For the other more complicated ones, some of the ideas (such as the geometric reductions) could be borrowed to devise more practical data structures.
- A few heuristics based on R-trees have been proposed in [21] , but no provably good solutions are known.
- Unlike range searching, the authors need to consider the interplay between the uncertain points when answering a nearest neighbor query, which seems to make the problem considerably more difficult.
Did you find this useful? Give us your feedback
Citations
84 citations
60 citations
Cites background from "Range searching on uncertain data"
...There also has been extensive research in the database community on clustering and ranking of uncertain data [4,5,10] and on range searching and indexing [1,2,3]....
[...]
52 citations
40 citations
26 citations
References
37,183 citations
3,950 citations
"Range searching on uncertain data" refers background in this paper
...Besides the recent efforts in the data management community (see the survey [15]), various issues related with data uncertainty have also been studied in artificial intelligence [20], machine learning [5], statistics [18], and many other areas....
[...]
...…the recent efforts in the data management community (see the survey [Dalvi et al. 2009]), various issues related with data uncertainty have also been studied in arti.cial intelligence [Kanal and Lemmer 1986], machine learning [Alpaydin 2004], statistics [Halpern 2003], and many other areas....
[...]
1,407 citations
"Range searching on uncertain data" refers background in this paper
...Besides the recent efforts in the data management community (see the survey [15]), various issues related with data uncertainty have also been studied in artificial intelligence [20], machine learning [5], statistics [18], and many other areas....
[...]
1,163 citations
"Range searching on uncertain data" refers background in this paper
...Following the random-sampling framework of Clarkson and Shor [1989], the expected size of Ct is at most O(2i )....
[...]
...and Shor [14], the expected size of Ct is at most O(2i)....
[...]