scispace - formally typeset
Open AccessProceedings ArticleDOI

Group nearest neighbor queries

Reads0
Chats0
TLDR
This work proposes several algorithms for finding the group nearest neighbors efficiently and extends their techniques for situations where Q cannot fit in memory, covering both indexed and nonindexed query points.
Abstract
Given two sets of points P and Q, a group nearest neighbor (GNN) query retrieves the point(s) of P with the smallest sum of distances to all points in Q. Consider, for instance, three users at locations q/sub 1/ q/sub 2/ and q/sub 3/ that want to find a meeting point (e.g., a restaurant); the corresponding query returns the data point p that minimizes the sum of Euclidean distances |pq/sub i/| for 1/spl les/i/spl les/3. Assuming that Q fits in memory and P is indexed by an R-tree, we propose several algorithms for finding the group nearest neighbors efficiently. As a second step, we extend our techniques for situations where Q cannot fit in memory, covering both indexed and nonindexed query points. An experimental evaluation identifies the best alternative based on the data and query properties.

read more

Content maybe subject to copyright    Report

Group Nearest Neighbor Queries
Dimitris Papadias
Qiongmao Shen
Yufei Tao
§
Kyriakos Mouratidis
Department of Computer Science
Hong Kong University of Science and Technology
Clear Water Bay, Hong Kong
{dimitris, qmshen, kyriakos}@cs.ust.hk
§
Department of Computer Science
City University of Hong Kong
Tat Chee Avenue, Hong Kong
taoyf@cs.cityu.edu.hk
Abstract
Given two sets of points P and Q, a group nearest neighbor
(GNN) query retrieves the point(s) of P with the smallest
sum of distances to all points in Q. Consider, for instance,
three users at locations q
1
, q
2
and q
3
that want to find a
meeting point (e.g., a restaurant); the corresponding query
returns the data point p that minimizes the sum of Euclidean
distances |pq
i
| for 1i3. Assuming that Q fits in memory
and P is indexed by an R-tree, we propose several
algorithms for finding the group nearest neighbors
efficiently. As a second step, we extend our techniques for
situations where Q cannot fit in memory, covering both
indexed and non-indexed query points. An experimental
evaluation identifies the best alternative based on the data
and query properties.
1. Introduction
Nearest neighbor (NN) search is one of the oldest problems
in computer science. Several algorithms and theoretical
performance bounds have been devised for exact and
approximate processing in main memory [S91, AMN+98].
Furthermore, the application of NN search to content-based
and similarity retrieval has led to the development of
numerous cost models [PM97, WSB98, BGRS99, B00] and
indexing techniques [SYUK00, YOTJ01] for high-
dimensional versions of the problem. In spatial databases
most of the work has focused on the point NN query that
retrieves the k (1) objects from a dataset P that are closest
(usually according to Euclidean distance) to a query point
q. The existing algorithms (reviewed in Section 2) assume
that P is indexed by a spatial access method and utilize
some pruning bounds to restrict the search space. Shahabi
et al. [SKS02] and Papadias et al. [PZMT03] deal with
nearest neighbor queries in spatial network databases,
where the distance between two points is defined as the
length of the shortest path connecting them in the network.
In addition to conventional (i.e., point) NN queries, recently
there has been an increasing interest in alternative forms of
spatial and spatio-temporal NN search. Ferhatosmanoglu et
al. [FSAA01] discover the NN in a constrained area of the
data space. Korn and Muthukrishnan [KM00] discuss
reverse nearest neighbor queries, where the goal is to
retrieve the data points whose nearest neighbor is a
specified query point. Korn et al. [KMS02] study the same
problem in the context of data streams. Given a query
moving with steady velocity, [SR01, TP02] incrementally
maintain the NN (as the query moves), while [BJKS02,
TPS02] propose techniques for continuous NN processing,
where the goal is to return all results up to a future time.
Kollios et al. [KGT99] develop various schemes for
answering NN queries on 1D moving objects. An overview
of existing NN methods for spatial and spatio-temporal
databases can be found in [TP03].
In this paper we discuss group nearest neighbor (GNN)
queries, a novel form of NN search. The input of the
problem consists of a set P={p
1
,…,p
N
} of static data points
in multidimensional space and a group of query points
Q={q
1
,…,q
n
}. The output contains the k (1) data point(s)
with the smallest sum of distances to all points in Q. The
distance between a data point p and Q is defined as
dist(p,Q)=
i=1~n
|pq
i
|, where |pq
i
| is the Euclidean distance
between p and query point q
i
. As an example consider a
database that manages (static) facilities (i.e., dataset P). The
query contains a set of user locations Q={q
1
,…,q
n
} and the
result returns the facility that minimizes the total travel
distance for all users. In addition to its relevance in
geographic information systems and mobile computing
applications, GNN search is important in several other
domains. For instance, in clustering [JMF99] and outlier
detection [AY01], the quality of a solution can be evaluated
by the distances between the points and their nearest cluster
centroid. Furthermore, the operability and speed of very
large circuits depends on the relative distance between the
various components in them. GNN can be applied to detect
abnormalities and guide relocation of components [NO97].
Assuming that Q fits in memory and P is indexed by an R-
tree, we first propose three algorithms for solving this
problem. Then, we extend our techniques for cases that Q is
too large to fit in memory, covering both indexed and non-
indexed query points. The rest of the paper is structured as
follows. Section 2 outlines the related work on conventional
nearest neighbor search and top-k queries. Section 3

describes algorithms for the case that Q fits in memory and
Section 4 for the case that Q resides on the disk. Section 5
experimentally evaluates the algorithms and identifies the
best one depending on the problem characteristics. Section
6 concludes the paper with directions for future work.
2. Related work
Following most approaches in the relevant literature, we
assume 2D data points indexed by an R-tree [G84]. The
proposed techniques, however, are applicable to higher
dimensions and other data-partition access methods such as
A-trees [SYUK00] etc. Figure 2.1 shows an R-tree for point
set P={p
1
,p
2
,…,p
12
} assuming a capacity of three entries
per node. Points that are close in space (e.g., p
1
, p
2
, p
3
) are
clustered in the same leaf node (N
3
). Nodes are then
recursively grouped together with the same principle until
the top level, which consists of a single root.
Existing algorithms for point NN queries using R-trees
follow the branch-and-bound paradigm, utilizing some
metrics to prune the search space. The most common such
metric is mindist(N,q), which corresponds to the closest
possible distance between q and any point in the subtree of
node N. Figure 2.1a shows the mindist between point q and
nodes N
1
, N
2
. Similarly, mindist(N
1
,N
2
) is the minimum
possible distance between any two points that reside in the
sub-trees of nodes N
1
and N
2
.
R
N
3
N
4
N
6
N
5
p
1
p
2
p
3
p
4
p
5
p
6
p
7
p
8
p
9
p
10
p
11
p
12
N
3
N
4
N
5
N
6
N
1
N
2
N
1
N
2
(a) Points and node extents (b) The corresponding R-tree
Figure 2.1: Example of an R-tree and a point NN query
The first NN algorithm for R-trees [RKV95] searches the
tree in a depth-first (DF) manner. Specifically, starting from
the root, it visits the node with the minimum
mindist from q
(e.g.,
N
1
in Figure 2.1). The process is repeated recursively
until the leaf level (node
N
4
), where the first potential
nearest neighbor is found (
p
5
). During backtracking to the
upper level (node
N
1
), the algorithm only visits entries
whose minimum distance is smaller than the distance of the
nearest neighbor already retrieved. In the example of Figure
2.1, after discovering
p
5
, DF will backtrack to the root level
(without visiting
N
3
), and then follow the path N
2
,N
6
where
the actual NN
p
11
is found.
The DF algorithm is sub-optimal, i.e., it accesses more
nodes than necessary. In particular, as proven in [PM97], an
optimal algorithm should visit only nodes intersecting the
vicinity circle that centers at the query point q and has
radius equal to the distance between
q and its nearest
neighbor. In Figure 2.1a, for instance, an optimal algorithm
should visit only nodes
R, N
1
, N
2
, and N
6
(whereas DF also
visits
N
4
). The best-first (BF) algorithm of [HS99] achieves
the optimal I/O performance by maintaining a heap
H with
the entries visited so far, sorted by their
mindist. As with
DF, BF starts from the root, and inserts all the entries into
H (together with their mindist), e.g., in Figure 2.1a,
H={<N
1
, mindist(N
1
,q)>, <N
2
, mindist(N
2
,q)>}. Then, at
each step, BF visits the node in
H with the smallest mindist.
Continuing the example, the algorithm retrieves the content
of
N
1
and inserts all its entries in H, after which H={<N
2
,
mindist(N
2
,q)>, <N
4
, mindist(N
4
,q)>, <N
3
, mindist(N
3
,q)>}.
Similarly, the next two nodes accessed are
N
2
and N
6
(inserted in
H after visiting N
2
), in which p
11
is discovered
as the current NN. At this time, the algorithm terminates
(with
p
11
as the final result) since the next entry (N
4
) in H is
farther (from
q) than p
11
. Both DF and BF can be easily
extended for the retrieval of
k>1 nearest neighbors. In
addition, BF is also
incremental. Namely, it reports the
nearest neighbors in ascending order of their distance to the
query, so that
k does not have to be known in advance
(allowing different termination conditions to be used).
The branch-and-bound framework also applies to
closest
pair queries that find the pair of objects from two datasets,
such that their distance is the minimum among all pairs.
[HS98, CMTV00] propose various algorithms based on the
concepts of DF and BF traversal. The difference from NN
is that the algorithms access two index structures (one for
each data set) simultaneously. If the
mindist of two
intermediate nodes
N
i
and N
j
(one from each R-tree) is
already greater than the distance of the closest pair of
objects found so far, the sub-trees of
N
i
and N
j
cannot
contain a closest pair (thus, the pair is pruned).
As shown in the next section, a processing technique for
GNN queries applies multiple conventional NN queries
(one for each query point) and then combines their results.
Some related work on this topic has appeared in the
literature of top-
k (or ranked) queries over multiple data
repositories (see [FLN01, BCG02, F02] for representative
papers). As an example, consider that a user wants to find
the
k images that are most similar to a query image, where
similarity is defined according to
n features, e.g., color
histogram, object arrangement, texture, shape etc. The
query is submitted to
n retrieval engines that return the best
matches for particular features together with their similarity
scores, i.e., the first engine will output a set of matches
according to color, the second according to arrangement
and so on. The problem is to combine the multiple inputs in
order to determine the top-
k results in terms of their overall
similarity.
The main idea behind all techniques is to minimize the
extent and cost of search performed on each retrieval
engine in order to compute the final result. The
threshold
algorithm [FLN01] works as follows (assuming retrieval of

the single best match): the first query is submitted to the
first search engine, which returns the closest image
p
1
according to the first feature. The similarity between
p
1
and
the query image with respect to the other features is
computed. Then, the second query is submitted to the
second search engine, which returns
p
2
(best match
according to the second feature). The overall similarity of
p
2
is also computed, and the best of p
1
and p
2
becomes the
current result. The process is repeated in a round-robin
fashion, i.e., after the last search engine is queried, the
second match is retrieved with respect to the first feature
and so on. The algorithm will terminate when the similarity
of the current result is higher than the similarity that can be
achieved by any subsequent solution. In the next section
we adapt this approach to GNN processing.
3. Algorithms for memory-resident queries
Assuming that the set Q of query points fits in memory and
that the data points are indexed by an R-tree, we present
three algorithms for processing GNN queries. For each
algorithm we first illustrate retrieval of a single nearest
neighbor, and then show the extension to
k>1. Table 3.1
contains the primary symbols used in our description (some
have not appeared yet, but will be clarified shortly).
Symbol Description
Q set of query points
Q
i
a group of queries that fits in memory
n
(n
i
) number of queries in Q (Q
i
)
M (M
i
) MBR of Q (Q
i
)
q centroid of Q
dist(p,Q) sum of distances between
point p and query points in Q
mindist(N,q) minimum distance between
MBR of node N and centroid q
mindist(p,M) minimum distance between
data point p and query MBR M
()
,
i
i
n
mindist N M
weighted mindist of node N
with respect to all query groups
Table 3.1: Frequently used symbols
3.1 Multiple query method
The
multiple query method (MQM) utilizes the main idea
of the
threshold algorithm, i.e., it performs incremental NN
queries for each point in
Q and combines their results. For
instance, in Figure 3.1 (where
Q ={q
1
,q
2
}), MQM retrieves
the first NN of
q
1
(point p
10
with |p
10
q
1
|=2) and computes
the distance |
p
10
q
2
| (=5). Similarly, it finds the first NN of q
2
(point
p
11
with |p
11
q
2
|=3) and computes |p
11
q
1
|(=3). The
point (
p
11
) with the minimum sum of distances
(|
p
11
q
1
|+|p
11
q
2
|=6) to all query points becomes the current
GNN of
Q.
For each query point
q
i
, MQM stores a threshold t
i
, which is
the distance of the current NN, i.e.,
t
1
=|p
10
q
1
|=2 and
t
2
=|p
11
q
2
|=3. The total threshold T is defined as the sum of
all thresholds (=5). Continuing the example, since
T <
dist
(p
11
,Q), it is possible that there exists a point in P whose
distance to
Q is smaller than dist(p
11
,Q). So MQM retrieves
the second NN of
q
1
(p
11
, which has already been
encountered by
q
2
) and updates the threshold t
1
to |p
11
q
1
|
(=3). Since
T (=6) now equals the summed distance
between the best neighbor found so far and the points of
Q,
MQM terminates with
p
11
as the final result. In other words,
every non-encountered point has distance greater or equal
to
T (=6), and therefore it cannot be closer to Q (in the
global sense) than
p
11
.
Figure 3.1: Example of a GNN query
Figure 3.2 shows the pseudo code for MQM (1NN), where
best_dist (initially ) is the distance of the best_NN found
so far. In order to achieve locality of the node accesses for
individual queries, we sort the points in
Q according to their
Hilbert value; thus, two subsequent queries are likely to
correspond to nearby points and access similar R-tree
nodes. The algorithm for computing nearest neighbors of
query points should be incremental (e.g., best-first search
discussed in Section 2) because the termination condition is
not known in advance. The extension for the retrieval of
k
(>1) nearest neighbors is straightforward. The
k neighbors
with the minimum overall distances are inserted in a list of
k pairs <p, dist(p,Q)> (sorted on dist(p,Q)) and best_dist
equals the distance of the
k-th NN. Then, MQM proceeds in
the same way as in Figure 3.2, except that whenever a better
neighbor is found, it is inserted in
best_NN and the last
element of the list is removed.
MQM(Q: group of query points)
/* T : threshold ; best_dist distance of the current NN*/
sort points in Q according to Hilbert value;
for each query point: t
i
=0;
T=0; best_dist=; best_NN=null; //Initialization
while (T < best_dist)
get the next nearest neighbor p
j
of the next query point q
i
;
t
i
= |p
j
q
i
|; update T;
if dist(p
j
,Q)<best_dist
best_NN =p
j
; //Update current GNN of Q
best_dist = dist(p
j
,Q) ;
end of while;
return best_NN;
Figure 3.2: The MQM algorithm

3.2 Single point method
MQM may incur multiple accesses to the same node (and
retrieve the same data point, e.g.,
p
11
) through different
queries. To avoid this problem, the
single point method
(SPM) processes GNN queries by a single traversal. First,
SPM computes the
centroid q of Q, which is a point in
space with a small value of
dist(q,Q) (ideally, q is the point
with the minimum
dist(q,Q)). The intuition behind this
approach is that the nearest neighbor is a point of
P "near"
q. It remains to derive (i) the computation of q, and (ii) the
range around
q in which we should look for points of P,
before we conclude that no better NN can be found.
Towards the first goal, let (
x,y) be the coordinates of
centroid
q and (x
i
,y
i
) be the coordinates of query point q
i
.
The centroid
q minimizes the distance function:
1
(, ) (- ) ( )
n
ii
i
d
ist q Q x x y y
=
=+
Since the partial derivatives of function dist(q,Q) with
respect to its independent variables x and y are zero at the
centroid q, we have the following equations:
22
1
22
1
(, )
0
()()
(, )
0
()()
n
i
i
ii
n
i
i
ii
xx
dist q Q
x
xx yy
yy
dist q Q
y
xx yy
=
=
==
−+
==
−+
Unfortunately, the above equations cannot be solved into
closed form for n>2, or in other words, they must be
evaluated numerically, which implies that the centroid is
approximate. In our implementation, we use the gradient
descent [HYC01] method to quickly obtain a good
approximation. Specifically, starting with some arbitrary
initial coordinates, e.g. x=(1/n)
i=1~n
x
i
and, y=(1/n)
i=1~n
y
i
,
the method modifies the coordinates as follows:
(, )
d
ist q Q
xx
x
η
=−
and
(, )
d
ist q Q
yy
y
η
=−
,
where
ŋ
is a step size. The process is repeated until the
distance function dist(q,Q) converges to a minimum value.
Although the resulting point q is only an approximation of
the ideal centroid, it suffices for the purposes of SPM. Next
we show how q can be used to prune the search space based
on the following lemma.
Lemma 1: Let Q={q
1
,…,q
n
} be a group of query points and
q an arbitrary point in space. The following inequality holds
for any point p: dist(p,Q) n
|p q| - dist(q,Q), where |pq|
denotes the Euclidean distance between p and q.
Proof: Due to the triangular inequality, for each query point
q
i
we have that: |pq
i
|+|q
i
q||pq|. By summing up the n
inequalities:
|
||||| (,)||-(,
)
ii
ii
qQ qQ
p
q q q n pq dist p Q n pq dist q Q
∈∈
+≥
≥⋅
∑∑
Lemma 1 provides a threshold for the termination of SPM.
In particular, by applying an incremental point NN query at
q, we stop when we find the first point p such that: n
|pq|
dist(q,Q) dist(best_NN,Q). By Lemma 1, dist(p,Q)
n
|pq|
dist(q,Q) and, therefore, dist(p,Q) dist(best_NN,Q).
The same idea can be used for pruning intermediate nodes,
as summarized by the following heuristic.
Heuristic 1: Let q be the centroid of Q and best_dist be the
distance of the best GNN found so far. Node N can be
pruned if:
+()
(,)
b
est_dist dist q,Q
mindist N q
n
where mindist(N,q) is the minimum distance between the
MBR of N and the centroid q. An example of the heuristic
is shown in Figure 3.3, where the best_dist = 5+4. Since,
dist(q,Q)=1+2, the right part of the inequality equals 6,
meaning that both nodes in the figure will be pruned.
Figure 3.3: Pruning of nodes in SPM
Based on the above observations, it is straightforward to
implement SPM using the depth-first or best-first
paradigms. Figure 3.4 shows the pseudo-code of DF SPM.
Starting from the root of the R-tree (for P), entries are
sorted in a list according to their mindist from the query
centroid q and are visited (recursively) in this order. Once
the first entry with mindist(N
j
,q) (best_dist+dist(q,Q))/n
has been found, the subsequent ones in the list are pruned.
The extension to k (>1) GNN queries is the same as
conventional (point) NN algorithms.
SPM(Node: R-tree node, Q: group of query points)
/* q: the centroid of Q*/
if Node is an intermediate node
sort entries N
j
in Node according to mindist(N
j
,q) in list;
repeat
get_next entry N
j
from list;
if mindist(N
j
,q)< (best_dist+dist(q,Q))/n; /* Heuristic 1
SPM(N
j
,Q); /* recursion*/
until mindist(N
j
,q) (best_dist+dist(q,Q))/n or end of list;
else if Node is a leaf node
sort points p
j
in Node according to mindist(p
j
,q) in list;
repeat
get_next entry p
j
from list;
if |p
j
q|<(best_dist+dist(q,Q))/n; /* Heuristic 1 for points
if dist(p
j
,Q)< best_dist
best_NN =p
j
; //Update current GNN
best_dist = dist(p
j
,Q) ;
until |p
j
q| (best_dist+dist(q,Q))/n or end of list;
return best_NN;
Figure 3.4: The SPM algorithm

3.3 Minimum bounding method
Like SPM, the minimum bounding method (MBM)
performs a single query, but uses the minimum bounding
rectangle M of Q (instead of the centroid q) to prune the
search space. Specifically, starting from the root of the R-
tree for dataset P, MBM visits only nodes that may contain
candidate points. In the sequel, we discuss heuristics for
identifying such qualifying nodes.
Heuristic 2: Let M be the MBR of Q, and best_dist be the
distance of the best GNN found so far. A node N cannot
contain qualifying points, if:
(, )
b
est_di
st
mindist N M
n
where mindist(N,M) is the minimum distance between M
and N, and n is the cardinality of Q. Figure 3.5 shows a
group of query points Q={q
1
,q
2
} and the best_NN with
best_dist=5. Since mindist(N
1
,M) = 3 > best_dist/2 = 2.5,
N
1
can be pruned without being visited. In other words,
even if there is a data point p at the upper-right corner of N
1
and all the query points were at the lower right corner of Q,
it would still be the case that dist(p,Q)> best_dist. The
concept of heuristic 2 also applies to the leaf entries. When
a point p is encountered, we first compute mindist(p,M)
from p to the MBR of Q. If mindist(p,M)
best_dist/n, p is
discarded since it cannot be closer than the best_NN. In this
way we avoid performing the distance computations
between p and the points of Q.
Figure 3.5: Example of heuristic 2
The heuristic incurs minimum overhead, since for every
node it requires a single distance computation. However, it
is not very tight, i.e., it leads to unnecessary node accesses.
For instance, node N
2
(in Figure 3.5) passes heuristic 2 (and
should be visited), although it cannot contain qualifying
points. Heuristic 3 presents a tighter bound for avoiding
such visits.
Heuristic 3: Let best_dist be the distance of the best GNN
found so far. A node N can be safely pruned if:
(,)
i
i
qQ
m
indist N q best_di
st
where mindist(N,q
i
) is the minimum distance between N and
query point q
i
Q. In Figure 3.5, since mindist(N
2
, q
1
) +
mindist(N
2
, q
2
) = 6 > best_dist = 5, N
2
is pruned.
Because heuristic 3 requires multiple distance computations
(one for each query point) it is applied only for nodes that
pass heuristic 2. Note that (like heuristic 2) heuristic 3 does
represent the tightest condition for successful node visits;
i.e., it is possible for a node to satisfy the heuristic and still
not contain qualifying points. Consider, for instance, Figure
3.6, which includes 3 query points. The current best_dist is
7, and node N
3
passes heuristic 3, since mindist(N
3
,q
1
) +
mindist(N
3
,q
2
) + mindist(N
3
,q
3
) = 5. Nevertheless, N
3
should not be visited, because the minimum distance that
can be achieved by any point in N
3
is greater than 7. The
dotted lines in Figure 3.6 correspond to the distance
between the best possible point p' (not necessarily a data
point) in N
3
and the three query points.
Figure 3.6: Example of a hypothetical optimal heuristic
Assuming that we can identify the best point p' in the node,
we can obtain a tight heuristic a follows: if the distance of
p' is smaller than best_dist visit the node; otherwise, reject
it. The combination of the best-first approach with this
heuristic would lead to an I/O optimal method (such as the
algorithm of [HS99] for conventional NN queries). Finding
point p', however, is similar to the problem of locating the
query centroid (but this time in a region constrained by the
node MBR), which, as discussed in Section 3.2, can only be
solved numerically (i.e., approximately). Although an
approximation suffices for SPM, for the correctness of
best_dist it is necessary to have the precise solution (in
order to avoid false misses). As a result, this hypothetical
heuristic cannot be applied for exact GNN retrieval.
Heuristics 2 and 3 can be used with both the depth-first and
best-first traversal paradigms. For simplicity, we discuss
MBM based on depth-fist traversal using the example of
Figure 3.7. The root of the R-tree is retrieved and its entries
are sorted by their mindist to M. Then, the node (N
1
) with
the minimum mindist is visited, inside which the entry of N
4
has the smallest mindist. Points p
5
, p
6
, p
4
(in N
4
) are
processed according to the value of mindist(p
j
,M) and p
5
becomes the current GNN of Q (best_dist=11). Points p
6
and p
4
have larger distances and are discarded. When
backtracking to N
1
, the subtree of N
3
is pruned by heuristic
2. Thus, MBM backtracks again to the root and visits nodes
N
2
and N
6
, inside which p
10
has the smallest mindist to M
and is processed first, replacing p
5
as the GNN
(best_dist=7). Then, p
11
becomes the best NN
(best_dist=6). Finally, N
5
is pruned by heuristic 2, and the
algorithm terminates with p
11
as the final GNN. The
extension to retrieval of kNN and the best-first
implementation are straightforward.

Citations
More filters
Journal Article

When is nearest neighbor meaningful

TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.
Proceedings ArticleDOI

The new Casper: query processing for location services without compromising privacy

TL;DR: Zhang et al. as mentioned in this paper presented Casper1, a new framework in which mobile and stationary users can entertain location-based services without revealing their location information, which consists of two main components, the location anonymizer and the privacy-aware query processor.
Proceedings ArticleDOI

Monitoring k-nearest neighbor queries over moving objects

TL;DR: This work proposes two efficient and scalable algorithms using grid indices based on indexing objects and queries for k-nearest neighbor queries over moving objects within a geographic area, and shows that these algorithms significantly outperform R-tree-based solutions.
Proceedings ArticleDOI

Conceptual partitioning: an efficient method for continuous nearest neighbor monitoring

TL;DR: Con conceptual partitioning (CPM) is proposed, a comprehensive technique for the efficient monitoring of continuous NN queries and it is shown that it outperforms the current state-of-the-art algorithms for all problem settings.
Book ChapterDOI

On trip planning queries in spatial databases

TL;DR: This paper provides a number of approximation algorithms with approximation ratios that depend on either the number of categories, the maximum number of points per category or both, and gives an experimental evaluation of the proposed algorithms using both synthetic and real datasets.
References
More filters
Journal ArticleDOI

Data clustering: a review

TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Proceedings ArticleDOI

R-trees: a dynamic index structure for spatial searching

TL;DR: A dynamic index structure called an R-tree is described which meets this need, and algorithms for searching and updating it are given and it is concluded that it is useful for current database systems in spatial applications.
Proceedings ArticleDOI

The R*-tree: an efficient and robust access method for points and rectangles

TL;DR: The R*-tree is designed which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory which clearly outperforms the existing R-tree variants.
Journal ArticleDOI

An optimal algorithm for approximate nearest neighbor searching fixed dimensions

TL;DR: In this paper, it was shown that given an integer k ≥ 1, (1 + ϵ)-approximation to the k nearest neighbors of q can be computed in additional O(kd log n) time.
Book ChapterDOI

When Is ''Nearest Neighbor'' Meaningful?

TL;DR: The effect of dimensionality on the "nearest neighbor" problem is explored, and it is shown that under a broad set of conditions, as dimensionality increases, the Distance to the nearest data point approaches the distance to the farthest data point.
Frequently Asked Questions (13)
Q1. What are the contributions in "Group nearest neighbor queries" ?

Assuming that Q fits in memory and P is indexed by an R-tree, the authors propose several algorithms for finding the group nearest neighbors efficiently. As a second step, the authors extend their techniques for situations where Q can not fit in memory, covering both indexed and non-indexed query points. 

In the future the authors intend to explore the application of related techniques to variations of group nearest neighbor search. Furthermore, it would be interesting to study other distance metrics ( e. g., network distance ) that necessitate alternative pruning heuristics and algorithms. Additional constraints ( e. g., a facility may serve at most k users ) may further complicate the solutions. 

For nodes the authors use the weighted mindist, based on the intuition that nodes with small values are likely to lead to neighbors with small global distance, so that subsequent visits can be pruned by heuristic 5. 

Existing algorithms for point NN queries using R-trees follow the branch-and-bound paradigm, utilizing some metrics to prune the search space. 

Because heuristic 3 requires multiple distance computations (one for each query point) it is applied only for nodes that pass heuristic 2. 

The authors store the qualifying list as an in-memory hash table on point ids to facilitate the retrieval of information (i.e., counter(pi), curr_dist(pi)) about particular points (pi). 

The query is submitted to n retrieval engines that return the best matches for particular features together with their similarity scores, i.e., the first engine will output a set of matches according to color, the second according to arrangement and so on. 

Since now the query cardinality n is fixed to that of the corresponding dataset, the authors perform experiments by varying the relative workspaces of the two datasets. 

In order to achieve locality of the node accesses for individual queries, the authors sort the points in Q according to their Hilbert value; thus, two subsequent queries are likely to correspond to nearby points and access similar R-tree nodes. 

The distance between a data point p and Q is defined as dist(p,Q)=∑i=1~n|pqi|, where |pqi| is the Euclidean distance between p and query point qi. 

A possible optimization is to keep each NN in memory, together with its distances to all groups, so that the authors avoid these computations if the same point is encountered later through another group. 

As an example, consider that a user wants to find the k images that are most similar to a query image, where similarity is defined according to n features, e.g., color histogram, object arrangement, texture, shape etc. 

The best-first (BF) algorithm of [HS99] achieves the optimal I/O performance by maintaining a heap H with the entries visited so far, sorted by their mindist.