scispace - formally typeset
Open AccessBook ChapterDOI

Efficient processing of top-k spatial keyword queries

TLDR
A novel index to improve the performance of top-k spatial keyword queries named Spatial Inverted Index (S2I), which maps each distinct term to a set of objects containing the term and can be retrieved efficiently in decreasing order of keyword relevance and spatial proximity.
Abstract
Given a spatial location and a set of keywords, a top-k spatial keyword query returns the k best spatio-textual objects ranked according to their proximity to the query location and relevance to the query keywords. There are many applications handling huge amounts of geotagged data, such as Twitter and Flickr, that can benefit from this query. Unfortunately, the state-of-the-art approaches require non-negligible processing cost that incurs in long response time. In this paper, we propose a novel index to improve the performance of top-k spatial keyword queries named Spatial Inverted Index (S2I). Our index maps each distinct term to a set of objects containing the term. The objects are stored differently according to the document frequency of the term and can be retrieved efficiently in decreasing order of keyword relevance and spatial proximity. Moreover, we present algorithms that exploit S2I to process top-k spatial keyword queries efficiently. Finally, we show through extensive experiments that our approach outperforms the state-of-the-art approaches in terms of update and query cost.

read more

Content maybe subject to copyright    Report

Efficient Processing of Top-k Spatial
Keyword Queries
Jo˜ao B. Rocha-Junior
, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørv˚ag
Department of Comp u ter and I n fo rm a tio n Science
Norwegian University of Science and Technology (NTNU)
Trondheim, Norway
{joao, o r est is, simonj, noervaag}@idi.ntnu.no
Abstract. Given a spatial location and a set of keywords, a to p - k spatial
keyword query returns the k best spatio-textual objects ranked accord-
ing to their proximity to the query location and relevance to the query
keywords. There are many applic a tio n s h a n d l in g huge amounts of geo-
tagged data, such as Twitter and Flickr, that can benefit from this query.
Unfortunately, the state-of-the-art approaches require non-negligible pro-
cessing cost that incurs in long response time. In this paper, we propose a
novel index to improve the performance of top-k spatial keyword queries
named Spatial Inverted Index (S2I). Our index maps each distinct term
to a set of o bjects containing the term. The objects are stored differently
according to th e document frequency o f the term and can be retrieved
efficiently in decreasing order of keyword relevance and spatial proximity.
Moreover, we present algorithms that exploit S2I to process top-k spatia l
keyword queries efficiently. Finally, we show through extensive experi-
ments that our approach outperforms the state-of-the-a rt approaches in
terms of update and query cost.
1 Introduction
Given a location and a set of keywords, a top-k spatial keyword query returns a
ranked set of the k best spatio-textual objects taking into account both 1) the
spatial distance between the objects (spatio-textual objects) and the query lo-
cation, and 2) the rele vance of the text describing the objects to the query
keywords. There are several applications that can benefit from top-k spatial
keyword queries such as finding th e tweets sent from a given location (Twit-
ter) or finding images near by a given location whose annotation is similar to
the query keywords (Flickr). There are also other applications for GPS-enabled
mobile phones that can benefit from such queries.
For example, Fig. 1 shows a s p atial area containing objects p (bars and pu bs )
with their respective textual de s cr i pt ion. Consider a tourist in ao Paulo with
a GPS mobile phone that wants to find a bar playing samba near her current
location q. The tourist poses a top-3 spatial keyword query on her mobile phone
On leave from the State University of Feira de Santana (UEFS).

2 Jo˜ao B. Rocha-Junior et al.
Fig. 1. Example of top-k spatial keyword query.
with the keywords bar and samba (the query location q is automatically sent by
the mobile phone). The top-1 result is p
4
because its description is similar to the
query keywords, and it is close to the query location q. The top-2 result is p
6
that is nearer to q than p
7
and has a better textual relevance to the query key-
words than p
7
. Here, we are assuming for simplicity that documents with higher
numbers of occurrences of query keywords are more textually relevant. Later,
we will drop this assumpti on and present a more advanced model. Consequently,
the top-3 results are hp
4
, p
6
, p
7
i.
Top-k s pati al keyword queries are intuitive and constitute a useful tool for
many applications. However, processing top-k spatial keyword queries efficiently
is complex and requires a hybrid index combining information retrieval and
spatial indexes. The state-of-the-art approaches proposed by Cong et al. [4] and
Li et al. [11] employ a hybrid index that augments the nodes of an R-tree with
inverted indexes. The inverted index at each node refers to a pseudo-document
that represents all the objects under the node. Therefore, in orde r to verify if
a node is relevant for a set of query keywords, the current approaches access
the inverted index at each node to evaluate the similarity between the query
keywords and the pseudo-document associated with the nod e. This process incurs
in non-negligible processing cost that results in long response time.
In this paper, we propose a novel method for processin g top-k spatial keyword
queries more efficiently. Instead of employing a single R-tree embedded with in-
verted indexes , we pr opose a new index named Spatial Inverted Index (S2I) that
maps each keyword (term) to a distinct aggregated R-tree (aR-tree) [14] th at
stores the objects with the given term. In fact, we employ an aR-tree only when
the numbe r of objects exceeds a given threshold. As long as the threshold is
not exceeded, the objects are stored in a file, one block per term. However, for
ease of presentation, let us assume that we employ an aR-tree for each term.
The aR-tree stores the latitude and longitude of the objects, and maintains an
aggregated value that represents the maximum term impact (normalized weight)
of the objects un de r the node. Conse qu ently, it is possible to retrieve the best
objects ranked in terms of both spatial rel evance and keyword relevance effi-

Efficient P rocessing of Top-k Spatial Keyword Queries 3
ciently and incrementally. For processing a top-k s p atial keyword query with a
single keyword, only few nodes of a single aR-tree are access ed . For queri es with
more than one keyword, we employ an efficient algorithm that aggregates the
partial-scores on keyword relevance of the objects to obtain the k best results
efficiently. In summary, the main contributions of this paper are:
We present S2I, an index that maps each term in the vocabulary into a
distinct aR-tree or block that stores all objects with the given te r m.
We p r opose efficient algorithms that exploit the S2I in ord er to process top-k
spatial keywor d queries efficiently.
Finally, we show through an extensive experimental evaluation that our ap-
proach outperforms the state-of-the-art algorithms in terms of update time,
I/O cost, and response time.
The rest of this paper is organized as follows. Sect. 2 gives an overview of
the related work. Sect. 3 poses the problem statement. In Sect. 4, we describe
S2I. In Sect. 5 and 6, we present the algorithms for pr ocessing top-k spatial
keyword queries. Finally, the experimental evaluation is presented in Sect. 7 and
the paper is conclude in Sect. 8.
2 Related Work
Initially, the research on spatial keyword queries focused on improving the per-
formance of spatial queries in search engines. Zhou et al. [17] did a relevant
work combining inverted indexes [18] and R-tr e es [2], and propose three ap-
proaches: 1) indexing the data in both R-trees and inverted indexes, 2) creating
an R-tree for each term, and 3) integrating keywords in the intermediary nodes
of an R-tre e. They found out that the second approach achieved better perfor-
mance. However, they did not consider objects with a precise location (latitude
and longitude), and did not pr ovide support for top-k spatial keyword queries.
Chen et al. [3] also had an information retrieval perspective on their work and
did not provide support for exact query location of objects. In their approach,
inverted indexes and R-trees are accessed separately in two different s tages .
With the popularization of GPS-enabled devices, the research focused on
searching for objects in a specific location. Hariharan et al. [7] proposed aug-
menting the nodes of an R-tree with keywords ext r acte d from the objects in the
sub-tree of the node. These keywords are then indexed in a structure similar to
an inverted index for fast retrieval. Their approach supports conjunctive query
in a given region of sp ace . It is not clear, however, how their solution can be
extended to support top-k spatial keyword queries. Later, Ian de Felipe et al. [5]
proposed a data structure that integrates signature files and R-trees. The main
idea was indexing the spatial objects in an R-tree employing a signature on the
nodes to indicate the presence of a given keyword in the sub-tree of the node.
Consequently, at query processing time, the nodes that cannot contribute with
the query keywords can be pruned. The main problem of this approach is the
limitation to Boolean queries and to a small number of keywords per document.

4 Jo˜ao B. Rocha-Junior et al.
To the best of our knowledge, there are two previous approaches that support
top-k spatial keyword queries. They were developed concurrently by Cong et
al. [4] and Li et al. [11]. Both approaches augment the nodes of an R-tree with
a document vector that represents all documents in the sub-tree of the node.
For all terms present in the objects in the sub-tree of the node, the vector stores
the maximum impact of the term (normalized weight). Consequently, the vector
allows computing an upper bound for the textual score (textual relevance) that
can be achieved visiting a given node. Hence, it is possible to rank the nodes
according to textual relevance and spatial relevance, and decide which nodes
should be accessed first to compute the top-k results.
The work of Cong et al. goes beyond the work of Li et al. by incorporating
document similarity to build a more advanced R-tree namely DIR-tree. DIR- tr e e
groups, in the same node, objects that are near each other in terms of spatial
distance, and whose textual description are also simi lar. Furthermore, instead of
comparing vectors at query time, DIR-t r ee employs an inver te d index associated
with each node that permits to retrieve th e children of the node that can con-
tribute with a given query keyword efficiently. Only the posting lists associated
with the query keywords are accessed. Cong et al. also propose clustering the
nodes of DIR-tr e e (CDIR-tree) to further improve the query processing perfor-
mance. The main id ea is gr oupi ng related entries (objects, in case of leaf-nodes)
and employing a pseudo-document to represe nt each group. Hence, more pre-
cise bounds can be estimated at query time, consequently, improving the query
processing performance. However, it is not clear if the improvement achieved
at query processing time compensates the additional cost requ ir e d for clustering
the nodes (pre-processing), and the extra storage space demanded by CDIR-tree.
Moreover, keeping a CDIR-tree updated is more complex. For this reason, we
decided to compare our approach against the DIR-tree proposed by Cong et al.,
and we consider this appr oach as the state-of-the-art.
3 Problem Statement
Let P be a dataset with |P | spatio-textual objects p = hp.id, p.l, p.di, where p.id
is the identification of p, p.l is the spatial location (latitude and longitude) of
p, and p.d is the textual document describing p (e.g., menu of a restaurant).
Let q = hq.l, q.d, q.ki be a top-k spatial keyword query, whe r e q.l is the query
location (latitude and longitude), q.d is the set of query keywords, and q.k is
the number of expected results. A query q returns q.k spatio-textual objects
{p
1
, p
2
, ··· , p
q.k
} fr om P with the highest scores τ (p, q), τ (p
1
, q) τ(p
2
, q)
··· τ (p
k
, q). Furthermore, a spatio-textual object p is part of the result set R
of q, if and only if exists at least one term t q.d that is also in p.d (p R
t q.d : t p.d). Th e score of p for a given query q is d e fin ed in the following
equation:
τ(p, q) = α · δ(p.l, q.l) + (1 α) · θ( p.d , q.d) (1)
where δ(p.l, q.l) is the spatial proximity between the query location q.l and the
object location p.l, and θ(q.d, p.d) is the textual relevance of p.d according to

Efficient P rocessing of Top-k Spatial Keyword Queries 5
q.d. Both measures r et ur n values within the range [0, 1]. The query preference
parameter α (0, 1) defines the importance of one measure over the other. For
example, α = 0.5 means t hat spatial proximity and textual relevance are equally
important. In the following, we define the measures more precisely.
Spatial proximity (δ). The spatial proximity is defined in the following equa-
tion:
δ(p.l, q.l) = 1
d(p.l, q.l)
d
max
(2)
where d(p.l, q.l) is the Euclidean distance between p.l and q.l, and d
max
is the
largest Euclid ean distance that any two points in the space may have. The
maximum distance may be obtained, for example, by getting the largest diagonal
of the Euclidean space of the application.
Textual relevance (θ). There are several similarity measures that can be used
to evaluate the textual relevance between the query keywords q.d and the text
document p.d [13]. In this paper, we adopt the well-known cosine similarity
between the vectors composed by the weights of the terms in q.d and p.d:
θ(p.d, q.d) =
P
tq.d
w
t,p.d
· w
t,q.d
q
P
tp.d
(w
t,p.d
)
2
·
P
tq.d
(w
t,q.d
)
2
(3)
In order to compute the cosine, we adopt the approach employed by Zobel and
Moffat [18]. Therefore, the weight w
t,p.d
is computed as w
t,p.d
= 1 + ln(f
t,p.d
),
where f
t,p.d
is the number of occur r en ce s (frequency) of t in p.d; and the weight
w
t,q.d
is obtained from the f ollowing formula w
t,q.d
= ln(1 +
|P |
df
t
), where |P | is
the total number of documents in the collection. The document frequency df
t
of
a term t gives the number of documents in P t hat contains t. The h ighe r the
cosine value, the higher the textual relevance. The textual relevance is a value
within the range [0, 1] (property of cosine).
We also define the impact λ
t,d
of a term t in a document d, where d represents
the description of an object p.d or the query keywords q.d. The impact λ
t,d
is
the normalized weight of the term in th e do c ume nt [1, 16], λ
t,d
=
w
t,d
P
td
(w
t,d
)
2
.
The impact takes into account the length of the document and can be used to
compare the relevance of two different documents according to a term t present in
both documents. Consequently, the textual relevance θ(p.d, q.d) can be rewritten
in terms of the impact [16], θ(p.d, q.d) =
P
tq.d
λ
t,q.d
· λ
t,p.d
.
Other types of spatial proximity and textual relevance measur e s such as
Okapi BM25 [13] can be supported by our framework. The focus of this paper
is, however, on the efficiency of top-k spatial keyword queries. In the following,
we pres e nt the S2I (Sect. 4) and desc r ibe the algorithms to process top-k spatial
keyword queries efficiently (Sect. 5 and 6).
4 Spatial Inverted Index
The S2I was designed t akin g in account the following observations. First, terms
with different document frequency should be stored differently. It is well-known

Citations
More filters
Journal ArticleDOI

Spatial keyword query processing: an experimental evaluation

TL;DR: An all-around survey of 12 state-of-the-art geo-textual indices and proposes a benchmark that enables the comparison of the spatial keyword query performance, thus uncovering new insights that may guide index selection as well as further research.
Journal ArticleDOI

What do people study when they study Twitter? Classifying Twitter related academic papers

TL;DR: The majority of published work relating to Twitter concentrates on aspects of the messages sent and details of the users, and a variety of methodological approaches is used across a range of identified domains.
Book ChapterDOI

Spatial keyword querying

TL;DR: This paper reviews recent results by the authors that aim to achieve spatial keyword querying functionality that is easy to use, relevant to users, and can be supported efficiently.
Proceedings ArticleDOI

Temporal Spatial-Keyword Top-k publish/subscribe

TL;DR: A novel solution to efficiently process a large number of TaSK queries over a stream of geotextual objects and the experimental results show that the solution is able to achieve a reduction of the processing time by 70-80% compared with two baselines.
Proceedings ArticleDOI

Top-k spatial keyword queries on road networks

TL;DR: This paper addresses the challenging problem of processing top-k spatial keyword queries on road networks where the distance between the query location and the spatial object is the shortest path, and formalizes the new query type, and presents novel indexing structures and algorithms that are able to process such queries efficiently.
References
More filters
Book

Introduction to Information Retrieval

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Journal ArticleDOI

Term Weighting Approaches in Automatic Text Retrieval

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Proceedings ArticleDOI

The R*-tree: an efficient and robust access method for points and rectangles

TL;DR: The R*-tree is designed which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory which clearly outperforms the existing R-tree variants.
Journal ArticleDOI

Inverted files for text search engines

TL;DR: This tutorial introduces the key techniques in the area of text indexing, describing both a core implementation and how the core can be enhanced through a range of extensions.
Journal ArticleDOI

Distance browsing in spatial databases

TL;DR: The incremental nearest neighbor algorithm significantly outperforms the existing k-nearest neighbor algorithm for distance browsing queries in a spatial database that uses the R-tree as a spatial index and it is proved informally that at any step in its execution the incremental nearest neighbors algorithm is optimal with respect to the spatial data structure that is employed.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What have the authors contributed in "Efficient processing of top-k spatial keyword queries" ?

In this paper, the authors propose a novel index to improve the performance of top-k spatial keyword queries named Spatial Inverted Index ( S2I ). Moreover, the authors present algorithms that exploit S2I to process top-k spatial keyword queries efficiently. Finally, the authors show through extensive experiments that their approach outperforms the state-of-the-art approaches in terms of update and query cost.