What have the authors contributed in "Efficient processing of top-k spatial keyword queries" ?

Q: What have the authors contributed in "Efficient processing of top-k spatial keyword queries" ?

In this paper, the authors propose a novel index to improve the performance of top-k spatial keyword queries named Spatial Inverted Index ( S2I ). Moreover, the authors present algorithms that exploit S2I to process top-k spatial keyword queries efficiently. Finally, the authors show through extensive experiments that their approach outperforms the state-of-the-art approaches in terms of update and query cost.

(Open Access) Efficient processing of top-k spatial keyword queries (2011) | João B. Rocha-Junior

Eﬃcient Processing of Top-k Spatial

Keyword Queries

Jo˜ao B. Rocha-Junior

⋆

, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørv˚ag

Department of Comp u ter and I n fo rm a tio n Science

Norwegian University of Science and Technology (NTNU)

Trondheim, Norway

{joao, o r est is, simonj, noervaag}@idi.ntnu.no

Abstract. Given a spatial location and a set of keywords, a to p - k spatial

keyword query returns the k best spatio-textual objects ranked accord-

ing to their proximity to the query location and relevance to the query

keywords. There are many applic a tio n s h a n d l in g huge amounts of geo-

tagged data, such as Twitter and Flickr, that can beneﬁt from this query.

Unfortunately, the state-of-the-art approaches require non-negligible pro-

cessing cost that incurs in long response time. In this paper, we propose a

novel index to improve the performance of top-k spatial keyword queries

named Spatial Inverted Index (S2I). Our index maps each distinct term

to a set of o bjects containing the term. The objects are stored diﬀerently

according to th e document frequency o f the term and can be retrieved

eﬃciently in decreasing order of keyword relevance and spatial proximity.

Moreover, we present algorithms that exploit S2I to process top-k spatia l

keyword queries eﬃciently. Finally, we show through extensive experi-

ments that our approach outperforms the state-of-the-a rt approaches in

terms of update and query cost.

1 Introduction

Given a location and a set of keywords, a top-k spatial keyword query returns a

ranked set of the k best spatio-textual objects taking into account both 1) the

spatial distance between the objects (spatio-textual objects) and the query lo-

cation, and 2) the rele vance of the text describing the objects to the query

keywords. There are several applications that can beneﬁt from top-k spatial

keyword queries such as ﬁnding th e tweets sent from a given location (Twit-

ter) or ﬁnding images near by a given location whose annotation is similar to

the query keywords (Flickr). There are also other applications for GPS-enabled

mobile phones that can beneﬁt from such queries.

For example, Fig. 1 shows a s p atial area containing objects p (bars and pu bs )

with their respective textual de s cr i pt ion. Consider a tourist in S˜ao Paulo with

a GPS mobile phone that wants to ﬁnd a bar playing samba near her current

location q. The tourist poses a top-3 spatial keyword query on her mobile phone

⋆

On leave from the State University of Feira de Santana (UEFS).

2 Jo˜ao B. Rocha-Junior et al.

Fig. 1. Example of top-k spatial keyword query.

with the keywords bar and samba (the query location q is automatically sent by

the mobile phone). The top-1 result is p

because its description is similar to the

query keywords, and it is close to the query location q. The top-2 result is p

that is nearer to q than p

and has a better textual relevance to the query key-

words than p

. Here, we are assuming for simplicity that documents with higher

numbers of occurrences of query keywords are more textually relevant. Later,

we will drop this assumpti on and present a more advanced model. Consequently,

the top-3 results are hp

, p

Top-k s pati al keyword queries are intuitive and constitute a useful tool for

many applications. However, processing top-k spatial keyword queries eﬃciently

is complex and requires a hybrid index combining information retrieval and

spatial indexes. The state-of-the-art approaches proposed by Cong et al. [4] and

Li et al. [11] employ a hybrid index that augments the nodes of an R-tree with

inverted indexes. The inverted index at each node refers to a pseudo-document

that represents all the objects under the node. Therefore, in orde r to verify if

a node is relevant for a set of query keywords, the current approaches access

the inverted index at each node to evaluate the similarity between the query

keywords and the pseudo-document associated with the nod e. This process incurs

in non-negligible processing cost that results in long response time.

In this paper, we propose a novel method for processin g top-k spatial keyword

queries more eﬃciently. Instead of employing a single R-tree embedded with in-

verted indexes , we pr opose a new index named Spatial Inverted Index (S2I) that

maps each keyword (term) to a distinct aggregated R-tree (aR-tree) [14] th at

stores the objects with the given term. In fact, we employ an aR-tree only when

the numbe r of objects exceeds a given threshold. As long as the threshold is

not exceeded, the objects are stored in a ﬁle, one block per term. However, for

ease of presentation, let us assume that we employ an aR-tree for each term.

The aR-tree stores the latitude and longitude of the objects, and maintains an

aggregated value that represents the maximum term impact (normalized weight)

of the objects un de r the node. Conse qu ently, it is possible to retrieve the best

objects ranked in terms of both spatial rel evance and keyword relevance eﬃ-

Eﬃcient P rocessing of Top-k Spatial Keyword Queries 3

ciently and incrementally. For processing a top-k s p atial keyword query with a

single keyword, only few nodes of a single aR-tree are access ed . For queri es with

more than one keyword, we employ an eﬃcient algorithm that aggregates the

partial-scores on keyword relevance of the objects to obtain the k best results

eﬃciently. In summary, the main contributions of this paper are:

– We present S2I, an index that maps each term in the vocabulary into a

distinct aR-tree or block that stores all objects with the given te r m.

– We p r opose eﬃcient algorithms that exploit the S2I in ord er to process top-k

spatial keywor d queries eﬃciently.

– Finally, we show through an extensive experimental evaluation that our ap-

proach outperforms the state-of-the-art algorithms in terms of update time,

I/O cost, and response time.

The rest of this paper is organized as follows. Sect. 2 gives an overview of

the related work. Sect. 3 poses the problem statement. In Sect. 4, we describe

S2I. In Sect. 5 and 6, we present the algorithms for pr ocessing top-k spatial

keyword queries. Finally, the experimental evaluation is presented in Sect. 7 and

the paper is conclude in Sect. 8.

2 Related Work

Initially, the research on spatial keyword queries focused on improving the per-

formance of spatial queries in search engines. Zhou et al. [17] did a relevant

work combining inverted indexes [18] and R-tr e es [2], and propose three ap-

proaches: 1) indexing the data in both R-trees and inverted indexes, 2) creating

an R-tree for each term, and 3) integrating keywords in the intermediary nodes

of an R-tre e. They found out that the second approach achieved better perfor-

mance. However, they did not consider objects with a precise location (latitude

and longitude), and did not pr ovide support for top-k spatial keyword queries.

Chen et al. [3] also had an information retrieval perspective on their work and

did not provide support for exact query location of objects. In their approach,

inverted indexes and R-trees are accessed separately in two diﬀerent s tages .

With the popularization of GPS-enabled devices, the research focused on

searching for objects in a speciﬁc location. Hariharan et al. [7] proposed aug-

menting the nodes of an R-tree with keywords ext r acte d from the objects in the

sub-tree of the node. These keywords are then indexed in a structure similar to

an inverted index for fast retrieval. Their approach supports conjunctive query

in a given region of sp ace . It is not clear, however, how their solution can be

extended to support top-k spatial keyword queries. Later, Ian de Felipe et al. [5]

proposed a data structure that integrates signature ﬁles and R-trees. The main

idea was indexing the spatial objects in an R-tree employing a signature on the

nodes to indicate the presence of a given keyword in the sub-tree of the node.

Consequently, at query processing time, the nodes that cannot contribute with

the query keywords can be pruned. The main problem of this approach is the

limitation to Boolean queries and to a small number of keywords per document.

4 Jo˜ao B. Rocha-Junior et al.

To the best of our knowledge, there are two previous approaches that support

top-k spatial keyword queries. They were developed concurrently by Cong et

al. [4] and Li et al. [11]. Both approaches augment the nodes of an R-tree with

a document vector that represents all documents in the sub-tree of the node.

For all terms present in the objects in the sub-tree of the node, the vector stores

the maximum impact of the term (normalized weight). Consequently, the vector

allows computing an upper bound for the textual score (textual relevance) that

can be achieved visiting a given node. Hence, it is possible to rank the nodes

according to textual relevance and spatial relevance, and decide which nodes

should be accessed ﬁrst to compute the top-k results.

The work of Cong et al. goes beyond the work of Li et al. by incorporating

document similarity to build a more advanced R-tree namely DIR-tree. DIR- tr e e

groups, in the same node, objects that are near each other in terms of spatial

distance, and whose textual description are also simi lar. Furthermore, instead of

comparing vectors at query time, DIR-t r ee employs an inver te d index associated

with each node that permits to retrieve th e children of the node that can con-

tribute with a given query keyword eﬃciently. Only the posting lists associated

with the query keywords are accessed. Cong et al. also propose clustering the

nodes of DIR-tr e e (CDIR-tree) to further improve the query processing perfor-

mance. The main id ea is gr oupi ng related entries (objects, in case of leaf-nodes)

and employing a pseudo-document to represe nt each group. Hence, more pre-

cise bounds can be estimated at query time, consequently, improving the query

processing performance. However, it is not clear if the improvement achieved

at query processing time compensates the additional cost requ ir e d for clustering

the nodes (pre-processing), and the extra storage space demanded by CDIR-tree.

Moreover, keeping a CDIR-tree updated is more complex. For this reason, we

decided to compare our approach against the DIR-tree proposed by Cong et al.,

and we consider this appr oach as the state-of-the-art.

3 Problem Statement

Let P be a dataset with |P | spatio-textual objects p = hp.id, p.l, p.di, where p.id

is the identiﬁcation of p, p.l is the spatial location (latitude and longitude) of

p, and p.d is the textual document describing p (e.g., menu of a restaurant).

Let q = hq.l, q.d, q.ki be a top-k spatial keyword query, whe r e q.l is the query

location (latitude and longitude), q.d is the set of query keywords, and q.k is

the number of expected results. A query q returns q.k spatio-textual objects

, p

, ··· , p

q.k

} fr om P with the highest scores τ (p, q), τ (p

, q) ≥ τ(p

, q) ≥

··· ≥ τ (p

, q). Furthermore, a spatio-textual object p is part of the result set R

of q, if and only if exists at least one term t ∈ q.d that is also in p.d (p ∈ R ⇔

∃t ∈ q.d : t ∈ p.d). Th e score of p for a given query q is d e ﬁn ed in the following

equation:

τ(p, q) = α · δ(p.l, q.l) + (1 − α) · θ( p.d , q.d) (1)

where δ(p.l, q.l) is the spatial proximity between the query location q.l and the

object location p.l, and θ(q.d, p.d) is the textual relevance of p.d according to

Eﬃcient P rocessing of Top-k Spatial Keyword Queries 5

q.d. Both measures r et ur n values within the range [0, 1]. The query preference

parameter α ∈ (0, 1) deﬁnes the importance of one measure over the other. For

example, α = 0.5 means t hat spatial proximity and textual relevance are equally

important. In the following, we deﬁne the measures more precisely.

Spatial proximity (δ). The spatial proximity is deﬁned in the following equa-

tion:

δ(p.l, q.l) = 1 −

d(p.l, q.l)

max

(2)

where d(p.l, q.l) is the Euclidean distance between p.l and q.l, and d

max

is the

largest Euclid ean distance that any two points in the space may have. The

maximum distance may be obtained, for example, by getting the largest diagonal

of the Euclidean space of the application.

Textual relevance (θ). There are several similarity measures that can be used

to evaluate the textual relevance between the query keywords q.d and the text

document p.d [13]. In this paper, we adopt the well-known cosine similarity

between the vectors composed by the weights of the terms in q.d and p.d:

θ(p.d, q.d) =

t∈q.d

t,p.d

· w

t,q.d

t∈p.d

t,p.d

)

t∈q.d

t,q.d

)

(3)

In order to compute the cosine, we adopt the approach employed by Zobel and

Moﬀat [18]. Therefore, the weight w

t,p.d

is computed as w

t,p.d

= 1 + ln(f

t,p.d

where f

t,p.d

is the number of occur r en ce s (frequency) of t in p.d; and the weight

t,q.d

is obtained from the f ollowing formula w

t,q.d

= ln(1 +

|P |

), where |P | is

the total number of documents in the collection. The document frequency df

a term t gives the number of documents in P t hat contains t. The h ighe r the

cosine value, the higher the textual relevance. The textual relevance is a value

within the range [0, 1] (property of cosine).

We also deﬁne the impact λ

t,d

of a term t in a document d, where d represents

the description of an object p.d or the query keywords q.d. The impact λ

t,d

the normalized weight of the term in th e do c ume nt [1, 16], λ

t,d

√

t∈d

t,d

)

The impact takes into account the length of the document and can be used to

compare the relevance of two diﬀerent documents according to a term t present in

both documents. Consequently, the textual relevance θ(p.d, q.d) can be rewritten

in terms of the impact [16], θ(p.d, q.d) =

t∈q.d

t,q.d

· λ

t,p.d

Other types of spatial proximity and textual relevance measur e s such as

Okapi BM25 [13] can be supported by our framework. The focus of this paper

is, however, on the eﬃciency of top-k spatial keyword queries. In the following,

we pres e nt the S2I (Sect. 4) and desc r ibe the algorithms to process top-k spatial

keyword queries eﬃciently (Sect. 5 and 6).

4 Spatial Inverted Index

The S2I was designed t akin g in account the following observations. First, terms

with diﬀerent document frequency should be stored diﬀerently. It is well-known

Efficient processing of top-k spatial keyword queries

Figures

Citations

Spatial keyword query processing: an experimental evaluation

What do people study when they study Twitter? Classifying Twitter related academic papers

Spatial keyword querying

Temporal Spatial-Keyword Top-k publish/subscribe

Top-k spatial keyword queries on road networks

References

Introduction to Information Retrieval

Term Weighting Approaches in Automatic Text Retrieval

The R*-tree: an efficient and robust access method for points and rectangles

Inverted files for text search engines

Distance browsing in spatial databases

Related Papers (5)

Efficient retrieval of the top-k most relevant spatial web objects

Keyword Search on Spatial Databases

IR-Tree: An Efficient Index for Geographic Document Search

Spatial keyword query processing: an experimental evaluation

Collective spatial keyword querying

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Efficient processing of top-k spatial keyword queries" ?