What are the contributions in "Keyword searching and browsing in databases using banks" ?

The authors describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. The authors present an efficient heuristic algorithm for finding and ranking query results.

What are the future works mentioned in the paper "Keyword searching and browsing in databases using banks" ?

The authors plan to implement attribute: keyword queries such as author: Levy. The authors also want to summarize the output, i. e., group the output tuples into sets that have the same tree structure, and allow the user to look for further answers with a particular tree structure. Query evaluation with keywords matching metadata can be relatively slow, since a large number of tuples may be defined to be relevant to the keyword.

(Open Access) Keyword searching and browsing in databases using BANKS (2002) | G. Bhalotia

Keyword Searching and Browsing in Databases using BANKS

Gaurav Bhalotia



Arvind Hulgeri



Charuta Nakhe



Soumen Chakrabarti

S. Sudarshan



Computer Science and Engg. Dept., I.I.T. Bombay

bhalotia@cs.berkeley.edu, charuta@pspl.co.in,



aru,soumen,sudarsha



@cse.iitb.ac.in

Abstract

With the growth of the Web, there has been a rapid

increase in the number of users who need to access on-

line databases without having a detailed knowledge of the

schema or of query languages; even relatively simple query

languages designed for non-experts are too complicated

for them. We describe BANKS, a system which enables

keyword-based search on relational databases, together

with data and schema browsing. BANKS enables users to

extract information in a simple manner without any knowl-

edge of the schema or any need for writing complex queries.

A user can get information by typing a few keywords, fol-

lowing hyperlinks, and interacting with controls on the dis-

played results.

BANKS models tuples as nodes in a graph, connected by

links induced by foreign key and other relationships. An-

swers to a query are modeled as rooted trees connecting tu-

ples that match individual keywords in the query. Answers

are ranked using a notion of proximity coupled with a notion

of prestige of nodes based on inlinks, similar to techniques

developed for Web search. We present an efﬁcient heuristic

algorithm for ﬁnding and ranking query results.

1. Introduction

Relational databases are commonly searched using struc-

tured query languages. The user needs to know the data

schema to be able to ask suitable queries. Search engines on

the Web havepopularizedan alternative unstructured query-

ing and browsing paradigm that is simple and user-friendly.

Users type in keywords and follow hyperlinks to navigate

from one document to the other. No knowledge of schema

is needed.



Current afﬁliation: University of California, Berkeley



Supported by an Infosys Fellowship



Current afﬁliation: Persistent Systems Pvt. Ltd., Pune, India



Partly supported by an IBM Faculty Partnership Grant

With the growth of the World Wide Web, there has been

a rapid increase in the number of users who need to ac-

cess online databases without having a detailed knowledge

of schema or query languages; even relatively simple query

languages designed for non-experts are too complicated for

such users. Query languages for semi-structured/XML data

are evenmorecomplex,increasingthe impedance mismatch

further.

Unfortunately, keyword search techniques used for lo-

cating information from collections of (Web) documents

cannot be used on data stored in databases. In relational

databases, information needed to answer a keyword query is

often split across the tables/tuples, due to normalization. As

an example consider a bibliographicdatabase shown in Fig-

ure 1. This database contains paper titles, their authors and

citations extracted from the DBLP repository. The schema

is shown in Figure 1(A). Figure 1(B) shows a fragment of

the DBLP database. It depicts partial information—paper

title and authors—about a particular paper. As we can see,

the information is distributed across seven tuples related

through foreign key references. A user looking for this

paper may use queries like ”sunita temporal” or ”soumen

sunita”. In keyword based search, we need to identify tu-

ples containing the keywords and ascertain their proximity

through links.

Answers to keyword queries on the Web are often only

the starting point for further browsing to locate required

information. Similar browsing facilities are needed in the

context of searching for information from databases.

In this paper, we describe techniques for keyword

searching and browsing on databases that we have devel-

oped as part of the BANKS system (BANKS is an acronym

for Browsing ANd Keyword Searching). The BANKS

system enables data and schema browsing together with

keyword-based search for relational databases. BANKS en-

ables a user to get information by typing a few keywords,

following hyperlinks, and interacting with controls on the

displayed results; absolutelyno query language or program-

Mining Surprising Patterns Using

Temporal Description Length

Cites

Citing

Cited

...

Paper

PaperId

PaperName

...

Writes

AuthorId

PaperId

...

Author

...

AuthorName

AuthorId

Foreign Key Primary Key

AuthorId

PaperId

AuthorName

ChakrabartiSD98

...

Byron DomByronDSunita SarawagiSunitaSSoumen ChakrabartiSoumenC

SoumenC

ChakrabartiSD98

SunitaS

ChakrabartiSD98 ByronD

ChakrabartiSD98

Paper Tuple

Author Tuple Author Tuple Author Tuple

PaperId PaperName

Writes Tuple Writes TupleWrites Tuple

(A) The Schema

(B) A Fragment of the Database

Figure 1. The DBLP Bibliography Databases

ming is required.

The contributions of this paper are as follows:

1. We outline a framework for keyword querying of rela-

tional databases. Our framework makes joins implicit

and transparent, and incorporates notions of proximity

and prestige when ranking answers.

There has been a fair amount of earlier work on key-

word querying of databases, including [6, 7, 12, 13].

We describe the connections of the BANKS model to

related work on keyword search in Section 6.

2. We present novel, efﬁcient heuristic algorithms for ex-

ecuting keyword queries.

3. We describe key features of the BANKS system.

Keyword searching in BANKS is done using proximity

based ranking, based on foreign key links and other types of

links. We model the database as a graph, with the tuples as

nodes and cross references between them as edges. BANKS

allows query keywords to match data (tokens appearing in

any textual attribute), and meta data (e.g., column or rela-

tion name).

The greatest value of BANKS lies in near zero-effort

Web publishing of relational data which would otherwise

remain invisible to the Web [2]. BANKS may be used to

publish organizational data, bibliographic data, and elec-

tronic catalogs. Search facilities for such applications can

be hand crafted: many Web sites provide forms to carry out

limited types of queries on their backend databases. For

example, a university Web site may provide form interface

to search for faculty and students. Searching for depart-

ments would require yet another form, as would searching

for courses offered. Creating an interface for each such task

is laborious, and is also confusing to users since they must

ﬁrst expend effort ﬁnding which form to use.

An approach taken in some cases is to export data from

the database to Web pages, and then provide text search on

Web documents. This approach results in duplication of

data, with resultant problems of keeping the versions up-to-

date, in addition to space and time overheads. Further, with

highly connected data, it is not feasible to export every pos-

sible combination. For instance, a bibliographic database

can export details of each paper as a Web document, but a

query that requires ﬁnding a citation link between two pa-

pers would not be supported.

BANKS provides a rich interface to browse data, and

automatically generates hyperlinks, corresponding to for-

eign keys and other links, on displayed results. BANKS

helps create hierarchical and graphical views of data

with hyperlink facilities built in. The BANKS sys-

tem is developed in Java using servlets and JDBC,

and can be run on any schema without any program-

ming. BANKS is accessible over the Web at the URL:

http://www.cse.iitb.ac.in/banks/

The rest of the paper is organized as follows: Section 2

outlines our graph model for representing connectivity in-

formation, as well as our model for answer relevance. Sec-

tion 3 outlines an algorithm for incrementally ﬁnding the

best answers to keyword queries. We present an overview

of the browsing features of BANKS in Section 4. Section 5

outlines a preliminary evaluation of our system. We discuss

related work in Section 6. Section 7 outlines directions for

future work and Section 8 concludes the paper.

2. Database and Query Model

In this section we describe how a relational database is mod-

eled as a graph in the BANKS system. First we evaluate var-

ious options available and describe our model informally,

and then formalize it.

2.1. Informal Model Description

We model the database as a directed graph and each tuple

in the database as a node in the graph. Each foreign-key–

primary-key link is modeled as a directed edge between the

corresponding tuples. This can be easily extended to other

type of connections; for example, we can extend the model

to include edges corresponding to inclusion dependencies,

where the values in the referencing column of the referenc-

ing table are contained in the referred column of the referred

table but the referred column need not be a key of the re-

ferred table.

Intuitively, an answer to a query should be a subgraph

connecting nodes matching the keywords. Just by looking

at a subgraph it is notapparentas to what information it con-

veys. We wish to identify a node in the graph as a central

node that connects all the keyword nodes, and strongly re-

ﬂects the relationship amongst them. We therefore consider

an answer to be a rooted directed tree containing a directed

path from the root to each keyword node. (The motivation

for directionality is outlined later in this section.) We call

the root node an information node and the tree a connection

tree. Conceptually this model is similar to the one described

in [13] although there are several differences which are de-

tailed in Section 6.

In general, the importance of a link depends upon the

type of the link i.e. what relations it connects and on its

semantics; for example, in the bibliographic database, the

link between the Paper table and the Writes table is seen as

a stronger link than the link between the Paper table and the

Cites table. The link between Paper and Cites tables would

have a higher weight. The weight of a tree is proportional

to the total of its edge weights, and the relevance of a tree is

inversely related to its weight.

The example in Figure 1 illustrates that some links point

towards the root of the tree, instead of away from the root as

required by our model. For instance, the Writes relation has

foreign keys to the Paper and Author relations, whereas we

require paths from Paper to Author, traversing a foreign key

edge in the opposite direction. However, we cannot simply

regard the edges as undirected.

Ignoring directionality would cause problems because of

“hubs” which are connected to a large numbers of nodes.

For example, in a university database a department with a

large number of faculty and students would act as a hub. As

a result, many nodes would be within a short distance of

many other nodes, reducing the effectiveness of proximity-

based scoring.

To solve the problem, we create for each link/edge



a backward edge



with a different edge weight; we

model the weight of



as directly proportional to num-

ber of links to



from the nodes of the same type as



(Equations for computing the weights are presented in Sec-

tion 2.2.) In the example from Figure 1, the backward edges

ensure that there is a directed tree rooted at the paper, with

a path to each leaf. To illustrate the effect of backward edge

weights, let us return to the university department exam-

ple. A forward edge from a student to her department and

a back edge from the department to another student would

form a path between each pair of students in the department.

If there are more students in a department, the back edges

would be assigned a higher weight, resulting in lower prox-

imity (due to the department) for each pair of students, than

if there are fewer students registered. In contrast, in the bib-

liographic database, papers (typically) have smaller num-

bers of authors, and the backward edge weights from Paper

to Writes nodes would be less resulting in higher proximity

between co-authors.

We may restrict the information node to be from a se-

lected set of nodes of the graph; for example, we may ex-

clude the nodes corresponding to the tuples from a speciﬁed

set of relations, such as Writes, which we believe are not

meaningfulroot nodes (this is similar to the scheme in [13]).

In the example from Figure 1(B), let the keyword nodes be

SunitaS, SoumenC and ByronD. These nodes, which are au-

thor nodes, have a relationship induced due to paper node

ChakrabartiSD98. The tree shown in Figure 1(B) (with

backward edges from the Paper node to the Writes nodes)

would be a connection tree for the keyword nodes, with the

paper node as the information node.

We incorporate another interesting feature, namely node

weights, inspired by prestige rankings such as PageRank

in Google [4]. With this feature, nodes that have multi-

ple pointers to them get a higher prestige. In our current

implementation we set the node prestige to the indegree of

the node. Higher node weight corresponds to higher pres-

tige. E.g., in a bibliography database containing citation

information, if the user gives a query Query Optimization

our technique would give higher prestige to the papers with

more citations. As another example, in a TPCD database

storing information about parts, suppliers, customers and

orders, the orders information contains references to parts,

suppliers and customers. As a result, if a query matches two

parts (or suppliers, or customers) the one with more orders

would get a higher prestige.

Node weights and tree weights need to be combined to

get an overall relevance score as discussed in Section 2.3.

2.2. Formal Database Model

Based on the discussion thus far, our model comprises

nodes with node weight and edges with forward and back-

ward edge weights, and a similarity measure between rela-

tions.

Nodes/vertices: For each tuple



in the database, the

graph has a corresponding node





. We will speak inter-

changeably of a tuple and the corresponding node in the

graph.

Node weights: Each node



in the graph is assigned a

weight





which depends upon the prestige of the node.

In our current implementation we set the node prestige to a

function of the indegree of the node. Extensions to handle

transfer of prestige (as is done, e.g., in Google’s PageRank

[4]) can be easily added to the model.

Edges: For each pair of tuples



and



such that there

is a foreign key from





, the graph contains an edge

from









and a back edge from









(this can

be extended to handle other types of connections).

Similarity between relations: Let













be the (gen-

erally asymmetric) similarity from relation



to relation



where



is the referencing relation and



is the refer-

enced relation. The similarity













depends upon the

type of the link from relation



to relation















is set to inﬁnity if relation





doesn’t refer to relation





Edge weights: In our model, the weight of a forward link

along a foreign key relationship reﬂects the strength of the

proximity relationship between two tuples and is set to 1

by default. It can be set to any desired value to reﬂect the

importance of the link (small values correspond to greater

proximity).

Consider two nodes



and



in the database. Let





and





be the respective relations that they belong to.

The weight of the directed edge

  

depends on two con-

ditions: whether the database has a link from





, and

whether it has a link from





. Neither, one or both links

may exist.



exists but



does not, we can simply as-

sign the weight







 



 



. If

  

does not

exist and



does, according to our earlier arguments,

we ought to assign weight











 







where







is the indegree of



contributed by the tuples

belonging to relation





If both



and



exist in the graph, we assign the

weight





as the minimum of the two values, i.e.,

! #"%$







 



 













  





'&(

(1)

Other choices are possible. For instance, if one were to view

the two weights as resistances in an electrical network, one

may use the equivalent parallel resistance.

2.3. Query and Answer Model

Generally, a query consists of

) * +

search terms









-((-(



,/.

. The ﬁrst step is to locate nodes matching

search terms. A node is relevant to a search term if it con-

tains the search term as part of an attribute valueor metadata

(such as column, table or view names). E.g., all tuples be-

longing to a relation named AUTHOR would be regarded as

relevant to the keyword ‘author’. For each search term

,0

the query we ﬁnd the set of nodes

that are relevant to

,20

Let

143











61%7



(-((



1



Extensions of the model to incorporate queries such as

“author:Levy” which would require the keyword “Levy” to

be in an author name attribute, can be easily incorporated.

Approximate matching of keywords to words present in tu-

ples can also be supported, by extending the model to in-

corporate node relevances. These features are not currently

implemented in our prototype, and we omit further details.

An answer to a query is a rooted directed tree contain-

ing at least one node from each

. Note that the tree may

also contain nodes not in any

and is therefore a Steiner

tree. The relevance score of an answer tree is computed

from the relevance scores of its nodes and its edge weights.

(The condition that one node from each

must be present

can be relaxed to allow answers containing only some of the

given keywords.)

Figure 2 shows a sample result of a query containing

the keywords soumen and sunita executed on the biblio-

graphic database. Each result is a tree containing node tu-

ples (including intermediate nodes) along with the resp. ta-

ble names and column names. Indentation is used to depict

the tree structure, and nodes containing keywords are distin-

guished from intermediate nodes by the color of the nodes.

Each answer tree has to be assigned a relevance score,

and answers have to be presented in decreasing order of that

score. Scoring involves a combination of relevance clues

from nodes and edges. Node weights and edge weights pro-

vide two separate measures of relevance. We desire a ﬁnal

relevance score in the range [0,1]. We also wish to control

the variation in individual weights so that a few nodes or

edges with very large weights do not skew the results ex-

cessively. We therefore take the following approach.

We scale individual node weights to

:9<;=

, the max-

imum node weight in the graph. We can addition-

ally depress the scale using logarithms (as with ‘IDF’

weighting in Information Retrieval); accordingly, the

normalized score

>?A@BDC



of a node



is deﬁned





EA

9<;=

F#GIH



/+KJL



E'

9<;=



respectively.

These are both scale-free quantities in

M N



-+AO

(if log is

to base 2).

Figure 2. Result of query “soumen sunita”

To get the overall node score

>?A@BDC

, we take the av-

erage of the node scores. To favor meaningful root

nodes, and to reduce the effect of intermediate nodes,

we consider only leaf nodes (containing the keywords)

and the root node when computing the average. A node

containing multiple search terms is counted as many

times as the number of search terms it contains, to

avoid giving extra weight to trees with separate nodes

for each keyword.

We get the normalized edge score



?A@BDC



C



of an

edge by dividing the edge weight by

K9

 

, the mini-

mum edge weight in the graph, to make it scale-free,

and may additionally depress the scale by deﬁning the

edge score of

F#GIH



/+ J 



C



E





The overall edge score is then deﬁned to be



?A@BDC3

+E



/+ J



?A@BDC



C



, since we wish to give lower

relevance to large trees. This quantity is also in the

range

M N



+AO

Finally, we can combine the overall edge score and

node score, to get an overall relevance score, either by

addition or by multiplication; in both cases, a factor



controlling their relative weightage. The additive com-

bination uses the formula









?A@BDC J



>?A@BDC

while the multiplicative combination uses the formula



?A@BDC



 ?-@BC



There are a total of eight combinations, since we have

three options (for edge score, node score and combination)

each of which can take two values. In our evaluation we

discarded three combinations: those that involvelog scaling

and multiplication as these scores tended to become quite

small, and compared the remaining combinations.

While inspired by standard IR weighting and smoothing

practice, the choices and parameters above are somewhat

ad-hoc, but this appears to be inescapable in all related sys-

tems that we have reviewed [7, 17].

3. Searching for the Best Answers

The computation of minimum Steiner trees is already a hard

(NP complete) problem, and is made complicated by node

weight considerations, required to compute the overall rel-

evance of a tree. We are interested in not just the most rele-

vant tree, but also in other trees with high relevance scores,

since they may be part of what the user is searching for. We

also wish to generate answers incrementally to avoid gener-

ating answers of low relevance that the user may never look

at.

In this section, we present an outline of the backward

expanding search algorithm which offers a heuristic solu-

tion for incrementally computing query results. Complete

details can be found in the full version of the paper [3].

We assume that the graph ﬁts in memory. This is not

unreasonable, even for moderately large databases, because

the in-memory node representation need not store any at-

tribute of the corresponding tuple other than the RID. The

only other in-memory structure is an index to map RIDs to

the graph nodes. Indices to map keywords to RIDs can be

disk resident. As a result the graphs of even large databases

with millions of nodes and edges can ﬁt in modest amounts

of memory.

Given a set of keywords, ﬁrst we ﬁnd, for each keyword

term

, the set of nodes,

, that are relevant to the keyword

by using disk resident indices on keywords.

Let





. The backward expanding search al-

gorithm concurrently runs

 

copies of Dijkstra’s single

source shortest path algorithm, one for each keyword node

)



, with

)

as the source. The copies of the algorithm

Keyword searching and browsing in databases using BANKS

Figures

Citations

YAGO: A Large Ontology from Wikipedia and WordNet

Discover: keyword search in relational databases

XRANK: ranked keyword search over XML documents

Random Walk Inference and Learning in A Large Scale Knowledge Base

BLINKS: ranked keyword searches on graphs

References

The anatomy of a large-scale hypertextual Web search engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Dynamic queries for visual information seeking

Proceedings of the Eighth International Conference on Data Engineering

HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering

Related Papers (5)

Discover: keyword search in relational databases

DBXplorer: a system for keyword-based search over relational databases

Bidirectional expansion for keyword search on graph databases

Efficient IR-style keyword search over relational databases

BLINKS: ranked keyword searches on graphs

Frequently Asked Questions (2)

Q1. What are the contributions in "Keyword searching and browsing in databases using banks" ?

Q2. What are the future works mentioned in the paper "Keyword searching and browsing in databases using banks" ?