scispace - formally typeset
Open AccessProceedings ArticleDOI

Keyword searching and browsing in databases using BANKS

TLDR
BANKS is described, a system which enables keyword-based search on relational databases, together with data and schema browsing, and presents an efficient heuristic algorithm for finding and ranking query results.
Abstract
With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

read more

Content maybe subject to copyright    Report

Keyword Searching and Browsing in Databases using BANKS
Gaurav Bhalotia
Arvind Hulgeri
Charuta Nakhe
Soumen Chakrabarti
S. Sudarshan
Computer Science and Engg. Dept., I.I.T. Bombay
bhalotia@cs.berkeley.edu, charuta@pspl.co.in,
aru,soumen,sudarsha
@cse.iitb.ac.in
Abstract
With the growth of the Web, there has been a rapid
increase in the number of users who need to access on-
line databases without having a detailed knowledge of the
schema or of query languages; even relatively simple query
languages designed for non-experts are too complicated
for them. We describe BANKS, a system which enables
keyword-based search on relational databases, together
with data and schema browsing. BANKS enables users to
extract information in a simple manner without any knowl-
edge of the schema or any need for writing complex queries.
A user can get information by typing a few keywords, fol-
lowing hyperlinks, and interacting with controls on the dis-
played results.
BANKS models tuples as nodes in a graph, connected by
links induced by foreign key and other relationships. An-
swers to a query are modeled as rooted trees connecting tu-
ples that match individual keywords in the query. Answers
are ranked using a notion of proximity coupled with a notion
of prestige of nodes based on inlinks, similar to techniques
developed for Web search. We present an efficient heuristic
algorithm for finding and ranking query results.
1. Introduction
Relational databases are commonly searched using struc-
tured query languages. The user needs to know the data
schema to be able to ask suitable queries. Search engines on
the Web havepopularizedan alternative unstructured query-
ing and browsing paradigm that is simple and user-friendly.
Users type in keywords and follow hyperlinks to navigate
from one document to the other. No knowledge of schema
is needed.
Current affiliation: University of California, Berkeley
Supported by an Infosys Fellowship
Current affiliation: Persistent Systems Pvt. Ltd., Pune, India
Partly supported by an IBM Faculty Partnership Grant
With the growth of the World Wide Web, there has been
a rapid increase in the number of users who need to ac-
cess online databases without having a detailed knowledge
of schema or query languages; even relatively simple query
languages designed for non-experts are too complicated for
such users. Query languages for semi-structured/XML data
are evenmorecomplex,increasingthe impedance mismatch
further.
Unfortunately, keyword search techniques used for lo-
cating information from collections of (Web) documents
cannot be used on data stored in databases. In relational
databases, information needed to answer a keyword query is
often split across the tables/tuples, due to normalization. As
an example consider a bibliographicdatabase shown in Fig-
ure 1. This database contains paper titles, their authors and
citations extracted from the DBLP repository. The schema
is shown in Figure 1(A). Figure 1(B) shows a fragment of
the DBLP database. It depicts partial information—paper
title and authors—about a particular paper. As we can see,
the information is distributed across seven tuples related
through foreign key references. A user looking for this
paper may use queries like ”sunita temporal” or ”soumen
sunita”. In keyword based search, we need to identify tu-
ples containing the keywords and ascertain their proximity
through links.
Answers to keyword queries on the Web are often only
the starting point for further browsing to locate required
information. Similar browsing facilities are needed in the
context of searching for information from databases.
In this paper, we describe techniques for keyword
searching and browsing on databases that we have devel-
oped as part of the BANKS system (BANKS is an acronym
for Browsing ANd Keyword Searching). The BANKS
system enables data and schema browsing together with
keyword-based search for relational databases. BANKS en-
ables a user to get information by typing a few keywords,
following hyperlinks, and interacting with controls on the
displayed results; absolutelyno query language or program-

Mining Surprising Patterns Using
Temporal Description Length
Cites
Citing
Cited
...
Paper
PaperId
PaperName
...
Writes
AuthorId
PaperId
...
Author
...
AuthorName
AuthorId
Foreign Key Primary Key
AuthorId
AuthorId
PaperId
AuthorName
ChakrabartiSD98
...
...
...
...
Byron DomByronDSunita SarawagiSunitaSSoumen ChakrabartiSoumenC
SoumenC
ChakrabartiSD98
SunitaS
ChakrabartiSD98 ByronD
ChakrabartiSD98
Paper Tuple
Author Tuple Author Tuple Author Tuple
PaperId PaperName
Writes Tuple Writes TupleWrites Tuple
(A) The Schema
(B) A Fragment of the Database
Figure 1. The DBLP Bibliography Databases
ming is required.
The contributions of this paper are as follows:
1. We outline a framework for keyword querying of rela-
tional databases. Our framework makes joins implicit
and transparent, and incorporates notions of proximity
and prestige when ranking answers.
There has been a fair amount of earlier work on key-
word querying of databases, including [6, 7, 12, 13].
We describe the connections of the BANKS model to
related work on keyword search in Section 6.
2. We present novel, efficient heuristic algorithms for ex-
ecuting keyword queries.
3. We describe key features of the BANKS system.
Keyword searching in BANKS is done using proximity
based ranking, based on foreign key links and other types of
links. We model the database as a graph, with the tuples as
nodes and cross references between them as edges. BANKS
allows query keywords to match data (tokens appearing in
any textual attribute), and meta data (e.g., column or rela-
tion name).
The greatest value of BANKS lies in near zero-effort
Web publishing of relational data which would otherwise
remain invisible to the Web [2]. BANKS may be used to
publish organizational data, bibliographic data, and elec-
tronic catalogs. Search facilities for such applications can
be hand crafted: many Web sites provide forms to carry out
limited types of queries on their backend databases. For
example, a university Web site may provide form interface
to search for faculty and students. Searching for depart-
ments would require yet another form, as would searching
for courses offered. Creating an interface for each such task
is laborious, and is also confusing to users since they must
first expend effort finding which form to use.
An approach taken in some cases is to export data from
the database to Web pages, and then provide text search on
Web documents. This approach results in duplication of
data, with resultant problems of keeping the versions up-to-
date, in addition to space and time overheads. Further, with
highly connected data, it is not feasible to export every pos-
sible combination. For instance, a bibliographic database
can export details of each paper as a Web document, but a
query that requires finding a citation link between two pa-
pers would not be supported.
BANKS provides a rich interface to browse data, and
automatically generates hyperlinks, corresponding to for-
eign keys and other links, on displayed results. BANKS
helps create hierarchical and graphical views of data
with hyperlink facilities built in. The BANKS sys-
tem is developed in Java using servlets and JDBC,
and can be run on any schema without any program-
ming. BANKS is accessible over the Web at the URL:
http://www.cse.iitb.ac.in/banks/
The rest of the paper is organized as follows: Section 2
outlines our graph model for representing connectivity in-
formation, as well as our model for answer relevance. Sec-
tion 3 outlines an algorithm for incrementally finding the
best answers to keyword queries. We present an overview

of the browsing features of BANKS in Section 4. Section 5
outlines a preliminary evaluation of our system. We discuss
related work in Section 6. Section 7 outlines directions for
future work and Section 8 concludes the paper.
2. Database and Query Model
In this section we describe how a relational database is mod-
eled as a graph in the BANKS system. First we evaluate var-
ious options available and describe our model informally,
and then formalize it.
2.1. Informal Model Description
We model the database as a directed graph and each tuple
in the database as a node in the graph. Each foreign-key–
primary-key link is modeled as a directed edge between the
corresponding tuples. This can be easily extended to other
type of connections; for example, we can extend the model
to include edges corresponding to inclusion dependencies,
where the values in the referencing column of the referenc-
ing table are contained in the referred column of the referred
table but the referred column need not be a key of the re-
ferred table.
Intuitively, an answer to a query should be a subgraph
connecting nodes matching the keywords. Just by looking
at a subgraph it is notapparentas to what information it con-
veys. We wish to identify a node in the graph as a central
node that connects all the keyword nodes, and strongly re-
flects the relationship amongst them. We therefore consider
an answer to be a rooted directed tree containing a directed
path from the root to each keyword node. (The motivation
for directionality is outlined later in this section.) We call
the root node an information node and the tree a connection
tree. Conceptually this model is similar to the one described
in [13] although there are several differences which are de-
tailed in Section 6.
In general, the importance of a link depends upon the
type of the link i.e. what relations it connects and on its
semantics; for example, in the bibliographic database, the
link between the Paper table and the Writes table is seen as
a stronger link than the link between the Paper table and the
Cites table. The link between Paper and Cites tables would
have a higher weight. The weight of a tree is proportional
to the total of its edge weights, and the relevance of a tree is
inversely related to its weight.
The example in Figure 1 illustrates that some links point
towards the root of the tree, instead of away from the root as
required by our model. For instance, the Writes relation has
foreign keys to the Paper and Author relations, whereas we
require paths from Paper to Author, traversing a foreign key
edge in the opposite direction. However, we cannot simply
regard the edges as undirected.
Ignoring directionality would cause problems because of
“hubs” which are connected to a large numbers of nodes.
For example, in a university database a department with a
large number of faculty and students would act as a hub. As
a result, many nodes would be within a short distance of
many other nodes, reducing the effectiveness of proximity-
based scoring.
To solve the problem, we create for each link/edge

a backward edge

with a different edge weight; we
model the weight of

as directly proportional to num-
ber of links to
from the nodes of the same type as
.
(Equations for computing the weights are presented in Sec-
tion 2.2.) In the example from Figure 1, the backward edges
ensure that there is a directed tree rooted at the paper, with
a path to each leaf. To illustrate the effect of backward edge
weights, let us return to the university department exam-
ple. A forward edge from a student to her department and
a back edge from the department to another student would
form a path between each pair of students in the department.
If there are more students in a department, the back edges
would be assigned a higher weight, resulting in lower prox-
imity (due to the department) for each pair of students, than
if there are fewer students registered. In contrast, in the bib-
liographic database, papers (typically) have smaller num-
bers of authors, and the backward edge weights from Paper
to Writes nodes would be less resulting in higher proximity
between co-authors.
We may restrict the information node to be from a se-
lected set of nodes of the graph; for example, we may ex-
clude the nodes corresponding to the tuples from a specified
set of relations, such as Writes, which we believe are not
meaningfulroot nodes (this is similar to the scheme in [13]).
In the example from Figure 1(B), let the keyword nodes be
SunitaS, SoumenC and ByronD. These nodes, which are au-
thor nodes, have a relationship induced due to paper node
ChakrabartiSD98. The tree shown in Figure 1(B) (with
backward edges from the Paper node to the Writes nodes)
would be a connection tree for the keyword nodes, with the
paper node as the information node.
We incorporate another interesting feature, namely node
weights, inspired by prestige rankings such as PageRank
in Google [4]. With this feature, nodes that have multi-
ple pointers to them get a higher prestige. In our current
implementation we set the node prestige to the indegree of
the node. Higher node weight corresponds to higher pres-
tige. E.g., in a bibliography database containing citation
information, if the user gives a query Query Optimization
our technique would give higher prestige to the papers with
more citations. As another example, in a TPCD database
storing information about parts, suppliers, customers and
orders, the orders information contains references to parts,
suppliers and customers. As a result, if a query matches two
parts (or suppliers, or customers) the one with more orders
would get a higher prestige.

Node weights and tree weights need to be combined to
get an overall relevance score as discussed in Section 2.3.
2.2. Formal Database Model
Based on the discussion thus far, our model comprises
nodes with node weight and edges with forward and back-
ward edge weights, and a similarity measure between rela-
tions.
Nodes/vertices: For each tuple
in the database, the
graph has a corresponding node

. We will speak inter-
changeably of a tuple and the corresponding node in the
graph.
Node weights: Each node
in the graph is assigned a
weight

which depends upon the prestige of the node.
In our current implementation we set the node prestige to a
function of the indegree of the node. Extensions to handle
transfer of prestige (as is done, e.g., in Google’s PageRank
[4]) can be easily added to the model.
Edges: For each pair of tuples

and

such that there
is a foreign key from

to

, the graph contains an edge
from

to

and a back edge from

to

(this can
be extended to handle other types of connections).
Similarity between relations: Let


be the (gen-
erally asymmetric) similarity from relation

to relation

where

is the referencing relation and

is the refer-
enced relation. The similarity


depends upon the
type of the link from relation

to relation

.


is set to infinity if relation
doesn’t refer to relation
.
Edge weights: In our model, the weight of a forward link
along a foreign key relationship reflects the strength of the
proximity relationship between two tuples and is set to 1
by default. It can be set to any desired value to reflect the
importance of the link (small values correspond to greater
proximity).
Consider two nodes
and
in the database. Let

and

be the respective relations that they belong to.
The weight of the directed edge
 
depends on two con-
ditions: whether the database has a link from
to
, and
whether it has a link from
to
. Neither, one or both links
may exist.
If

exists but

does not, we can simply as-
sign the weight



 
to

. If
 
does not
exist and

does, according to our earlier arguments,
we ought to assign weight







to

,
where


is the indegree of
contributed by the tuples
belonging to relation

.
If both

and

exist in the graph, we assign the
weight

as the minimum of the two values, i.e.,
! #"%$











'&(
(1)
Other choices are possible. For instance, if one were to view
the two weights as resistances in an electrical network, one
may use the equivalent parallel resistance.
2.3. Query and Answer Model
Generally, a query consists of
) * +
search terms
,
,
-((-(
,/.
. The first step is to locate nodes matching
search terms. A node is relevant to a search term if it con-
tains the search term as part of an attribute valueor metadata
(such as column, table or view names). E.g., all tuples be-
longing to a relation named AUTHOR would be regarded as
relevant to the keyword ‘author’. For each search term
,0
in
the query we find the set of nodes
1
0
that are relevant to
,20
.
Let
143
51
61
61%7
(-((
1
.
.
Extensions of the model to incorporate queries such as
“author:Levy” which would require the keyword “Levy” to
be in an author name attribute, can be easily incorporated.
Approximate matching of keywords to words present in tu-
ples can also be supported, by extending the model to in-
corporate node relevances. These features are not currently
implemented in our prototype, and we omit further details.
An answer to a query is a rooted directed tree contain-
ing at least one node from each
1
0
. Note that the tree may
also contain nodes not in any
1
0
and is therefore a Steiner
tree. The relevance score of an answer tree is computed
from the relevance scores of its nodes and its edge weights.
(The condition that one node from each
1
0
must be present
can be relaxed to allow answers containing only some of the
given keywords.)
Figure 2 shows a sample result of a query containing
the keywords soumen and sunita executed on the biblio-
graphic database. Each result is a tree containing node tu-
ples (including intermediate nodes) along with the resp. ta-
ble names and column names. Indentation is used to depict
the tree structure, and nodes containing keywords are distin-
guished from intermediate nodes by the color of the nodes.
Each answer tree has to be assigned a relevance score,
and answers have to be presented in decreasing order of that
score. Scoring involves a combination of relevance clues
from nodes and edges. Node weights and edge weights pro-
vide two separate measures of relevance. We desire a final
relevance score in the range [0,1]. We also wish to control
the variation in individual weights so that a few nodes or
edges with very large weights do not skew the results ex-
cessively. We therefore take the following approach.
8
We scale individual node weights to
:9<;=
, the max-
imum node weight in the graph. We can addition-
ally depress the scale using logarithms (as with ‘IDF’
weighting in Information Retrieval); accordingly, the
normalized score
>?A@BDC

of a node
is defined
as

EA
9<;=
or
F#GIH

/+KJL

E'
9<;=
respectively.
These are both scale-free quantities in
M N
-+AO
(if log is
to base 2).

Figure 2. Result of query “soumen sunita”
To get the overall node score
>?A@BDC
, we take the av-
erage of the node scores. To favor meaningful root
nodes, and to reduce the effect of intermediate nodes,
we consider only leaf nodes (containing the keywords)
and the root node when computing the average. A node
containing multiple search terms is counted as many
times as the number of search terms it contains, to
avoid giving extra weight to trees with separate nodes
for each keyword.
8
We get the normalized edge score
?A@BDC
C
of an
edge by dividing the edge weight by
K9

, the mini-
mum edge weight in the graph, to make it scale-free,
and may additionally depress the scale by defining the
edge score of
C
as
F#GIH

/+ J
C
E

0
.
.
The overall edge score is then defined to be
?A@BDC3
+E
/+ J

?A@BDC
C

, since we wish to give lower
relevance to large trees. This quantity is also in the
range
M N
+AO
.
8
Finally, we can combine the overall edge score and
node score, to get an overall relevance score, either by
addition or by multiplication; in both cases, a factor
controlling their relative weightage. The additive com-
bination uses the formula
2+



?A@BDC J

>?A@BDC
,
while the multiplicative combination uses the formula
?A@BDC

?-@BC

.
There are a total of eight combinations, since we have
three options (for edge score, node score and combination)
each of which can take two values. In our evaluation we
discarded three combinations: those that involvelog scaling
and multiplication as these scores tended to become quite
small, and compared the remaining combinations.
While inspired by standard IR weighting and smoothing
practice, the choices and parameters above are somewhat
ad-hoc, but this appears to be inescapable in all related sys-
tems that we have reviewed [7, 17].
3. Searching for the Best Answers
The computation of minimum Steiner trees is already a hard
(NP complete) problem, and is made complicated by node
weight considerations, required to compute the overall rel-
evance of a tree. We are interested in not just the most rele-
vant tree, but also in other trees with high relevance scores,
since they may be part of what the user is searching for. We
also wish to generate answers incrementally to avoid gener-
ating answers of low relevance that the user may never look
at.
In this section, we present an outline of the backward
expanding search algorithm which offers a heuristic solu-
tion for incrementally computing query results. Complete
details can be found in the full version of the paper [3].
We assume that the graph fits in memory. This is not
unreasonable, even for moderately large databases, because
the in-memory node representation need not store any at-
tribute of the corresponding tuple other than the RID. The
only other in-memory structure is an index to map RIDs to
the graph nodes. Indices to map keywords to RIDs can be
disk resident. As a result the graphs of even large databases
with millions of nodes and edges can fit in modest amounts
of memory.
Given a set of keywords, first we find, for each keyword
term
,
0
, the set of nodes,
1
0
, that are relevant to the keyword
by using disk resident indices on keywords.
Let
3

<1
0
. The backward expanding search al-
gorithm concurrently runs

copies of Dijkstra’s single
source shortest path algorithm, one for each keyword node
)
in
, with
)
as the source. The copies of the algorithm

Citations
More filters
Journal ArticleDOI

YAGO: A Large Ontology from Wikipedia and WordNet

TL;DR: YAGO is a large ontology with high coverage and precision, based on a clean logical model with a decidable consistency that allows representing n-ary relations in a natural way while maintaining compatibility with RDFS.
Book ChapterDOI

Discover: keyword search in relational databases

TL;DR: It is proved that DISCOVER finds without redundancy all relevant candidate networks, whose size can be data bound, by exploiting the structure of the schema and the selection of the optimal execution plan (way to reuse common subexpressions) is NP-complete.
Proceedings ArticleDOI

XRANK: ranked keyword search over XML documents

TL;DR: The XRANK system is presented, designed to handle the novel features of XML keyword search, which naturally generalizes a hyperlink based HTML search engine such as Google and can be used to query a mix of HTML and XML documents.
Proceedings Article

Random Walk Inference and Learning in A Large Scale Knowledge Base

TL;DR: It is shown that a soft inference procedure based on a combination of constrained, weighted, random walks through the knowledge base graph can be used to reliably infer new beliefs for theknowledge base.
Proceedings ArticleDOI

BLINKS: ranked keyword searches on graphs

TL;DR: BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search, and offers orders-of-magnitude performance improvement over existing approaches.
References
More filters
Journal ArticleDOI

The anatomy of a large-scale hypertextual Web search engine

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Journal Article

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Sergey Brin, +1 more
- 01 Jan 1998 - 
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Journal ArticleDOI

Dynamic queries for visual information seeking

TL;DR: The author discusses how experts may benefit from visual interfaces because they will be able to formulate more complex queries and interpret intricate results.
Proceedings ArticleDOI

HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering

TL;DR: Experience with HyPursuit suggests that abstraction functions based on hypertext clustering can be used to construct meaningful and scalable cluster hierarchies, and is encouraged by preliminary results on clustering based on both document contents and hyperlink structures.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What are the contributions in "Keyword searching and browsing in databases using banks" ?

The authors describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. The authors present an efficient heuristic algorithm for finding and ranking query results. 

The authors plan to implement attribute: keyword queries such as author: Levy. The authors also want to summarize the output, i. e., group the output tuples into sets that have the same tree structure, and allow the user to look for further answers with a particular tree structure. Query evaluation with keywords matching metadata can be relatively slow, since a large number of tuples may be defined to be relevant to the keyword.