scispace - formally typeset
Search or ask a question

Showing papers by "Soumen Chakrabarti published in 2002"


Proceedings ArticleDOI
26 Feb 2002
TL;DR: BANKS is described, a system which enables keyword-based search on relational databases, together with data and schema browsing, and presents an efficient heuristic algorithm for finding and ranking query results.
Abstract: With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

970 citations


Book
01 Jan 2002
TL;DR: This chapter discusses the infrastructure of the Web, the future of Web mining, and applications of semi-supervised learning for text and similarity and clustering.
Abstract: Preface. Introduction. I Infrastructure: Crawling the Web. Web search. II Learning: Similarity and clustering. Supervised learning for text. Semi-supervised learning. III Applications: Social network analysis. Resource discovery. The future of Web mining.

751 citations


Proceedings ArticleDOI
07 May 2002
TL;DR: Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page?
Abstract: The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page wrt an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded

264 citations


Book ChapterDOI
20 Aug 2002
TL;DR: Browsing ANd Keyword Searching (BANKS) enables almost effortless Web publishing of relational and eXtensible Markup Language (XML) data that would otherwise remain (at least partially) invisible to the Web.
Abstract: The BANKS system enables keyword-based search on databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. Extensive support for answer ranking forms a critical part of the BANKS system.

167 citations


Posted Content
TL;DR: It is proposed that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities and measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.
Abstract: The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

138 citations


Proceedings ArticleDOI
07 May 2002
TL;DR: In this paper, a topic taxonomy and an automatic classifier are used to measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions.
Abstract: The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

117 citations


Proceedings ArticleDOI
23 Jul 2002
TL;DR: A new technique for multi-way classification which exploits the accuracy of SVMs and the speed of NB classifiers, which is 3 to 6 times faster than SVM and yet matches or even exceeds their accuracy.
Abstract: Support vector machines (SVMs) excel at two-class discriminative learning problems. They often outperform generative classifiers, especially those that use inaccurate generative models, such as the naive Bayes (NB) classifier. On the other hand, generative classifiers have no trouble in handling an arbitrary number of classes efficiently, and NB classifiers train much faster than SVMs owing to their extreme simplicity. In contrast, SVMs handle multi-class problems by learning redundant yes/no (one-vs-others) classifiers for each class, further worsening the performance gap. We propose a new technique for multi-way classification which exploits the accuracy of SVMs and the speed of NB classifiers. We first use a NB classifier to quickly compute a confusion matrix, which is used to reduce the number and complexity of the two-class SVMs that are built in the second stage. During testing, we first get the prediction of a NB classifier and use that to selectively apply only a subset of the two-class SVMs. On standard benchmarks, our algorithm is 3 to 6 times faster than SVMs and yet matches or even exceeds their accuracy.

63 citations


Book ChapterDOI
01 Jan 2002
TL;DR: SIMPL uses Fisher's linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected low-dimensional subspace before inducing a decision tree on the projected instances.
Abstract: Publisher Summary Text classification is a standard problem in information retrieval (IR). A learner is presented with training documents, each labeled as containing or not containing material relevant to a given topic. Support vector machines (SVMs) have shown superb performance for text classification tasks. They are accurate, robust, and quick to apply to test instances. Their only potential drawback is their training time and memory requirement. SVMs have been trained on data sets with several thousand instances, but Web directories today contain millions of instances that are valuable for mapping billions of Web pages into Yahoo!-like directories. This chapter presents Simple Iterative Multiple Projection on Lines (SIMPL), a nearly linear-time classification algorithm that mimics the strengths of SVMs while avoiding the training bottleneck. It uses Fisher's linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected low-dimensional subspace before inducing a decision tree on the projected instances. SIMPL uses efficient sequential scans and sorts, and is comparable in speed and memory scalability to widely used naive Bayes (NB) classifiers, but it beats NB accuracy decisively.

13 citations