Top 9 papers published by Soumen Chakrabarti from Indian Institute of Technology Bombay in 2002

Proceedings Article•DOI•

Keyword searching and browsing in databases using BANKS

[...]

G. Bhalotia¹, Arvind Hulgeri, Charuta Nakhe¹, Soumen Chakrabarti¹, Sundararajarao Sudarshan¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Bombay¹

26 Feb 2002

TL;DR: BANKS is described, a system which enables keyword-based search on relational databases, together with data and schema browsing, and presents an efficient heuristic algorithm for finding and ranking query results.

...read moreread less

Abstract: With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

...read moreread less

970 citations

Book•

Mining the Web: Discovering Knowledge from Hypertext Data

[...]

Soumen Chakrabarti

01 Jan 2002

TL;DR: This chapter discusses the infrastructure of the Web, the future of Web mining, and applications of semi-supervised learning for text and similarity and clustering.

...read moreread less

Abstract: Preface. Introduction. I Infrastructure: Crawling the Web. Web search. II Learning: Similarity and clustering. Supervised learning for text. Semi-supervised learning. III Applications: Social network analysis. Resource discovery. The future of Web mining.

...read moreread less

751 citations

Proceedings Article•DOI•

Accelerated focused crawling through online relevance feedback

[...]

Soumen Chakrabarti¹, Kunal Punera¹, Mallela Subramanyam²•Institutions (2)

Indian Institute of Technology Bombay¹, University of Texas at Austin²

07 May 2002

TL;DR: Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page?

...read moreread less

Abstract: The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page wrt an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded

...read moreread less

264 citations

Book Chapter•DOI•

BANKS: browsing and keyword searching in relational databases

[...]

B. Aditya, Gaurav Bhalotia¹, Soumen Chakrabarti, Arvind Hulgeri, Charuta Nakhe, Parag Parag, Sundararajarao Sudarshan - Show less +3 more•Institutions (1)

University of California, Berkeley¹

20 Aug 2002

TL;DR: Browsing ANd Keyword Searching (BANKS) enables almost effortless Web publishing of relational and eXtensible Markup Language (XML) data that would otherwise remain (at least partially) invisible to the Web.

...read moreread less

Abstract: The BANKS system enables keyword-based search on databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. Extensive support for answer ranking forms a critical part of the BANKS system.

...read moreread less

167 citations

Posted Content•

The structure of broad topics on the Web

[...]

Soumen Chakrabarti¹, Mukul M. Joshi¹, Kunal Punera¹, David M. Pennock•Institutions (1)

Indian Institute of Technology Bombay¹

20 Mar 2002-arXiv: Information Retrieval

TL;DR: It is proposed that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities and measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

...read moreread less

Abstract: The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

...read moreread less

138 citations

Proceedings Article•DOI•

The structure of broad topics on the web

[...]

Soumen Chakrabarti¹, Mukul M. Joshi¹, Kunal Punera¹, David M. Pennock•Institutions (1)

Indian Institute of Technology Bombay¹

07 May 2002

TL;DR: In this paper, a topic taxonomy and an automatic classifier are used to measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions.

...read moreread less

Abstract: The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.

...read moreread less

117 citations

Proceedings Article•DOI•

Scaling multi-class support vector machines using inter-class confusion

[...]

Shantanu Godbole¹, Sunita Sarawagi¹, Soumen Chakrabarti¹•Institutions (1)

Indian Institute of Technology Bombay¹

23 Jul 2002

TL;DR: A new technique for multi-way classification which exploits the accuracy of SVMs and the speed of NB classifiers, which is 3 to 6 times faster than SVM and yet matches or even exceeds their accuracy.

...read moreread less

Abstract: Support vector machines (SVMs) excel at two-class discriminative learning problems. They often outperform generative classifiers, especially those that use inaccurate generative models, such as the naive Bayes (NB) classifier. On the other hand, generative classifiers have no trouble in handling an arbitrary number of classes efficiently, and NB classifiers train much faster than SVMs owing to their extreme simplicity. In contrast, SVMs handle multi-class problems by learning redundant yes/no (one-vs-others) classifiers for each class, further worsening the performance gap. We propose a new technique for multi-way classification which exploits the accuracy of SVMs and the speed of NB classifiers. We first use a NB classifier to quickly compute a confusion matrix, which is used to reduce the number and complexity of the two-class SVMs that are built in the second stage. During testing, we first get the prediction of a NB classifier and use that to selectively apply only a subset of the two-class SVMs. On standard benchmarks, our algorithm is 3 to 6 times faster than SVMs and yet matches or even exceeds their accuracy.

...read moreread less

63 citations

Book Chapter•DOI•

Chapter 57 – Fast and accurate text classification via multiple linear discriminant projections

[...]

Soumen Chakrabarti, Shourya Roy¹, Mahesh V. Soundalgekar¹•Institutions (1)

Indian Institute of Technology Bombay¹

01 Jan 2002

TL;DR: SIMPL uses Fisher's linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected low-dimensional subspace before inducing a decision tree on the projected instances.

...read moreread less

Abstract: Publisher Summary Text classification is a standard problem in information retrieval (IR). A learner is presented with training documents, each labeled as containing or not containing material relevant to a given topic. Support vector machines (SVMs) have shown superb performance for text classification tasks. They are accurate, robust, and quick to apply to test instances. Their only potential drawback is their training time and memory requirement. SVMs have been trained on data sets with several thousand instances, but Web directories today contain millions of instances that are valuable for mapping billions of Web pages into Yahoo!-like directories. This chapter presents Simple Iterative Multiple Projection on Lines (SIMPL), a nearly linear-time classification algorithm that mimics the strengths of SVMs while avoiding the training bottleneck. It uses Fisher's linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected low-dimensional subspace before inducing a decision tree on the projected instances. SIMPL uses efficient sequential scans and sorts, and is comparable in speed and memory scalability to widely used naive Bayes (NB) classifiers, but it beats NB accuracy decisively.

...read moreread less

13 citations

Journal Article•

Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation.

[...]

Soumen Chakrabarti, Ravindra Jaju

01 Jan 2002-IEEE Data(base) Engineering Bulletin

2 citations

Showing papers by "Soumen Chakrabarti published in 2002"