scispace - formally typeset
Search or ask a question

Showing papers by "Soumen Chakrabarti published in 1998"


Proceedings ArticleDOI
01 Jun 1998
TL;DR: This work has developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained and its technique also adapts gracefully to the fraction of neighboring documents having known topics.
Abstract: A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

1,124 citations


Journal ArticleDOI
01 Apr 1998
TL;DR: An evaluation of ARC suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic.
Abstract: We describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative Web resources on any (sufficiently broad) topic. The goal of ARC is to compile resource lists similar to those provided by Yahoo! or Infoseek. The fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically. We describe the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users. This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. We also provide examples of ARC resource lists for the reader to examine.

810 citations


Patent
23 Jun 1998
TL;DR: In this article, a hierarchical taxonomy and path enhanced retrieval system (TAPER) is used to generate context-dependent document indexing terms, in addition to keywords, for better focused searching and browsing of the text database.
Abstract: A system, process, and article of manufacture for organizing a large text database into a hierarchy of topics and for maintaining this organization as documents are added and deleted and as the topic hierarchy changes. Given sample documents belonging to various nodes in the topic hierarchy, the tokens (terms, phrases, dates, or other usable feature in the document) that are most useful at each internal decision node for the purpose of routing new documents to the children of that node are automatically detected. Using feature terms, statistical models are constructed for each topic node. The models are used in an estimation technique to assign topic paths to new unlabeled documents. The hierarchical technique, in which feature terms can be very different at different nodes, leads to an efficient context-sensitive classification technique. The hierarchical technique can handle millions of documents and tens of thousands of topics. A resulting taxonomy and path enhanced retrieval system (TAPER) is used to generate context-dependent document indexing terms. The topic paths are used, in addition to keywords, for better focused searching and browsing of the text database.

523 citations


Journal ArticleDOI
01 Aug 1998
TL;DR: An automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy is described.
Abstract: We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or discriminants, from thenoise words at each node of the taxonomy. Using these, we build a multilevel classifier. At each node, this classifier can ignore the large number of “noise” words in a document. Thus, the classifier has a small model size and is very fast. Owing to the use of context-sensitive features, the classifier is very accurate. As a by-product, we can compute for each document a set of terms that occur significantly more often in it than in the classes to which it belongs. We describe the design and implementation of our system, stressing how to exploit standard, efficient relational operations like sorts and joins. We report on experiences with the Reuters newswire benchmark, the US patent database, and web document samples from Yahoo!. We discuss applications where our system can improve searching and filtering capabilities.

292 citations


Proceedings ArticleDOI
01 Jan 1998
TL;DR: This paper proposes two novel scheduling metrics, namely, maz-stretch and mar-flow, which gauge the responsiveness of the scheduler to each job and avoid starvation of any job.
Abstract: Many servers, such as web and database servers, receive a continual stream of requests. These requests may require an amount of processing time that varies over several orders of magnitude. The servers should schedule these requests to provide the “best” and ‘fairest” po5 sible service to users. However, it is difficult to quantify this objective. In this paper, we isolate and study the problem of scheduling a continuous stream of requests of varying sizes. More precisely, assume a request or job j has arrival time oj and requires processing time tj. We aim to schedule these jobs with a guaranteed quality of service (QoS). QoS is interpreted in many ways (e.g., throughput, avoiding jitter or delay, etc.). Here we adopt the widely-accepted requirement that the schedule be responsive to each job and avoid starvation of any job [ 11. In this paper, we propose two novel scheduling metrics, namely, maz-stretch and mar-flow. These metrics gauge the responsiveness of the scheduler to each job. Surprisingly, despite the extensive research on scheduling algorithms in the past few decades, these metrics seem to be new. We initiate the study of optimizing these metrics under varying circumstances (offline/online, preemptive/nonpreemptive, etc.). We prove positive and negative theoretical results. We also report experiments with several scheduling heuristics on real data sets (HTTP server log requests at U.C. Berkeley). We present our observations on the overall “fairness” of various metrics and scheduling strategies. In what follows, we describe our metrics and results in detail.

267 citations


Proceedings Article
24 Aug 1998
TL;DR: It is argued that once the analyst is already familiar with prevalent patterns in data, the greatest, increment,al benefit is likely to be from changes in the relationship between item frequencies.
Abstract: We propose a new notion of surprising temporal patterns in market. basket data, and algorithms to find such pat,terns. This is distinct, from finding frequent pat-terns as addressed in the common mining literature. We argue that. once the analyst. is already familiar with prevalent patterns in t,he data, the greatest, increment,al benefit. is likely t,o be from changes in the relationship between item frequencies

156 citations


Patent
29 Aug 1998
TL;DR: In this paper, a method for cataloging, filtering and ranking Web pages of the Internet is presented, which includes steps for enabling a user to interactively create a frame-based, hierarchical organizational structure for the information elements.
Abstract: A method for cataloging, filtering and ranking information, as for example, World Wide Web pages of the Internet. The method is preferably implemented in computer software and features steps for enabling a user to interactively create an information database including preferred information elements such as preferred-authority World Wide Web pages. The method includes steps for enabling a user to interactively create a frame-based, hierarchical organizational structure for the information elements, and steps for identifying and automatically filtering and ranking by relevance, information elements, such as World Wide Web pages for populating the structure, to form, for example, a searchable, World Wide Web page database. Additionally, the method features steps for enabling a user to interactively define a frame-based, hierarchical information structure for cataloging information, identifying a preliminary population of information elements for a particular hierarchical category arranged as a frame, based upon the respective frame attributes, and thereafter, expanding the information population to include related information, and subsequently, automatically filtering and ranking the information based upon relevance, and then populating the hierarchical structure with a definable portion of the filtered, ranked information elements.

126 citations


Journal ArticleDOI
TL;DR: A parallel threshold strategy based on rethrowing balls placed in heavily loaded bins achieves loads within a constant factor of the lower bound for a constant number of rounds, and it achieves a final load of at most O(log’slog”n) given Ω( log log n) rounds of communication.
Abstract: It is well known that after placing n balls independently and uniformly at random into n bins, the fullest bin holds Θ(log n/log log n) balls with high probability. More recently, Azar et al. analyzed the following process: randomly choose d bins for each ball, and then place the balls, one by one, into the least full bin from its d choices. Azar et al. They show that after all n balls have been placed, the fullest bin contains only log log n/log d+Θ(1) balls with high probability. We explore extensions of this result to parallel and distributed settings. Our results focus on the tradeoff between the amount of communication and the final load. Given r rounds of communication, we provide lower bounds on the maximum load of \Omega(\root r \of {\log{n}/\log\log{n}}) [Note to reader: see “View Article” for equation] for a wide class of strategies. Our results extend to the case where the number of rounds is allowed to grow with n. We then demonstrate parallelizations of the sequential strategy presented in Azar et al. that achieve loads within a constant factor of the lower bound for two communication rounds and almost match the sequential strategy given log log n/log d+O(d) rounds of communication. We also examine a parallel threshold strategy based on rethrowing balls placed in heavily loaded bins. This strategy achieves loads within a constant factor of the lower bound for a constant number of rounds, and it achieves a final load of at most O(log log n) given Ω(log log n) rounds of communication. The algorithm also works well in asynchronous environments © 1998 John Wiley & Sons, Inc. Random Structure Alg., 13, 159–188 (1998)

95 citations


Patent
21 Aug 1998
TL;DR: In this article, a system and method for data mining in which temporal patterns of itemsets in transactions having unexpected support values are identified is described, where a surprising temporal pattern is an itemset whose support changes over time.
Abstract: A system and method for data mining is provided in which temporal patterns of itemsets in transactions having unexpected support values are identified. A surprising temporal pattern is an itemset whose support changes over time. The method may use a minimum description length formulation to discover these surprising temporal patterns.

85 citations


Patent
29 Aug 1998
TL;DR: In this article, a method for cataloging, filtering, and ranking Web pages of the Internet is presented, where a user can interactively create an information database including preferred information elements such as preferred-authority Web pages.
Abstract: A method for cataloging, filtering and ranking information; as for example, World Wide Web pages of the Internet. The method is preferably implemented in computer software and features steps for enabling a user to interactively create an information database including preferred information elements such as preferred-authority World Wide Web pages. The method including steps for enabling a user to interactively creating a frame-based, hierarchical organizational structure for the information elements, and steps for identifying and automatically filtering and ranking by relevance, information elements, such as World Wide Web pages for populating the structure, to form; for example, a searchable, World Wide Web page database. Additionally, the method featuring steps for enabling a user to interactively define a frame-based, hierarchical information structure for cataloging information, identify a preliminary population of information elements for a particular hierarchical category arranged as a frame, based upon the respective frame attributes, and thereafter, expand the information population to include related information, and subsequently, automatically filter and rank the information based upon relevance, and then populate the hierarchical structure with the a definable portion of the filtered, upper-ranked information elements.

27 citations


Proceedings Article
26 Jan 1998
TL;DR: In this article, a portable and robust software mechanism for adaptive frame rate control for real-time packet video transfer and viewing in workstation environments is described, which includes a responsive feedback system for dynamic presentation layer rate control in response to network load and a simple and effective transport layer flow control scheme built upon an unreliable network protocol.
Abstract: We describe a portable and robust software mechanism for adaptive frame rate control for real time packet video transfer and viewing in workstation environments. No special hardware or system support is assumed. Our contributions include (1) a responsive feedback system for dynamic presentation layer rate control in response to network load, (2) a simple and effective transport layer flow control scheme built upon an unreliable network protocol, and (3) data structures and scheduling support for integrating the rate and flow control mechanisms. We demonstrate that excellent jitter control is achieved by gracefully trading image resolution, and loss rates can be drastically reduced by appropriate flow control.