scispace - formally typeset
Search or ask a question

Showing papers by "Soumen Chakrabarti published in 2000"


Journal ArticleDOI
TL;DR: Recent advances in learning and mining problems related to hypertext in general and the Web in particular are surveyed and the continuum of supervised to semi-supervised to unsupervised learning problems is reviewed.
Abstract: With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end.In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of recent and ongoing research.

331 citations


Journal ArticleDOI
01 Jun 2000
TL;DR: The beginnings of a Memex for the Web are presented, which envisage that Memex will be shared by a community of surfers with overlapping interests, and a novel formulation of the community taxonomy synthesis problem , algorithms, and experimental results are presented.
Abstract: Keyword indices, topic directories, and link-based rankings are used to search and structure the rapidly growing Web today. Surprisingly little use is made of years of browsing experience of millions of people. Indeed, this information is routinely discarded by browsers. Even deliberate bookmarks are stored passively, in browser-dependent formats; this separates them from the dominant world of HTML hypermedia, even if their owners were willing to share them. All this goes against Vannevar Bush's dream of the Memex : an enhanced supplement to personal and community memory. We present the beginnings of a Memex for the Web. Memex blurs the artificial distinction between browsing history and deliberate bookmarks. The resulting glut of data is analyzed in a number of ways. It is indexed not only by keywords but also according to the user's view of topics ; this lets the user recall topic-based browsing contexts by asking questions like `What trails was I following when I was last surfing about classical music ?' and `What are some popular pages related to my recent trail regarding cycling ?' Memex is a browser assistant that performs these functions. We envisage that Memex will be shared by a community of surfers with overlapping interests; in that context, the meaning and ramifications of topical trails may be decided by not one but many surfers. We present a novel formulation of the community taxonomy synthesis problem , algorithms, and experimental results. We also recommend uniform APIs which will help managing advanced interactions with the browser.

37 citations


Proceedings Article
10 Sep 2000
TL;DR: It is proposed to demonstrate the beginnings of a ‘Memex’ for the Web: a browsing assistant for individuals and groups with focused interests, which blurs the articial distinction between browsing history and deliberate bookmarks.
Abstract: Keyword indices, topic directories, and link-based rankings are used to search and structure the rapidly growing Web today. Surprisingly little use is made of years of browsing experience of millions of people. Indeed, this information is routinely discarded by browsers. Even deliberate bookmarks are stored in a passive and isolated manner. All this goes against Vannevar Bush’s dream of the Memex : an enhanced supplement to personal and community memory. We propose to demonstrate the beginnings of a ‘Memex’ for the Web: a browsing assistant for individuals and groups with focused interests. Memex blurs the articial distinction between browsing history and deliberate bookmarks. The resulting glut of data is analyzed in a number of ways at the individual and community levels. Memex constructs a topic directory customized to the community, mapping their interests naturally to nodes in this directory. This lets the user recall topic-based browsing contexts by asking questions like \What trails was I following when I was last surng about classical music?" and \What are some popular pages in or near my community’s recent trail graph related to music?" 1 Motivation Three paradigms have emerged for exploring the Web: keyword search, directory browsing, and following links. Popular search engine and directory sites are visited tens of millions of times per day. We speculate that the total number of clicks per day is orders of magnitude larger. This third source of information, the browsing history of millions of Web users over several years, an information source that dwarfs the scale of the Web itself, is almost entirely discarded by browsers as ‘history’. Deliberate ‘bookmarks’ are preserved, but passively, in browser-dependent formats; this separates them from the dominant world of HTML hypermedia, even if their owners were willing to share them (as they are, in our experience, with all but a small section of their browsing activity).

18 citations


Patent
Soumen Chakrabarti1
04 Feb 2000
TL;DR: In this paper, a system and method for optimizing I/O to low-level index access during bulk-routing through a taxonomy to classify documents, e.g., Web pages, in the taxonomy is presented.
Abstract: A system and method for optimizing I/O to low-level index access during bulk-routing through a taxonomy to classify documents, e.g., Web pages, in the taxonomy. In a first optimization, bulk-routing is regarded as a generalized join operation in a relational database framework. In a second optimization, instead of processing each document individually through nodes of the taxonomy, a group of documents are processed node by node in a wavefront-style routing scheme for better amortization of index probes.

14 citations


Patent
10 Mar 2000
TL;DR: In this paper, a Web server stores a table of Web page inlinks, and when a web page is accessed and a user wants to access other pages related to the accessed page, the user requests the table of inlinks and from it generates a list of sibling links to the accessing page, being outlinks of one or more of the inlinks in the table.
Abstract: A Web server stores a table of Web page inlinks. When a Web page is accessed and a user wants to access other pages related to the accessed page, the user requests the table of inlinks, and from it generates a list of sibling links to the accessed page, the sibling links being outlinks of one or more of the inlinks in the table.

14 citations


Proceedings ArticleDOI
01 Aug 2000

3 citations


Proceedings ArticleDOI
20 Aug 2000

2 citations