Showing papers by "Soumen Chakrabarti published in 2000"

PDF

Open Access

Journal Article•DOI•

Data mining for hypertext: a tutorial survey

[...]

01 Jan 2000-Sigkdd Explorations

TL;DR: Recent advances in learning and mining problems related to hypertext in general and the Web in particular are surveyed and the continuum of supervised to semi-supervised to unsupervised learning problems is reviewed.

...read moreread less

Abstract: With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end.In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of recent and ongoing research.

...read moreread less

331 citations

Journal Article•DOI•

Using Memex to archive and mine community Web browsing experience

[...]

Soumen Chakrabarti¹, Sandeep Kumar Srivastava¹, Mallela Subramanyam¹, Mitul Tiwari¹•Institutions (1)

Indian Institute of Technology Bombay¹

01 Jun 2000

TL;DR: The beginnings of a Memex for the Web are presented, which envisage that Memex will be shared by a community of surfers with overlapping interests, and a novel formulation of the community taxonomy synthesis problem , algorithms, and experimental results are presented.

...read moreread less

Abstract: Keyword indices, topic directories, and link-based rankings are used to search and structure the rapidly growing Web today. Surprisingly little use is made of years of browsing experience of millions of people. Indeed, this information is routinely discarded by browsers. Even deliberate bookmarks are stored passively, in browser-dependent formats; this separates them from the dominant world of HTML hypermedia, even if their owners were willing to share them. All this goes against Vannevar Bush's dream of the Memex : an enhanced supplement to personal and community memory. We present the beginnings of a Memex for the Web. Memex blurs the artificial distinction between browsing history and deliberate bookmarks. The resulting glut of data is analyzed in a number of ways. It is indexed not only by keywords but also according to the user's view of topics ; this lets the user recall topic-based browsing contexts by asking questions like `What trails was I following when I was last surfing about classical music ?' and `What are some popular pages related to my recent trail regarding cycling ?' Memex is a browser assistant that performs these functions. We envisage that Memex will be shared by a community of surfers with overlapping interests; in that context, the meaning and ramifications of topical trails may be decided by not one but many surfers. We present a novel formulation of the community taxonomy synthesis problem , algorithms, and experimental results. We also recommend uniform APIs which will help managing advanced interactions with the browser.

...read moreread less

37 citations

Proceedings Article•

Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails

[...]

Soumen Chakrabarti, Sandeep Kumar Srivastava, Mallela Subramanyam, Mitul Tiwari

10 Sep 2000

TL;DR: It is proposed to demonstrate the beginnings of a ‘Memex’ for the Web: a browsing assistant for individuals and groups with focused interests, which blurs the articial distinction between browsing history and deliberate bookmarks.

...read moreread less

Abstract: Keyword indices, topic directories, and link-based rankings are used to search and structure the rapidly growing Web today. Surprisingly little use is made of years of browsing experience of millions of people. Indeed, this information is routinely discarded by browsers. Even deliberate bookmarks are stored in a passive and isolated manner. All this goes against Vannevar Bush’s dream of the Memex : an enhanced supplement to personal and community memory. We propose to demonstrate the beginnings of a ‘Memex’ for the Web: a browsing assistant for individuals and groups with focused interests. Memex blurs the articial distinction between browsing history and deliberate bookmarks. The resulting glut of data is analyzed in a number of ways at the individual and community levels. Memex constructs a topic directory customized to the community, mapping their interests naturally to nodes in this directory. This lets the user recall topic-based browsing contexts by asking questions like \What trails was I following when I was last surng about classical music?" and \What are some popular pages in or near my community’s recent trail graph related to music?" 1 Motivation Three paradigms have emerged for exploring the Web: keyword search, directory browsing, and following links. Popular search engine and directory sites are visited tens of millions of times per day. We speculate that the total number of clicks per day is orders of magnitude larger. This third source of information, the browsing history of millions of Web users over several years, an information source that dwarfs the scale of the Web itself, is almost entirely discarded by browsers as ‘history’. Deliberate ‘bookmarks’ are preserved, but passively, in browser-dependent formats; this separates them from the dominant world of HTML hypermedia, even if their owners were willing to share them (as they are, in our experience, with all but a small section of their browsing activity).

...read moreread less

18 citations

Patent•

System and method for dynamic index-probe optimizations for high-dimensional similarity search

[...]

Soumen Chakrabarti¹•Institutions (1)

IBM¹

04 Feb 2000

TL;DR: In this paper, a system and method for optimizing I/O to low-level index access during bulk-routing through a taxonomy to classify documents, e.g., Web pages, in the taxonomy is presented.

...read moreread less

Abstract: A system and method for optimizing I/O to low-level index access during bulk-routing through a taxonomy to classify documents, e.g., Web pages, in the taxonomy. In a first optimization, bulk-routing is regarded as a generalized join operation in a relational database framework. In a second optimization, instead of processing each document individually through nodes of the taxonomy, a group of documents are processed node by node in a wavefront-style routing scheme for better amortization of index probes.

...read moreread less

14 citations

Patent•

Method and system for distributed autonomous maintenance of bidirectional hyperlink metadata on the web and similar hypermedia repository

[...]

Soumen Chakrabarti¹, Byron Dom¹, David Gibson¹, Kevin S. McCurley¹, Martin Henk van den Berg¹ - Show less +1 more•Institutions (1)

IBM¹

10 Mar 2000

TL;DR: In this paper, a Web server stores a table of Web page inlinks, and when a web page is accessed and a user wants to access other pages related to the accessed page, the user requests the table of inlinks and from it generates a list of sibling links to the accessing page, being outlinks of one or more of the inlinks in the table.

...read moreread less

Abstract: A Web server stores a table of Web page inlinks. When a Web page is accessed and a user wants to access other pages related to the accessed page, the user requests the table of inlinks, and from it generates a list of sibling links to the accessed page, the sibling links being outlinks of one or more of the inlinks in the table.

...read moreread less

14 citations

Proceedings Article•DOI•

Data mining for hypertext (tutorial session) (title only)

[...]

Soumen Chakrabarti

01 Aug 2000

3 citations

Proceedings Article•DOI•

Hypertext data mining (tutorial AM-1)

[...]

Soumen Chakrabarti¹•Institutions (1)