scispace - formally typeset
Search or ask a question
Author

Mark A. Sheldon

Bio: Mark A. Sheldon is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Effect system & Query optimization. The author has an hindex of 9, co-authored 10 publications receiving 1158 citations.

Papers
More filters
Proceedings ArticleDOI
01 Sep 1991
TL;DR: Experimental results from a semantic file system implementation support the thesis that semantic file systems present a more effective storage abstraction than do traditional tree structured file systems for information sharing and command level programming.
Abstract: A semantic file system is an information storage system that provides flexible associative access to the system's contents by automatically extracting attributes from files with file type specific transducers. Associative access is provided by a conservative extension to existing tree-structured file system protocols, and by protocols that are designed specifically for content based access. Compatiblity with existing file system protocols is provided by introducing the concept of a virtual directory. Virtual directory names are interpreted as queries, and thus provide flexible associative access to files and directories in a manner compatible with existing software. Rapid attribute-based access to file system contents is implemented by automatic extraction and indexing of key properties of file system objects. The automatic indexing of files and directories is called "semantic" because user programmable transducers use information about the semantics of updated file system objects to extract the properties for indexing. Experimental results from a semantic file system implementation support the thesis that semantic file systems present a more effective storage abstraction than do traditional tree structured file systems for information sharing and command level programming.

500 citations

Proceedings ArticleDOI
01 Mar 1996
TL;DR: Experience with HyPursuit suggests that abstraction functions based on hypertext clustering can be used to construct meaningful and scalable cluster hierarchies, and is encouraged by preliminary results on clustering based on both document contents and hyperlink structures.
Abstract: HyPursuit is a new hierarchical network search engine that clusters hypertext documents to structure a given information space for browsing and search act ivities. Our content-link clustering algorithm is based on the semantic information embedded in hyperlink structures and document contents. HyPursuit admits multiple, coexisting cluster hierarchies based on different principles for grouping documents, such as the Library of Congress catalog scheme and automatically created hypertext clusters. HyPursuit’s abstraction functions summarize cluster contents to support scalable query processing. The abstraction functions satisfy system resource limitations with controlled information 10SS. The result of query processing operations on a cluster summary approximates the result of performing the operations on the entire information space. We constructed a prototype system comprising 100 leaf WorldWide Web sites and a hierarchy of 42 servers that route queries to the leaf sites. Experience with our system suggests that abstraction functions based on hypertext clustering can be used to construct meaningful and scalable cluster hierarchies. We are also encouraged by preliminary results on clustering based on both document contents and hyperlink structures.

342 citations

Proceedings ArticleDOI
01 Jul 1997
TL;DR: RMAP as mentioned in this paper is a fast and practical query refinement algorithm that refines multiple term queries by dynamically combining precomputed suggestions for single term queries and achieves accuracy comparable to a much slower algorithm, although both algorithms lag behind the best possible term suggestions offered by the oracle.
Abstract: Query Refinement is an essential information retrieval tool that interactively recommends new terms related to a particular query. This paper introduces concept recall, an experimental measure of an algorithm’s ability to suggest terms humans have judged to be semantically related to an information need. This study uses precision improvement experiments to measure the ability of an algorithm to produce single term query modifications that predict a user’s information need as partially encoded by the query. An omcie algorithm produces ideal query modifications, providing a meaningful context for interpreting precision improvement results. This study also introduces RMAP, a fast and practical query refinement algorithm that refines multiple term queries by dynamically combining precomputed suggestions for single term queries. RMAP achieves accuracy comparable to a much slower algorithm, although both RMAP and the slower algorithm lag behind the best possible term suggestions offered by the oracle. We believe RMAP is fast enough to be integrated into present day Internet search engines: RMAP computes 100 term suggestions for a 160,000 document collection in 15 ms on a low-end PC.

138 citations

Journal ArticleDOI
01 Apr 1995
TL;DR: The experience with query refinement has convinced us that the expansion of query fragments is essential in helping one use a large, dynamically changing, heterogenous distributed information system.
Abstract: We have built an HTTP based resource discovery system called Discover that provides a single point of access to over 500 WAIS servers. Discover provides two key services: query refinement and query routing . Query refinement helps a user improve a query fragment to describe the user's interests more precisely. Once a query has been refined and describes a manageable result set, query routing automatically forwards the query to the WAIS servers that contain relevant documents. Abbreviated descriptions of WAIS sites called content labels are used by the query refinement and query routing algorithms. Our experimental results suggest that query refinement in conjunction with the query routing provides an effective way to discover resources in a large universe of documents. Our experience with query refinement has convinced us that the expansion of query fragments is essential in helping one use a large, dynamically changing, heterogenous distributed information system.

65 citations

Book ChapterDOI
02 May 1994
TL;DR: A system that provides query based associative access to the contents of distributed information servers using content labels to permit users to learn about available resources and to formulate queries with adequate discriminatory power is described.
Abstract: We describe a system that provides query based associative access to the contents of distributed information servers. In typical distributed information systems there are so many objects that underconstrained queries can produce large result sets and extraordinary processing costs. To deal with this scaling problem we use content labels to permit users to learn about available resources and to formulate queries with adequate discriminatory power. We have implemented associative access to a distributed set of information servers in the content routing system. A content routing system is a hierarchy of servers called content routers that present a single query based image of a distributed information system. We have successfully used a prototype content routing system to locate documents on several user file systems and on a large number of information servers.

37 citations


Cited by
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations

Proceedings Article
11 Nov 1999
TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
Abstract: The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

14,400 citations

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
Jon Kleinberg1
TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure, and has connections to the eigenvectors of certain matrices associated with the link graph.
Abstract: The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of context on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authorative” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristrics for link-based analysis.

8,328 citations

Book ChapterDOI
TL;DR: Pastry as mentioned in this paper is a scalable, distributed object location and routing substrate for wide-area peer-to-peer ap- plications, which performs application-level routing and object location in a po- tentially very large overlay network of nodes connected via the Internet.
Abstract: This paper presents the design and evaluation of Pastry, a scalable, distributed object location and routing substrate for wide-area peer-to-peer ap- plications. Pastry performs application-level routing and object location in a po- tentially very large overlay network of nodes connected via the Internet. It can be used to support a variety of peer-to-peer applications, including global data storage, data sharing, group communication and naming. Each node in the Pastry network has a unique identifier (nodeId). When presented with a message and a key, a Pastry node efficiently routes the message to the node with a nodeId that is numerically closest to the key, among all currently live Pastry nodes. Each Pastry node keeps track of its immediate neighbors in the nodeId space, and notifies applications of new node arrivals, node failures and recoveries. Pastry takes into account network locality; it seeks to minimize the distance messages travel, according to a to scalar proximity metric like the number of IP routing hops. Pastry is completely decentralized, scalable, and self-organizing; it automatically adapts to the arrival, departure and failure of nodes. Experimental results obtained with a prototype implementation on an emulated network of up to 100,000 nodes confirm Pastry's scalability and efficiency, its ability to self-organize and adapt to node failures, and its good network locality properties.

7,423 citations