scispace - formally typeset
Search or ask a question

Showing papers by "Eugene J. Shekita published in 2006"


Proceedings ArticleDOI
23 May 2006
TL;DR: This paper proposes two ways to obtain user annotations, using explicit and implicit feedback, and shows how they can be integrated into a search engine.
Abstract: A major difference between corporate intranets and the Internet is that in intranets the barrier for users to create web pages is much higher. This limits the amount and quality of anchor text, one of the major factors used by Internet search engines, making intranet search more difficult. The social phenomenon at play also means that spam is relatively rare. Both on the Internet and in intranets, users are often willing to cooperate with the search engine in improving the search experience. These characteristics naturally lead to considering using user feedback to improve search quality in intranets. In this paper we show how a particular form of feedback, namely user annotations, can be used to improve the quality of intranet search. An annotation is a short description of the contents of a web page, which can be considered a substitute for anchor text. We propose two ways to obtain user annotations, using explicit and implicit feedback, and show how they can be integrated into a search engine. Preliminary experiments on the IBM intranet demonstrate that using annotations improves the search quality.

104 citations


Book ChapterDOI
26 Mar 2006
TL;DR: This paper describes a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once, and shows how this representation model can be encoded in an inverted index.
Abstract: Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

51 citations


Patent
30 Nov 2006
TL;DR: An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents as discussed by the authors, which is a method for querying multifaceted information that includes constraints on documents, associated with indexed tokens and corresponding posting lists.
Abstract: A method for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. Constructing the inverted index includes generating a full path token and an associated full path token posting list. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.

36 citations


Patent
21 Nov 2006
TL;DR: Disclosed as discussed by the authors is an evaluation technique for text search with black-box scoring functions, where it is unnecessary for the evaluation engine to maintain details of the scoring function, and proofs of correctness, as well experimental evidence showing that the performance of the technique is comparable in efficiency to those used in custom-built engines.
Abstract: Disclosed is an evaluation technique for text search with black-box scoring functions, where it is unnecessary for the evaluation engine to maintain details of the scoring function. Included is a description of a system for dealing with blackbox searching, proofs of correctness, as well experimental evidence showing that the performance of the technique is comparable in efficiency to those techniques used in custom-built engines.

17 citations


Posted Content
TL;DR: This paper introduces Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information.
Abstract: ably successful in building a large market and adapting to the changes of the last three decades, its impact on the broader market of information management is surprisingly limited. If we were to design an information management system from scratch, based upon today's requirements and hardware capabilities, would it look anything like today's database systems?" In this paper, we introduce Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information. We first summarize the trends that will shape information management for the foreseeable future. Those trends imply three major requirements for Impliance: (1) to be able to store, manage, and uniformly query all data, not just structured records; (2) to be able to scale out as the volume of this data grows; and (3) to be simple and robust in operation. We then describe four key ideas that are uniquely combined in Impliance to address these requirements, namely the ideas of: (a) integrating software and off-the-shelf hardware into a generic information appliance; (b) automatically discovering, organizing, and managing all data - unstructured as well as structured - in a uniform way; (c) achieving scale-out by exploiting simple, massive parallel processing, and (d) virtualizing compute and storage resources to unify, simplify, and streamline the management of Impliance. Impliance is an ambitious, long-term effort to define simpler, more robust, and more scalable information systems for tomorrow's enterprises.

8 citations