Showing papers by "Eugene J. Shekita published in 2006"

PDF

Open Access

Proceedings Article•DOI•

[...]

Pavel Dmitriev¹, Nadav Eiron², Marcus Fontoura³, Eugene J. Shekita⁴•Institutions (4)

Cornell University¹, Google², Yahoo!³, IBM⁴

23 May 2006

TL;DR: This paper proposes two ways to obtain user annotations, using explicit and implicit feedback, and shows how they can be integrated into a search engine.

...read moreread less

Abstract: A major difference between corporate intranets and the Internet is that in intranets the barrier for users to create web pages is much higher. This limits the amount and quality of anchor text, one of the major factors used by Internet search engines, making intranet search more difficult. The social phenomenon at play also means that spam is relatively rare. Both on the Internet and in intranets, users are often willing to cooperate with the search engine in improving the search experience. These characteristics naturally lead to considering using user feedback to improve search quality in intranets. In this paper we show how a particular form of feedback, namely user annotations, can be used to improve the quality of intranet search. An annotation is a short description of the contents of a web page, which can be considered a substitute for anchor text. We propose two ways to obtain user annotations, using explicit and implicit feedback, and show how they can be integrated into a search engine. Preliminary experiments on the IBM intranet demonstrate that using annotations improves the search quality.

...read moreread less

104 citations

Book Chapter•DOI•

Indexing shared content in information retrieval systems

[...]

Andrei Z. Broder¹, Nadav Eiron², Marcus Fontoura¹, Michael Herscovici³, Ronny Lempel³, John McPherson³, Runping Qi¹, Eugene J. Shekita³ - Show less +4 more•Institutions (3)

Yahoo!¹, Google², IBM³

26 Mar 2006

TL;DR: This paper describes a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once, and shows how this representation model can be encoded in an inverted index.

...read moreread less

Abstract: Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

...read moreread less

51 citations

Patent•

Efficient multifaceted search in information retrieval systems

[...]

Andrei Z. Broder¹, Nadav Eiron¹, Felipe Marcus Fontoura¹, Ronny Lempel¹, Ning Li¹, John Ai McPherson¹, Andreas Neumann¹, Shila Ofek-Koifman¹, Runping Qi¹, Eugene J. Shekita¹ - Show less +6 more•Institutions (1)

IBM¹

30 Nov 2006

TL;DR: An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents as discussed by the authors, which is a method for querying multifaceted information that includes constraints on documents, associated with indexed tokens and corresponding posting lists.

...read moreread less

Abstract: A method for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. Constructing the inverted index includes generating a full path token and an associated full path token posting list. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.

...read moreread less

36 citations

Patent•

Adaptive evaluation of text search queries with blackbox scoring functions

[...]

Kevin Scott Beyer¹, Robert W. Lyle¹, Sridhar Rajagopalan¹, Eugene J. Shekita¹•Institutions (1)

IBM¹

21 Nov 2006

TL;DR: Disclosed as discussed by the authors is an evaluation technique for text search with black-box scoring functions, where it is unnecessary for the evaluation engine to maintain details of the scoring function, and proofs of correctness, as well experimental evidence showing that the performance of the technique is comparable in efficiency to those used in custom-built engines.

...read moreread less

Abstract: Disclosed is an evaluation technique for text search with black-box scoring functions, where it is unnecessary for the evaluation engine to maintain details of the scoring function. Included is a description of a system for dealing with blackbox searching, proofs of correctness, as well experimental evidence showing that the performance of the technique is comparable in efficiency to those techniques used in custom-built engines.

...read moreread less

17 citations

Posted Content•

Impliance: A Next Generation Information Management Appliance

[...]

Bishwaranjan Bhattacharjee, Vuk Ercegovac, Joseph S. Glider, Richard A. Golding, Guy M. Lohman, Volker Markl, Hamid Pirahesh, Jun Rao, Robert M. Rees, Frederick Reiss, Eugene J. Shekita, Garret Swart¹ - Show less +8 more•Institutions (1)

IBM¹

22 Dec 2006-arXiv: Databases

TL;DR: This paper introduces Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information.

...read moreread less

Abstract: ably successful in building a large market and adapting to the changes of the last three decades, its impact on the broader market of information management is surprisingly limited. If we were to design an information management system from scratch, based upon today's requirements and hardware capabilities, would it look anything like today's database systems?" In this paper, we introduce Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information. We first summarize the trends that will shape information management for the foreseeable future. Those trends imply three major requirements for Impliance: (1) to be able to store, manage, and uniformly query all data, not just structured records; (2) to be able to scale out as the volume of this data grows; and (3) to be simple and robust in operation. We then describe four key ideas that are uniquely combined in Impliance to address these requirements, namely the ideas of: (a) integrating software and off-the-shelf hardware into a generic information appliance; (b) automatically discovering, organizing, and managing all data - unstructured as well as structured - in a uniform way; (c) achieving scale-out by exploiting simple, massive parallel processing, and (d) virtualizing compute and storage resources to unify, simplify, and streamline the management of Impliance. Impliance is an ambitious, long-term effort to define simpler, more robust, and more scalable information systems for tomorrow's enterprises.

...read moreread less

8 citations