Showing papers by "Andrei Z. Broder published in 2006"

PDF

Open Access

Journal Article•DOI•

Efficient PageRank approximation via graph aggregation

[...]

Andrei Z. Broder¹, Ronny Lempel¹, Farzin Maghoul², Jan Pedersen²•Institutions (2)

01 Mar 2006-Information Retrieval

TL;DR: This work presents a framework for approximating random-walk based probability distributions over Web pages using graph aggregation that can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host.

...read moreread less

Abstract: We present a framework for approximating random-walk based probability distributions over Web pages using graph aggregation. The basic idea is to partition the graph into classes of quasi-equivalent vertices, to project the page-based random walk to be approximated onto those classes, and to compute the stationary probability distribution of the resulting class-based random walk. From this distribution we can quickly reconstruct a distribution on pages. In particular, our framework can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host. We experimented on a Web-graph containing over 1.4 billion pages and over 6.6 billion links from a crawl of the Web conducted by AltaVista in September 2003. We were able to produce a ranking that has Spearman rank-order correlation of 0.95 with respect to PageRank. The clock time required by a simplistic implementation of our method was less than half the time required by a highly optimized implementation of PageRank, implying that larger speedup factors are probably possible.

...read moreread less

95 citations

Proceedings Article•DOI•

Estimating corpus size via queries

[...]

Andrei Z. Broder¹, Marcus Fontura¹, Vanja Josifovski¹, Ravi Kumar¹, Rajeev Motwani², Shubha U. Nabar², Rina Panigrahy², Andrew Tomkins¹, Ying Xu² - Show less +5 more•Institutions (2)

Yahoo!¹, Stanford University²

06 Nov 2006

TL;DR: The main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.

...read moreread less

Abstract: We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.

...read moreread less

72 citations

Book Chapter•DOI•

Indexing shared content in information retrieval systems

[...]

Andrei Z. Broder¹, Nadav Eiron², Marcus Fontoura¹, Michael Herscovici³, Ronny Lempel³, John McPherson³, Runping Qi¹, Eugene J. Shekita³ - Show less +4 more•Institutions (3)

Yahoo!¹, Google², IBM³

26 Mar 2006

TL;DR: This paper describes a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once, and shows how this representation model can be encoded in an inverted index.

...read moreread less

Abstract: Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

...read moreread less

51 citations

Patent•

Efficient multifaceted search in information retrieval systems

[...]

Andrei Z. Broder¹, Nadav Eiron¹, Felipe Marcus Fontoura¹, Ronny Lempel¹, Ning Li¹, John Ai McPherson¹, Andreas Neumann¹, Shila Ofek-Koifman¹, Runping Qi¹, Eugene J. Shekita¹ - Show less +6 more•Institutions (1)

IBM¹

30 Nov 2006

TL;DR: An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents as discussed by the authors, which is a method for querying multifaceted information that includes constraints on documents, associated with indexed tokens and corresponding posting lists.

...read moreread less

Abstract: A method for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. Constructing the inverted index includes generating a full path token and an associated full path token posting list. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.

...read moreread less

36 citations

Proceedings Article•DOI•

Effective and efficient classification on a search-engine model

[...]

Aris Anagnostopoulos¹, Andrei Z. Broder¹, Kunal Punera²•Institutions (2)

Yahoo!¹, University of Texas at Austin²

06 Nov 2006

TL;DR: It is shown that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms, and that optimizing the efficiency of query execution by careful selection of these terms can further reduce the query costs.

...read moreread less

Abstract: Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the "best" short query that characterizes a document class using operators normally available within large engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. Moreover, we show that optimizing the efficiency of query execution by careful selection of these terms can further reduce the query costs. More precisely, we show that on our set-up the best 10 terms query canachieve 90% of the accuracy of the best SVM classifier (14000 terms), and if we are willing to tolerate a reduction to 86% of the best SVM, we can build a 10 terms query that can be executed more than twice as fast as the best 10 terms query.

...read moreread less

19 citations

When You're Lost for Words: Faceted Search with Autocompletion

[...]

Holger Bast¹, Ingmar Weber¹, Andrei Z. Broder, Yoelle Maarek•Institutions (1)

Max Planck Society¹

01 Jan 2006

TL;DR: This paper shows how the autocompletion data structure of [2] can be used to answer faceted-search queries eciently and obtains very fast query processing times, improving those obtained by standard approaches by an order of magnitude.

...read moreread less

Abstract: In this paper, we show how the autocompletion data structure of [2] can be used to answer faceted-search queries eciently. Specically , we have built a fully-functional browserbased search engine that can index collections with arbitrary category information and that, after each keystroke from the user, computes and displays the following information: (i) words or phrases that begin with the last query word and would lead to good hits; (ii) the most relevant categories for those hits; (iii) any category names that match the query as typed so far; (iv) the most relevant hits for the query as typed so far. By appropriately rewriting the faceted-search queries as autocompletion queries according to [2], we obtain very fast query processing times, improving those obtained by standard approaches by an order of magnitude. On 11,685 scientic articles from the DBLP collection, with their full text and categorized by author, conference, and year, the average query processing time is about 25 milliseconds, on a single machine and with the index on disk. For the 2,172,832 articles of the latest dump of the English Wikipedia, with their full text and categorized by Wikipedia’s own category labels, we achieve an average query processing time of about 350 milliseconds.

...read moreread less

14 citations

Book Chapter•DOI•

The future of web search: from information retrieval to information supply

[...]

Andrei Z. Broder¹•Institutions (1)

Yahoo!¹

04 Jul 2006

TL;DR: This talk argues for the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query.

...read moreread less

Abstract: In the past decade, Web search engines have evolved from a first generation based on classic Information Retrieval (IR) algorithms scaled to web size and thus supporting only informational queries, to a second generation supporting navigational queries using web specific information (primarily link analysis), to a third generation enabling transactional and other "semantic" queries based on a variety of technologies aimed to directly satisfy the unexpressed "user intent." What is coming next? In this talk, we argue for the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query. The information supply concept greatly precedes information retrieval. What is new in the web framework is the ability to supply relevant information specific to a given activity and a given user, while the activity is being performed. A prime example is the matching of ads to content being read, however the information supply paradigm is starting to appear in other contexts such as social networks, e-commerce, browsers, e-mail, and others.

...read moreread less

8 citations