scispace - formally typeset
Search or ask a question

Showing papers by "Andrei Z. Broder published in 2006"


Journal ArticleDOI
TL;DR: This work presents a framework for approximating random-walk based probability distributions over Web pages using graph aggregation that can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host.
Abstract: We present a framework for approximating random-walk based probability distributions over Web pages using graph aggregation. The basic idea is to partition the graph into classes of quasi-equivalent vertices, to project the page-based random walk to be approximated onto those classes, and to compute the stationary probability distribution of the resulting class-based random walk. From this distribution we can quickly reconstruct a distribution on pages. In particular, our framework can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host. We experimented on a Web-graph containing over 1.4 billion pages and over 6.6 billion links from a crawl of the Web conducted by AltaVista in September 2003. We were able to produce a ranking that has Spearman rank-order correlation of 0.95 with respect to PageRank. The clock time required by a simplistic implementation of our method was less than half the time required by a highly optimized implementation of PageRank, implying that larger speedup factors are probably possible.

95 citations


Proceedings ArticleDOI
06 Nov 2006
TL;DR: The main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.
Abstract: We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approach requires a uniform random sample of documents from the corpus. The second approach avoids this notoriously difficult sample generation problem, and instead uses two fairly uncorrelated sets of terms as query pools; the accuracy of the second approach depends on the degree of correlation among the two sets of terms.Experiments on a large TREC collection and on three major search engines demonstrates the effectiveness of our algorithms.

72 citations


Book ChapterDOI
26 Mar 2006
TL;DR: This paper describes a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once, and shows how this representation model can be encoded in an inverted index.
Abstract: Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

51 citations


Patent
30 Nov 2006
TL;DR: An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents as discussed by the authors, which is a method for querying multifaceted information that includes constraints on documents, associated with indexed tokens and corresponding posting lists.
Abstract: A method for querying multifaceted information. An inverted index is constructed to include unique indexed tokens associated with posting lists of one or more documents. An indexed token is either a facet token included in a document as an annotation or a path prefix of the facet token. The annotation indicates a path within a tree structure representing a facet that includes the document. The tree structure includes nodes representing categories of documents. Constructing the inverted index includes generating a full path token and an associated full path token posting list. A query is received that includes constraints on documents. The constraints are associated with indexed tokens and corresponding posting lists. An execution of the query includes identifying the corresponding posting lists by utilizing the constraints and the inverted index and intersecting the posting lists to obtain a query result.

36 citations


Proceedings ArticleDOI
06 Nov 2006
TL;DR: It is shown that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms, and that optimizing the efficiency of query execution by careful selection of these terms can further reduce the query costs.
Abstract: Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the "best" short query that characterizes a document class using operators normally available within large engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. Moreover, we show that optimizing the efficiency of query execution by careful selection of these terms can further reduce the query costs. More precisely, we show that on our set-up the best 10 terms query canachieve 90% of the accuracy of the best SVM classifier (14000 terms), and if we are willing to tolerate a reduction to 86% of the best SVM, we can build a 10 terms query that can be executed more than twice as fast as the best 10 terms query.

19 citations


01 Jan 2006
TL;DR: This paper shows how the autocompletion data structure of [2] can be used to answer faceted-search queries eciently and obtains very fast query processing times, improving those obtained by standard approaches by an order of magnitude.
Abstract: In this paper, we show how the autocompletion data structure of [2] can be used to answer faceted-search queries eciently. Specically , we have built a fully-functional browserbased search engine that can index collections with arbitrary category information and that, after each keystroke from the user, computes and displays the following information: (i) words or phrases that begin with the last query word and would lead to good hits; (ii) the most relevant categories for those hits; (iii) any category names that match the query as typed so far; (iv) the most relevant hits for the query as typed so far. By appropriately rewriting the faceted-search queries as autocompletion queries according to [2], we obtain very fast query processing times, improving those obtained by standard approaches by an order of magnitude. On 11,685 scientic articles from the DBLP collection, with their full text and categorized by author, conference, and year, the average query processing time is about 25 milliseconds, on a single machine and with the index on disk. For the 2,172,832 articles of the latest dump of the English Wikipedia, with their full text and categorized by Wikipedia’s own category labels, we achieve an average query processing time of about 350 milliseconds.

14 citations


Book ChapterDOI
Andrei Z. Broder1
04 Jul 2006
TL;DR: This talk argues for the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query.
Abstract: In the past decade, Web search engines have evolved from a first generation based on classic Information Retrieval (IR) algorithms scaled to web size and thus supporting only informational queries, to a second generation supporting navigational queries using web specific information (primarily link analysis), to a third generation enabling transactional and other "semantic" queries based on a variety of technologies aimed to directly satisfy the unexpressed "user intent." What is coming next? In this talk, we argue for the trend towards context driven Information Supply (IS), that is, the goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query. The information supply concept greatly precedes information retrieval. What is new in the web framework is the ability to supply relevant information specific to a given activity and a given user, while the activity is being performed. A prime example is the matching of ads to content being read, however the information supply paradigm is starting to appear in other contexts such as social networks, e-commerce, browsers, e-mail, and others.

8 citations