scispace - formally typeset
Search or ask a question

Showing papers by "Andrei Z. Broder published in 2004"


Journal ArticleDOI
TL;DR: The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.
Abstract: A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Bloom filters allow false positives but the space savings often outweigh this drawback when the probability of an error is controlled. Bloom filters have been used in database applications since the 1970s, but only in recent years have they become popular in the networking literature. The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.

2,199 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: A strong notion of a decay measure is formalized and a number of applications of such a measure are described to search engines, web page maintainers, ontologists, and individual users.
Abstract: The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.

166 citations


Patent
21 Jan 2004
TL;DR: In this paper, a computerized method is used to estimate the relative coverage of Web search engines by generating a random query, which is a logical combination of words found in a subset of the pages.
Abstract: A computerized method is used to estimate the relative coverage of Web search engines. Each search engine maintains an index of words of pages located at specific URL addresses in a network. The method generates a random query. The random query is a logical combination of words found in a subset of the pages. The random query is submitted to a first search engine. In response a set of URLs of pages matching the query are received. Each URL identifies a page indexed by the first search engine that satisfies the random query. A particular URL identifying a sample page is randomly selected. A strong query corresponding to the sample page is generated, and the strong query is submitted to a second search engine. Result information received in response to the strong query is compared to determine if the second search engine has indexed the sample page, or a page substantially similar to the sample page. This procedure is repeated to gather statistical data which is used to estimate the relative sizes and amount of overlap of search engines.

93 citations


Proceedings ArticleDOI
19 May 2004
TL;DR: A framework for approximating random-walk based probability distributions over Web pages using graph aggregation is presented, which can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host.
Abstract: We present a framework for approximating random-walk based probability distributions over Web pages using graph aggregation. We (1) partition the Web's graph into classes of quasi-equivalent vertices, (2) project the page-based random walk to be approximated onto those classes, and (3) compute the stationary probability distribution of the resulting class-based random walk. From this distribution we can quickly reconstruct a distribution on pages. Inparticular, our framework can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host. We experimented on a Web-graph containing over 1.4 billion pages, and were able to produce a ranking that has Spearman rank-order correlation of 0.95 with respect to PageRank. A simplistic implementation of our method required less than half the running time of a highly optimized implementation of PageRank, implying that larger speedup factors are probably possible.

61 citations


Journal ArticleDOI
TL;DR: This paper elucidates the differences between search systems for the Web and those for enterprises, with an emphasis on the future of enterprise search systems.
Abstract: Unstructured information represents the vast majority of data collected and accessible to enterprises. Exploiting this information requires systems for managing and extracting knowledge from large collections of unstructured data and applications for discovering patterns and relationships. This paper elucidates the differences between search systems for the Web and those for enterprises, with an emphasis on the future of enterprise search systems. It also introduces the Unstructured Information Management Architecture (UIMA) and provides the context for the unstructured information management (UIM) papers that follow.

38 citations


26 Apr 2004
TL;DR: The coupling of the classic vector space approach with a carefully chosen small set of relational operators allows us to express XML informational searches as enhanced XML fragments in a natural and powerful way.
Abstract: A cornerstone concept in classical Information Retrieval is the vector space model, whereby both documents and queries are viewed as vectors in a multidimensional space. Relevance of a given document to a given query is determined by evaluating the similarity between these vectors, using a measure like the cosine measure of similarity, for instance. The vector space model has been highly successful for dealing with plain text collections, in both theoretical and practical terms. In prior work, we extended this classic approach to the search of XML collections by requiring queries to be presented as XML Fragments, which allows for a very simple extension of the cosine similarity measure to the XML framework. In this paper, we formalize this approach by presenting the full syntax and semantics of XML Fragments as implemented in a practical system. Furthermore, we show how small additions to the pure model improve the expressiveness of queries and enable us to deal with a wide range of users' needs. These additions introduce certain novel constructs that are not syntactically correct XML but implement essential operators. We evaluate the expressiveness of our model, both from a formal viewpoint, by comparing it to the XPath language, and from a practical viewpoint by running experiments on the INEX (Initiative for XML Retrieval) collection. Our conclusion is that the coupling of the classic vector space approach with a carefully chosen small set of relational operators allows us to express XML informational searches as enhanced XML fragments in a natural and powerful way.

7 citations


Patent
24 Dec 2004
TL;DR: In this article, the authors provide system architecture, component, and search technique for unstructured information management system (UIMS), which can be provided as middleware for effective management and exchange of information relating to a wide range of arrays of information sources.
Abstract: PROBLEM TO BE SOLVED: To provide system architecture, component, and search technique for unstructured information management system (UIMS). SOLUTION: The UIMS can be provided as middleware for effective management and exchange of unstructured information relating to a wide range of arrays of information sources. The architecture generally comprises a search engine, a data recording area, an analysis engine including pipelined document annotators, and various adaptors. The search technique utilizes two-level retrieval process. Search query include a search operator having a plurality of search sub-expressions each having an associated weight value. The search engine returns one or more documents having a weight value sum exceeding a threshold weight value sum. The search operator is implemented as a Boolean predicate which functions as weighted AND(WAND). COPYRIGHT: (C)2005,JPO&NCIPI

2 citations