Showing papers by "Andrei Z. Broder published in 2004"

PDF

Open Access

Journal Article•DOI•

Network Applications of Bloom Filters: A Survey

[...]

Andrei Z. Broder¹, Michael Mitzenmacher²•Institutions (2)

01 Jan 2004-Internet Mathematics

TL;DR: The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.

...read moreread less

Abstract: A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Bloom filters allow false positives but the space savings often outweigh this drawback when the probability of an error is controlled. Bloom filters have been used in database applications since the 1970s, but only in recent years have they become popular in the networking literature. The aim of this paper is to survey the ways in which Bloom filters have been used and modified in a variety of network problems, with the aim of providing a unified mathematical and practical framework for understanding them and stimulating their use in future applications.

...read moreread less

2,199 citations

Proceedings Article•DOI•

Sic transit gloria telae: towards an understanding of the web's decay

[...]

Ziv Bar-Yossef¹, Andrei Z. Broder¹, Ravi Kumar¹, Andrew Tomkins¹•Institutions (1)

IBM¹

17 May 2004

TL;DR: A strong notion of a decay measure is formalized and a number of applications of such a measure are described to search engines, web page maintainers, ontologists, and individual users.

...read moreread less

Abstract: The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.

...read moreread less

166 citations

Patent•

Method for estimating coverage of Web search engines

[...]

Krishna Bharat, Andrei Z. Broder

21 Jan 2004

TL;DR: In this paper, a computerized method is used to estimate the relative coverage of Web search engines by generating a random query, which is a logical combination of words found in a subset of the pages.

...read moreread less

Abstract: A computerized method is used to estimate the relative coverage of Web search engines. Each search engine maintains an index of words of pages located at specific URL addresses in a network. The method generates a random query. The random query is a logical combination of words found in a subset of the pages. The random query is submitted to a first search engine. In response a set of URLs of pages matching the query are received. Each URL identifies a page indexed by the first search engine that satisfies the random query. A particular URL identifying a sample page is randomly selected. A strong query corresponding to the sample page is generated, and the strong query is submitted to a second search engine. Result information received in response to the strong query is compared to determine if the second search engine has indexed the sample page, or a page substantially similar to the sample page. This procedure is repeated to gather statistical data which is used to estimate the relative sizes and amount of overlap of search engines.

...read moreread less

93 citations

Proceedings Article•DOI•

Efficient pagerank approximation via graph aggregation

[...]

Andrei Z. Broder¹, Ronny Lempel¹, Farzin Maghoul², Jan Pedersen²•Institutions (2)

IBM¹, Yahoo!²

19 May 2004

TL;DR: A framework for approximating random-walk based probability distributions over Web pages using graph aggregation is presented, which can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host.

...read moreread less

Abstract: We present a framework for approximating random-walk based probability distributions over Web pages using graph aggregation. We (1) partition the Web's graph into classes of quasi-equivalent vertices, (2) project the page-based random walk to be approximated onto those classes, and (3) compute the stationary probability distribution of the resulting class-based random walk. From this distribution we can quickly reconstruct a distribution on pages. Inparticular, our framework can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host. We experimented on a Web-graph containing over 1.4 billion pages, and were able to produce a ranking that has Spearman rank-order correlation of 0.95 with respect to PageRank. A simplistic implementation of our method required less than half the running time of a highly optimized implementation of PageRank, implying that larger speedup factors are probably possible.

...read moreread less

61 citations

Journal Article•DOI•

Towards the next generation of enterprise search technology

[...]

Andrei Z. Broder¹, Arthur Charles Ciccolo¹•Institutions (1)

IBM¹

01 Jul 2004-Ibm Systems Journal

TL;DR: This paper elucidates the differences between search systems for the Web and those for enterprises, with an emphasis on the future of enterprise search systems.

...read moreread less

Abstract: Unstructured information represents the vast majority of data collected and accessible to enterprises. Exploiting this information requires systems for managing and extracting knowledge from large collections of unstructured data and applications for discovering patterns and relationships. This paper elucidates the differences between search systems for the Web and those for enterprises, with an emphasis on the future of enterprise search systems. It also introduces the Unstructured Information Management Architecture (UIMA) and provides the context for the unstructured information management (UIM) papers that follow.

...read moreread less

38 citations

Using XML to query XML: from theory to practice

[...]

Andrei Z. Broder¹, Yoelle Maarek¹, Matan Mandelbrod¹, Yosi Mass¹•Institutions (1)

IBM¹

26 Apr 2004

TL;DR: The coupling of the classic vector space approach with a carefully chosen small set of relational operators allows us to express XML informational searches as enhanced XML fragments in a natural and powerful way.

...read moreread less

Abstract: A cornerstone concept in classical Information Retrieval is the vector space model, whereby both documents and queries are viewed as vectors in a multidimensional space. Relevance of a given document to a given query is determined by evaluating the similarity between these vectors, using a measure like the cosine measure of similarity, for instance. The vector space model has been highly successful for dealing with plain text collections, in both theoretical and practical terms. In prior work, we extended this classic approach to the search of XML collections by requiring queries to be presented as XML Fragments, which allows for a very simple extension of the cosine similarity measure to the XML framework. In this paper, we formalize this approach by presenting the full syntax and semantics of XML Fragments as implemented in a practical system. Furthermore, we show how small additions to the pure model improve the expressiveness of queries and enable us to deal with a wide range of users' needs. These additions introduce certain novel constructs that are not syntactically correct XML but implement essential operators. We evaluate the expressiveness of our model, both from a formal viewpoint, by comparing it to the XPath language, and from a practical viewpoint by running experiments on the INEX (Initiative for XML Retrieval) collection. Our conclusion is that the coupling of the classic vector space approach with a carefully chosen small set of relational operators allows us to express XML informational searches as enhanced XML fragments in a natural and powerful way.

...read moreread less

7 citations

Patent•

System, method, and computer program recording medium for performing unstructured information management and automatic text analysis

[...]

Andrei Z. Broder, David Carmel, Herscovici Michael, Aya Soffer, Jason Zien - Show less +1 more

24 Dec 2004

TL;DR: In this article, the authors provide system architecture, component, and search technique for unstructured information management system (UIMS), which can be provided as middleware for effective management and exchange of information relating to a wide range of arrays of information sources.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide system architecture, component, and search technique for unstructured information management system (UIMS). SOLUTION: The UIMS can be provided as middleware for effective management and exchange of unstructured information relating to a wide range of arrays of information sources. The architecture generally comprises a search engine, a data recording area, an analysis engine including pipelined document annotators, and various adaptors. The search technique utilizes two-level retrieval process. Search query include a search operator having a plurality of search sub-expressions each having an associated weight value. The search engine returns one or more documents having a weight value sum exceeding a threshold weight value sum. The search operator is implemented as a Boolean predicate which functions as weighted AND(WAND). COPYRIGHT: (C)2005,JPO&NCIPI

...read moreread less

2 citations