Top 12 papers published by Andrei Z. Broder from Google in 2000

Journal Article•DOI•

[...]

Andrei Z. Broder, Ravi Kumar¹, Farzin Maghoul, Prabhakar Raghavan¹, Sridhar Rajagopalan¹, Raymie Stata, Andrew Tomkins¹, Janet L. Wiener - Show less +4 more•Institutions (1)

IBM¹

01 Jun 2000

TL;DR: The study of the web as a graph yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution.

...read moreread less

Abstract: The study of the web as a graph is not only fascinating in its own right, but also yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the web graph using two Altavista crawls each with over 200 million pages and 1.5 billion links. Our study indicates that the macroscopic structure of the web is considerably more intricate than suggested by earlier experiments on a smaller scale.

...read moreread less

2,973 citations

Journal Article•DOI•

Summary cache: a scalable wide-area web cache sharing protocol

[...]

Li Fan¹, Pei Cao², Jussara M. Almeida¹, Andrei Z. Broder•Institutions (2)

University of Wisconsin-Madison¹, Cisco Systems, Inc.²

01 Jun 2000-IEEE ACM Transactions on Networking

TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.

...read moreread less

Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we demonstrate the benefits of cache sharing, measure the overhead of the existing protocols, and propose a new protocol called "summary cache". In this new protocol, each proxy keeps a summary of the cache directory of each participating proxy, and checks these summaries for potential hits before sending any queries. Two factors contribute to our protocol's low overhead: the summaries are updated only periodically, and the directory representations are very economical, as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that, compared to existing protocols such as the Internet cache protocol (ICP), summary cache reduces the number of intercache protocol messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP. Hence summary cache scales to a large number of proxies. (This paper is a revision of Fan et al. 1998; we add more data and analysis in this version.).

...read moreread less

2,174 citations

Journal Article•DOI•

Min-Wise Independent Permutations

[...]

Andrei Z. Broder, Moses Charikar¹, Alan Frieze², Michael Mitzenmacher³•Institutions (3)

Stanford University¹, Carnegie Mellon University², Harvard University³

01 Jun 2000

TL;DR: This research was motivated by the fact that such a family of permutations is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents.

...read moreread less

Abstract: We define and study the notion of min-wise independent families of permutations. We say that F?Sn (the symmetric group) is min-wise independent if for any set X?n and any x?X, when ? is chosen at random in F we havePr(min{?(X)}=?(x))=1|X| . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under ?. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of our investigation we have discovered interesting and challenging theoretical questions related to this concept?we present the solutions to some of them and we list the rest as open problems.

...read moreread less

962 citations

Book Chapter•DOI•

Identifying and Filtering Near-Duplicate Documents

[...]

Andrei Z. Broder

21 Jun 2000

TL;DR: The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

...read moreread less

Abstract: The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size "sketch" for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for effcient large scale web indexing it is not necessary to determine the actual resemblance value: it suffces to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffces to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a "sample" of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest. The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

...read moreread less

465 citations

Journal Article•DOI•

A comparison of techniques to find mirrored hosts on the WWW

[...]

Krishna Bharat¹, Andrei Z. Broder, Jeffrey Dean¹, Monika Henzinger¹•Institutions (1)

Google¹

01 Oct 2000-Journal of the Association for Information Science and Technology

TL;DR: In this article, the authors compare several algorithms for identifying mirrored hosts on the World Wide Web, based on URL strings and linkage data, the type of information about Web pages easily available from Web proxies and crawlers.

...read moreread less

Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web-based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.

...read moreread less

80 citations

Patent•

Method and apparatus for finding mirrored hosts by analyzing urls

[...]

Krishna Bharat¹, Andrei Z. Broder¹, Steven C. Glassman¹, Jeffrey Dean¹, Monika R. Henzinger¹ - Show less +1 more•Institutions (1)

AmeriCorps VISTA¹

05 May 2000

TL;DR: In this paper, a method and apparatus that detects mirrored host pairs using information about a large set of pages, including URLs, is described, and the identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites.

...read moreread less

Abstract: A method and apparatus that detects mirrored host pairs using information about a large set of pages, including URLs. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites. The described embodiments of the present invention look at the URLs of pages hosts to determine whether the hosts are potentially mirrored.

...read moreread less

49 citations

Proceedings Article•DOI•

Improved classification via connectivity information

[...]

Andrei Z. Broder, Robert Krauthgamer¹, Michael Mitzenmacher²•Institutions (2)

Weizmann Institute of Science¹, Harvard University²

01 Feb 2000

TL;DR: This work model and analyze the problem of how to improve the classification of Web pages by using link information, and presents a theoretical framework for this problem based on a graph model and suggests two linear algorithms based on similar methods that have been proven effective in the setting of error-correcting codes.

...read moreread less

Abstract: The motivat ion for our work is the observation that Web pages on a part icular topic are often linked to other pages on the same topic. We model and analyze the problem of how to improve the classification of Web pages ( that is, determining the topic of the page) by using link information. In our setting, an initial classifter examines the text of a Web page and assigns to it some classification, possibly mistaken. We investigate how to reduce the error probabili ty using the observation above, 'thus building an improved classifier. We present a theoretical framework for this problem based on a r andom graph model and suggest two linear t ime algorithms, based on similar methods that have been proven effective in the setting of error-correcting codes. We provide simulation results to verify our analysis and to compare the performance of our suggested

...read moreread less

16 citations

Journal Article•

A Comparison of Techniques to Find Mirrored Hosts on the WWW.

[...]

Krishna Bharat, Andrei Z. Broder, Jeffrey Dean, Monika Henzinger

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: Four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information are evaluated.

...read moreread less

Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web-based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.

...read moreread less

14 citations

Book Chapter•DOI•

Min-wise Independent Permutations: Theory and Practice

[...]

Andrei Z. Broder

09 Jul 2000

TL;DR: This talk will review the current research in min-wise independent permutations and trace the interplay of theory and practice that motivated it.

...read moreread less

Abstract: A family of permutations F ⊆ Sn (the symmetric group) is called min-wise independent if for any set X ⊆ [n] and any x ∈ X, when a permutation π is chosen at random in F we have Pr(min{π(X)} = π(x) = 1/|X|. In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. The rigorous study of such families was instigated by the fact that such a family (under some relaxations) is essential to the algorithm used by the AltaVista Web indexing software to detect and filter near-duplicate documents. The insights gained from theoretical investigations led to practical changes, which in turn inspired new mathematical inquiries and results. This talk will review the current research in this area and will trace the interplay of theory and practice that motivated it.

...read moreread less

11 citations

Proceedings Article•

The Bow-Tie web

[...]

Andrei Z. Broder

01 Jan 2000

6 citations

Proceedings Article•DOI•

Min-Wise versus linear independence (extended abstract)

[...]

Andrei Z. Broder, Uriel Feige¹•Institutions (1)

Weizmann Institute of Science¹

01 Feb 2000

TL;DR: This work presents a new family of nearly min-wise independent permutations, and argues that the balance it achieves between ease of use and provable statistical properties makes it a favorable candidate for practical applications.

...read moreread less

Abstract: 1 P r [ m i n { ~ r ( S ) } = 7r (x) ] = IS-~ when ~is chosen at random from F. The rigorous study of such families was initiated by Broder, Charikax, Frieze, and Mitzenmacher (STOC98), motivated by practical applications such as detecting near duplicate web documents by the AltaVista search engine. For these practical uses, it is required that the family be easily sampleable, and that the permutations be efficiently computable. To achieve this one often uses relaxed notions of rain-wise independence, such as P r [ m i n { l r ( S ) } = ~r(x) ] > IS--~for some small 0 < e < 1. We present a new family of nearly min-wise independent permutations, and argue that the balance it achieves between ease of use and provable statistical properties makes it a favorable candidate for practical applications.

...read moreread less

Patent•

Method and apparatus for finding mirrored hosts

[...]

Krishna Bharat¹, Andrei Z. Broder¹, Jeffrey Dean¹, Steven C. Glassman¹, Monika R. Henzinger¹ - Show less +1 more•Institutions (1)

AmeriCorps VISTA¹

05 May 2000

TL;DR: In this article, a method and system that detects mirrored host pairs using information about a large set of pages, including one or more of: URLs, IP addresses, and connectivity information, is presented.

...read moreread less

Abstract: A method and system that detects mirrored host pairs using information about a large set of pages, including one or more of: URLs, IP addresses, and connectivity information. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites. The described embodiments of the present invention use one or a combination of techniques to identify mirrors. A first group of techniques involves determining mirrors based on URLs and information about connectivity (i.e., hyperlinks) between pages. A second group of techniques looks at connectivity information at a higher granularity, considering all links from all pages on a host as one group and ignoring the target of each link beyond the host level.

...read moreread less

Showing papers by "Andrei Z. Broder published in 2000"