Top 6 papers published by Jeffrey Dean from Google in 2000

Journal Article•DOI•

A comparison of techniques to find mirrored hosts on the WWW

[...]

Krishna Bharat¹, Andrei Z. Broder, Jeffrey Dean¹, Monika Henzinger¹•Institutions (1)

01 Oct 2000-Journal of the Association for Information Science and Technology

TL;DR: In this article, the authors compare several algorithms for identifying mirrored hosts on the World Wide Web, based on URL strings and linkage data, the type of information about Web pages easily available from Web proxies and crawlers.

...read moreread less

Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web-based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.

...read moreread less

80 citations

Patent•

Method and apparatus for finding mirrored hosts by analyzing urls

[...]

Krishna Bharat¹, Andrei Z. Broder¹, Steven C. Glassman¹, Jeffrey Dean¹, Monika R. Henzinger¹ - Show less +1 more•Institutions (1)

AmeriCorps VISTA¹

05 May 2000

TL;DR: In this paper, a method and apparatus that detects mirrored host pairs using information about a large set of pages, including URLs, is described, and the identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites.

...read moreread less

Abstract: A method and apparatus that detects mirrored host pairs using information about a large set of pages, including URLs. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites. The described embodiments of the present invention look at the URLs of pages hosts to determine whether the hosts are potentially mirrored.

...read moreread less

49 citations

Patent•

Distributed crawling of hyperlinked documents

[...]

Jeffrey Dean¹, Craig Silverstein¹, Benedict A. Gomes¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

14 Aug 2000

TL;DR: In this article, the authors present a technique for crawling hyperlinked documents based on the stall time of the host to indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host.

...read moreread less

Abstract: Techniques for crawling hyperlinked documents are provided. Hyperlinked documents to be crawled are grouped by host and the host to be crawled next is selected according to a stall time of the host. The stall time can indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host.

...read moreread less

18 citations

Journal Article•

A Comparison of Techniques to Find Mirrored Hosts on the WWW.

[...]

Krishna Bharat, Andrei Z. Broder, Jeffrey Dean, Monika Henzinger

01 Jan 2000-IEEE Data(base) Engineering Bulletin

TL;DR: Four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information are evaluated.

...read moreread less

Abstract: We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web-based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top-down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.

...read moreread less

14 citations

Patent•

Scoring links in a document

[...]

Jeffrey Dean¹, Craig Silverstein¹, Lawrence E. Page¹•Institutions (1)

Google¹

13 Dec 2000

TL;DR: In this paper, a system modifies documents to aid users in determining which of the entries in the documents to choose, and then provides the modified document to a user, based on the determined scores.

...read moreread less

Abstract: A system modifies documents to aid users in determining which of the entries in the documents to choose. The system identifies a document that includes one or more entries. The system determines a score for each of the entries and modifies the identified document, or entries in the identified document, based on the determined scores. The system then provides the modified document to a user.

...read moreread less

11 citations

Patent•

Method and apparatus for finding mirrored hosts

[...]

Krishna Bharat¹, Andrei Z. Broder¹, Jeffrey Dean¹, Steven C. Glassman¹, Monika R. Henzinger¹ - Show less +1 more•Institutions (1)

AmeriCorps VISTA¹

05 May 2000

TL;DR: In this article, a method and system that detects mirrored host pairs using information about a large set of pages, including one or more of: URLs, IP addresses, and connectivity information, is presented.

...read moreread less

Abstract: A method and system that detects mirrored host pairs using information about a large set of pages, including one or more of: URLs, IP addresses, and connectivity information. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites. The described embodiments of the present invention use one or a combination of techniques to identify mirrors. A first group of techniques involves determining mirrors based on URLs and information about connectivity (i.e., hyperlinks) between pages. A second group of techniques looks at connectivity information at a higher granularity, considering all links from all pages on a host as one group and ignoring the target of each link beyond the host level.

...read moreread less

1 citations

Showing papers by "Jeffrey Dean published in 2000"