scispace - formally typeset
Search or ask a question

Showing papers by "Andrei Z. Broder published in 1998"


Proceedings ArticleDOI
01 Oct 1998
TL;DR: This paper proposes a new protocol called "Summary Cache"; each proxy keeps a summary of the URLs of cached documents of each participating proxy and checks these summaries for potential hits before sending any queries, which enables cache sharing among a large number of proxies.
Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we propose a new protocol called "Summary Cache"; each proxy keeps a summary of the URLs of cached documents of each participating proxy and checks these summaries for potential hits before sending any queries. Two factors contribute to the low overhead: the summaries are updated only periodically, and the summary representations are economical --- as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that compared to the existing Internet Cache Protocol (ICP), Summary Cache reduces the number of inter-cache messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, and eliminates between 30% to 95% of the CPU overhead, while at the same time maintaining almost the same hit ratio as ICP. Hence Summary Cache enables cache sharing among a large number of proxies.

446 citations


Journal Article
TL;DR: A standardized, statistical way of measuring search engine coverage and overlap through random queries is described that can be implemented by third-party evaluators using only public query interfaces and suggests the size of the static, public Web as of November was over 200 million pages.

407 citations


Journal ArticleDOI
01 Apr 1998
TL;DR: In this paper, the authors describe a standardized, statistical way of measuring search engine coverage and overlap through random queries, which can be implemented by third-party evaluators using only public query interfaces.
Abstract: Search engines are among the most useful and popular services on the Web. Users are eager to know how they compare. Which one has the largest coverage? Have they indexed the same portion of the Web? How many pages are out there? Although these questions have been debated in the popular and technical press, no objective evaluation methodology has been proposed and few clear answers have emerged. In this paper we describe a standardized, statistical way of measuring search engine coverage and overlap through random queries. Our technique does not require privileged access to any database. It can be implemented by third-party evaluators using only public query interfaces. We present results from our experiments showing size and overlap estimates for HotBot, AltaVista, Excite, and Infoseek as percentages of their total joint coverage in mid 1997 and in November 1997. Our method does not provide absolute values. However using data from other sources we estimate that as of November 1997 the number of pages indexed by HotBot, AltaVista, Excite, and Infoseek were respectively roughly 77M, 100M, 32M, and 17M and the joint total coverage was 160 million pages. We further conjecture that the size of the static, public Web as of November was over 200 million pages. The most startling finding is that the overlap is very small: less than 1.4% of the total coverage, or about 2.2 million pages were indexed by all four engines.

393 citations


Proceedings ArticleDOI
23 May 1998
TL;DR: This research was motivated by the fact that such a family of permutations is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents.
Abstract: We define and study the notion of min-wise independent families of permutations. We say that F ⊆ Sn is min-wise independent if for any set X ⊆ [n] and any x ∈ X , when π is chosen at random in F we have Pr ( min{π(X)} = π(x) ) = 1 |X | . In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π. Our research was motivated by the fact that such a family (under some relaxations) is essential to the algorithm used in practice by the AltaVista web index software to detect and filter near-duplicate documents. However, in the course of ∗Digital SRC, 130 Lytton Avenue, Palo Alto, CA 94301, USA. E-mail: broder@pa.dec.com. †Computer Science Department, Stanford University, CA 94305, USA. E-mail: moses@cs.stanford.edu. Part of this work was done while this author was a summer intern at Digital SRC. Supported by the Pierre and Christine Lamond Fellowship and in part by an ARO MURI Grant DAAH04-96-1-0007 and NSF Award CCR-9357849, with matching funds from IBM, Schlumberger Foundation, Shell Foundation, and Xerox Corporation. ‡Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA. Part of this work was done while this author was visiting Digital SRC. Supported in part by NSF grant CCR9530974. E-mail: af1p@andrew.cmu.edu §Digital SRC, 130 Lytton Avenue, Palo Alto, CA 94301, USA. E-mail: michaelm@pa.dec.com. our investigation we have discovered interesting and challenging theoretical questions related to this concept – we present the solution to some of them and we list the rest as open problems.

345 citations


Patent
13 Apr 1998
TL;DR: In this paper, a computerized method selectively accepts access requests from a client computer connected to a server computer by a network is proposed, where the server computer receives an access request from the client computer and generates a predetermined number of random characters.
Abstract: A computerized method selectively accepts access requests from a client computer connected to a server computer by a network. The server computer receives an access request from the client computer. In response, the server computer generates a predetermined number of random characters. The random characters are used to form a string in the server computer. The string is randomly modified either visually or audibly to form a riddle. The original string becomes the correct answer to the riddle. The server computer renders the riddle on an output device of the client computer. In response, the client computer sends an answer to the server. Hopefully, the answer is a user's guess for the correct answer. The server determines if the guess is the correct answer, and if so, the access request is accepted. If the correct answer is not received within a predetermined amount of time, the connection between the client and server computer is terminated by the server on the assumption that an automated agent is operating in the client on behalf of the user.

281 citations


Patent
26 Mar 1998
TL;DR: In this article, a computer-implemented method determines the resemblance of data objects such as Web pages, where each data object is partitioned into a sequence of tokens and the tokens are grouped into overlapping sets of the tokens to form shingles.
Abstract: A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.

243 citations


Journal ArticleDOI
01 Apr 1998
TL;DR: A server that provides linkage information for all pages indexed by the AltaVista search engine and can produce the entire neighbourhood of L up to a given distance, and envisage numerous other applications such as ranking, visualization, and classification.
Abstract: We have built a server that provides linkage information for all pages indexed by the AltaVista search engine. In its basic operation, the server accepts a query consisting of a set L of one or more URLs and returns a list of all pages that point to pages in L (predecessors) and a list of all pages that are pointed to from pages in L (successors). More generally the server can produce the entire neighbourhood (in the graph theory sense) of L up to a given distance and can include information about all links that exist among pages in the neighbourhood. Although some of this information can be retrieved directly from Alta Vista or other search engines, these engines are not optimized for this purpose and the process of constructing the neighbourhood of a given set of pages is show and laborious. In contrast our prototype server needs less than 0.1 ms per result URL. So far we have built two applications that use the Connectivity Server: a direct interface that permits fast navigation of the Web via the predecessor/successor relation, and a visualization tool for the neighbourhood of a given set of pages. We envisage numerous other applications such as ranking, visualization, and classification.

225 citations


Patent
10 Mar 1998
TL;DR: In this article, a server computer is provided for representing and navigating the connectivity of Web pages, which includes links to other Web pages and associated names (URLs), the names of the Web pages are sorted in a memory of the connectivity server.
Abstract: A server computer is provided for representing and navigating the connectivity of Web pages. The Web pages include links to other Web pages. The links and Web page s have associated names (URLs). The names of the Web pages are sorted in a memory of the connectivity server. The sorted names are delta encoded while periodically storing full names as checkpoints in the memory. Each delta encoded name and checkpoint has a unique identification. A list of pairs of identifications representing existent links is sorted twice, first according to the first identification of each pair to produce an inlist, and second according to the second identification of each pair to produce an outlist. An array of elements is stored in the memory, there is one array element for each Web page. Each element includes a first pointer to one of the checkpoints, a second pointer to an associated inlist of the Web page, and a third pointer to an associated outlist of the Web page. The array is indexed by a particular identification to locate connected Web pages.

91 citations


Patent
23 Nov 1998
TL;DR: In this paper, a method for facilitating the comparison of two computerized documents is proposed, which includes loading a first document into a random access memory (RAM), loading a second document into the RAM, reducing the first sentence into a first sequence of tokens, reducing a second sentence to a second sequence of token, converting the first set of tokens to a first (multi)set of shingles and converting the second set of token to a multiple set of shingsles.
Abstract: A method for facilitating the comparison of two computerized documents. The method includes loading a first document into a random access memory (RAM), loading a second document into the RAM, reducing the first document into a first sequence of tokens, reducing the second document into a second sequence of tokens, converting the first set of tokens to a first (multi)set of shingles, converting the second set of tokens to a second (multi)set of shingles, determining a first sketch of the first (multi)set of shingles, determining a second sketch of the second (multi)set of shingles, and comparing the first sketch and the second sketch. The sketches have a fixed size, independent of the size of the documents. The resemblance of two documents is provided using a sketch of each document. The sketches may be computed fairly fast and given two sketches the resemblance of the corresponding documents can be computed in linear time in the size of the sketches.

42 citations


Book ChapterDOI
08 Oct 1998
TL;DR: It is shown that approximate min-wise independence allows similar uses, by presenting a derandomization of the RNC algorithm for approximate set cover due to S. Rajagopalan and V. Vazirani.
Abstract: Min-wise independence is a recently introduced notion of limited independence, similar in spirit to pairwise independence. The later has proven essential for the derandomization of many algorithms. Here we show that approximate min-wise independence allows similar uses, by presenting a derandomization of the RNC algorithm for approximate set cover due to S. Rajagopalan and V. Vazirani. We also discuss how to derandomize their set multi-cover and multi-set multi-cover algorithms in restricted cases. The multi-cover case leads us to discuss the concept of k-minima-wise independence, a natural counterpart to k-wise independence.

31 citations


Patent
13 Jul 1998
TL;DR: In this paper, a method for determining a random permutation of input lines that produced a permuted set of bits in a bitstream is presented. But the method is not suitable for the problem of finding the permutation.
Abstract: A method determines a random permutation of input lines that produced a permuted set of bits in a bitstream. In a source design, the method replaces a logic element whose input lines are permutable with a test function. The source design is processed by a design tool to generate the bitstream including the permuted set of bits. The test function is probed with test values, and the probe results are compared with the permuted set of bits to discover the permutation of the set of bits. The test values can include a message.

Journal Article
TL;DR: The Connectivity Server as mentioned in this paper provides linkage information for all pages indexed by the AltaVista search engine and can produce the entire neighbourhood (in the graph theory sense) of L up to a given distance and can include information about all links that exist among pages in the neighbourhood.

Book ChapterDOI
20 Apr 1998
TL;DR: This work gives the first routing algorithm on this topology that is stable under an injection rate within a constant factor of the hardware bandwidth, and holds for a broad range of packet generation stochastic distributions.
Abstract: We study the performance of packet routing on arrays (or meshes) with bounded buffers in the routing switches, assuming that new packets are continuously inserted at all the nodes. We give the first routing algorithm on this topology that is stable under an injection rate within a constant factor of the hardware bandwidth. Unlike previous results, our algorithm does not require the global synchronization of the insertion times or the retraction and reinsertion of excessively delayed messages and our analysis holds for a broad range of packet generation stochastic distributions. This result represents a new application of a general technique for the design and analysis of dynamic algorithms that we first presented in [Broder et al., FOCS 96, pp. 390–399.].