scispace - formally typeset
Search or ask a question

Showing papers by "Andrei Z. Broder published in 1997"


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Abstract: Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints.

1,989 citations


Journal ArticleDOI
01 Sep 1997
TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.
Abstract: We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.

1,560 citations


Patent
02 Jun 1997
TL;DR: In this paper, a method of operating a multiprocessor system having a predefined number of processing units for processing data, includes obtaining load information representing a loading of each of a number of randomly selected ones of the processing units.
Abstract: A method of operating a multiprocessor system having a predefined number of processing units for processing data, includes obtaining load information representing a loading of each of a number of randomly selected ones of the processing units. The number of randomly selected processing units is greater than 1 and substantially less than the predefined number of processing units. A least loaded of the randomly selected processing units is identified from the obtained load information. The data is directed to the identified least loaded randomly selected processing unit for processing.

95 citations


Patent
19 Dec 1997
TL;DR: In this article, a set of documents is stored in memories of server computers and the server computers can be connected to each other by a network such as the Internet. But the search engine also maintains a first abstract for each document that is indexed, which is highly dependent on the content of each document.
Abstract: Provided is a computerized method for monitoring the content of documents. A set of documents is stored in memories of server computers. The server computers can be connected to each other by a network such as the Internet. Entries are generated in a search engine for each document of the set. The search engine is also connected to the Internet. The entries are in the form of a full word index of the set of documents. The search engine also maintains a first abstract for each document that is indexed. The abstract is highly dependent on the content of each document. For example, the abstract is in the form of a sketch or a feature vector. Periodically a query is submitted to the search engine. The query locates a result set of documents that satisfy the query. A second abstract is generated for each document member of the result set. The first and second abstracts are compared to identify documents that have changed between the time the set of documents were indexed and the time the result set is generated.

79 citations


Patent
29 Oct 1997
TL;DR: In this article, the server computers store a plurality of Web pages and partition them into sets, where each set includes Web pages that are substantially similar in content and a preset compression dictionary is generated for each set of web pages.
Abstract: In a distributed network, client computers are connected to server computers. The server computers store a plurality of Web pages. The Web pages are partitioned into sets, where each set includes Web pages that are substantially similar in content. A preset compression dictionary is generated for each set of Web pages. In addition, a fingerprint is generated for each preset dictionary. The fingerprints uniquely identify each of the preset dictionaries. When one of the client computers requests one of the Web pages, a compressed form of the Web page is sent along with the fingerprint of the dictionary that was used to compress the Web page. The client computer can then request the preset dictionary in order to decompress the Web page when the client does not have a copy of the preset dictionary.

74 citations


Journal ArticleDOI
TL;DR: An algorithm for counting the number of minimum weight spanning trees is presented, based on the fact that the generating function for theNumber of spanning trees of a given graph, by weight, can be expressed as a simple determinant.

24 citations


Patent
15 Sep 1997
TL;DR: In this paper, the probability of a collision among fingerprints of dissimilar strings is estimated by a computerized method, and the number of matching matching fingerprints is recorded regarding the number.
Abstract: Strings, such as Web pages or other documents, are fingerprinted in order to detect substantially similar strings, so as to avoid processing duplicate strings. At the same time determine a computerized method estimates the probability that a collision among fingerprints of dissimilar strings. As fingerprints are generated for strings presented for processing, when the fingerprint of a string is determined not to be identical to any fingerprint in a set of stored fingerprints, the new fingerprint is masked and the unmasked portion of the fingerprint is compared with a corresponding portion of the fingerprints in the stored set. Information is recorded regarding the number of matching masked fingerprints.

19 citations


Proceedings ArticleDOI
04 May 1997
TL;DR: The random walk approach gives a simple and fully distributed solution for the problem of virtual circuit switching in bounded degree expander graphs and shows that if the injection to the network and the duration of connections are both controlled by Poisson processes then the algorithm achieves a steady state utilization of the network.
Abstract: This paper addresses the problem of virtual circuit switching in bounded degree expander graphs. We study the static and dynamic versions of this problem. Our solutions are baaed on the rapidly mixing properties of random walks on expander graphs. In the static version of the problem an algorithm is required to route a path between each of K pairs of vertices so that no edge is used by more than g paths. A natural approach to this problem is through a multicommodity flow reduction. However, we show that the random walk approach leads to significantly stronger results than those recently obtained by Leighton and Rao [10] using the multi-commodity flow setup. In the dynamic version of the problem connection requests are continuously injected into the network, Once a connection is established it utilizes a path (a virtual circuit) for a certain time until the communication terminates and the pat h is deleted. Again each edge in the network should not be used by more than g paths at once. The dynamic version is a better model for the practical use of communication networks. Our random walk approach gives a simple and fully distributed solution for this problem. We show that if the injection to the network and the duration of connections are both controlled by Poisson processes then our algorithm achieves ●Digital Systems Research Center, 130 Lytton Ave, Palo Alto, CA 943o1 t Department of Mathematics, Carnegie-Mellon University. A portion of this work was done while the author was visiting Digital SRC. Supported in part by NSF grants CCR-9225008 and CCR9530974. t IBM Almaden Research Center, San Jose, CA 95120, and Department of Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel. Pumission 10nmkc digllalflmd topics ofnll or pflll ot’thismxtcrinlfhr pemmal or clmsnmm usc is gr:mtcd ivilhoul k pro!fidcd 111:11 the copies arc not mode or distrihltcd t’orprolit or conmwrciu I adwmtagc, Ihe copyrighl notice. Ihc Iitle ol”thc puldicoliol) :In(i ils tialc appcw. and nolicc is gi\&ll that LX)pyrigh(i, b) pNllli\,iOll (>~tht :’!Vhi. ill~. “[’0LOp\ Othtr\! ist. to republish. 10 post on wrvers or 10 rcdis!r}l>tjlc IO 1ists. requires speci Iic penniwion andfor kc ,$770( ‘ 97 1:1 1’,,so. ‘1’c\m 1 ‘s:\ Copyrighl 11)97 ,-\Ckl0-XtJ7’)I-XXX-(V97,05 .,$3 5[) a steady state utilization of the network which is similar to the utilization achieved in the static case situation.

11 citations


Patent
21 Aug 1997
TL;DR: In this paper, a depth-first search of a flow diagram representing the execution of a program is performed, and the search proceeds simultaneously for all the registers and identifies the free registers from the search.
Abstract: A system and method for identifying free registers within a program. A depth first search of a flow diagram representing the execution of a program is performed. The search proceeds simultaneously for all the registers and identifies the free registers from the search. The free registers may then be utilized for various applications without saving and restoring the contents of these registers to memory. The system may limit the amount of time spent searching for free registers with a timer.

7 citations


Proceedings Article
01 Jan 1997
TL;DR: The random walk approach gives a simple and fully distributed solution for the problem of virtual circuit switching in bounded degree expander graphs and shows that if the injection to the network and the duration of connections are both controlled by Poisson processes then the algorithm achieves a steady state utilization of the network.
Abstract: This paper addresses the problem of virtual circuit switching in bounded degree expander graphs We study the static and dynamic versions of this problem Our solutions are baaed on the rapidly mixing properties of random walks on expander graphs In the static version of the problem an algorithm is required to route a path between each of K pairs of vertices so that no edge is used by more than g paths A natural approach to this problem is through a multicommodity flow reduction However, we show that the random walk approach leads to significantly stronger results than those recently obtained by Leighton and Rao [10] using the multi-commodity flow setup In the dynamic version of the problem connection requests are continuously injected into the network, Once a connection is established it utilizes a path (a virtual circuit) for a certain time until the communication terminates and the pat h is deleted Again each edge in the network should not be used by more than g paths at once The dynamic version is a better model for the practical use of communication networks Our random walk approach gives a simple and fully distributed solution for this problem We show that if the injection to the network and the duration of connections are both controlled by Poisson processes then our algorithm achieves ●Digital Systems Research Center, 130 Lytton Ave, Palo Alto, CA 943o1 t Department of Mathematics, Carnegie-Mellon University A portion of this work was done while the author was visiting Digital SRC Supported in part by NSF grants CCR-9225008 and CCR9530974 t IBM Almaden Research Center, San Jose, CA 95120, and Department of Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel Pumission 10nmkc digllalflmd topics ofnll or pflll ot’thismxtcrinlfhr pemmal or clmsnmm usc is gr:mtcd ivilhoul k pro!fidcd 111:11 the copies arc not mode or distrihltcd t’orprolit or conmwrciu I adwmtagc, Ihe copyrighl notice Ihc Iitle ol”thc puldicoliol) :In(i ils tialc appcw and nolicc is gi\&ll that LX)pyrigh(i, b) pNllli\,iOll (>~tht :’!Vhi ill~ “[’0LOp\ Othtr\! ist to republish 10 post on wrvers or 10 rcdis!r}l>tjlc IO 1ists requires speci Iic penniwion andfor kc ,$770( ‘ 97 1:1 1’,,so ‘1’c\m 1 ‘s:\ Copyrighl 11)97 ,-\Ckl0-XtJ7’)I-XXX-(V97,05 ,$3 5[) a steady state utilization of the network which is similar to the utilization achieved in the static case situation

2 citations