scispace - formally typeset
Search or ask a question

Showing papers by "Andrei Z. Broder published in 2003"


Proceedings ArticleDOI
Andrei Z. Broder1, David Carmel1, Michael Herscovici1, Aya Soffer1, Jason Zien1 
03 Nov 2003
TL;DR: An efficient query evaluation method based on a two level approach that significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall.
Abstract: We present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. The efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. The amount of pruning can be controlled by the user as a function of time allocated for query evaluation. Experimentally, using the TREC Web Track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or Weak AND that might be of independent interest.

435 citations


Patent
06 May 2003
TL;DR: In this article, a method and apparatus for ranking a plurality of pages identified during a search of a linked database includes forming a linear combination of two or more matrices, and using the coefficients of the eigenvector of the resulting matrix to rank the quality of the pages.
Abstract: A method and apparatus for ranking a plurality of pages identified during a search of a linked database includes forming a linear combination of two or more matrices, and using the coefficients of the eigenvector of the resulting matrix to rank the quality of the pages. The matrices includes information about the pages and are generally normalized, stochastic matrices. The linear combination can include attractor matrices that indicate desirable or “high quality” sites, and/or non-attractor matrices that indicate sites that are undesirable. Attractor matrices and non-attractor matrices can be used alone or in combination with each other in the linear combination. Additional bias toward high quality sites, or away from undesirable sites, can be further introduced with probability weighting matrices for attractor and non-attractor matrices. Other known matrices, such as a co-citation matrix or a bibliographic coupling matrix, can also be used in the present invention.

240 citations


Patent
Andrei Z. Broder1, David Carmel1, Herscovici Michael1, Aya Soffer1, Jason Zien1 
30 May 2003
TL;DR: In this paper, the authors present a system architecture, components and a searching technique for an unstructured information management system (UIMS), which is provided as middleware for the effective management and interchange of information over a wide array of information sources.
Abstract: Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. A search query includes a search operator containing of a plurality of search sub-expressions each having an associated weight value. The search engine returns a document or documents having a weight value sum that exceeds a threshold weight value sum. The search operator is implemented as a Boolean predicate that functions as a Weighted AND (WAND).

210 citations


Patent
03 Nov 2003
TL;DR: In this article, a method for identifying related pages among a plurality of pages in a linked database such as the World Wide Web is described, in which an initial page is selected from the plurality of web pages and pages linked to the initial page are represented as a graph in a memory.
Abstract: A method is described for identifying related pages among a plurality of pages in a linked database such as the World Wide Web. An initial page is selected from the plurality of pages. Pages linked to the initial page are represented as a graph in a memory. The pages represented in the graph are scored on content, and a set of pages is selected, the selected set of pages having scores greater than a first predetermined threshold. The selected set of pages is scored on connectivity, and a subset of the set of pages that have scores greater than a second predetermined threshold are selected as related pages.

167 citations


Patent
30 May 2003
TL;DR: In this paper, the authors present a system architecture, components and a searching technique for an unstructured information management system (UIMS), which is provided as middleware for the effective management and interchange of unstructuring information over a wide array of information sources.
Abstract: Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. Also disclosed is system, method and computer program product to process document data. The method includes inputting a document and operating at least one text analysis engine that comprises a plurality of coupled annotators for tokenizing document data for identifying and annotating a particular type of semantic content. Operating the at least one text analysis engine generates a plurality of views of a document, where each of the plurality of views are derived from a different tokenization of the document. The method further includes storing the plurality of views in a common data structure associated with the document.

167 citations


Proceedings ArticleDOI
20 May 2003
TL;DR: C caching is very effective - in this setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%.
Abstract: Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)-(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the "seen" URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.

84 citations


Patent
30 May 2003
TL;DR: In this paper, a system architecture, components and a searching technique for an unstructured information management system (UIMS) is described, where the UIMS may be provided as middleware for the effective management and interchange of information over a wide array of information sources.
Abstract: Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique.

70 citations


Patent
Andrei Z. Broder1, David A. Ferrucci1, Alan David Marwick1, Yosi Mass1, Wlodek Zadrozny1 
30 May 2003
TL;DR: In this article, the authors present a system architecture, components and a searching technique for an unstructured information management system (UIMS), which may be provided as middleware for the effective management and interchange of unstructuring information over a wide array of information sources.
Abstract: Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. The data processing system includes a token inverted file system storing tokens obtained by at least one tokenizer from document data. An annotation inverted file system stores annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.

48 citations


Patent
13 Jan 2003
TL;DR: In this article, a set of features are generated for the sequence of tokens and each object in the set of objects is similar to the specified object, for example, a name or an address.
Abstract: The invention provides a system and method for locating records in a database storing objects similar to a specified object. A set of object expansion rules and a set of canonicalization rules are applied to the specified object to generate a sequence of tokens. A set of features are then generated for the sequence of tokens. Generating a set of features includes: generating a set of characters from the sequence of tokens; assigning an identification element to each character in the set of characters to create a set of identification elements; creating a set of permuted identification elements; selecting a predetermined number of permuted identification elements from the set of permuted identification elements; partitioning the selected, permuted identification elements into a plurality of groups; and producing a feature value from each of these groups. Finally, a set of objects from the database with a predefined number of feature values in common with those of the specified object are located. Each object in the set of objects is similar to the specified object. Further, an object may be, for example, a name or an address.

25 citations


Journal ArticleDOI
TL;DR: It is shown that approximate min-wise independence allows similar uses, by presenting a derandomization of the RNC algorithm for approximate set cover due to S. Rajagopalan and V. Vazirani.

11 citations


Proceedings ArticleDOI
Andrei Z. Broder1
28 Jul 2003
TL;DR: Some challenges and opportunities for using IR methods and techniques in the exploration of the Web graph are explored, in particular in dealing with legitimate and "spam" perturbations of the "natural" process of birth and death of nodes and links.
Abstract: The Web graph, meaning the graph induced by Web pages as nodes and their hyperlinks as directed edges, has become a fascinating object of study for many people: physicists, sociologists, mathematicians, computer scientists, and information retrieval specialists.Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).The goal of this talk is to convey an introduction to the state of the art in this area and to sketch the current issues in collecting, representing, analyzing, and modeling this graph. Although graph analytic methods are essential tools in the Web IR arsenal, they are well known to the SIGIR community and will not be discussed here in any detail; instead, we will explore some challenges and opportunities for using IR methods and techniques in the exploration of the Web graph, in particular in dealing with legitimate and "spam" perturbations of the "natural" process of birth and death of nodes and links, and conversely, the challenges and opportunities of using graph methods in support of IR on the Web and in the enterprise.