Showing papers by "Andrei Z. Broder published in 2003"

PDF

Open Access

Proceedings Article•DOI•

Efficient query evaluation using a two-level retrieval process

[...]

Andrei Z. Broder¹, David Carmel¹, Michael Herscovici¹, Aya Soffer¹, Jason Zien¹ - Show less +1 more•Institutions (1)

03 Nov 2003

TL;DR: An efficient query evaluation method based on a two level approach that significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall.

...read moreread less

Abstract: We present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. The efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. The amount of pruning can be controlled by the user as a function of time allocated for query evaluation. Experimentally, using the TREC Web Track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or Weak AND that might be of independent interest.

...read moreread less

435 citations

Patent•

Method and apparatus for ranking web page search results

[...]

Andrei Z. Broder

06 May 2003

TL;DR: In this article, a method and apparatus for ranking a plurality of pages identified during a search of a linked database includes forming a linear combination of two or more matrices, and using the coefficients of the eigenvector of the resulting matrix to rank the quality of the pages.

...read moreread less

Abstract: A method and apparatus for ranking a plurality of pages identified during a search of a linked database includes forming a linear combination of two or more matrices, and using the coefficients of the eigenvector of the resulting matrix to rank the quality of the pages. The matrices includes information about the pages and are generally normalized, stochastic matrices. The linear combination can include attractor matrices that indicate desirable or “high quality” sites, and/or non-attractor matrices that indicate sites that are undesirable. Attractor matrices and non-attractor matrices can be used alone or in combination with each other in the linear combination. Additional bias toward high quality sites, or away from undesirable sites, can be further introduced with probability weighting matrices for attractor and non-attractor matrices. Other known matrices, such as a co-citation matrix or a bibliographic coupling matrix, can also be used in the present invention.

...read moreread less

240 citations

Patent•

System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a weighted and (WAND)

[...]

Andrei Z. Broder¹, David Carmel¹, Herscovici Michael¹, Aya Soffer¹, Jason Zien¹ - Show less +1 more•Institutions (1)

IBM¹

30 May 2003

TL;DR: In this paper, the authors present a system architecture, components and a searching technique for an unstructured information management system (UIMS), which is provided as middleware for the effective management and interchange of information over a wide array of information sources.

...read moreread less

Abstract: Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. A search query includes a search operator containing of a plurality of search sub-expressions each having an associated weight value. The search engine returns a document or documents having a weight value sum that exceeds a threshold weight value sum. The search operator is implemented as a Boolean predicate that functions as a Weighted AND (WAND).

...read moreread less

210 citations

Patent•

Method for identifying related pages in a hyperlinked database

[...]

Jeffrey Dean Black, Monika Henzinger, Andrei Z. Broder

03 Nov 2003

TL;DR: In this article, a method for identifying related pages among a plurality of pages in a linked database such as the World Wide Web is described, in which an initial page is selected from the plurality of web pages and pages linked to the initial page are represented as a graph in a memory.

...read moreread less

Abstract: A method is described for identifying related pages among a plurality of pages in a linked database such as the World Wide Web. An initial page is selected from the plurality of pages. Pages linked to the initial page are represented as a graph in a memory. The pages represented in the graph are scored on content, and a set of pages is selected, the selected set of pages having scores greater than a first predetermined threshold. The selected set of pages is scored on connectivity, and a subset of the set of pages that have scores greater than a second predetermined threshold are selected as related pages.

...read moreread less

167 citations

Patent•

System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations

[...]

Andrei Z. Broder¹, David Carmel¹, Arthur Charles Ciccolo¹, David A. Ferrucci¹, Yoelle Maarek¹, Yosi Mass¹, Aya Soffer¹, Wlodek Zadrozny¹ - Show less +4 more•Institutions (1)

IBM¹

30 May 2003

TL;DR: In this paper, the authors present a system architecture, components and a searching technique for an unstructured information management system (UIMS), which is provided as middleware for the effective management and interchange of unstructuring information over a wide array of information sources.

...read moreread less

Abstract: Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. Also disclosed is system, method and computer program product to process document data. The method includes inputting a document and operating at least one text analysis engine that comprises a plurality of coupled annotators for tokenizing document data for identifying and annotating a particular type of semantic content. Operating the at least one text analysis engine generates a plurality of views of a document, where each of the plurality of views are derived from a different tokenization of the document. The method further includes storing the plurality of views in a common data structure associated with the document.

...read moreread less

167 citations

Proceedings Article•DOI•

Efficient URL caching for world wide web crawling

[...]

Andrei Z. Broder¹, Marc Najork², Janet L. Wiener³•Institutions (3)

IBM¹, Microsoft², Hewlett-Packard³

20 May 2003

TL;DR: C caching is very effective - in this setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%.

...read moreread less

Abstract: Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)-(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the "seen" URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.

...read moreread less

84 citations

Patent•

System, method and computer program product for performing unstructured information management and automatic text analysis

[...]

Andrei Z. Broder¹, Arthur Charles Ciccolo¹, David A. Ferrucci¹, Alan David Marwick¹, Wlodek Zadrozny¹ - Show less +1 more•Institutions (1)

IBM¹

30 May 2003

TL;DR: In this paper, a system architecture, components and a searching technique for an unstructured information management system (UIMS) is described, where the UIMS may be provided as middleware for the effective management and interchange of information over a wide array of information sources.

...read moreread less

70 citations

Patent•

System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching

[...]

Andrei Z. Broder¹, David A. Ferrucci¹, Alan David Marwick¹, Yosi Mass¹, Wlodek Zadrozny¹ - Show less +1 more•Institutions (1)

IBM¹

30 May 2003

TL;DR: In this article, the authors present a system architecture, components and a searching technique for an unstructured information management system (UIMS), which may be provided as middleware for the effective management and interchange of unstructuring information over a wide array of information sources.

...read moreread less

Abstract: Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. The data processing system includes a token inverted file system storing tokens obtained by at least one tokenizer from document data. An annotation inverted file system stores annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.

...read moreread less

48 citations

Patent•

System and method for locating similar records in a database

[...]

Andrei Z. Broder, Mark S. Manasse¹•Institutions (1)

Hewlett-Packard¹

13 Jan 2003

TL;DR: In this article, a set of features are generated for the sequence of tokens and each object in the set of objects is similar to the specified object, for example, a name or an address.

...read moreread less

Abstract: The invention provides a system and method for locating records in a database storing objects similar to a specified object. A set of object expansion rules and a set of canonicalization rules are applied to the specified object to generate a sequence of tokens. A set of features are then generated for the sequence of tokens. Generating a set of features includes: generating a set of characters from the sequence of tokens; assigning an identification element to each character in the set of characters to create a set of identification elements; creating a set of permuted identification elements; selecting a predetermined number of permuted identification elements from the set of permuted identification elements; partitioning the selected, permuted identification elements into a plurality of groups; and producing a feature value from each of these groups. Finally, a set of objects from the database with a predefined number of feature values in common with those of the specified object are located. Each object in the set of objects is similar to the specified object. Further, an object may be, for example, a name or an address.

...read moreread less

25 citations

Journal Article•DOI•

A derandomization using min-wise independent permutations

[...]

Andrei Z. Broder, Moses Charikar¹, Michael Mitzenmacher²•Institutions (2)

Stanford University¹, Harvard University²

01 Feb 2003-Journal of Discrete Algorithms

TL;DR: It is shown that approximate min-wise independence allows similar uses, by presenting a derandomization of the RNC algorithm for approximate set cover due to S. Rajagopalan and V. Vazirani.

...read moreread less

11 citations

Proceedings Article•DOI•

Keynote Address - exploring, modeling, and using the web graph

[...]

Andrei Z. Broder¹•Institutions (1)

IBM¹

28 Jul 2003

TL;DR: Some challenges and opportunities for using IR methods and techniques in the exploration of the Web graph are explored, in particular in dealing with legitimate and "spam" perturbations of the "natural" process of birth and death of nodes and links.

...read moreread less

Abstract: The Web graph, meaning the graph induced by Web pages as nodes and their hyperlinks as directed edges, has become a fascinating object of study for many people: physicists, sociologists, mathematicians, computer scientists, and information retrieval specialists.Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).Recent results range from theoretical (e.g.: models for the graph, semi-external algorithms), to experimental (e.g.: new insights regarding the rate of change of pages, new data on the distribution of degrees), to practical (e.g.: improvements in crawling technology).The goal of this talk is to convey an introduction to the state of the art in this area and to sketch the current issues in collecting, representing, analyzing, and modeling this graph. Although graph analytic methods are essential tools in the Web IR arsenal, they are well known to the SIGIR community and will not be discussed here in any detail; instead, we will explore some challenges and opportunities for using IR methods and techniques in the exploration of the Web graph, in particular in dealing with legitimate and "spam" perturbations of the "natural" process of birth and death of nodes and links, and conversely, the challenges and opportunities of using graph methods in support of IR on the Web and in the enterprise.

...read moreread less