scispace - formally typeset
Search or ask a question

Showing papers by "Andrei Z. Broder published in 2005"


Proceedings ArticleDOI
10 May 2005
TL;DR: This work presents and analyzes an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation and shows how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods.
Abstract: We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: Determining the set of categories in a given taxonomy spanned by the search results;Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;"Estimating the size of the result set;Data mining associations to the query terms.We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.)Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods.Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.

100 citations


Patent
10 Aug 2005
TL;DR: In this article, a thread processor analyzes the EMT threads and records the thread configuration data, and a query manager utilizes the thread configurations data to conduct selective searches of EMT volume.
Abstract: A method includes describing the thread configurations of a volume of well-ordered electronic message transmissions (EMT) and utilizing the thread configuration data to conduct selective searches of the EMT volume. An apparatus includes a thread processor and a query manager. The thread processor analyzes the EMT threads and records the thread configuration data. The query manager utilizes the thread configuration data to conduct selective searches of the EMT volume.

66 citations


Patent
Andrei Z. Broder1, David Carmel1, Adam Darlow1, Shai Fine1, Elad Yom-Tov1 
14 Jul 2005
TL;DR: In this paper, a method and system for the detection of missing content in a searchable repository is provided, which includes a missing content query identifier (401) for identifying queries to a search engine (102) for which no or little relevant content is returned.
Abstract: A method and system for the detection of missing content in a searchable repository is provided. A system includes: a missing content query identifier (401) for identifying queries to a search engine (102) for which no or little relevant content is returned; a missing content detector (110) which clusters missing content queries by topic; and an output provider for providing details of a missing content topic.

12 citations


Patent
12 Jan 2005
TL;DR: In this paper, a method for indexing a plurality of documents, that includes plurality of duplicate documents, first identifies one or more duplicate groups of documents from among the plurality, and then, one index of content for the duplicate group is created instead of indexing the content from every document within the duplicates group.
Abstract: A method for indexing a plurality of documents, that includes a plurality of duplicate documents, first identifies one or more duplicate groups of documents from among the plurality of documents. Then, one index of content for the duplicate group is created instead of indexing the content from every document within the duplicate group. However, in contrast to the content index, an index of metadata for each of the documents in the duplicate group is created. Thus the content of each duplicate group is indexed only once, while a search engine using such indexing techniques retains the capability to answer queries as if the duplicated content was indexed for each document of the group.

12 citations



Proceedings ArticleDOI
23 Jan 2005
TL;DR: This work considers a multidimensional variant of the balls-and-bins problem, where balls correspond to random D-dimensional 0-1 vectors, and demonstrates the utility of the power of two choices in this domain.
Abstract: We consider a multidimensional variant of the balls-and-bins problem, where balls correspond to random D-dimensional 0-1 vectors This variant is motivated by a problem in load balancing documents for distributed search engines We demonstrate the utility of the power of two choices in this domain

7 citations


Proceedings ArticleDOI
10 May 2005
TL;DR: This panel brings together experts and advocates for all three schools, who will discuss these approaches and share their experiences in the field and ask the audience to challenge experts with real information architecture problems.
Abstract: Searching and browsing are the two basic information discovery paradigms, since the early days of the Web. After more than ten years down the road, three schools seem to have emerged: (1) The search-centric school argues that guided navigation is superfluous since free form search has become so good and the search UI so common, that users can satisfy all their needs via simple queries (2) The taxonomy navigation school claims that users have difficulties expressing informational needs and (3) The meta-data centric school advocates the use of meta-data for narrowing large sets of results, and is successful in e-commerce where it is known as "multi faceted search". This panel brings together experts and advocates for all three schools, who will discuss these approaches and share their experiences in the field. We will ask the audience to challenge our experts with real information architecture problems.

2 citations


Proceedings ArticleDOI
10 May 2005
TL;DR: The Ways in which search engines have affected the web in the past and ways in which they may affect it in the future are discussed.
Abstract: The state of the web today has been and continues to be greatly influenced by the existence of web-search engines This panel will discuss the ways in which search engines have affected the web in the past and ways in which they may affect it in the future Both positive and negative effects will be discussed as will potential measures to combat the latter Besides the obvious ways in which search engines help people find content, other effects to be discussed include: the whole phenomenon of web-page spam, based on both text and link (eg link farms), the business of "Search Engine Optimization" (optimizing pages to rank highly in web-search results), the bided-terms business and the associated problem of click fraud, to name a few

1 citations