scispace - formally typeset
Search or ask a question

Showing papers on "Ranking (information retrieval) published in 2003"


Journal ArticleDOI
TL;DR: It is shown that using linear combinations of these (precomputed) biased PageRank vectors to generate context-specific importance scores for pages at query time, can generate more accurate rankings than with a single, generic PageRank vector.
Abstract: The original PageRank algorithm for improving the ranking of search-query results computes a single vector, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. For ordinary keyword search queries, we compute the topic-sensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topic-sensitive PageRank scores using the topic of the context in which the query appeared. By using linear combinations of these (precomputed) biased PageRank vectors to generate context-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. We describe techniques for efficiently implementing a large-scale search system based on the topic-sensitive PageRank scheme.

1,161 citations


Proceedings Article
09 Dec 2003
TL;DR: A simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data, to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data.
Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method.

767 citations


Journal ArticleDOI
28 Jul 2003
TL;DR: A framework for evaluating subtopic retrieval is proposed which generalizes the traditional precision and recall metrics by accounting for intrinsic topic difficulty as well as redundancy in documents and a maximal marginal relevance (MMR) ranking strategy is proposed.
Abstract: We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in a ranking is dependent on other documents in the ranking, violating the assumption of independent relevance which is assumed in most traditional retrieval methods. Subtopic retrieval poses challenges for evaluating performance, as well as for developing effective algorithms. We propose a framework for evaluating subtopic retrieval which generalizes the traditional precision and recall metrics by accounting for intrinsic topic difficulty as well as redundancy in documents. We propose and systematically evaluate several methods for performing subtopic retrieval using statistical language models and a maximal marginal relevance (MMR) ranking strategy. A mixture model combined with query likelihood relevance ranking is shown to modestly outperform a baseline relevance ranking on a data set used in the TREC interactive track.

611 citations


Book ChapterDOI
09 Sep 2003
TL;DR: This paper adapts IR-style document-relevance ranking strategies to the problem of processing free-form keyword queries over RDBMSs, and develops query-processing strategies that build on a crucial characteristic of IR- style keyword search: only the few most relevant matches are generally of interest.
Abstract: Applications in which plain text coexists with structured data are pervasive. Commercial relational database management systems (RDBMSs) generally provide querying capabilities for text attributes that incorporate state-of-the-art information retrieval (IR) relevance ranking strategies, but this search functionality requires that queries specify the exact column or columns against which a given list of keywords is to be matched. This requirement can be cumbersome and inflexible from a user perspective: good answers to a keyword query might need to be "assembled" -in perhaps unforeseen ways- by joining tuples from multiple relations. This observation has motivated recent research on free-form keyword search over RDBMSs. In this paper, we adapt IR-style document-relevance ranking strategies to the problem of processing free-form keyword queries over RDBMSs. Our query model can handle queries with both AND and OR semantics, and exploits the sophisticated single-column text-search functionality often available in commercial RDBMSs. We develop query-processing strategies that build on a crucial characteristic of IR-style keyword search: only the few most relevant matches -according to some definition of "relevance"- are generally of interest. Consequently, rather than computing all matches for a keyword query, which leads to inefficient executions, our techniques focus on the top-k matches for the query, for moderate values of k. A thorough experimental evaluation over real data shows the performance advantages of our approach.

581 citations


Proceedings ArticleDOI
Andrei Z. Broder1, David Carmel1, Michael Herscovici1, Aya Soffer1, Jason Zien1 
03 Nov 2003
TL;DR: An efficient query evaluation method based on a two level approach that significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall.
Abstract: We present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. The efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. The amount of pruning can be controlled by the user as a function of time allocated for query evaluation. Experimentally, using the TREC Web Track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or Weak AND that might be of independent interest.

435 citations


Proceedings ArticleDOI
03 Nov 2003
TL;DR: Two algorithms for determining expertise from email were compared: a content-based approach that takes account only of email text, and a graph-based ranking algorithm (HITS) that take account both of text and communication patterns.
Abstract: A common method for finding information in an organization is to use social networks---ask people, following referrals until someone with the right information is found. Another way is to automatically mine documents to determine who knows what. Email documents seem particularly well suited to this task of "expertise location", as people routinely communicate what they know. Moreover, because people explicitly direct email to one another, social networks are likely to be contained in the patterns of communication. Can these patterns be used to discover experts on particular topics? Is this approach better than mining message content alone? To find answers to these questions, two algorithms for determining expertise from email were compared: a content-based approach that takes account only of email text, and a graph-based ranking algorithm (HITS) that takes account both of text and communication patterns. An evaluation was done using email and explicit expertise ratings from two different organizations. The rankings given by each algorithm were compared to the explicit rankings with the precision and recall measures commonly used in information retrieval, as well as the d' measure commonly used in signal-detection theory. Results show that the graph-based algorithm performs better than the content-based algorithm at identifying experts in both cases, demonstrating that the graph-based algorithm effectively extracts more information than is found in content alone.

395 citations


Journal ArticleDOI
TL;DR: The identity measure and the best fingerprinting technique are both able to accurately identify coderivative documents, and it is demonstrated that the identity measure is clearly superior for fingerprinting parameters.
Abstract: The widespread use of on-line publishing of text promotes storage of multiple versions of documents and mirroring of documents in multiple locations, and greatly simplifies the task of plagiarizing the work of others. We evaluate two families of methods for searching a collection to find documents that are coderivative, that is, are versions or plagiarisms of each other. The first, the ranking family, uses information retrieval techniques; extending this family, we propose the identity measure, which is specifically designed for identification of co-derivative documents. The second, the fingerprinting family, uses hashing to generate a compact document description, which can then be compared to the fingerprints of the documents in the collection. We introduce a new method for evaluating the effectiveness of these techniques, and demonstrate it in practice. Using experiments on two collections, we demonstrate that the identity measure and the best fingerprinting technique are both able to accurately identify coderivative documents. However, for fingerprinting parameters must be carefully chosen, and even so the identity measure is clearly superior.

378 citations


Journal ArticleDOI
TL;DR: This study proposes a new method for query expansion based on user interactions recorded in user logs that extracts correlations between query terms and document terms by analyzing user logs and can produce much better results than both the classical search method and the other query expansion methods.
Abstract: Queries to search engines on the Web are usually short. They do not provide sufficient information for an effective selection of relevant documents. Previous research has proposed the utilization of query expansion to deal with this problem. However, expansion terms are usually determined on term co-occurrences within documents. In this study, we propose a new method for query expansion based on user interactions recorded in user logs. The central idea is to extract correlations between query terms and document terms by analyzing user logs. These correlations are then used to select high-quality expansion terms for new queries. Compared to previous query expansion methods, ours takes advantage of the user judgments implied in user logs. The experimental results show that the log-based query expansion method can produce much better results than both the classical search method and the other query expansion methods.

342 citations


Journal ArticleDOI
TL;DR: A firm can build more effective security strategies by identifying and ranking the severity of potential threats to its IS efforts.
Abstract: A firm can build more effective security strategies by identifying and ranking the severity of potential threats to its IS efforts.

335 citations


Proceedings ArticleDOI
03 Nov 2003
TL;DR: It is shown how time can be incorporated into both query-likelihood models and relevance models, and shows that time-based models perform as well as or better than the best of the heuristic techniques.
Abstract: We explore the relationship between time and relevance using TREC ad-hoc queries. A type of query is identified that favors very recent documents. We propose a time-based language model approach to retrieval for these queries. We show how time can be incorporated into both query-likelihood models and relevance models. These models were used for experiments comparing time-based language models to heuristic techniques for incorporating document recency in the ranking. Our results show that time-based models perform as well as or better than the best of the heuristic techniques.

308 citations


Proceedings ArticleDOI
22 Jun 2003
TL;DR: PlanetP as mentioned in this paper is a content addressable publish/subscribe service for unstructured peer-to-peer (P2P) communities that supports content addressing by providing a gossiping layer used to globally replicate a membership directory and an extremely compact content index.
Abstract: We introduce PlanetP, content addressable publish/subscribe service for unstructured peer-to-peer (P2P) communities. PlanetP supports content addressing by providing: (1) a gossiping layer used to globally replicate a membership directory and an extremely compact content index; and (2) a completely distributed content search and ranking algorithm that help users find the most relevant information. PlanetP is a simple, yet powerful system for sharing information. PlanetP is simple because each peer must only perform a periodic, randomized, point-to-point message exchange with other peers. PlanetP is powerful because it maintains a globally content-ranked view of the shared data. Using simulation and a prototype implementation, we show that PlanetP achieves ranking accuracy that is comparable to a centralized solution and scales easily to several thousand peers while remaining resilient to rapid membership changes.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: A user query classification scheme that uses the difference of distribution, mutual information, the usage rate as anchor texts, and the POS information for the classification and could get the best performance when the OKAPI scoring algorithm was used.
Abstract: The heterogeneous Web exacerbates IR problems and short user queries make them worse. The contents of web documents are not enough to find good answer documents. Link information and URL information compensates for the insufficiencies of content information. However, static combination of multiple evidences may lower the retrieval performance. We need different strategies to find target documents according to a query type. We can classify user queries as three categories, the topic relevance task, the homepage finding task, and the service finding task. In this paper, a user query classification scheme is proposed. This scheme uses the difference of distribution, mutual information, the usage rate as anchor texts, and the POS information for the classification. After we classified a user query, we apply different algorithms and information for the better results. For the topic relevance task, we emphasize the content information, on the other hand, for the homepage finding task, we emphasize the Link information and the URL information. We could get the best performance when our proposed classification method with the OKAPI scoring algorithm was used.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search, and presents a mixture-based language model to investigate several hypotheses.
Abstract: This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses from those research areas and investigate them in this context. To investigate these hypotheses, we present a mixture-based language model and also examine many of the current meta-search algorithms. We find that compatible output from systems is important for successful combination of document representations. We also demonstrate that combining low performing document representations can improve performance, but not consistently. We find that the techniques best suited for this task are robust to the inclusion of poorly performing document representations. We also explore the role of variance of results across systems and its impact on the performance of fusion, with the surprising result that the correct documents have higher variance across document representations than highly ranking incorrect documents.

Proceedings Article
01 Jan 2003
TL;DR: The challenges and several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval, are discussed and results of preliminary experiments are presented.
Abstract: Ranking and returning the most relevant results of a query is a popular paradigm in Information Retrieval. We discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. We present results of preliminary experiments.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed approach to relevant term extraction and term suggestion can provide organized and highly relevant terms, and can exploit the contextual information in a user's query session to make more effective suggestions.
Abstract: This paper proposes an effective term suggestion approach to interactive Web search. Conventional approaches to making term suggestions involve extracting co-occurring keyterms from highly ranked retrieved documents. Such approaches must deal with term extraction difficulties and interference from irrelevant documents, and, more importantly, have difficulty extracting terms that are conceptually related but do not frequently co-occur in documents. In this paper, we present a new, effective log-based approach to relevant term extraction and term suggestion. Using this approach, the relevant terms suggested for a user query are those that co-occur in similar query sessions from search engine logs, rather than in the retrieved documents. In addition, the suggested terms in each interactive search step can be organized according to its relevance to the entire query session, rather than to the most recent single query as in conventional approaches. The proposed approach was tested using a proxy server log containing about two million query transactions submitted to search engines in Taiwan. The obtained experimental results show that the proposed approach can provide organized and highly relevant terms, and can exploit the contextual information in a user's query session to make more effective suggestions.

Proceedings ArticleDOI
28 Jul 2003
TL;DR: It is shown that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases, and a new resource selection algorithm is proposed that uses information about database sizes as well as database contents.
Abstract: Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases. A new resource selection algorithm is proposed that uses information about database sizes as well as database contents. We also show how to acquire database size estimates in uncooperative environments as an extension of the query-based sampling used to acquire resource descriptions. Experiments demonstrate that the database size estimates are more accurate for large databases than estimates produced by a competing method; the new resource ranking algorithm is always at least as effective as the CORI algorithm; and the new algorithm results in better document rankings than the CORI algorithm.

Patent
12 Jun 2003
TL;DR: In this article, a user interface can present results in the form of browsing multiple hierarchical representations, wherein matching categories are differentiated from non-matching categories, providing an indication of the fitness of the search terms for returning satisfactory results.
Abstract: Systems and methods for data storage, retrieval, manipulation and display provide search engines and computer-based research tools for enabling multiple hierarchical points of view. Category definitions in the hierarchical data structures can include lists of set members, like word arrays of set members, generative descriptions for determining set members, and fitness functions for determining fitness of a presented item for being a member of a set. Significance and interest values can be assigned to search categories to set threshold confidence levels for returning search results and for weighting the results, respectively. A user interface can present results in the form of browsing multiple hierarchical representations, wherein matching categories are differentiated from non-matching categories. Peer ratings can represent the ranking of search term results with relation to results using other search terms, providing an indication of the fitness of the search terms for returning satisfactory results.

Book ChapterDOI
22 Sep 2003
TL;DR: The main objective of this work is to investigate the trade-off between the quality of the induced ranking function and the computational complexity of the algorithm, both depending on the amount of preference information given for each example.
Abstract: We consider supervised learning of a ranking function, which is a mapping from instances to total orders over a set of labels (options). The training information consists of examples with partial (and possibly inconsistent) information about their associated rankings. From these, we induce a ranking function by reducing the original problem to a number of binary classification problems, one for each pair of labels. The main objective of this work is to investigate the trade-off between the quality of the induced ranking function and the computational complexity of the algorithm, both depending on the amount of preference information given for each example. To this end, we present theoretical results on the complexity of pairwise preference learning, and experimentally investigate the predictive performance of our method for different types of preference information, such as top-ranked labels and complete rankings. The domain of this study is the prediction of a rational agent's ranking of actions in an uncertain environment.

Proceedings ArticleDOI
09 Mar 2003
TL;DR: It is shown that the rank based method, named Borda Count, is competitive with score based methods, but this is not true for metasearch, and it will be shown that Markov chain based methods compete with Score based methods.
Abstract: Given a set of rankings, the task of ranking fusion is the problem of combining these lists in such a way to optimize the performance of the combination. The ranking fusion problem is encountered in many situations and, e.g., metasearch is a prominent one. It deals with the problem of combining the result lists returned by multiple search engines in response to a given query, where each item in a result list is ordered with respect to a search engine and a relevance score. Several ranking fusion methods have been proposed in the literature. They can be classified based on whether: (i) they rely on the rank; (ii) they rely on the score; and (iii) they require training data or not. Our paper will make the following contributions: (i) we will report experimental results for the Markov chain rank based methods, for which no large experimental tests have yet been made; (ii) while it is believed that the rank based method, named Borda Count, is competitive with score based methods, we will show that this is not true for metasearch; and (iii) we will show that Markov chain based methods compete with score based methods. This is especially important in the context of metasearch as scores are usually not available from the search engines.

20 Jun 2003
TL;DR: This work analytically compares three recent approaches to personalizing PageRank and discusses the tradeoffs of each one.
Abstract: PageRank, the popular link-analysis algorithm for ranking web pages, assigns a query and user independent estimate of "importance" to web pages Query and user sensitive extensions of PageRank, which use a basis set of biased PageRank vectors, have been proposed in order to personalize the ranking function in a tractable way We analytically compare three recent approaches to personalizing PageRank and discuss the tradeoffs of each one

Journal ArticleDOI
01 Jan 2003
TL;DR: An efficient peer-to-peer information retrieval system, pSearch; that supports state-of-the-art content- and semantic-based full-text searches, avoiding the scalability problem of existing systems that employ centralized indexing, or index/query flooding.
Abstract: We describe an efficient peer-to-peer information retrieval system, pSearch; that supports state-of-the-art content- and semantic-based full-text searches. pSearch avoids the scalability problem of existing systems that employ centralized indexing, or index/query flooding. It also avoids the nondeterminism that is exhibited by heuristic-based approaches. In pSearch; documents in the network are organized around their vector representations (based on modern document ranking algorithms) such that the search space for a given query is organized around related documents, achieving both efficiency and accuracy.

Journal Article
TL;DR: A new family of topic- ranking algorithms for multi-labeled documents that achieve state-of-the-art results and outperforms topic-ranking adaptations of Rocchio's algorithm and of the Perceptron algorithm are described.
Abstract: We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stem from recent advances in online learning algorithms. The algorithms are simple to implement and are also time and memory efficient. We provide a unified analysis of the family of algorithms in the mistake bound model. We then discuss experiments with the proposed family of topic-ranking algorithms on the Reuters-21578 corpus and the new corpus released by Reuters in 2000. On both corpora, the algorithms we present achieve state-of-the-art results and outperforms topic-ranking adaptations of Rocchio's algorithm and of the Perceptron algorithm.

Proceedings Article
01 Jan 2003
TL;DR: Studies conducted by the two participating groups compared a search engine using automatic topic distillation features with the same engine with those features disabled in order to determine whether the automatic topic Distillation features assisted the users in the performance of their tasks and whether humans could achieve better results than the automatic system.
Abstract: The TREC 2003 web track consisted of both a non-interactive stream and an interactive stream. Both streams worked with the .GOV test collection. The non-interactive stream continued an investigation into the importance of homepages in Web ranking, via both a Topic Distillation task and a Navigational task. In the topic distillation task, systems were expected to return a list of the homepages of sites relevant to each of a series of broad queries. This differs from previous homepage experiments in that queries may have multiple correct answers. The navigational task required systems to return a particular desired web page as early as possible in the ranking in response to queries. In half of the queries, the target answer was the homepage of a site and the query was derived from the name of the site (Homepage finding) while in the other half, the target answers were not homepages and the queries were derived from the name of the page (Named page finding). The two types of query were arbitrarily mixed and not identified. The interactive stream focused on human participation in a topic distillation task over the .GOV collection. Studies conducted by the two participating groups compared a search engine using automatic topic distillation features with the same engine with those features disabled in order to determine whether the automatic topic distillation features assisted the users in the performance of their tasks and whether humans could achieve better results than the automatic system.

Journal ArticleDOI
TL;DR: In this paper, the problem of finding the simulated system with the best (maximum or minimum) expected performance when the number of systems is large and initial samples from each system have already been taken is addressed.
Abstract: In this paper we address the problem of finding the simulated system with the best (maximum or minimum) expected performance when the number of systems is large and initial samples from each system have already been taken This problem may be encountered when a heuristic search procedure--perhaps one originally designed for use in a deterministic environment--has been applied in a simulation-optimization context Because of stochastic variation, the system with the best sample mean at the end of the search procedure may not coincide with the true best system encountered during the search This paper develops statistical procedures that return the best system encountered by the search (or one near the best) with a prespecified probability We approach this problem using combinations of statistical subset selection and indifference-zone ranking procedures The subset-selection procedures, which use only the data already collected, screen out the obviously inferior systems, while the indifference-zone procedures, which require additional simulation effort, distinguish the best from the less obviously inferior systems

Proceedings ArticleDOI
20 May 2003
TL;DR: This work formalizes general properties a matchmaker should have, then it presents a matchmaking facilitator, compliant with desired properties, that embeds a NeoClassic reasoner, whose structural subsumption algorithm has been modified to allow match categorization into potential and partial, and ranking of matches within categories.
Abstract: More and more resources are becoming available on the Web, and there is a growing need for infrastructures that, based on advertised descriptions, are able to semantically match demands with suppliesWe formalize general properties a matchmaker should have, then we present a matchmaking facilitator, compliant with desired propertiesThe system embeds a NeoClassic reasoner, whose structural subsumption algorithm has been modified to allow match categorization into potential and partial, and ranking of matches within categories Experiments carried out show the good correspondence between users and system rankings

Proceedings ArticleDOI
03 Nov 2003
TL;DR: This work proposes a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection, that is effective for query expansion for web retrieval.
Abstract: Hundreds of millions of users each day use web search engines to meet their information needs Advances in web search effectiveness are therefore perhaps the most significant public outcomes of IR research Query expansion is one such method for improving the effectiveness of ranked retrieval by adding additional terms to a query In previous approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run We propose a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection Our scheme is effective for query expansion for web retrieval: our results show relative improvements over unexpanded full text retrieval of 26%--29%, and 18%--20% over an optimised, conventional expansion approach


Proceedings ArticleDOI
28 Jul 2003
TL;DR: Results show that the specific technique tested results in longer queries than a standard query elicitation technique, that this technique is indeed usable, that the technique results in increased user satisfaction with the search, and that query length is positively correlated with user satisfactionwith the search.
Abstract: Query length in best-match information retrieval (IR) systems is well known to be positively related to effectiveness in the IR task, when measured in experimental, non-interactive environments. However, in operational, interactive IR systems, query length is quite typically very short, on the order of two to three words. We report on a study which tested the effectiveness of a particular query elicitation technique in increasing initial searcher query length, and which tested the effectiveness of queries elicited using this technique, and the relationship in general between query length and search effectiveness in interactive IR. Results show that the specific technique results in longer queries than a standard query elicitation technique, that this technique is indeed usable, that the technique results in increased user satisfaction with the search, and that query length is positively correlated with user satisfaction with the search.

Journal ArticleDOI
TL;DR: Citation analysis, which provides a clear picture of actual use of journals and their articles, is an effective way to determine a journal's influence.
Abstract: Citation analysis, which provides a clear picture of actual use of journals and their articles, is an effective way to determine a journal's influence.

Book ChapterDOI
09 Sep 2003
TL;DR: This work study pruning techniques for query execution in large engines in the case where there is a global ranking of pages, as provided by Pagerank or any other method, in addition to the standard term-based approach, and shows that there is significant potential benefit in such techniques.
Abstract: Large web search engines have to answer thousands of queries per second with interactive response times. A major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. To address this issue, IR and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. Over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. We focus on the question of how such techniques can be efficiently integrated into query processing. In particular, we study pruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by Pagerank or any other method, in addition to the standard term-based approach. We describe pruning schemes for this case and evaluate their efficiency on an experimental cluster-based search engine with million web pages. Our results show that there is significant potential benefit in such techniques.