scispace - formally typeset
Search or ask a question

Showing papers on "Ranking (information retrieval) published in 2004"


Book ChapterDOI
14 Mar 2004
TL;DR: A method is proposed that, given a query submitted to a search engine, suggests a list of related queries that are based in previously issued queries and can be issued by the user to the search engine to tune or redirect the search process.
Abstract: In this paper we propose a method that, given a query submitted to a search engine, suggests a list of related queries The related queries are based in previously issued queries, and can be issued by the user to the search engine to tune or redirect the search process The method proposed is based on a query clustering process in which groups of semantically similar queries are identified The clustering process uses the content of historical preferences of users registered in the query log of the search engine The method not only discovers the related queries, but also ranks them according to a relevance criterion Finally, we show with experiments over the query log of a search engine the effectiveness of the method.

656 citations


Proceedings ArticleDOI
22 Aug 2004
TL;DR: The methodology is applied to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and a model with 300 topics is learned using a Markov chain Monte Carlo algorithm.
Abstract: We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.

618 citations


Book ChapterDOI
11 Jan 2004
TL;DR: In this article, a method for proving the termination of an unnested program loop by synthesizing linear ranking functions is presented, which relies on the fact that if a linear ranking function exists then it will be discovered by their method.
Abstract: We present an automated method for proving the termination of an unnested program loop by synthesizing linear ranking functions. The method is complete. Namely, if a linear ranking function exists then it will be discovered by our method. The method relies on the fact that we can obtain the linear ranking functions of the program loop as the solutions of a system of linear inequalities that we derive from the program loop. The method is used as a subroutine in a method for proving termination and other liveness properties of more general programs via transition invariants; see [PR03].

463 citations


Journal ArticleDOI
TL;DR: A logic-driven clustering in which prototypes are formed and evaluated in a sequential manner that considers an inverse similarity problem and shows how the relevance of the prototypes translates into their granularity.
Abstract: We introduce a logic-driven clustering in which prototypes are formed and evaluated in a sequential manner. The way of revealing a structure in data is realized by maximizing a certain performance index (objective function) that takes into consideration an overall level of matching (to be maximized) and a similarity level between the prototypes (the component to be minimized). The prototypes identified in the process come with the optimal weight vector that serves to indicate the significance of the individual features (coordinates) in the data grouping represented by the prototype. Since the topologies of these groupings are in general quite diverse the optimal weight vectors are reflecting the anisotropy of the feature space, i.e., they show some local ranking of features in the data space. Having found the prototypes we consider an inverse similarity problem and show how the relevance of the prototypes translates into their granularity.

433 citations


Patent
28 Dec 2004
TL;DR: In this article, the authors present methods, systems, and computer-readable media for advanced computer file organization, computer file and web search and information retrieval, and intelligent assistant agent to assist a user's creative activities.
Abstract: The present invention presents embodiments of methods, systems, and computer-readable media for advanced computer file organization, computer file and web search and information retrieval, and intelligent assistant agent to assist a user's creative activities. The embodiments presented herein categorize search results based on the keywords used in the search, provide user selectable ranking, use user's search objectives and advices to refine search, conduct search within an application program and using a file based, provide always-on search that monitors changes over a period of time, provide a high level file system that organizes files into categories, according to relations among files, and in ranking orders along multiple categorization and ranking dimensions and multiple levels of conceptual relationships, conduct searches for associations between keywords, concepts, and propositions, and provide validations of such associations to assist a user's creative activity.

411 citations


Proceedings ArticleDOI
10 Oct 2004
TL;DR: MRBIR first makes use of a manifold ranking algorithm to explore the relationship among all the data points in the feature space, and then measures relevance between the query and all the images in the database accordingly, which is different from traditional similarity metrics based on pair-wise distance.
Abstract: In this paper, we propose a novel transductive learning framework named manifold-ranking based image retrieval (MRBIR) Given a query image, MRBIR first makes use of a manifold ranking algorithm to explore the relationship among all the data points in the feature space, and then measures relevance between the query and all the images in the database accordingly, which is different from traditional similarity metrics based on pair-wise distance In relevance feedback, if only positive examples are available, they are added to the query set to improve the retrieval result; if examples of both labels can be obtained, MRBIR discriminately spreads the ranking scores of positive and negative examples, considering the asymmetry between these two types of images Furthermore, three active learning methods are incorporated into MRBIR, which select images in each round of relevance feedback according to different principles, aiming to maximally improve the ranking result Experimental results on a general-purpose image database show that MRBIR attains a significant improvement over existing systems from all aspects

382 citations


Proceedings ArticleDOI
25 Jul 2004
TL;DR: It is shown that query traffic from particular topical categories differs both from the query stream as a whole and from other categories, which is relevant to the development of enhanced query disambiguation, routing, and caching algorithms.
Abstract: We review a query log of hundreds of millions of queries that constitute the total query traffic for an entire week of a general-purpose commercial web search service. Previously, query logs have been studied from a single, cumulative view. In contrast, our analysis shows changes in popularity and uniqueness of topically categorized queries across the hours of the day. We examine query traffic on an hourly basis by matching it against lists of queries that have been topically pre-categorized by human editors. This represents 13% of the query traffic. We show that query traffic from particular topical categories differs both from the query stream as a whole and from other categories. This analysis provides valuable insight for improving retrieval effectiveness and efficiency. It is also relevant to the development of enhanced query disambiguation, routing, and caching algorithms.

338 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: This paper analyzes features of the rapidly growing "frontier" of the web, namely the part of theweb that crawlers are unable to cover for one reason or another, and suggests ways to improve the quality of ranking by modeling the growing presence of "link rot" on the web as more sites and pages fall out of maintenance.
Abstract: The celebrated PageRank algorithm has proved to be a very effective paradigm for ranking results of web search algorithms. In this paper we refine this basic paradigm to take into account several evolving prominent features of the web, and propose several algorithmic innovations. First, we analyze features of the rapidly growing "frontier" of the web, namely the part of the web that crawlers are unable to cover for one reason or another. We analyze the effect of these pages and find it to be significant. We suggest ways to improve the quality of ranking by modeling the growing presence of "link rot" on the web as more sites and pages fall out of maintenance. Finally we suggest new methods of ranking that are motivated by the hierarchical structure of the web, are more efficient than PageRank, and may be more resistant to direct manipulation.

269 citations


Patent
28 Jul 2004
TL;DR: In this paper, a user interface allows a user to specify queries using a combination of keywords and examples images, and the image retrieval system finds images with keywords that match the keywords in the query and/or images with similar low-level features, such as color, texture, and shape.
Abstract: An image retrieval system performs both keyword-based and content-based image retrieval. A user interface allows a user to specify queries using a combination of keywords and examples images. Depending on the input query, the image retrieval system finds images with keywords that match the keywords in the query and/or images with similar low-level features, such as color, texture, and shape. The system ranks the images and returns them to the user. The user interface allows the user to identify images that are more relevant to the query, as well as images that are less or not relevant to the query. The user may alternatively elect to refine the search by selecting one example image from the result set and submitting its low-level features in a new query. The image retrieval system monitors the user feedback and uses it to refine any search efforts and to train itself for future search queries. In the described implementation, the image retrieval system seamlessly integrates feature-based relevance feedback and semantic-based relevance feedback.

261 citations


Journal ArticleDOI
TL;DR: In this article, the authors examined the information requirements and importance of various types of information for potential students when selecting a university and identified seven broad information categories relating to university selection using data from 306 pupils studying at various schools in England, Scotland and Northern Ireland.
Abstract: This paper aims to examine the information requirements and the importance of various types of information for potential students when selecting a university. Using data from 306 pupils studying at various schools in England, Scotland and Northern Ireland seven broad information categories relating to university selection have been identified. It also revealed that the ranking of the various types of information required and the importance of this information is relatively similar.

246 citations


Book ChapterDOI
31 Aug 2004
TL;DR: A family of approximate top-k algorithms based on probabilistic arguments is introduced and the precision and the efficiency of the developed methods are experimentally evaluated based on a large Web corpus and a structured data collection.
Abstract: Top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discovery. The best known general-purpose algorithm for evaluating top-k queries is Fagin's threshold algorithm (TA). Since the user's goal behind top-k queries is to identify one or a few relevant and novel data items, it is intriguing to use approximate variants of TA to reduce run-time costs. This paper introduces a family of approximate top-k algorithms based on probabilistic arguments. When scanning index lists of the underlying multidimensional data space in descending order of local scores, various forms of convolution and derived bounds are employed to predict when it is safe, with high probability, to drop candidate items and to prune the index scans. The precision and the efficiency of the developed methods are experimentally evaluated based on a large Web corpus and a structured data collection.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This paper provides an elegant definition of relaxation on structure and defines primitive operators to span the space of relaxations for ranking schemes and proposes natural ranking schemes that adhere to these principles.
Abstract: Querying XML data is a well-explored topic with powerful database-style query languages such as XPath and XQuery set to become W3C standards. An equally compelling paradigm for querying XML documents is full-text search on textual content. In this paper, we study fundamental challenges that arise when we try to integrate these two querying paradigms.While keyword search is based on approximate matching, XPath has exact match semantics. We address this mismatch by considering queries on structure as a "template", and looking for answers that best match this template and the full-text search. To achieve this, we provide an elegant definition of relaxation on structure and define primitive operators to span the space of relaxations. Query answering is now based on ranking potential answers on structural and full-text search conditions. We set out certain desirable principles for ranking schemes and propose natural ranking schemes that adhere to these principles. We develop efficient algorithms for answering top-K queries and discuss results from a comprehensive set of experiments that demonstrate the utility and scalability of the proposed framework and algorithms.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This work hypothesizes the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources, and develops a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax.
Abstract: Recently, the Web has been rapidly "deepened" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says- or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some "concerted structure," by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar- and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax- that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach-it achieves above 85% accuracy for extracting query conditions across random sources.

Patent
Carl J. Kraenzel1, Paul B. Moody1, Joann Ruvolo1, Thomas P. Moran1, Justin Lessler1 
26 Aug 2004
TL;DR: In this paper, a method of generating a context-inferenced search query and sorting a result of the query is described, which includes analyzing an event associated with the user to determine a contextual setting, dynamically generating a search query based on the contextual setting and searching at least one information source using the search query to generate a search result.
Abstract: A method of generating a context-inferenced search query and of sorting a result of the query is described. The method includes analyzing an event associated with the user to determine a contextual setting, dynamically generating a search query based on the contextual setting, and searching at least one information source using the search query to generate a search result. Additionally, the method includes calculating an importance value for each item of the search result, sorting the items of the search result according the importance value, and displaying the sorted search result to the user.

01 Dec 2004
TL;DR: A simple, computer‐generated example is provided to illustrate the procedure for multimodel inference based on K‐L information and arguments are presented, based on statistical underpinnings that have been overlooked with time, that its theoretical basis renders it preferable to other approaches.
Abstract: Uncertainty of hydrogeologic conditions makes it important to consider alternative plausible models in an effort to evaluate the character of a ground water system, maintain parsimony, and make predictions with reasonable definition of their uncertainty. When multiple models are considered, data collection and analysis focus on evaluation of which model(s) is(are) most supported by the data. Generally, more than one model provides a similar acceptable fit to the observations; thus, inference should be made from multiple models. Kullback-Leibler (K-L) information provides a rigorous foundation for model inference that is simple to compute, is easy to interpret, selects parsimonious models, and provides a more realistic measure of precision than evaluation of any one model or evaluation based on other commonly referenced model selection criteria. These alternative criteria strive to identify the true (or quasi-true) model, assume it is represented by one of the models in the set, and given their preference for parsimony regardless of the available number of observations the selected model may be underfit. This is in sharp contrast to the K-L information approach, where models are considered to be approximations to reality, and it is expected that more details of the system will be revealed when more data are available. We provide a simple, computer-generated example to illustrate the procedure for multimodel inference based on K-L information and present arguments, based on statistical underpinnings that have been overlooked with time, that its theoretical basis renders it preferable to other approaches.

Patent
07 Dec 2004
TL;DR: In this article, a traffic counter for a business location in the business database is updated when a mobile device's geo-position moves inside the business location, and the traffic counter is adjusted for business size by dividing the traffic counters by the square footage of the business and sorting the result set based on the mobile-device visits or repeat visits, per square foot.
Abstract: An Internet search engine ranks search results based on popularity with mobile-device users. Geo-position data from cell phones and other mobile devices are collected into a device geo-position database. The geo-position data is compared to locations of businesses in a business database. When a mobile device's geo-position moves inside a business location, a traffic counter for that business location in the business database is updated. When an Internet user performs a local search, the result set is sorted based on a rank that is at least partially determined by the traffic counters. The popularity-ranked search results indicate which businesses received the most mobile-device visits, an indication of the business's overall popularity. The popularity ranking may be adjusted for business size by dividing the traffic counter by the square footage of the business and sorting the result set based on the mobile-device visits, or repeat visits, per square foot.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This work focuses on a subclass of CAS queries consisting of simple path expressions, which study algorithmic issues in integrating structure indexes with inverted lists for the evaluation of these queries, where they rank all documents that match the query and return the top k documents in order of relevance.
Abstract: Several methods have been proposed to evaluate queries over a native XML DBMS, where the queries specify both path and keyword constraints These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; and approaches based on information-retrieval style inverted lists We propose a strategy that combines the two forms of auxiliary indexes, and a query evaluation algorithm for branching path expressions based on this strategy Our technique is general and applicable for a wide range of choices of structure indexes and inverted list join algorithms Our experiments over the Niagara XML DBMS show the benefit of integrating the two forms of indexes We also consider algorithmic issues in evaluating path expression queries when the notion of relevance ranking is incorporated By integrating the above techniques with the Threshold Algorithm proposed by Fagin et al, we obtain instance optimal algorithms to push down top k computation

Patent
31 Mar 2004
TL;DR: In this article, techniques are disclosed that locate implicitly defined semantic structures in a document, such as, for example, implicitly defined lists in an HTML document, which can be used in the calculation of distance values between terms in the documents.
Abstract: Techniques are disclosed that locate implicitly defined semantic structures in a document, such as, for example, implicitly defined lists in an HTML document. The semantic structures can be used in the calculation of distance values between terms in the documents. The distance values may be used, for example, in the generation of ranking scores that indicate a relevance level of the document to a search query.

Book ChapterDOI
20 Oct 2004
TL;DR: The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information and supports functionality for disambiguation, query expansion, relevance ranking and metadata extraction.
Abstract: The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, a geographical ontology, maintenance and retrieval functions for a test collection of web documents, textual and spatial indexes, relevance ranking and metadata extraction. Here we summarise the functionality and interaction between these components before focusing on the design of the geo-ontology and the development of spatio-textual indexing methods. The geo-ontology supports functionality for disambiguation, query expansion, relevance ranking and metadata extraction. Geographical place names are accompanied by multiple geometric footprints and qualitative spatial relationships. Spatial indexing of documents has been integrated with text indexing through the use of spatio-textual keys in which terms are concatenated with spatial cells to which they relate. Preliminary experiments demonstrate considerable performance benefits when compared with pure text indexing and with text indexing followed by a spatial filtering stage.

Proceedings ArticleDOI
30 Mar 2004
TL;DR: This work proposes a novel way of integrating spatio-temporal indexes with sketches, traditionally used for approximate query processing, to solve the distinct counting problem of summarized information about moving objects that lie in a query region during a query interval.
Abstract: Several spatio-temporal applications require the retrieval of summarized information about moving objects that lie in a query region during a query interval (e.g., the number of mobile users covered by a cell, traffic volume in a district, etc.). Existing solutions have the distinct counting problem: if an object remains in the query region for several timestamps during the query interval, it will be counted multiple times in the result. We solve this problem by integrating spatio-temporal indexes with sketches, traditionally used for approximate query processing. The proposed techniques can also be applied to reduce the space requirements of conventional spatio-temporal data and to mine spatio-temporal association rules.

Book ChapterDOI
31 Aug 2004
TL;DR: This work adapt and apply principles of probabilistic models from Information Retrieval for structured data to solve the problem of ranking answers to a database query when many tuples are returned.
Abstract: We investigate the problem of ranking answers to a database query when many tuples are returned. We adapt and apply principles of probabilistic models from Information Retrieval for structured data. Our proposed solution is domain independent. It leverages data and workload statistics and correlations. Our ranking functions can be further customized for different applications. We present results of preliminary experiments which demonstrate the efficiency as well as the quality of our ranking system.

Patent
Simon Tong1, Mark Pearson1, Sergey Brin1
10 Sep 2004
TL;DR: In this paper, a search query is received, a related query related to the query is determined, an article (such as a web page) associated with the search query was determined, and a ranking score for the article based at least in part on data associated with related queries was determined.
Abstract: Systems and methods that improve search rankings for a search query by using data associated with queries related to the search query are described. In one aspect, a search query is received, a related query related to the search query is determined, an article (such as a web page) associated with the search query is determined, and a ranking score for the article based at least in part on data associated with the related query is determined. Several algorithms and types of data associated with related queries useful in carrying out such systems and methods are described.

Patent
22 Dec 2004
TL;DR: In this paper, a system and a method are directed to targeted graphical advertisements, which may involve identifying a graphical advertisement associated with an entity (e.g., advertiser), associating one or more concepts with the graphical advertisement, and delivering the graphical advertisements associated with the concept.
Abstract: A system and a method are directed to targeted graphical advertisements, which may involve identifying a graphical advertisement associated with an entity (e.g., advertiser); associating one or more concepts with the graphical advertisement; receiving a request for an advertisement associated with a concept; and delivering the graphical advertisement associated with the concept, wherein the graphical advertisement is positioned for display based on a ranking among advertisements for the concept, the ranking being based at least on a price parameter amount offered by the entity.

Journal ArticleDOI
TL;DR: Machine learning methods are shown to be possible to automatically build models for retrieving high-quality, content-specific articles using inclusion or citation by the ACP Journal Club as a gold standard in a given time period in internal medicine that perform better than the 1994 PubMed clinical query filters.

Patent
23 Apr 2004
TL;DR: In this article, a method and computer program product for determining a document relevance function for estimating a relevance score of a document in a database with respect to a query is presented. But the method is not suitable for the task of document classification.
Abstract: Provided is a method and computer program product for determining a document relevance function for estimating a relevance score of a document in a database with respect to a query. For each of a plurality of test queries, a respective set of result documents is collected. For each test query, a subset of the documents in the respective result set is selected, and a set of training relevance scores is assigned to documents in the subset. In one embodiment, at least some of the training relevance scores are assigned by human subjects who determine individual relevance scores for submitted documents with respect to the corresponding queries. Finally, a relevance function is determined based on the plurality of test queries, the subsets of documents, and the sets of training relevance scores.

Journal ArticleDOI
01 Jan 2004
TL;DR: An automatic mechanism for selecting appropriate concepts that both describe and identify documents as well as language employed in user requests is described, and a scalable disambiguation algorithm that prunes irrelevant concepts and allows relevant ones to associate with documents and participate in query generation is proposed.
Abstract: Technology in the field of digital media generates huge amounts of nontextual information, audio, video, and images, along with more familiar textual information. The potential for exchange and retrieval of information is vast and daunting. The key problem in achieving efficient and user-friendly retrieval is the development of a search mechanism to guarantee delivery of minimal irrelevant information (high precision) while insuring relevant information is not overlooked (high recall). The traditional solution employs keyword-based search. The only documents retrieved are those containing user-specified keywords. But many documents convey desired semantic information without containing these keywords. This limitation is frequently addressed through query expansion mechanisms based on the statistical co-occurrence of terms. Recall is increased, but at the expense of deteriorating precision. One can overcome this problem by indexing documents according to context and meaning rather than keywords, although this requires a method of converting words to meanings and the creation of a meaning-based index structure. We have solved the problem of an index structure through the design and implementation of a concept-based model using domain-dependent ontologies. An ontology is a collection of concepts and their interrelationships that provide an abstract view of an application domain. With regard to converting words to meaning, the key issue is to identify appropriate concepts that both describe and identify documents as well as language employed in user requests. This paper describes an automatic mechanism for selecting these concepts. An important novelty is a scalable disambiguation algorithm that prunes irrelevant concepts and allows relevant ones to associate with documents and participate in query generation. We also propose an automatic query expansion mechanism that deals with user requests expressed in natural language. This mechanism generates database queries with appropriate and relevant expansion through knowledge encoded in ontology form. Focusing on audio data, we have constructed a demonstration prototype. We have experimentally and analytically shown that our model, compared to keyword search, achieves a significantly higher degree of precision and recall. The techniques employed can be applied to the problem of information selection in all media types.

Proceedings Article
01 Jan 2004
TL;DR: This year’s main experiment involved processing a mixed query stream, with an even mix of each query type studied in TREC-2003, to find ranking approaches which work well over the 225 queries, without access to query type labels.
Abstract: This year’s main experiment involved processing a mixed query stream, with an even mix of each query type studied in TREC-2003: 75 homepage finding queries, 75 named page finding queries and 75 topic distillation queries. The goal was to find ranking approaches which work well over the 225 queries, without access to query type labels. We also ran two small experiments. First, participants were invited to submit classification runs, attempting to correctly label the 225 queries by type. Second, we invited participants to download the new W3C test collection, and think about appropriate experiments for the proposed TREC-2005 Enterprise Track. This is the last year for the Web Track in its current form, it will not run in TREC-2005.

Proceedings ArticleDOI
10 Oct 2004
TL;DR: This work proposes using query- class dependent weights within a hierarchial mixture-of-expert framework to combine multiple retrieval results with query-class associated weights, which can be learned from the development data efficiently and generalized to the unseen queries easily.
Abstract: Combining retrieval results from multiple modalities plays a crucial role for video retrieval systems, especially for automatic video retrieval systems without any user feedback and query expansion. However, most of current systems only utilize query independent combination or rely on explicit user weighting. In this work, we propose using query-class dependent weights within a hierarchial mixture-of-expert framework to combine multiple retrieval results. We first classify each user query into one of the four predefined categories and then aggregate the retrieval results with query-class associated weights, which can be learned from the development data efficiently and generalized to the unseen queries easily. Our experimental results demonstrate that the performance with query-class dependent weights can considerably surpass that with the query independent weights.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: A rank-aware query optimization framework that fully integrates rank-join operators into relational query engines is introduced based on extending the System R dynamic programming algorithm in both enumeration and pruning and introduces a probabilistic model for estimating the input cardinality, and hence the cost of a rank- join operator.
Abstract: Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join query operators have been proposed based on rank aggregation algorithms. Rank-join operators progressively rank the join results while performing the join operation. The new operators have a direct impact on traditional query processing and optimization.We introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators. We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this paper is the first effort in estimating the needed input size for optimal rank aggregation algorithms. Costing ranking plans, although challenging, is key to the full integration of rank-join operators in real-world query processing engines. We experimentally evaluate our framework by modifying the query optimizer of an open-source database management system. The experiments show the validity of our framework and the accuracy of the proposed estimation model.

Proceedings ArticleDOI
31 Oct 2004
TL;DR: This paper presents a general probabilistic technique for error ranking that exploits correlation behavior amongst reports and incorporates user feedback into the ranking process and observes a factor of 2-8 improvement over randomized ranking for error reports emitted by both intra-Procedural and inter-procedural analysis tools.
Abstract: Static program checking tools can find many serious bugs in software, but due to analysis limitations they also frequently emit false error reports. Such false positives can easily render the error checker useless by hiding real errors amidst the false. Effective error report ranking schemes mitigate the problem of false positives by suppressing them during the report inspection process [17, 19, 20]. In this way, ranking techniques provide a complementary method to increasing the precision of the analysis results of a checking tool. A weakness of previous ranking schemes, however, is that they produce static rankings that do not adapt as reports are inspected, ignoring useful correlations amongst reports. This paper addresses this weakness with two main contributions. First, we observe that both bugs and false positives frequently cluster by code locality. We analyze clustering behavior in historical bug data from two large systems and show how clustering can be exploited to greatly improve error report ranking. Second, we present a general probabilistic technique for error ranking that (1) exploits correlation behavior amongst reports and (2) incorporates user feedback into the ranking process. In our results we observe a factor of 2-8 improvement over randomized ranking for error reports emitted by both intra-procedural and inter-procedural analysis tools.