scispace - formally typeset
Search or ask a question

Showing papers on "Ranking (information retrieval) published in 2011"


Proceedings ArticleDOI
06 Nov 2011
TL;DR: This work proposes a generative model over the joint space of attribute ranking outputs, and proposes a novel form of zero-shot learning in which the supervisor relates the unseen object category to previously seen objects via attributes (for example, ‘bears are furrier than giraffes’).
Abstract: Human-nameable visual “attributes” can benefit various recognition tasks. However, existing techniques restrict these properties to categorical labels (for example, a person is ‘smiling’ or not, a scene is ‘dry’ or not), and thus fail to capture more general semantic relationships. We propose to model relative attributes. Given training data stating how object/scene categories relate according to different attributes, we learn a ranking function per attribute. The learned ranking functions predict the relative strength of each property in novel images. We then build a generative model over the joint space of attribute ranking outputs, and propose a novel form of zero-shot learning in which the supervisor relates the unseen object category to previously seen objects via attributes (for example, ‘bears are furrier than giraffes’). We further show how the proposed relative attributes enable richer textual descriptions for new images, which in practice are more precise for human interpretation. We demonstrate the approach on datasets of faces and natural scenes, and show its clear advantages over traditional binary attribute prediction for these new tasks.

1,046 citations


Journal ArticleDOI
27 Jun 2011-PLOS ONE
TL;DR: It is shown that LeaderRank outperforms PageRank in terms of ranking effectiveness, as well as robustness against manipulations and noisy data, which suggest that leaders who are aware of their clout may reinforce the development of social networks, and thus the power of collective search.
Abstract: Finding pertinent information is not limited to search engines. Online communities can amplify the influence of a small number of power users for the benefit of all other users. Users' information foraging in depth and breadth can be greatly enhanced by choosing suitable leaders. For instance in delicious.com, users subscribe to leaders' collection which lead to a deeper and wider reach not achievable with search engines. To consolidate such collective search, it is essential to utilize the leadership topology and identify influential users. Google's PageRank, as a successful search algorithm in the World Wide Web, turns out to be less effective in networks of people. We thus devise an adaptive and parameter-free algorithm, the LeaderRank, to quantify user influence. We show that LeaderRank outperforms PageRank in terms of ranking effectiveness, as well as robustness against manipulations and noisy data. These results suggest that leaders who are aware of their clout may reinforce the development of social networks, and thus the power of collective search.

718 citations


Proceedings ArticleDOI
12 Jun 2011
TL;DR: The design of CrowdDB is described, a major change is that the traditional closed-world assumption for query processing does not hold for human input, and important avenues for future work in the development of crowdsourced query processing systems are outlined.
Abstract: Some queries cannot be answered by machines only. Processing such queries requires human input for providing information that is missing from the database, for performing computationally difficult functions, and for matching, ranking, or aggregating results based on fuzzy criteria. CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. It uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages many aspects of traditional database systems, there are also important differences. Conceptually, a major change is that the traditional closed-world assumption for query processing does not hold for human input. From an implementation perspective, human-oriented query operators are needed to solicit, integrate and cleanse crowdsourced data. Furthermore, performance and cost depend on a number of new factors including worker affinity, training, fatigue, motivation and location. We describe the design of CrowdDB, report on an initial set of experiments using Amazon Mechanical Turk, and outline important avenues for future work in the development of crowdsourced query processing systems.

688 citations


Journal ArticleDOI
TL;DR: This article presents an improved approach to assist diagnosis of failures in software by ranking program statements or blocks in accordance with to how likely they are to be buggy, which out-performs previously proposed methods for the model program, the Siemens test suite and Space.
Abstract: This article presents an improved approach to assist diagnosis of failures in software (fault localisation) by ranking program statements or blocks in accordance with to how likely they are to be buggy. We present a very simple single-bug program to model the problem. By examining different possible execution paths through this model program over a number of test cases, the effectiveness of different proposed spectral ranking methods can be evaluated in idealised conditions. The results are remarkably consistent to those arrived at empirically using the Siemens test suite and Space benchmarks. The model also helps identify groups of metrics that are equivalent for ranking. Due to the simplicity of the model, an optimal ranking method can be devised. This new method out-performs previously proposed methods for the model program, the Siemens test suite and Space. It also helps provide insight into other ranking methods.

405 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work proposes a principled approach for multi-attribute retrieval which explicitly models the correlations that are present between the attributes in the vocabulary, and integrates ranking and retrieval within the same formulation.
Abstract: We propose a novel approach for ranking and retrieval of images based on multi-attribute queries. Existing image retrieval methods train separate classifiers for each word and heuristically combine their outputs for retrieving multiword queries. Moreover, these approaches also ignore the interdependencies among the query terms. In contrast, we propose a principled approach for multi-attribute retrieval which explicitly models the correlations that are present between the attributes. Given a multi-attribute query, we also utilize other attributes in the vocabulary which are not present in the query, for ranking/retrieval. Furthermore, we integrate ranking and retrieval within the same formulation, by posing them as structured prediction problems. Extensive experimental evaluation on the Labeled Faces in the Wild(LFW), FaceTracer and PASCAL VOC datasets show that our approach significantly outperforms several state-of-the-art ranking and retrieval methods.

384 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: Three extensions to automatic query expansion are introduced: a method capable of preventing tf-idf failure caused by the presence of sets of correlated features, an improved spatial verification and re-ranking step that incrementally builds a statistical model of the query object and a learn relevant spatial context to boost retrieval performance.
Abstract: Most effective particular object and image retrieval approaches are based on the bag-of-words (BoW) model. All state-of-the-art retrieval results have been achieved by methods that include a query expansion that brings a significant boost in performance. We introduce three extensions to automatic query expansion: (i) a method capable of preventing tf-idf failure caused by the presence of sets of correlated features (confusers), (ii) an improved spatial verification and re-ranking step that incrementally builds a statistical model of the query object and (iii) we learn relevant spatial context to boost retrieval performance. The three improvements of query expansion were evaluated on standard Paris and Oxford datasets according to a standard protocol, and state-of-the-art results were achieved.

340 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: This paper proposes an approach that can encode more spatial information into BoV representation and that is efficient enough to be applied to large-scale databases and can be integrated to the min-hash method to improve its retrieval accuracy.
Abstract: The most popular approach to large scale image retrieval is based on the bag-of-visual-word (BoV) representation of images. The spatial information is usually re-introduced as a post-processing step to re-rank the retrieved images, through a spatial verification like RANSAC. Since the spatial verification techniques are computationally expensive, they can be applied only to the top images in the initial ranking. In this paper, we propose an approach that can encode more spatial information into BoV representation and that is efficient enough to be applied to large-scale databases. Other works pursuing the same purpose have proposed exploring the word co-occurrences in the neighborhood areas. Our approach encodes more spatial information through the geometry-preserving visual phrases (GVP). In addition to co-occurrences, the GVP method also captures the local and long-range spatial layouts of the words. Our GVP based searching algorithm increases little memory usage or computational time compared to the BoV method. Moreover, we show that our approach can also be integrated to the min-hash method to improve its retrieval accuracy. The experiment results on Oxford 5K and Flicker 1M dataset show that our approach outperforms the BoV method even following a RANSAC verification.

324 citations


Patent
16 Mar 2011
TL;DR: In this article, a search engine allows authors to submit bids in auction for ranking in order to keep their posts (or posts of other authors) visible to targeted searchers for a longer period of time than would normally be available.
Abstract: Embodiments of a search engine are disclosed that enable authors and third parties to influence the persistence and ranking of the author or the author's posts in search result listings using a bidding process or other compensation-based mechanism. In one embodiment, the search engine allows authors to submit bids in auction for ranking in order to keep their posts (or posts of other authors) visible to targeted searchers for a longer period of time than would normally be available. The bid amount, together with other attributes, can be used to determine the relevance and ranking of posts or authors provided in a search results page to a searcher. Embodiments of the search engine may be utilized with a microblogging service or a social networking service.

314 citations


Patent
23 Sep 2011
TL;DR: In one embodiment, the method comprises receiving an input query; conducting a search to identify candidate answers to the input query, and producing a plurality of scores for each of the candidate answers as discussed by the authors.
Abstract: A method, system and computer program product for generating answers to questions. In one embodiment, the method comprises receiving an input query; conducting a search to identify candidate answers to the input query, and producing a plurality of scores for each of the candidate answers. For each of the candidate answers, one, of a plurality of candidate ranking functions, is selected. This selected ranking function is applied to the each of the candidate answers to determine a ranking for the candidate answer based on the scores for that candidate answer. One or more of the candidate answers is selected, based on the rankings for the candidate answers, as one or more answers to the input query. In an embodiment, the ranking function selection is performed using information about the question. In an embodiment, the ranking function selection is performed using information about each answer.

312 citations


Journal ArticleDOI
TL;DR: An efficient index, called IR-tree, is proposed that together with a top-k document search algorithm facilitates four major tasks in document searches, namely, 1) spatial filtering, 2) textual filtering, 3) relevance computation, and 4) document ranking in a fully integrated manner.
Abstract: Given a geographic query that is composed of query keywords and a location, a geographic search engine retrieves documents that are the most textually and spatially relevant to the query keywords and the location, respectively, and ranks the retrieved documents according to their joint textual and spatial relevances to the query. The lack of an efficient index that can simultaneously handle both the textual and spatial aspects of the documents makes existing geographic search engines inefficient in answering geographic queries. In this paper, we propose an efficient index, called IR-tree, that together with a top-k document search algorithm facilitates four major tasks in document searches, namely, 1) spatial filtering, 2) textual filtering, 3) relevance computation, and 4) document ranking in a fully integrated manner. In addition, IR-tree allows searches to adopt different weights on textual and spatial relevance of documents at the runtime and thus caters for a wide variety of applications. A set of comprehensive experiments over a wide range of scenarios has been conducted and the experiment results demonstrate that IR-tree outperforms the state-of-the-art approaches for geographic document searches.

270 citations


Journal ArticleDOI
01 Mar 2011
TL;DR: The results of the evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related.
Abstract: Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.

Proceedings ArticleDOI
28 Mar 2011
TL;DR: It is shown that context, such as the user's recent queries, can be used to improve the prediction quality considerably even for such short prefixes, and a context-sensitive query auto completion algorithm is proposed, NearestCompletion, which outputs the completions of the users' input that are most similar to the context queries.
Abstract: Query auto completion is known to provide poor predictions of the user's query when her input prefix is very short (e.g., one or two characters). In this paper we show that context, such as the user's recent queries, can be used to improve the prediction quality considerably even for such short prefixes. We propose a context-sensitive query auto completion algorithm, NearestCompletion, which outputs the completions of the user's input that are most similar to the context queries. To measure similarity, we represent queries and contexts as high-dimensional term-weighted vectors and resort to cosine similarity. The mapping from queries to vectors is done through a new query expansion technique that we introduce, which expands a query by traversing the query recommendation tree rooted at the query.In order to evaluate our approach, we performed extensive experimentation over the public AOL query log. We demonstrate that when the recent user's queries are relevant to the current query she is typing, then after typing a single character, NearestCompletion's MRR is 48% higher relative to the MRR of the standard MostPopularCompletion algorithm on average. When the context is irrelevant, however, NearestCompletion's MRR is essentially zero. To mitigate this problem, we propose HybridCompletion, which is a hybrid of NearestCompletion with MostPopularCompletion. HybridCompletion is shown to dominate both NearestCompletion and MostPopularCompletion, achieving a total improvement of 31.5% in MRR relative to MostPopularCompletion on average.

Proceedings ArticleDOI
06 Jun 2011
TL;DR: This is the first analysis to show an algorithm which breaks the natural 1 - 1/e -barrier' in the unknown distribution model (the authors' analysis in fact works in the stricter, random order model) and answers an open question in [GM08].
Abstract: We consider the online bipartite matching problem in the unknown distribution input model. We show that the Ranking algorithm of [KVV90] achieves a competitive ratio of at least 0.653. This is the first analysis to show an algorithm which breaks the natural 1 - 1/e -barrier' in the unknown distribution model (our analysis in fact works in the stricter, random order model) and answers an open question in [GM08]. We also describe a family of graphs on which Ranking does no better than 0.727 in the random order model. Finally, we show that for graphs which have k > 1 disjoint perfect matchings, Ranking achieves a competitive ratio of at least 1 - √(1/k - 1/k2 + 1/n) -- in particular Ranking achieves a factor of 1 - o(1) for graphs with ω(1) disjoint perfect matchings.

Proceedings ArticleDOI
24 Jul 2011
TL;DR: A novel cascade ranking model is formulated and developed, which unlike previous approaches, can simultaneously improve both top k ranked effectiveness and retrieval efficiency and a novel boosting algorithm is presented for learning such cascades to directly optimize the tradeoff between effectiveness and efficiency.
Abstract: There is a fundamental tradeoff between effectiveness and efficiency when designing retrieval models for large-scale document collections. Effectiveness tends to derive from sophisticated ranking functions, such as those constructed using learning to rank, while efficiency gains tend to arise from improvements in query evaluation and caching strategies. Given their inherently disjoint nature, it is difficult to jointly optimize effectiveness and efficiency in end-to-end systems. To address this problem, we formulate and develop a novel cascade ranking model, which unlike previous approaches, can simultaneously improve both top k ranked effectiveness and retrieval efficiency. The model constructs a cascade of increasingly complex ranking functions that progressively prunes and refines the set of candidate documents to minimize retrieval latency and maximize result set quality. We present a novel boosting algorithm for learning such cascades to directly optimize the tradeoff between effectiveness and efficiency. Experimental results show that our cascades are faster and return higher quality results than comparable ranking models.

Proceedings Article
19 Jun 2011
TL;DR: This work proposes a method for automatically labelling topics learned via LDA topic models using a combination of association measures and lexical features, optionally fed into a supervised ranking model.
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.

Proceedings ArticleDOI
09 Feb 2011
TL;DR: This work presents a personalization approach that builds a user interest profile using users' complete browsing behavior, then uses this model to rerank web results, and shows that using a combination of content and previously visited websites provides effective personalization.
Abstract: Personalizing web search results has long been recognized as an avenue to greatly improve the search experience. We present a personalization approach that builds a user interest profile using users' complete browsing behavior, then uses this model to rerank web results. We show that using a combination of content and previously visited websites provides effective personalization. We extend previous work by proposing a number of techniques for filtering previously viewed content that greatly improve the user model used for personalization. Our approaches are compared to previous work in offline experiments and are evaluated against unpersonalized web search in large scale online tests. Large improvements are found in both cases.

Proceedings ArticleDOI
24 Oct 2011
TL;DR: This paper proposes a personalized tweet ranking method, leveraging the use of retweet behavior, to bring more important tweets forward, and investigates how to determine the audience of tweets more effectively, by ranking the users based on their likelihood of retweeting the tweets.
Abstract: The increasing volume of streaming data on microblogs has re-introduced the necessity of effective filtering mechanisms for such media. Microblog users are overwhelmed with mostly uninteresting pieces of text in order to access information of value. In this paper, we propose a personalized tweet ranking method, leveraging the use of retweet behavior, to bring more important tweets forward. In addition, we also investigate how to determine the audience of tweets more effectively, by ranking the users based on their likelihood of retweeting the tweets. Finally, conducting a pilot user study, we analyze how retweet likelihood correlates with the interestingness of the tweets.

Journal ArticleDOI
TL;DR: Results demonstrated that the proposed interactive 3-D object retrieval scheme not only significantly speeds up the retrieval process but also achieves encouraging retrieval performance.
Abstract: The explosively increasing 3-D objects make their efficient retrieval technology highly desired. Extensive research efforts have been dedicated to view-based 3-D object retrieval for its advantage of using 2-D views to represent 3-D objects. In this paradigm, typically the retrieval is accomplished by matching the views of the query object with the objects in database. However, using all the query views may not only introduce difficulty in rapid retrieval but also degrade retrieval accuracy when there is a mismatch between the query views and the object views in the database. In this work, we propose an interactive 3-D object retrieval scheme. Given a set of query views, we first perform clustering to obtain several candidates. We then incrementally select query views for object matching: in each round of relevance feedback, we only add the query view that is judged to be the most informative one based on the labeling information. In addition, we also propose an efficient approach to learn a distance metric for the newly selected query view and the weights for combining all of the selected query views. We conduct experiments on the National Taiwan University 3D Model database, ETH 3D object collection, and Shape Retrieval Content of Non-Rigid 3D Model, and results demonstrated that our approach not only significantly speeds up the retrieval process but also achieves encouraging retrieval performance.

31 Jul 2011
TL;DR: This paper presents a novel approach for automatic detection of semantic change of words based on distributional similarity models and shows that the method obtains good results with respect to a reference ranking produced by human raters.
Abstract: This paper presents a novel approach for automatic detection of semantic change of words based on distributional similarity models. We show that the method obtains good results with respect to a reference ranking produced by human raters. The evaluation also analyzes the performance of frequency-based methods, comparing them to the similarity method proposed.

Journal ArticleDOI
TL;DR: This paper proposes a novel method, Navigation-Pattern-based Relevance Feedback (NPRF), to achieve the high efficiency and effectiveness of CBIR in coping with the large-scale image data and reveals that NPRF outperforms other existing methods significantly in terms of precision, coverage, and number of feedbacks.
Abstract: Nowadays, content-based image retrieval (CBIR) is the mainstay of image retrieval systems. To be more profitable, relevance feedback techniques were incorporated into CBIR such that more precise results can be obtained by taking user's feedbacks into account. However, existing relevance feedback-based CBIR methods usually request a number of iterative feedbacks to produce refined search results, especially in a large-scale image database. This is impractical and inefficient in real applications. In this paper, we propose a novel method, Navigation-Pattern-based Relevance Feedback (NPRF), to achieve the high efficiency and effectiveness of CBIR in coping with the large-scale image data. In terms of efficiency, the iterations of feedback are reduced substantially by using the navigation patterns discovered from the user query log. In terms of effectiveness, our proposed search algorithm NPRFSearch makes use of the discovered navigation patterns and three kinds of query refinement strategies, Query Point Movement (QPM), Query Reweighting (QR), and Query Expansion (QEX), to converge the search space toward the user's intention effectively. By using NPRF method, high quality of image retrieval on RF can be achieved in a small number of feedbacks. The experimental results reveal that NPRF outperforms other existing methods significantly in terms of precision, coverage, and number of feedbacks.

Proceedings ArticleDOI
24 Oct 2011
TL;DR: It is shown how reading level can provide a valuable new relevance signal for both general and personalized Web search, and models and algorithms are described to address the three key problems in improving relevance for search using reading difficulty.
Abstract: Traditionally, search engines have ignored the reading difficulty of documents and the reading proficiency of users in computing a document ranking. This is one reason why Web search engines do a poor job of serving an important segment of the population: children. While there are many important problems in interface design, content filtering, and results presentation related to addressing children's search needs, perhaps the most fundamental challenge is simply that of providing relevant results at the right level of reading difficulty. At the opposite end of the proficiency spectrum, it may also be valuable for technical users to find more advanced material or to filter out material at lower levels of difficulty, such as tutorials and introductory texts. We show how reading level can provide a valuable new relevance signal for both general and personalized Web search. We describe models and algorithms to address the three key problems in improving relevance for search using reading difficulty: estimating user proficiency, estimating result difficulty, and re-ranking based on the difference between user and result reading level profiles. We evaluate our methods on a large volume of Web query traffic and provide a large-scale log analysis that highlights the importance of finding results at an appropriate reading level for the user.

Journal ArticleDOI
TL;DR: This work shows that it is possible to exploit existing large collections of question–answer pairs to extract such features and train ranking models which combine them effectively, providing one of the most compelling evidence to date that complex linguistic features such as word senses and semantic roles can have a significant impact on large-scale information retrieval tasks.
Abstract: This work investigates the use of linguistically motivated features to improve search, in particular for ranking answers to non-factoid questions. We show that it is possible to exploit existing large collections of question-answer pairs (from online social Question Answering sites) to extract such features and train ranking models which combine them effectively. We investigate a wide range of feature types, some exploiting natural language processing such as coarse word sense disambiguation, named-entity identification, syntactic parsing, and semantic role labeling. Our experiments demonstrate that linguistic features, in combination, yield considerable improvements in accuracy. Depending on the system settings we measure relative improvements of 14% to 21% in Mean Reciprocal Rank and [email protected], providing one of the most compelling evidence to date that complex linguistic features such as word senses and semantic roles can have a significant impact on large-scale information retrieval tasks.

Proceedings ArticleDOI
01 Oct 2011
TL;DR: The study indicates that both the topical content of information sources and social network structure affect source credibility, and designs a novel method of automatically identifying and ranking social network users according to their relevance and expertise for a given topic.
Abstract: A task of primary importance for social network users is to decide whose updates to subscribe to in order to maximize the relevance, credibility, and quality of the information received. To address this problem, we conducted an experiment designed to measure the extent to which different factors in online social networks affect both explicit and implicit judgments of credibility. The results of the study indicate that both the topical content of information sources and social network structure affect source credibility. Based on these results, we designed a novel method of automatically identifying and ranking social network users according to their relevance and expertise for a given topic. We performed empirical studies to compare a variety of alternative ranking algorithms and a proprietary service provided by a commercial website specifically designed for the same purpose. Our findings show a great potential for automatically identifying and ranking credible users for any given topic.

Patent
02 Jun 2011
TL;DR: In this paper, a computer-implemented method is described to generate a local result set and one or more non-local result sets for a search query, and determine a display location for the local result sets relative to the nonlocal result set based on the position of the search query in a local relevance indicator.
Abstract: A computer-implemented method is disclosed. The method includes receiving from a remote device a search query, generating a local result set and one or more non-local result sets for the search query, determining a display location for the local result set relative to the non-local result set based on a position of the search query in a local relevance indicium.

Proceedings ArticleDOI
24 Jul 2011
TL;DR: This work proposes a probabilistic mechanism for generating query suggestions from the corpus without using query logs and utilizes the document corpus to extract a set of candidate phrases that are highly correlated with the partial user query.
Abstract: After an end-user has partially input a query, intelligent search engines can suggest possible completions of the partial query to help end-users quickly express their information needs. All major web-search engines and most proposed methods that suggest queries rely on search engine query logs to determine possible query suggestions. However, for customized search systems in the enterprise domain, intranet search, or personalized search such as email or desktop search or for infrequent queries, query logs are either not available or the user base and the number of past user queries is too small to learn appropriate models. We propose a probabilistic mechanism for generating query suggestions from the corpus without using query logs. We utilize the document corpus to extract a set of candidate phrases. As soon as a user starts typing a query, phrases that are highly correlated with the partial user query are selected as completions of the partial query and are offered as query suggestions. Our proposed approach is tested on a variety of datasets and is compared with state-of-the-art approaches. The experimental results clearly demonstrate the effectiveness of our approach in suggesting queries with higher quality.

Proceedings ArticleDOI
09 Feb 2011
TL;DR: This paper explores how queries, their associated documents, and the query intent change over the course of 10 weeks by analyzing query log data, a daily Web crawl, and periodic human relevance judgments, and identifies several interesting features by which changes to query popularity can be classified.
Abstract: Web search is strongly influenced by time. The queries people issue change over time, with some queries occasionally spiking in popularity (e.g., earthquake) and others remaining relatively constant (e.g., youtube). The documents indexed by the search engine also change, with some documents always being about a particular query (e.g., the Wikipedia page on earthquakes is about the query earthquake) and others being about the query only at a particular point in time (e.g., the New York Times is only about earthquakes following a major seismic activity). The relationship between documents and queries can also change as people's intent changes (e.g., people sought different content for the query earthquake before the Haitian earthquake than they did after). In this paper, we explore how queries, their associated documents, and the query intent change over the course of 10 weeks by analyzing query log data, a daily Web crawl, and periodic human relevance judgments. We identify several interesting features by which changes to query popularity can be classified, and show that presence of these features, when accompanied by changes in result content, can be a good indicator of change in query intent.

Proceedings ArticleDOI
11 Apr 2011
TL;DR: A general framework for evaluation and optimization of methods for diversifying query results is described, and the first thorough experimental evaluation of the various diversification techniques implemented in a common framework is presented.
Abstract: In this paper we describe a general framework for evaluation and optimization of methods for diversifying query results. In these methods, an initial ranking candidate set produced by a query is used to construct a result set, where elements are ranked with respect to relevance and diversity features, i.e., the retrieved elements should be as relevant as possible to the query, and, at the same time, the result set should be as diverse as possible. While addressing relevance is relatively simple and has been heavily studied, diversity is a harder problem to solve. One major contribution of this paper is that, using the above framework, we adapt, implement and evaluate several existing methods for diversifying query results. We also propose two new approaches, namely the Greedy with Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods. Another major contribution of this paper is that we present the first thorough experimental evaluation of the various diversification techniques implemented in a common framework. We examine the methods' performance with respect to precision, running time and quality of the result. Our experimental results show that while the proposed methods have higher running times, they achieve precision very close to the optimal, while also providing the best result quality. While GMC is deterministic, the randomized approach (GNE) can achieve better result quality if the user is willing to tradeoff running time.

Proceedings ArticleDOI
Vidit Jain1, Manik Varma2
28 Mar 2011
TL;DR: This paper hypothesize that images clicked in response to a query are mostly relevant to the query, and re-rank the original search results so as to promote images that are likely to be clicked to the top of the ranked list.
Abstract: Our objective is to improve the performance of keyword based image search engines by re-ranking their original results. To this end, we address three limitations of existing search engines in this paper. First, there is no straight-forward, fully automated way of going from textual queries to visual features. Image search engines therefore primarily rely on static and textual features for ranking. Visual features are mainly used for secondary tasks such as finding similar images. Second, image rankers are trained on query-image pairs labeled with relevance judgments determined by human experts. Such labels are well known to be noisy due to various factors including ambiguous queries, unknown user intent and subjectivity in human judgments. This leads to learning a sub-optimal ranker. Finally, a static ranker is typically built to handle disparate user queries. The ranker is therefore unable to adapt its parameters to suit the query at hand which again leads to sub-optimal results. We demonstrate that all of these problems can be mitigated by employing a re-ranking algorithm that leverages aggregate user click data.We hypothesize that images clicked in response to a query are mostly relevant to the query. We therefore re-rank the original search results so as to promote images that are likely to be clicked to the top of the ranked list. Our re-ranking algorithm employs Gaussian Process regression to predict the normalized click count for each image, and combines it with the original ranking score. Our approach is shown to significantly boost the performance of the Bing image search engine on a wide range of tail queries.

Proceedings ArticleDOI
09 Feb 2011
TL;DR: This paper presents the quality-biased ranking method that promotes documents containing high-quality content, and penalizes low-quality documents, and consistently improves the retrieval performance of text-based and link-based retrieval methods that do not take into account the quality of the document content.
Abstract: Many existing retrieval approaches do not take into account the content quality of the retrieved documents, although link-based measures such as PageRank are commonly used as a form of document prior. In this paper, we present the quality-biased ranking method that promotes documents containing high-quality content, and penalizes low-quality documents. The quality of the document content can be determined by its readability, layout and ease-of-navigation, among other factors. Accordingly, instead of using a single estimate for document quality, we consider multiple content-based features that are directly integrated into a state-of- the-art retrieval method. These content-based features are easy to compute, store and retrieve, even for large web collections. We use several query sets and web collections to empirically evaluate the performance of our quality-biased retrieval method. In each case, our method consistently improves by a large margin the retrieval performance of text-based and link-based retrieval methods that do not take into account the quality of the document content.

Patent
05 Aug 2011
TL;DR: In this article, a search query is evaluated to determine whether it is the type of query that a user might want to ask to a friend, and if the query is of such a type, then the search engine may examine a social graph to determine which friends of the user who entered the query may have information that is relevant to answering the query.
Abstract: Search results may include both objective results and person results. In one example, a search query is evaluated to determine whether it is the type of query that a user might want to ask to a friend. If the query is of such a type, then the search engine may examine a social graph to determine which friends of the user who entered the query may have information that is relevant to answering the query. If such friends exist, then the friends may be displayed along with objective search results, along with an explanation of each friend's relevance to the query. Clicking on a person in the results may cause a conversation to be initiated with that person, thereby allowing the user who entered the query to ask his or her friend about the subject of the query.