Showing papers on "Ranking (information retrieval) published in 2010"

PDF

Open Access

Patent•

[...]

14 Sep 2010

TL;DR: An improved human user computer interface system, wherein a user characteristic or set of characteristics, such as demographic profile or societal role, is employed to define a scope or domain of operation, is proposed in this article, where user privacy and anonymity is maintained by physical and algorithmic controls over access to the personal profiles, and releasing only aggregate data without personally identifying information or of small groups.

...read moreread less

Abstract: An improved human user computer interface system, wherein a user characteristic or set of characteristics, such as demographic profile or societal “role”, is employed to define a scope or domain of operation. The operation itself may be a database search, to interactively define a taxonomic context for the operation, a business negotiation, or other activity. After retrieval of results, a scoring or ranking may be applied according to user define criteria, which are, for example, commensurate with the relevance to the context, but may be, for example, by date, source, or other secondary criteria. A user profile is preferably stored in a computer accessible form, and may be used to provide a history of use, persistent customization, collaborative filtering and demographic information for the user. Advantageously, user privacy and anonymity is maintained by physical and algorithmic controls over access to the personal profiles, and releasing only aggregate data without personally identifying information or of small groups.

...read moreread less

1,465 citations

From RankNet to LambdaRank to LambdaMART: An Overview

[...]

Christopher J. C. Burges¹•Institutions (1)

Microsoft¹

23 Jun 2010

TL;DR: RankNet, LambdaRank, and LambdaMART have proven to be very successful algorithms for solving real world ranking problems and the details are spread across several papers and reports, so here is a self-contained, detailed and complete description of them.

...read moreread less

Abstract: LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very successful algorithms for solving real world ranking problems: for example an ensemble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge. The details of these algorithms are spread across several papers and reports, and so here we give a self-contained, detailed and complete description of them.

...read moreread less

1,114 citations

Journal Article•

Large Scale Online Learning of Image Similarity Through Ranking

[...]

Gal Chechik, Varun Sharma, Uri Shalit, Samy Bengio

01 Mar 2010-Journal of Machine Learning Research

TL;DR: OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost, which suggests that query independent similarity could be accurately learned even for large scale data sets that could not be handled before.

...read moreread less

Abstract: Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large data sets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a data set with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, data sets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale data set, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale data sets that could not be handled before.

...read moreread less

738 citations

Proceedings Article•DOI•

Person Re-Identification by Support Vector Ranking

[...]

Bryan James Prosser¹, Wei-Shi Zheng², Shaogang Gong¹, Tao Xiang¹•Institutions (2)

Queen Mary University of London¹, University of London²

01 Jan 2010

TL;DR: This work converts the person re-identification problem from an absolute scoring p roblem to a relative ranking problem and develops an novel Ensemble RankSVM to overcome the scalability limitation problem suffered by existing SVM-based ranking methods.

...read moreread less

Abstract: Solving the person re-identification problem involves matching observation s of individuals across disjoint camera views. The problem becomes particularly hard in a busy public scene as the number of possible matches is very high. This is further compounded by significant appearance changes due to varying lighting conditions, vie wing angles and body poses across camera views. To address this problem, existing approaches focus on extracting or learning discriminative features followed by template matching using a distance measure. The novelty of this work is that we reformulate the person reidentification problem as a ranking problem and learn a subspace where th e potential true match is given highest ranking rather than any direct distance measure. By doing so, we convert the person re-identification problem from an absolute scoring p roblem to a relative ranking problem. We further develop an novel Ensemble RankSVMto overcome the scalability limitation problem suffered by existing SVM-based ranking methods. This new model reduces significantly memory usage therefore is much more scalable, whilst maintaining high-level performance. We present extensive experiments to demonstrate the performance gain of the proposed ranking approach over existing template matching and classification models.

...read moreread less

736 citations

Journal Article•DOI•

LETOR: A benchmark collection for research on learning to rank for information retrieval

[...]

Tao Qin¹, Tie-Yan Liu¹, Jun Xu¹, Hang Li¹•Institutions (1)

Microsoft¹

01 Aug 2010-Information Retrieval

TL;DR: The details of the LETOR collection are described and it is shown how it can be used in different kinds of researches, and several state-of-the-art learning to rank algorithms on LETOR are compared.

...read moreread less

Abstract: LETOR is a benchmark collection for the research on learning to rank for information retrieval, released by Microsoft Research Asia. In this paper, we describe the details of the LETOR collection and show how it can be used in different kinds of researches. Specifically, we describe how the document corpora and query sets in LETOR are selected, how the documents are sampled, how the learning features and meta information are extracted, and how the datasets are partitioned for comprehensive evaluation. We then compare several state-of-the-art learning to rank algorithms on LETOR, report their ranking performances, and make discussions on the results. After that, we discuss possible new research topics that can be supported by LETOR, in addition to algorithm comparison. We hope that this paper can help people to gain deeper understanding of LETOR, and enable more interesting research projects on learning to rank and related topics.

...read moreread less

486 citations

Proceedings Article•DOI•

Exploiting query reformulations for web search result diversification

[...]

Rodrygo L. T. Santos¹, Craig Macdonald¹, Iadh Ounis¹•Institutions (1)

University of Glasgow¹

26 Apr 2010

TL;DR: A novel probabilistic framework for Web search result diversification, which explicitly accounts for the various aspects associated to an underspecified query, is introduced and diversify a document ranking by estimating how well a given document satisfies each uncovered aspect and the extent to which different aspects are satisfied by the ranking as a whole.

...read moreread less

Abstract: When a Web user's underlying information need is not clearly specified from the initial query, an effective approach is to diversify the results retrieved for this query. In this paper, we introduce a novel probabilistic framework for Web search result diversification, which explicitly accounts for the various aspects associated to an underspecified query. In particular, we diversify a document ranking by estimating how well a given document satisfies each uncovered aspect and the extent to which different aspects are satisfied by the ranking as a whole. We thoroughly evaluate our framework in the context of the diversity task of the TREC 2009 Web track. Moreover, we exploit query reformulations provided by three major Web search engines (WSEs) as a means to uncover different query aspects. The results attest the effectiveness of our framework when compared to state-of-the-art diversification approaches in the literature. Additionally, by simulating an upper-bound query reformulation mechanism from official TREC data, we draw useful insights regarding the effectiveness of the query reformulations generated by the different WSEs in promoting diversity.

...read moreread less

464 citations

Proceedings Article•

SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles

[...]

Su Nam Kim¹, Olena Medelyan, Min-Yen Kan², Timothy Baldwin¹•Institutions (2)

University of Melbourne¹, National University of Singapore²

15 Jul 2010

TL;DR: The participating systems were evaluated by matching their extracted keyphrases against manually assigned ones and the overall ranking of the submitted systems is presented.

...read moreread less

Abstract: This paper describes Task 5 of the Workshop on Semantic Evaluation 2010 (SemEval-2010). Systems are to automatically assign keyphrases or keywords to given scientific articles. The participating systems were evaluated by matching their extracted keyphrases against manually assigned ones. We present the overall ranking of the submitted systems and discuss our findings to suggest future directions for this task.

...read moreread less

413 citations

Proceedings Article•

Metric Learning to Rank

[...]

Brian McFee¹, Gert R. G. Lanckriet¹•Institutions (1)

University of California, San Diego¹

21 Jun 2010

TL;DR: A general metric learning algorithm is presented, based on the structural SVM framework, to learn a metric such that rankings of data induced by distance from a query can be optimized against various ranking measures, such as AUC, Precision-at-k, MRR, MAP or NDCG.

...read moreread less

Abstract: We study metric learning as a problem of information retrieval. We present a general metric learning algorithm, based on the structural SVM framework, to learn a metric such that rankings of data induced by distance from a query can be optimized against various ranking measures, such as AUC, Precision-at-k, MRR, MAP or NDCG. We demonstrate experimental results on standard classification data sets, and a large-scale online dating recommendation problem.

...read moreread less

371 citations

Journal Article•DOI•

Learning author-topic models from text corpora

[...]

Michal Rosen-Zvi¹, Chaitanya Chemudugunta², Thomas L. Griffiths³, Padhraic Smyth², Mark Steyvers² - Show less +1 more•Institutions (3)

IBM¹, University of California, Irvine², University of California, Berkeley³

29 Jan 2010-ACM Transactions on Information Systems

TL;DR: The interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors are discussed.

...read moreread less

Abstract: We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.

...read moreread less

329 citations

Book•

Preference Learning

[...]

Johannes Frnkranz, Eyke Hllermeier

13 Oct 2010

TL;DR: The editors first offer a thorough introduction, including a systematic categorization according to learning task and learning technique, along with a unified notation, and the first half of the book is organized into parts on applications of preference learning in multiattribute domains, information retrieval, and recommender systems.

...read moreread less

Abstract: The topic of preferences is a new branch of machine learning and data mining, and it has attracted considerable attention in artificial intelligence research in previous years. It involves learning from observations that reveal information about the preferences of an individual or a class of individuals. Representing and processing knowledge in terms of preferences is appealing as it allows one to specify desires in a declarative way, to combine qualitative and quantitative modes of reasoning, and to deal with inconsistencies and exceptions in a flexible manner. And, generalizing beyond training data, models thus learned may be used for preference prediction. This is the first book dedicated to this topic, and the treatment is comprehensive. The editors first offer a thorough introduction, including a systematic categorization according to learning task and learning technique, along with a unified notation. The first half of the book is organized into parts on label ranking, instance ranking, and object ranking; while the second half is organized into parts on applications of preference learning in multiattribute domains, information retrieval, and recommender systems. The book will be of interest to researchers and practitioners in artificial intelligence, in particular machine learning and data mining, and in fields such as multicriteria decision-making and operations research.

...read moreread less

304 citations

Proceedings Article•

An Empirical Study on Learning to Rank of Tweets

[...]

Yajuan Duan¹, Long Jiang², Tao Qin², Ming Zhou², Heung-Yeung Shum² - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, Microsoft²

23 Aug 2010

TL;DR: This paper proposes a new ranking strategy which uses not only the content relevance of a tweet, but also the account authority and tweet-specific features such as whether a URL link is included in the tweet.

...read moreread less

Abstract: Twitter, as one of the most popular micro-blogging services, provides large quantities of fresh information including real-time news, comments, conversation, pointless babble and advertisements. Twitter presents tweets in chronological order. Recently, Twitter introduced a new ranking strategy that considers popularity of tweets in terms of number of retweets. This ranking method, however, has not taken into account content relevance or the twitter account. Therefore a large amount of pointless tweets inevitably flood the relevant tweets. This paper proposes a new ranking strategy which uses not only the content relevance of a tweet, but also the account authority and tweet-specific features such as whether a URL link is included in the tweet. We employ learning to rank algorithms to determine the best set of features with a series of experiments. It is demonstrated that whether a tweet contains URL or not, length of tweet and account authority are the best conjunction.

...read moreread less

Journal Article•DOI•

Learning Context-Sensitive Shape Similarity by Graph Transduction

[...]

Xiang Bai¹, Xingwei Yang², Longin Jan Latecki², Wenyu Liu¹, Zhuowen Tu³ - Show less +1 more•Institutions (3)

Huazhong University of Science and Technology¹, Temple University², University of California, Los Angeles³

01 May 2010-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A new perspective to this problem is provided by considering the existing shapes as a group, and study their similarity measures to the query shape in a graph structure, which achieves promising improvements on both shape classification and shape clustering.

...read moreread less

Abstract: Shape similarity and shape retrieval are very important topics in computer vision. The recent progress in this domain has been mostly driven by designing smart shape descriptors for providing better similarity measure between pairs of shapes. In this paper, we provide a new perspective to this problem by considering the existing shapes as a group, and study their similarity measures to the query shape in a graph structure. Our method is general and can be built on top of any existing shape similarity measure. For a given similarity measure, a new similarity is learned through graph transduction. The new similarity is learned iteratively so that the neighbors of a given shape influence its final similarity to the query. The basic idea here is related to PageRank ranking, which forms a foundation of Google Web search. The presented experimental results demonstrate that the proposed approach yields significant improvements over the state-of-art shape matching algorithms. We obtained a retrieval rate of 91.61 percent on the MPEG-7 data set, which is the highest ever reported in the literature. Moreover, the learned similarity by the proposed method also achieves promising improvements on both shape classification and shape clustering.

...read moreread less

Proceedings Article•DOI•

Image retrieval via probabilistic hypergraph ranking

[...]

Yuchi Huang¹, Qingshan Liu¹, Shaoting Zhang¹, Dimitris N. Metaxas¹•Institutions (1)

Rutgers University¹

13 Jun 2010

TL;DR: A new transductive learning framework for image retrieval is proposed, in which images are taken as vertices in a weighted hypergraph and the task of image search is formulated as the problem of hypergraph ranking.

...read moreread less

Abstract: In this paper, we propose a new transductive learning framework for image retrieval, in which images are taken as vertices in a weighted hypergraph and the task of image search is formulated as the problem of hypergraph ranking. Based on the similarity matrix computed from various feature descriptors, we take each image as a ‘centroid’ vertex and form a hyperedge by a centroid and its k-nearest neighbors. To further exploit the correlation information among images, we propose a probabilistic hypergraph, which assigns each vertex v i to a hyperedge e j in a probabilistic way. In the incidence structure of a probabilistic hypergraph, we describe both the higher order grouping information and the affinity relationship between vertices within each hy-peredge. After feedback images are provided, our retrieval system ranks image labels by a transductive inference approach, which tends to assign the same label to vertices that share many incidental hyperedges, with the constraints that predicted labels of feedback images should be similar to their initial labels. We compare the proposed method to several other methods and its effectiveness is demonstrated by extensive experiments on Corel5K, the Scene dataset and Caltech 101.

...read moreread less

Proceedings Article•

Extracting and Ranking Product Features in Opinion Documents

[...]

Lei Zhang¹, Bing Liu¹, Suk Hwan Lim², Eamonn O'Brien-Strain²•Institutions (2)

University of Illinois at Chicago¹, Hewlett-Packard²

23 Aug 2010

TL;DR: The problem is formulated as a bipartite graph and the well-known web page ranking algorithm HITS is used to find important features and rank them high and demonstrates promising results on diverse real-life datasets.

...read moreread less

Abstract: An important task of opinion mining is to extract people's opinions on features of an entity. For example, the sentence, "I love the GPS function of Motorola Droid" expresses a positive opinion on the "GPS function" of the Motorola phone. "GPS function" is the feature. This paper focuses on mining features. Double propagation is a state-of-the-art technique for solving the problem. It works well for medium-size corpora. However, for large and small corpora, it can result in low precision and low recall. To deal with these two problems, two improvements based on part-whole and "no" patterns are introduced to increase the recall. Then feature ranking is applied to the extracted feature candidates to improve the precision of the top-ranked candidates. We rank feature candidates by feature importance which is determined by two factors: feature relevance and feature frequency. The problem is formulated as a bipartite graph and the well-known web page ranking algorithm HITS is used to find important features and rank them high. Experiments on diverse real-life datasets show promising results.

...read moreread less

Posted Content•

PageRank for ranking authors in co-citation networks

[...]

Ying Ding¹, Erjia Yan¹, Arthur E. Frazho², James Caverlee³•Institutions (3)

Indiana University¹, Purdue University², Texas A&M University³

22 Dec 2010-arXiv: Digital Libraries

TL;DR: It is found that in the author co-citation network, citation rank is highly correlated with PageRank's with different damping factors and also with different PageRank algorithms; citation rank and PageRank are not significantly correlation with centrality measures; and h-index is not significantly correlated withcentrality measures.

...read moreread less

Abstract: Google's PageRank has created a new synergy to information retrieval for a better ranking of Web pages. It ranks documents depending on the topology of the graphs and the weights of the nodes. PageRank has significantly advanced the field of information retrieval and keeps Google ahead of competitors in the search engine market. It has been deployed in bibliometrics to evaluate research impact, yet few of these studies focus on the important impact of the damping factor (d) for ranking purposes. This paper studies how varied damping factors in the PageRank algorithm can provide additional insight into the ranking of authors in an author co-citation network. Furthermore, we propose weighted PageRank algorithms. We select 108 most highly cited authors in the information retrieval (IR) area from the 1970s to 2008 to form the author co-citation network. We calculate the ranks of these 108 authors based on PageRank with damping factor ranging from 0.05 to 0.95. In order to test the relationship between these different measures, we compare PageRank and weighted PageRank results with the citation ranking, h-index, and centrality measures. We found that in our author co-citation network, citation rank is highly correlated with PageRank's with different damping factors and also with different PageRank algorithms; citation rank and PageRank are not significantly correlated with centrality measures; and h-index is not significantly correlated with centrality measures.

...read moreread less

Patent•

Facial recognition with social network aiding

[...]

David Petrou¹, Andrew Rabinovich¹, Hartwig Adam¹•Institutions (1)

Google¹

05 Aug 2010

TL;DR: In this paper, a facial recognition search system identifies one or more likely names (or other personal identifiers) corresponding to the facial image(s) in a query as follows: after receiving the visual query with one or multiple facial images, the system identifies images that potentially match the respective facial image in accordance with visual similarity criteria.

...read moreread less

Abstract: A facial recognition search system identifies one or more likely names (or other personal identifiers) corresponding to the facial image(s) in a query as follows. After receiving the visual query with one or more facial images, the system identifies images that potentially match the respective facial image in accordance with visual similarity criteria. Then one or more persons associated with the potential images are identified. For each identified person, person-specific data comprising metrics of social connectivity to the requester are retrieved from a plurality of applications such as communications applications, social networking applications, calendar applications, and collaborative applications. An ordered list of persons is then generated by ranking the identified persons in accordance with at least metrics of visual similarity between the respective facial image and the potential image matches and with the social connection metrics. Finally, at least one person identifier from the list is sent to the requester.

...read moreread less

Journal Article•DOI•

Towards a Relevant and Diverse Search of Social Images

[...]

Meng Wang¹, Kuiyuan Yang², Xian-Sheng Hua¹, Hong-Jiang Zhang³•Institutions (3)

Microsoft¹, University of Science and Technology of China², Advanced Technology Center³

01 Dec 2010-IEEE Transactions on Multimedia

TL;DR: A diverse relevance ranking scheme that is able to take relevance and diversity into account by exploring the content of images and their associated tags, and it is shown that the diversity of search results can be enhanced while maintaining a comparable level of relevance.

...read moreread less

Abstract: Recent years have witnessed the great success of social media websites. Tag-based image search is an important approach to accessing the image content on these websites. However, the existing ranking methods for tag-based image search frequently return results that are irrelevant or not diverse. This paper proposes a diverse relevance ranking scheme that is able to take relevance and diversity into account by exploring the content of images and their associated tags. First, it estimates the relevance scores of images with respect to the query term based on both the visual information of images and the semantic information of associated tags. Then, we estimate the semantic similarities of social images based on their tags. Based on the relevance scores and the similarities, the ranking list is generated by a greedy ordering algorithm which optimizes average diverse precision, a novel measure that is extended from the conventional average precision. Comprehensive experiments and user studies demonstrate the effectiveness of the approach. We also apply the scheme for web image search reranking, and it is shown that the diversity of search results can be enhanced while maintaining a comparable level of relevance.

...read moreread less

Proceedings Article•DOI•

Data summaries for on-demand queries over linked data

[...]

Andreas Harth¹, Katja Hose², Marcel Karnstedt³, Axel Polleres³, Kai-Uwe Sattler⁴, Jürgen Umbrich³ - Show less +2 more•Institutions (4)

Karlsruhe Institute of Technology¹, Max Planck Society², National University of Ireland, Galway³, Technische Universität Ilmenau⁴

26 Apr 2010

TL;DR: An approximate index structure summarising graph-structured content of sources adhering to Linked data principles is developed, an algorithm for answering conjunctive queries over Linked Data on theWeb exploiting the source summary is provided, and the system is evaluated using synthetically generated queries.

...read moreread less

Abstract: Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. However, this time-consuming pre-processing phase however leverages the benefits of Linked Data -- where structured data is accessible live and up-to-date at distributed Web resources that may change constantly -- only to a limited degree, as query results can never be current. An ideal query answering system for Linked Data should return current answers in a reasonable amount of time, even on corpora as large as the Web. Query processors evaluating queries directly on the live sources require knowledge of the contents of data sources. In this paper, we develop and evaluate an approximate index structure summarising graph-structured content of sources adhering to Linked Data principles, provide an algorithm for answering conjunctive queries over Linked Data on theWeb exploiting the source summary, and evaluate the system using synthetically generated queries. The experimental results show that our lightweight index structure enables complete and up-to-date query results over Linked Data, while keeping the overhead for querying low and providing a satisfying source ranking at no additional cost.

...read moreread less

Journal Article•DOI•

Gradient descent optimization of smoothed information retrieval metrics

[...]

Olivier Chapelle¹, Mingrui Wu¹•Institutions (1)

Yahoo!¹

01 Jun 2010-Information Retrieval

TL;DR: This work proposes an algorithm which aims at directly optimizing popular measures such as the Normalized Discounted Cumulative Gain and the Average Precision, to minimize a smooth approximation of these measures with gradient descent.

...read moreread less

Abstract: Most ranking algorithms are based on the optimization of some loss functions, such as the pairwise loss. However, these loss functions are often different from the criteria that are adopted to measure the quality of the web page ranking results. To overcome this problem, we propose an algorithm which aims at directly optimizing popular measures such as the Normalized Discounted Cumulative Gain and the Average Precision. The basic idea is to minimize a smooth approximation of these measures with gradient descent. Crucial to this kind of approach is the choice of the smoothing factor. We provide various theoretical analysis on that choice and propose an annealing algorithm to iteratively minimize a less and less smoothed approximation of the measure of interest. Results on the Letor benchmark datasets show that the proposed algorithm achieves state-of-the-art performances.

...read moreread less

Book•

Estimating the Query Difficulty for Information Retrieval

[...]

David Carmel¹, Elad Yom-Tov¹•Institutions (1)

IBM¹

30 Apr 2010

TL;DR: This tutorial is to expose participants to the current research on query performance prediction (also known as query difficulty estimation), and participants will become familiar with states-of-the-art performance prediction methods, and with common evaluation methodologies for prediction quality.

...read moreread less

Abstract: Many information retrieval (IR) systems suffer from a radical variance in performance when responding to users' queries. Even for systems that succeed very well on average, the quality of results returned for some of the queries is poor. Thus, it is desirable that IR systems will be able to identify "difficult" queries in order to handle them properly. Understanding why some queries are inherently more difficult than others is essential for IR, and a good answer to this important question will help search engines to reduce the variance in performance, hence better servicing their customer needs. The high variability in query performance has driven a new research direction in the IR field on estimating the expected quality of the search results, i.e. the query difficulty, when no relevance feedback is given. Estimating the query difficulty is a significant challenge due to the numerous factors that impact retrieval performance. Many prediction methods have been proposed recently. However, as many researchers observed, the prediction quality of state-of-the-art predictors is still too low to be widely used by IR applications. The low prediction quality is due to the complexity of the task, which involves factors such as query ambiguity, missing content, and vocabulary mismatch. The goal of this tutorial is to expose participants to the current research on query performance prediction (also known as query difficulty estimation). Participants will become familiar with states-of-the-art performance prediction methods, and with common evaluation methodologies for prediction quality. We will discuss the reasons that cause search engines to fail for some of the queries, and provide an overview of several approaches for estimating query difficulty. We then describe common methodologies for evaluating the prediction quality of those estimators, and some experiments conducted recently with their prediction quality, as measured over several TREC benchmarks. We will cover a few potential applications that can utilize query difficulty estimators by handling each query individually and selectively based on its estimated difficulty. Finally we will summarize with a discussion on open issues and challenges in the field.

...read moreread less

Journal Article•DOI•

A novel hybrid feature selection via Symmetrical Uncertainty ranking based local memetic search algorithm

[...]

Soundarapandian Kannan¹, N. Ramaraj•Institutions (1)

Thiagarajar College of Engineering¹

01 Aug 2010-Knowledge Based Systems

TL;DR: A novel correlation based memetic framework (MA-C) which is a combination of genetic algorithm (GA) and local search (LS) using correlation based filter ranking is proposed in this paper and outperforms recent existing methods in the literature in terms of classification accuracy, selected feature size and efficiency.

...read moreread less

Abstract: A novel correlation based memetic framework (MA-C) which is a combination of genetic algorithm (GA) and local search (LS) using correlation based filter ranking is proposed in this paper. The local filter method used here fine-tunes the population of GA solutions by adding or deleting features based on Symmetrical Uncertainty (SU) measure. The focus here is on filter methods that are able to assess the goodness or ranking of the individual features. Empirical study of MA-C on several commonly used datasets from the large-scale Gene expression datasets indicates that it outperforms recent existing methods in the literature in terms of classification accuracy, selected feature size and efficiency. Further, we also investigate the balance between local and genetic search to maximize the search quality and efficiency of MA-C.

...read moreread less

Proceedings Article•DOI•

Context-aware ranking in web search

[...]

Biao Xiang¹, Daxin Jiang², Jian Pei³, Xiaohui Sun², Enhong Chen¹, Hang Li² - Show less +2 more•Institutions (3)

University of Science and Technology of China¹, Microsoft², Simon Fraser University³

19 Jul 2010

TL;DR: The experimental results clearly show that the context-aware ranking approach improves the ranking of a commercial search engine which ignores context information and outperforms a baseline method which considers context information in ranking.

...read moreread less

Abstract: The context of a search query often provides a search engine meaningful hints for answering the current query better. Previous studies on context-aware search were either focused on the development of context models or limited to a relatively small scale investigation under a controlled laboratory setting. Particularly, about context-aware ranking for Web search, the following two critical problems are largely remained unsolved. First, how can we take advantage of different types of contexts in ranking? Second, how can we integrate context information into a ranking model? In this paper, we tackle the above two essential problems analytically and empirically. We develop different ranking principles for different types of contexts. Moreover, we adopt a learning-to-rank approach and integrate the ranking principles into a state-of-the-art ranking model by encoding the context information as features of the model. We empirically test our approach using a large search log data set obtained from a major commercial search engine. Our evaluation uses both human judgments and implicit user click data. The experimental results clearly show that our context-aware ranking approach improves the ranking of a commercial search engine which ignores context information. Furthermore, our method outperforms a baseline method which considers context information in ranking.

...read moreread less

Journal Article•DOI•

Retrieving top-k prestige-based relevant spatial web objects

[...]

Xin Cao¹, Gao Cong¹, Christian S. Jensen²•Institutions (2)

Nanyang Technological University¹, Aarhus University²

01 Sep 2010

TL;DR: Empirical studies with real-world spatial data demonstrate that LkPT queries are more effective in retrieving web objects than a previous approach that does not consider the effects of nearby objects; and they show that the proposed algorithms are scalable and outperform a baseline approach significantly.

...read moreread less

Abstract: The location-aware keyword query returns ranked objects that are near a query location and that have textual descriptions that match query keywords. This query occurs inherently in many types of mobile and traditional web services and applications, e.g., Yellow Pages and Maps services. Previous work considers the potential results of such a query as being independent when ranking them. However, a relevant result object with nearby objects that are also relevant to the query is likely to be preferable over a relevant object without relevant nearby objects.The paper proposes the concept of prestige-based relevance to capture both the textual relevance of an object to a query and the effects of nearby objects. Based on this, a new type of query, the Location-aware top-k Prestige-based Text retrieval (LkPT) query, is proposed that retrieves the top-k spatial web objects ranked according to both prestige-based relevance and location proximity.We propose two algorithms that compute LkPT queries. Empirical studies with real-world spatial data demonstrate that LkPT queries are more effective in retrieving web objects than a previous approach that does not consider the effects of nearby objects; and they show that the proposed algorithms are scalable and outperform a baseline approach significantly.

...read moreread less

Posted Content•

Should you believe in the Shanghai ranking

[...]

Jean-Charles Billaut, Denis Bouyssou, Philippe Vincke

02 Nov 2010-Research Papers in Economics

TL;DR: In this article, a critical analysis of the "Academic Ranking of World Universities", published every year by the Institute of Higher Education of the Jiao Tong University in Shanghai and more commonly known as the Shanghai ranking, is presented.

...read moreread less

Abstract: This paper proposes a critical analysis of the "Academic Ranking of World Universities", published every year by the Institute of Higher Education of the Jiao Tong University in Shanghai and more commonly known as the Shanghai ranking. After having recalled how the ranking is built, we first discuss the relevance of the criteria and then analyze the proposed aggregation method. Our analysis uses tools and concepts from Multiple Criteria Decision Making (MCDM). Our main conclusions are that the criteria that are used are not relevant, that the aggregation methodology is plagued by a number of major problems and that the whole exercise suffers from an insufficient attention paid to fundamental structuring issues. Hence, our view is that the Shanghai ranking, in spite of the media coverage it receives, does not qualify as a useful and pertinent tool to discuss the "quality" of academic institutions, let alone to guide the choice of students and family or to promote reforms of higher education systems. We outline the type of work that should be undertaken to oer sound alternatives to the Shanghai ranking.

...read moreread less

Journal Article•DOI•

Should you believe in the Shanghai ranking? An MCDM view

[...]

Jean-Charles Billaut¹, Denis Bouyssou², Philippe Vincke³•Institutions (3)

François Rabelais University¹, Paris Dauphine University², Université libre de Bruxelles³

02 Nov 2010-Scientometrics

TL;DR: The view is that the Shanghai ranking, in spite of the media coverage it receives, does not qualify as a useful and pertinent tool to discuss the “quality” of academic institutions, let alone to guide the choice of students and family or to promote reforms of higher education systems.

...read moreread less

Abstract: This paper proposes a critical analysis of the “Academic Ranking of World Universities”, published every year by the Institute of Higher Education of the Jiao Tong University in Shanghai and more commonly known as the Shanghai ranking. After having recalled how the ranking is built, we first discuss the relevance of the criteria and then analyze the proposed aggregation method. Our analysis uses tools and concepts from Multiple Criteria Decision Making (MCDM). Our main conclusions are that the criteria that are used are not relevant, that the aggregation methodology is plagued by a number of major problems and that the whole exercise suffers from an insufficient attention paid to fundamental structuring issues. Hence, our view is that the Shanghai ranking, in spite of the media coverage it receives, does not qualify as a useful and pertinent tool to discuss the “quality” of academic institutions, let alone to guide the choice of students and family or to promote reforms of higher education systems. We outline the type of work that should be undertaken to offer sound alternatives to the Shanghai ranking.

...read moreread less

Patent•

Method and device for transliteration

[...]

Piyush Kumar Rai¹, Samarth Vinod Deo¹•Institutions (1)

Samsung¹

25 Oct 2010

TL;DR: In this paper, a method for transliteration includes receiving input such as a word, a sentence, a phrase, and a paragraph, in a source language, creating source language sub-phonetic units for the word and converting the source language SUB-PHONETs to target language subphonETs.

...read moreread less

Abstract: A method for transliteration includes receiving input such as a word, a sentence, a phrase, and a paragraph, in a source language, creating source language sub-phonetic units for the word and converting the source language sub-phonetic units for the word to target language sub-phonetic units, retrieving ranking for each of the target language sub-phonetic units from a database and creating target language words for the word in the source language based on the target language sub-phonetic units and ranking of the each of the target language sub-phonetic units. The method further includes identifying candidate target language words based predefined criteria, and displaying candidate target language words.

...read moreread less

Proceedings Article•DOI•

Scalable indexing for layout based document retrieval and ranking

[...]

Loïc Lecerf¹, Boris Chidlovskii¹•Institutions (1)

Xerox¹

22 Mar 2010

TL;DR: A model of layout indexing of a collection adapted for the quick retrieval of top k relevant documents by document layout is developed and a direct evaluation of the similarity between a query and each document in the collection is avoided.

...read moreread less

Abstract: In this paper we propose a schema for querying large documents collections by document layout. We develop a model of layout indexing of a collection adapted for the quick retrieval of top k relevant documents. Fort the sake of scalability, we avoid a direct evaluation of the similarity between a query and each document in the collection; their similarity is instead approximated by the similarity between their projections on the set of representative blocks which are inferred from the collection on the indexed step. The technique also proposes new functions for the relevance ranking and the cluster pruning that ensure a scalable retrieval and ranking.

...read moreread less

Patent•

Snippet Extraction and Ranking

[...]

Rong Xiao¹, Qiang Hao¹, Changhu Wang¹, Rui Cai¹, Lei Zhang¹ - Show less +1 more•Institutions (1)

Microsoft¹

08 Jun 2010

TL;DR: In this paper, location-related aspects of user-generated content based on automated analysis of the usergenerated content are automatically learned based on dividing documents into document segments, and decomposing the segments into local topics and global topics.

...read moreread less

Abstract: Described herein is a technology that facilitates efficient automated mining of topic-related aspects of user-generated content based on automated analysis of the user-generated content. Locations are automatically learned based on dividing documents into document segments, and decomposing the segments into local topics and global topics. Techniques are described that facilitate automatically extracting snippets. These techniques include, for example, computer annotating travelogues with learned tags and images, performing topic learning to obtain an interest model, performing location matching based on the interest model, calculating geographic and semantic relevance scores, ranking snippets based on the geographic and semantic relevance scores, and searching snippets with a “location+context term” query.

...read moreread less

Journal Article•DOI•

Potential for personalization

[...]

Jaime Teevan¹, Susan T. Dumais¹, Eric Horvitz¹•Institutions (1)

Microsoft¹

06 Apr 2010-ACM Transactions on Computer-Human Interaction

TL;DR: This work explores the variation in what different people consider relevant to the same query by mining three data sources, finding that people's explicit judgments for the same queries differ greatly.

...read moreread less

Abstract: Current Web search tools do a good job of retrieving documents that satisfy the most common intentions associated with a query, but do not do a very good job of discerning different individuals' unique search goals. We explore the variation in what different people consider relevant to the same query by mining three data sources: (1) explicit relevance judgments, (2) clicks on search results (a behavior-based implicit measure of relevance), and (3) the similarity of desktop content to search results (a content-based implicit measure of relevance). We find that people's explicit judgments for the same queries differ greatly. As a result, there is a large gap between how well search engines could perform if they were to tailor results to the individual, and how well they currently perform by returning results designed to satisfy everyone. We call this gap the potential for personalization. The two implicit indicators we studied provide complementary value for approximating this variation in result relevance among people. We discuss several uses of our findings, including a personalized search system that takes advantage of the implicit measures by ranking personally relevant results more highly and improving click-through rates.

...read moreread less

The quality of death: ranking end-of-life care across the world

[...]

David Praill, Lukas Radbruch, Rajagopal

01 Oct 2010

Collapse