scispace - formally typeset
Search or ask a question

Showing papers on "Ranking (information retrieval) published in 2007"


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper brings query expansion into the visual domain via two novel contributions: strong spatial constraints between the query image and each result allow us to accurately verify each return, suppressing the false positives which typically ruin text-based query expansion.
Abstract: Given a query image of an object, our objective is to retrieve all instances of that object in a large (1M+) image database. We adopt the bag-of-visual-words architecture which has proven successful in achieving high precision at low recall. Unfortunately, feature detection and quantization are noisy processes and this can result in variation in the particular visual words that appear in different images of the same object, leading to missed results. In the text retrieval literature a standard method for improving performance is query expansion. A number of the highly ranked documents from the original query are reissued as a new query. In this way, additional relevant terms can be added to the query. This is a form of blind rele- vance feedback and it can fail if 'outlier' (false positive) documents are included in the reissued query. In this paper we bring query expansion into the visual domain via two novel contributions. Firstly, strong spatial constraints between the query image and each result allow us to accurately verify each return, suppressing the false positives which typically ruin text-based query expansion. Secondly, the verified images can be used to learn a latent feature model to enable the controlled construction of expanded queries. We illustrate these ideas on the 5000 annotated image Oxford building database together with more than 1M Flickr images. We show that the precision is substantially boosted, achieving total recall in many cases.

966 citations


Journal ArticleDOI
TL;DR: This work formally generalizes the ROC metric to the early recognition problem and proposes a novel metric called Boltzmann-enhanced discrimination of receiver operating characteristic that turns out to contain the discrimination power of the RIE metric but incorporates the statistical significance from ROC and its well-behaved boundaries.
Abstract: Many metrics are currently used to evaluate the performance of ranking methods in virtual screening (VS), for instance, the area under the receiver operating characteristic curve (ROC), the area under the accumulation curve (AUAC), the average rank of actives, the enrichment factor (EF), and the robust initial enhancement (RIE) proposed by Sheridan et al. In this work, we show that the ROC, the AUAC, and the average rank metrics have the same inappropriate behaviors that make them poor metrics for comparing VS methods whose purpose is to rank actives early in an ordered list (the "early recognition problem"). In doing so, we derive mathematical formulas that relate those metrics together. Moreover, we show that the EF metric is not sensitive to ranking performance before and after the cutoff. Instead, we formally generalize the ROC metric to the early recognition problem which leads us to propose a novel metric called the Boltzmann-enhanced discrimination of receiver operating characteristic that turns out to contain the discrimination power of the RIE metric but incorporates the statistical significance from ROC and its well-behaved boundaries. Finally, two major sources of errors, namely, the statistical error and the "saturation effects", are examined. This leads to practical recommendations for the number of actives, the number of inactives, and the "early recognition" importance parameter that one should use when comparing ranking methods. Although this work is applied specifically to VS, it is general and can be used to analyze any method that needs to segregate actives toward the front of a rank-ordered list.

676 citations


Proceedings ArticleDOI
Nick Craswell1, Martin Szummer1
23 Jul 2007
TL;DR: A Markov random walk model is applied to a large click log, producing a probabilistic ranking of documents for a given query, demonstrating its ability to retrieve relevant documents that have not yet been clicked for that query and rank those effectively.
Abstract: Search engines can record which documents were clicked for which query, and use these query-document pairs as "soft" relevance judgments. However, compared to the true judgments, click logs give noisy and sparse relevance information. We apply a Markov random walk model to a large click log, producing a probabilistic ranking of documents for a given query. A key advantage of the model is its ability to retrieve relevant documents that have not yet been clicked for that query and rank those effectively. We conduct experiments on click logs from image search, comparing our ("backward") random walk model to a different ("forward") random walk, varying parameters such as walk length and self-transition probability. The most effective combination is a long backward walk with high self-transition probability.

519 citations


Tie-Yan Liu, Jun Xu, Tao Qin, Wen-Ying Xiong, Hang Li 
01 Jan 2007
TL;DR: This paper has constructed a benchmark dataset referred to as LETOR, derived the LETOR data from the existing data sets widely used in IR, namely, OHSUMED and TREC data and provided the results of several state-ofthe-arts learning to rank algorithms on the data.
Abstract: This paper is concerned with learning to rank for information retrieval (IR). Ranking is the central problem for information retrieval, and employing machine learning techniques to learn the ranking function is viewed as a promising approach to IR. Unfortunately, there was no benchmark dataset that could be used in comparison of existing learning algorithms and in evaluation of newly proposed algorithms, which stood in the way of the related research. To deal with the problem, we have constructed a benchmark dataset referred to as LETOR and distributed it to the research communities. Specifically we have derived the LETOR data from the existing data sets widely used in IR, namely, OHSUMED and TREC data. The two collections contain queries, the contents of the retrieved documents, and human judgments on the relevance of the documents with respect to the queries. We have extracted features from the datasets, including both conventional features, such as term frequency, inverse document frequency, BM25, and language models for IR, and features proposed recently at SIGIR, such as HostRank, feature propagation, and topical PageRank. We have then packaged LETOR with the extracted features, queries, and relevance judgments. We have also provided the results of several state-ofthe-arts learning to rank algorithms on the data. This paper describes in details about LETOR.

503 citations


Journal ArticleDOI
TL;DR: The results show that the combination of experts significantly improves the effectiveness of feature location as compared to each of the experts used independently.
Abstract: This paper recasts the problem of feature location in source code as a decision-making problem in the presence of uncertainty. The solution to the problem is formulated as a combination of the opinions of different experts. The experts in this work are two existing techniques for feature location: a scenario-based probabilistic ranking of events and an information-retrieval-based technique that uses latent semantic indexing. The combination of these two experts is empirically evaluated through several case studies, which use the source code of the Mozilla Web browser and the Eclipse integrated development environment. The results show that the combination of experts significantly improves the effectiveness of feature location as compared to each of the experts used independently

473 citations


Journal ArticleDOI
TL;DR: A model for the exploitation of ontology-based knowledge bases to improve search over large document repositories and is combined with conventional keyword-based retrieval to achieve tolerance to knowledge base incompleteness.
Abstract: Semantic search has been one of the motivations of the semantic Web since it was envisioned. We propose a model for the exploitation of ontology-based knowledge bases to improve search over large document repositories. In our view of information retrieval on the semantic Web, a search engine returns documents rather than, or in addition to, exact values in response to user queries. For this purpose, our approach includes an ontology-based scheme for the semiautomatic annotation of documents and a retrieval system. The retrieval model is based on an adaptation of the classic vector-space model, including an annotation weighting algorithm, and a ranking algorithm. Semantic search is combined with conventional keyword-based retrieval to achieve tolerance to knowledge base incompleteness. Experiments are shown where our approach is tested on corpora of significant scale, showing clear improvements with respect to keyword-based search

456 citations


Proceedings Article
03 Dec 2007
TL;DR: A method which uses Maximum Margin Matrix Factorization and optimizes ranking instead of rating is presented and gives very good ranking scores and scales well on collaborative filtering tasks.
Abstract: In this paper, we consider collaborative filtering as a ranking problem. We present a method which uses Maximum Margin Matrix Factorization and optimizes ranking instead of rating. We employ structured output prediction to optimize directly for ranking scores. Experimental results show that our method gives very good ranking scores and scales well on collaborative filtering tasks.

442 citations


Journal ArticleDOI
01 Feb 2007
TL;DR: This correspondence presents a novel hybrid wrapper and filter feature selection algorithm for a classification problem using a memetic framework that incorporates a filter ranking method in the traditional genetic algorithm to improve classification performance and accelerate the search in identifying the core feature subsets.
Abstract: This correspondence presents a novel hybrid wrapper and filter feature selection algorithm for a classification problem using a memetic framework. It incorporates a filter ranking method in the traditional genetic algorithm to improve classification performance and accelerate the search in identifying the core feature subsets. Particularly, the method adds or deletes a feature from a candidate feature subset based on the univariate feature ranking information. This empirical study on commonly used data sets from the University of California, Irvine repository and microarray data sets shows that the proposed method outperforms existing methods in terms of classification accuracy, number of selected features, and computational efficiency. Furthermore, we investigate several major issues of memetic algorithm (MA) to identify a good balance between local search and genetic search so as to maximize search quality and efficiency in the hybrid filter and wrapper MA

426 citations


Journal ArticleDOI
TL;DR: The effectiveness and the benefits of the new indices are exhibited to unfold the full potential of the h-index, with extensive experimental results obtained from the DBLP, a widely known on-line digital library.
Abstract: What is the value of a scientist and its impact upon the scientific thinking? How can we measure the prestige of a journal or a conference? The evaluation of the scientific work of a scientist and the estimation of the quality of a journal or conference has long attracted significant interest, due to the benefits by obtaining an unbiased and fair criterion. Although it appears to be simple, defining a quality metric is not an easy task. To overcome the disadvantages of the present metrics used for ranking scientists and journals, J. E. Hirsch proposed a pioneering metric, the now famous h-index. In this article we demonstrate several inefficiencies of this index and develop a pair of generalizations and effective variants of it to deal with scientist ranking and publication forum ranking. The new citation indices are able to disclose trendsetters in scientific research, as well as researchers that constantly shape their field with their influential work, no matter how old they are. We exhibit the effectiveness and the benefits of the new indices to unfold the full potential of the h-index, with extensive experimental results obtained from the DBLP, a widely known on-line digital library.

399 citations


Proceedings ArticleDOI
11 Jun 2007
TL;DR: This paper proposes a new ranking formula by adapting existing IR techniques based on a natural notion of virtual document and proposes several efficient query processing methods for the new ranking method.
Abstract: With the increasing amount of text data stored in relational databases, there is a demand for RDBMS to support keyword queries over text data. As a search result is often assembled from multiple relational tables, traditional IR-style ranking and query evaluation methods cannot be applied directly. In this paper, we study the effectiveness and the efficiency issues of answering top-k keyword query in relational database systems. We propose a new ranking formula by adapting existing IR techniques based on a natural notion of virtual document. Compared with previous approaches, our new ranking method is simple yet effective, and agrees with human perceptions. We also study efficient query processing methods for the new ranking method, and propose algorithms that have minimal accesses to the database. We have conducted extensive experiments on large-scale real databases using two popular RDBMSs. The experimental results demonstrate significant improvement to the alternative approaches in terms of retrieval effectiveness and efficiency.

343 citations


Proceedings Article
01 Apr 2007
TL;DR: An algorithm is presented that jointly learns ranking models for individual aspects by modeling the dependencies between assigned ranks, and it is proved that the agreementbased joint model is more expressive than individual ranking models.
Abstract: We address the problem of analyzing multiple related opinions in a text. For instance, in a restaurant review such opinions may include food, ambience and service. We formulate this task as a multiple aspect ranking problem, where the goal is to produce a set of numerical scores, one for each aspect. We present an algorithm that jointly learns ranking models for individual aspects by modeling the dependencies between assigned ranks. This algorithm guides the prediction of individual rankers by analyzing meta-relations between opinions, such as agreement and contrast. We prove that our agreementbased joint model is more expressive than individual ranking models. Our empirical results further conrm the strength of the model: the algorithm provides signicant improvement over both individual rankers and a state-of-the-art joint ranking model.

Journal ArticleDOI
TL;DR: A new method based on the ranking of generalized trapezoidal fuzzy numbers for fuzzy risk analysis that can overcome the drawbacks of the existing centroid-index ranking methods is presented.
Abstract: In this paper, we present a new method for fuzzy risk analysis based on the ranking of generalized trapezoidal fuzzy numbers. The proposed method considers the centroid points and the standard deviations of generalized trapezoidal fuzzy numbers for ranking generalized trapezoidal fuzzy numbers. We also use an example to compare the ranking results of the proposed method with the existing centroid-index ranking methods. The proposed ranking method can overcome the drawbacks of the existing centroid-index ranking methods. Based on the proposed ranking method, we also present an algorithm to deal with fuzzy risk analysis problems. The proposed fuzzy risk analysis algorithm can overcome the drawbacks of the one we presented in [7].

Proceedings Article
06 Jan 2007
TL;DR: This paper presents ItemRank, a random-walk based scoring algorithm, which can be used to rank products according to expected user preferences, in order to recommend top-rank items to potentially interested users.
Abstract: Recommender systems are an emerging technology that helps consumers to find interesting products. A recommender system makes personalized product suggestions by extracting knowledge from the previous users interactions. In this paper, we present "ItemRank", a random-walk based scoring algorithm, which can be used to rank products according to expected user preferences, in order to recommend top-rank items to potentially interested users. We tested our algorithm on a standard database, the MovieLens data set, which contains data collected from a popular recommender system on movies, that has been widely exploited as a benchmark for evaluating recently proposed approaches to recommender system (e.g. [Fouss et al., 2005; Sarwar et al., 2002]). We compared ItemRank with other state-of-the-art ranking techniques (in particular the algorithms described in [Fouss et al., 2005]). Our experiments show that ItemRank performs better than the other algorithms we compared to and, at the same time, it is less complex than other proposed algorithms with respect to memory usage and computational cost too.

Proceedings ArticleDOI
Xiubo Geng1, Tie-Yan Liu1, Tao Qin1, Hang Li1
23 Jul 2007
TL;DR: This paper proposes a new feature selection method that uses its value to rank the training instances, and defines the ranking accuracy in terms of a performance measure or a loss function as the importance of the feature.
Abstract: Ranking is a very important topic in information retrieval. While algorithms for learning ranking models have been intensively studied, this is not the case for feature selection, despite of its importance. The reality is that many feature selection methods used in classification are directly applied to ranking. We argue that because of the striking differences between ranking and classification, it is better to develop different feature selection methods for ranking. To this end, we propose a new feature selection method in this paper. Specifically, for each feature we use its value to rank the training instances, and define the ranking accuracy in terms of a performance measure or a loss function as the importance of the feature. We also define the correlation between the ranking results of two features as the similarity between them. Based on the definitions, we formulate the feature selection issue as an optimization problem, for which it is to find the features with maximum total importance scores and minimum total similarity scores. We also demonstrate how to solve the optimization problem in an efficient way. We have tested the effectiveness of our feature selection method on two information retrieval datasets and with two ranking models. Experimental results show that our method can outperform traditional feature selection methods for the ranking task.

Journal ArticleDOI
TL;DR: A series of knowledge extractors are developed, which employ a combination of knowledge-based and statistical techniques, for automatically identifying clinically relevant aspects of MEDLINE abstracts, and which significantly outperforms the already competitive PubMed baseline.
Abstract: The combination of recent developments in question-answering research and the availability of unparalleled resources developed specifically for automatic semantic processing of text in the medical domain provides a unique opportunity to explore complex question answering in the domain of clinical medicine. This article presents a system designed to satisfy the information needs of physicians practicing evidence-based medicine. We have developed a series of knowledge extractors, which employ a combination of knowledge-based and statistical techniques, for automatically identifying clinically relevant aspects of MEDLINE abstracts. These extracted elements serve as the input to an algorithm that scores the relevance of citations with respect to structured representations of information needs, in accordance with the principles of evidence-based medicine. Starting with an initial list of citations retrieved by PubMed, our system can bring relevant abstracts into higher ranking positions, and from these abstracts generate responses that directly answer physicians' questions. We describe three separate evaluations: one focused on the accuracy of the knowledge extractors, one conceptualized as a document reranking task, and finally, an evaluation of answers by two physicians. Experiments on a collection of real-world clinical questions show that our approach significantly outperforms the already competitive PubMed baseline.

Proceedings ArticleDOI
28 Oct 2007
TL;DR: This paper proposes a novel method for co-ranking authors and their publications using several networks: the social network connecting the authors, the citation network connected the publications, as well as the authorship network that ties the previous two together.
Abstract: Recent graph-theoretic approaches have demonstrated remarkable successes for ranking networked entities, but most of their applications are limited to homogeneous networks such as the network of citations between publications. This paper proposes a novel method for co-ranking authors and their publications using several networks: the social network connecting the authors, the citation network connecting the publications, as well as the authorship network that ties the previous two together. The new co-ranking framework is based on coupling two random walks, that separately rank authors and documents following the PageRankparadigm. As a result, improved rankings of documents and their authors depend on each other in a mutually reinforcing way, thus taking advantage of the additional information implicit in the heterogeneous network of authors and documents.

Patent
22 May 2007
TL;DR: In this article, the authors present a search experience that is substantially akin to consultation with a human expert, and that satisfies a user's information need in fulfilling projects such as purchasing, shopping, procurement, bartering, requesting for quotes, in online retail, traditional retail, wholesale, health care, travel, real estate, restaurant-going, entertainment, logistics, and sourcing.
Abstract: Methods and apparatus that deliver a searching experience that is substantially akin to consultation with a human expert, and that satisfies a user's information need in fulfilling projects such as purchasing, shopping, procurement, bartering, requesting for quotes, in online retail, traditional retail, wholesale, health care, travel, real estate, restaurant-going, entertainment, logistics, and sourcing are disclosed. Search results often contain entities that provide services and products. Records being searched are associated with industry sectors in a broad sense. Industry sector information is first derived from a user query; and is used in determining relevant and adequate additional questions for a searcher, and in matching, ranking, and presenting search results.

Proceedings Article
06 Jan 2007
TL;DR: A novel extractive approach based on manifold-ranking of sentences to this summarization task can significantly outperform existing approaches of the top performing systems in DUC tasks and baseline approaches.
Abstract: Topic-focused multi-document summarization aims to produce a summary biased to a given topic or user profile. This paper presents a novel extractive approach based on manifold-ranking of sentences to this summarization task. The manifold-ranking process can naturally make full use of both the relationships among all the sentences in the documents and the relationships between the given topic and the sentences. The ranking score is obtained for each sentence in the manifold-ranking process to denote the biased information richness of the sentence. Then the greedy algorithm is employed to impose diversity penalty on each sentence. The summary is produced by choosing the sentences with both high biased information richness and high information novelty. Experiments on DUC2003 and DUC2005 are performed and the ROUGE evaluation results show that the proposed approach can significantly outperform existing approaches of the top performing systems in DUC tasks and baseline approaches.

Proceedings ArticleDOI
23 Jul 2007
TL;DR: An algorithm named FRank is proposed based on a generalized additive model for the sake of minimizing the fedelity loss and learning an effective ranking function and the experimental results show that the proposed algorithm outperforms other learning-based ranking methods on both conventional IR problem and Web search.
Abstract: Ranking problem is becoming important in many fields, especially in information retrieval (IR). Many machine learning techniques have been proposed for ranking problem, such as RankSVM, RankBoost, and RankNet. Among them, RankNet, which is based on a probabilistic ranking framework, is leading to promising results and has been applied to a commercial Web search engine. In this paper we conduct further study on the probabilistic ranking framework and provide a novel loss function named fidelity loss for measuring loss of ranking. The fidelity loss notonly inherits effective properties of the probabilistic ranking framework in RankNet, but possesses new properties that are helpful for ranking. This includes the fidelity loss obtaining zero for each document pair, and having a finite upper bound that is necessary for conducting query-level normalization. We also propose an algorithm named FRank based on a generalized additive model for the sake of minimizing the fedelity loss and learning an effective ranking function. We evaluated the proposed algorithm for two datasets: TREC dataset and real Web search dataset. The experimental results show that the proposed FRank algorithm outperforms other learning-based ranking methods on both conventional IR problem and Web search.

Proceedings ArticleDOI
23 Jul 2007
TL;DR: This work focuses on developing a regression framework for learning ranking functions for improving relevance of search engines serving diverse streams of user queries, and proposes a novel optimization framework emphasizing the use of relative relevance judgments.
Abstract: Effective ranking functions are an essential part of commercial search engines. We focus on developing a regression framework for learning ranking functions for improving relevance of search engines serving diverse streams of user queries. We explore supervised learning methodology from machine learning, and we distinguish two types of relevance judgments used as the training data: 1) absolute relevance judgments arising from explicit labeling of search results; and 2) relative relevance judgments extracted from user click throughs of search results or converted from the absolute relevance judgments. We propose a novel optimization framework emphasizing the use of relative relevance judgments. The main contribution is the development of an algorithm based on regression that can be applied to objective functions involving preference data, i.e., data indicating that a document is more relevant than another with respect to a query. Experimental results are carried out using data sets obtained from a commercial search engine. Our results show significant improvements of our proposed methods over some existing methods.

Proceedings ArticleDOI
Apostol Natsev1, Alexander Haubold2, Jelena Tesic1, Lexing Xie1, Rong Yan1 
29 Sep 2007
TL;DR: This paper proposes several new approaches for query expansion, in which textual keywords, visual examples, or initial retrieval results are analyzed to identify the most relevant visual concepts for the given query, and develops both lexical and statistical approaches.
Abstract: We study the problem of semantic concept-based query expansion and re-ranking for multimedia retrieval. In particular, we explore the utility of a fixed lexicon of visual semantic concepts for automatic multimedia retrieval and re-ranking purposes. In this paper, we propose several new approaches for query expansion, in which textual keywords, visual examples, or initial retrieval results are analyzed to identify the most relevant visual concepts for the given query. These concepts are then used to generate additional query results and/or to re-rank an existing set of results. We develop both lexical and statistical approaches for text query expansion, as well as content-based approaches for visual query expansion. In addition, we study several other recently proposed methods for concept-based query expansion. In total, we compare 7 different approaches for expanding queries with visual semantic concepts. They are evaluated using a large video corpus and 39 concept detectors from the TRECVID-2006 video retrieval benchmark. We observe consistent improvement over the baselines for all 7 approaches, leading to an overall performance gain of 77% relative to a text retrieval baseline, and a 31% improvement relative to a state-of-the-art multimodal retrieval baseline.

Journal ArticleDOI
TL;DR: The results show that meta-features associated with a query can be combined with text retrieval techniques to improve the understanding and treatment of text search on documents with timestamps.
Abstract: Documents with timestamps, such as email and news, can be placed along a timeline. The timeline for a set of documents returned in response to a query gives an indication of how documents relevant to that query are distributed in time. Examining the timeline of a query result set allows us to characterize both how temporally dependent the topic is, as well as how relevant the results are likely to be. We outline characteristic patterns in query result set timelines, and show experimentally that we can automatically classify documents into these classes. We also show that properties of the query result set timeline can help predict the mean average precision of a query. These results show that meta-features associated with a query can be combined with text retrieval techniques to improve our understanding and treatment of text search on documents with timestamps.

Journal ArticleDOI
TL;DR: A probabilistic topic-based model for content similarity called pmra that underlies the related article search feature in PubMed, and a novel technique for estimating parameters that does not require human relevance judgments is described.
Abstract: We present a probabilistic topic-based model for content similarity called pmra that underlies the related article search feature in PubMed. Whether or not a document is about a particular topic is computed from term frequencies, modeled as Poisson distributions. Unlike previous probabilistic retrieval models, we do not attempt to estimate relevance–but rather our focus is "relatedness", the probability that a user would want to examine a particular document given known interest in another. We also describe a novel technique for estimating parameters that does not require human relevance judgments; instead, the process is based on the existence of MeSH ® in MEDLINE ®. The pmra retrieval model was compared against bm25, a competitive probabilistic model that shares theoretical similarities. Experiments using the test collection from the TREC 2005 genomics track shows a small but statistically significant improvement of pmra over bm25 in terms of precision. Our experiments suggest that the pmra model provides an effective ranking algorithm for related article search.

Journal ArticleDOI
TL;DR: The impossibility theorems that abound in the theory of social choice show that there can be no satisfactory method for electing and ranking in the context of the traditional, 700-year-old model, and a new theory is proposed that avoids all impossibilities with a simple and eminently practical method, “the majority judgement.”
Abstract: The impossibility theorems that abound in the theory of social choice show that there can be no satisfactory method for electing and ranking in the context of the traditional, 700-year-old model. A more realistic model, whose antecedents may be traced to Laplace and Galton, leads to a new theory that avoids all impossibilities with a simple and eminently practical method, “the majority judgement.” It has already been tested.

Patent
21 Sep 2007
TL;DR: In this paper, a system and method for ranking items is described, which includes a ranking module that estimates the likelihood that a user will select the offerings and the likelihood of the user will purchase the items offered in the offerings.
Abstract: A system and method for ranking items is disclosed. In one embodiment, the items constitute offerings offered by at least one on-line vendor. The system includes a ranking module that estimates the likelihood that a user will select the offerings and the likelihood that the user will purchase the items offered in the offerings. Based on this information, the ranking module calculates the expected revenue to be generated by the offerings. The ranking module then ranks the offerings relative to one another so as to increase income received by the system administrator.

Book ChapterDOI
Qi Zhou1, Chong Wang1, Miao Xiong1, Haofen Wang1, Yong Yu1 
11 Nov 2007
TL;DR: This paper explores a novel approach of adapting keywords to querying the semantic web: the approach automatically translates keyword queries into formal logic queries so that end users can use familiar keywords to perform semantic search.
Abstract: Semantic search promises to provide more accurate result than present-day keyword search. However, progress with semantic search has been delayed due to the complexity of its query languages. In this paper, we explore a novel approach of adapting keywords to querying the semantic web: the approach automatically translates keyword queries into formal logic queries so that end users can use familiar keywords to perform semantic search. A prototype system named 'SPARK' has been implemented in light of this approach. Given a keyword query, SPARK outputs a ranked list of SPARQL queries as the translation result. The translation in SPARK consists of three major steps: term mapping, query graph construction and query ranking. Specifically, a probabilistic query ranking model is proposed to select the most likely SPARQL query. In the experiment, SPARK achieved an encouraging translation result.

Journal ArticleDOI
TL;DR: A search engine can use training data extracted from the logs to automatically tailor ranking functions to a particular user group or collection, and machine-learning techniques can harness to improve search quality.
Abstract: Search-engine logs provide a wealth of information that machine-learning techniques can harness to improve search quality. With proper interpretations that avoid inherent biases, a search engine can use training data extracted from the logs to automatically tailor ranking functions to a particular user group or collection.

Proceedings Article
23 Sep 2007
TL;DR: This work focuses on the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking.
Abstract: As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. While entities appear in many pages, current engines only find each page individually. Toward searching directly and holistically for finding information of finer granularity, we study the problem of entity search, a significant departure from traditional document retrieval. We focus on the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We evaluate our online prototype over a 2TB Web corpus, and show that EntityRank performs effectively.

Proceedings ArticleDOI
23 Jul 2007
TL;DR: This work presents a vocabulary independent system that can handle arbitrary queries, exploiting the information provided by having both word transcripts and phonetic transcripts, in order to retrieve information from speech data.
Abstract: We are interested in retrieving information from speech data like broadcast news, telephone conversations and roundtable meetings. Today, most systems use large vocabulary continuous speech recognition tools to produce word transcripts; the transcripts are indexed and query terms are retrieved from the index. However, query terms that are not part of the recognizer's vocabulary cannot be retrieved, and the recall of the search is affected. In addition to the output word transcript, advanced systems provide also phonetic transcripts, against which query terms can be matched phonetically. Such phonetic transcripts suffer from lower accuracy and cannot be an alternative to word transcripts.We present a vocabulary independent system that can handle arbitrary queries, exploiting the information provided by having both word transcripts and phonetic transcripts. A speech recognizer generates word confusion networks and phonetic lattices. The transcripts are indexed for query processing and ranking purpose.The value of the proposed method is demonstrated by the relative high performance ofour system, which received the highest overall ranking for US English speech data in the recent NIST Spoken Term Detection evaluation.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: The Lower Bound Constraint algorithm (LBC) is proven to be an instance optimal algorithm and extensive experiments demonstrate that LBC is four times more efficient than a straightforward algorithm.
Abstract: Skyline query processing has been investigated extensively in recent years, mostly for only one query reference point. An example of a single-source skyline query is to find hotels which are cheap and close to the beach (an absolute query), or close to a user-given location (a relatively query). A multi-source skyline query considers several query points at the same time (e.g., to find hotels which are cheap and close to the University, the Botanic Garden and the China Town). In this paper, we consider the problem of efficient multi-source skyline query processing in road networks. It is not only the first effort to consider multi-source skyline query in road networks but also the first effort to process the relative skyline queries where the network distance between two locations needs to be computed on-the-fly. Three different query processing algorithms are proposed and evaluated in this paper. The Lower Bound Constraint algorithm (LBC) is proven to be an instance optimal algorithm. Extensive experiments using large real road network datasets demonstrate that LBC is four times more efficient than a straightforward algorithm.