Showing papers presented at "International ACM SIGIR Conference on Research and Development in Information Retrieval in 2005"

PDF

Open Access

Proceedings Article•DOI•

Accurately interpreting clickthrough data as implicit feedback

[...]

Thorsten Joachims¹, Laura Granka², Bing Pan¹, Helene Hembrooke¹•Institutions (2)

Cornell University¹, Stanford University²

15 Aug 2005

TL;DR: It is concluded that clicks are informative but biased, and while this makes the interpretation of clicks as absolute relevance judgments difficult, it is shown that relative preferences derived from clicks are reasonably accurate on average.

...read moreread less

Abstract: This paper examines the reliability of implicit feedback generated from clickthrough data in WWW search. Analyzing the users' decision process using eyetracking and comparing implicit feedback against manual relevance judgments, we conclude that clicks are informative but biased. While this makes the interpretation of clicks as absolute relevance judgments difficult, we show that relative preferences derived from clicks are reasonably accurate on average.

...read moreread less

1,484 citations

Proceedings Article•DOI•

A Markov random field model for term dependencies

[...]

Donald Metzler¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

15 Aug 2005

TL;DR: A novel approach is developed to train the model that directly maximizes the mean average precision rather than maximizing the likelihood of the training data, and significant improvements are possible by modeling dependencies, especially on the larger web collections.

...read moreread less

Abstract: This paper develops a general, formal framework for modeling term dependencies via Markov random fields. The model allows for arbitrary text features to be incorporated as evidence. In particular, we make use of features based on occurrences of single terms, ordered phrases, and unordered phrases. We explore full independence, sequential dependence, and full dependence variants of the model. A novel approach is developed to train the model that directly maximizes the mean average precision rather than maximizing the likelihood of the training data. Ad hoc retrieval experiments are presented on several newswire and web collections, including the GOV2 collection used at the TREC 2004 Terabyte Track. The results show significant improvements are possible by modeling dependencies, especially on the larger web collections.

...read moreread less

996 citations

Proceedings Article•DOI•

Personalizing search via automated analysis of interests and activities

[...]

Jaime Teevan¹, Susan T. Dumais², Eric Horvitz²•Institutions (2)

Massachusetts Institute of Technology¹, Microsoft²

15 Aug 2005

TL;DR: This research suggests that rich representations of the user and the corpus are important for personalization, but that it is possible to approximate these representations and provide efficient client-side algorithms for personalizing search.

...read moreread less

Abstract: We formulate and study search algorithms that consider a user's prior interactions with a wide variety of content to personalize that user's current Web search. Rather than relying on the unrealistic assumption that people will precisely specify their intent when searching, we pursue techniques that leverage implicit information about the user's interests. This information is used to re-rank Web search results within a relevance feedback framework. We explore rich models of user interests, built from both search-related information, such as previously issued queries and previously visited Web pages, and other information about the user such as documents and email the user has read and created. Our research suggests that rich representations of the user and the corpus are important for personalization, but that it is possible to approximate these representations and provide efficient client-side algorithms for personalizing search. We show that such personalization algorithms can significantly improve on current Web search.

...read moreread less

928 citations

Proceedings Article•DOI•

Scalable collaborative filtering using cluster-based smoothing

[...]

Gui-Rong Xue¹, Chenxi Lin¹, Qiang Yang², Wensi Xi³, Hua-Jun Zeng⁴, Yong Yu¹, Zheng Chen⁴ - Show less +3 more•Institutions (4)

Shanghai Jiao Tong University¹, Hong Kong University of Science and Technology², Virginia Tech³, Microsoft⁴

15 Aug 2005

TL;DR: In this paper, clusters generated from the training data provide the basis for data smoothing and neighborhood selection and show that the new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.

...read moreread less

Abstract: Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. Alternatively, the model-based approach has been proposed to alleviate these problems, but this approach tends to limit the range of users. In this paper, we present a novel approach that combines the advantages of these two approaches by introducing a smoothing-based method. In our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. As a result, we provide higher accuracy as well as increased efficiency in recommendations. Empirical studies on two datasets (EachMovie and MovieLens) show that our new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.

...read moreread less

706 citations

Proceedings Article•DOI•

Context-sensitive information retrieval using implicit feedback

[...]

Xuehua Shen¹, Bin Tan¹, ChengXiang Zhai¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

15 Aug 2005

TL;DR: This paper proposes several context-sensitive retrieval algorithms based on statistical language models to combine the preceding queries and clicked document summaries with the current query for better ranking of documents.

...read moreread less

Abstract: A major limitation of most existing retrieval models and systems is that the retrieval decision is made based solely on the query and document collection; information about the actual user and search context is largely ignored. In this paper, we study how to exploit implicit feedback information, including previous queries and clickthrough information, to improve retrieval accuracy in an interactive information retrieval setting. We propose several context-sensitive retrieval algorithms based on statistical language models to combine the preceding queries and clicked document summaries with the current query for better ranking of documents. We use the TREC AP data to create a test collection with search context information, and quantitatively evaluate our models using this test set. Experiment results show that using implicit feedback, especially the clicked document summaries, can improve retrieval performance substantially.

...read moreread less

501 citations

Proceedings Article•DOI•

Information retrieval system evaluation: effort, sensitivity, and reliability

[...]

Mark Sanderson¹, Justin Zobel²•Institutions (2)

University of Sheffield¹, RMIT University²

15 Aug 2005

TL;DR: It is found that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems.

...read moreread less

Abstract: The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.

...read moreread less

381 citations

Proceedings Article•DOI•

Using ODP metadata to personalize search

[...]

Paul-Alexandru Chirita¹, Wolfgang Nejdl¹, Raluca Paiu¹, Christian Kohlschütter¹•Institutions (1)

Leibniz University of Hanover¹

15 Aug 2005

TL;DR: An additional criterion for web page ranking is introduced, namely the distance between a user profile defined using ODP topics and the sets of O DP topics covered by each URL returned in regular web search, and the boundaries of biasing PageRank on subtopics of the ODP are investigated.

...read moreread less

Abstract: The Open Directory Project is clearly one of the largest collaborative efforts to manually annotate web pages. This effort involves over 65,000 editors and resulted in metadata specifying topic and importance for more than 4 million web pages. Still, given that this number is just about 0.05 percent of the Web pages indexed by Google, is this effort enough to make a difference? In this paper we discuss how these metadata can be exploited to achieve high quality personalized web search. First, we address this by introducing an additional criterion for web page ranking, namely the distance between a user profile defined using ODP topics and the sets of ODP topics covered by each URL returned in regular web search. We empirically show that this enhancement yields better results than current web search using Google. Then, in the second part of the paper, we investigate the boundaries of biasing PageRank on subtopics of the ODP in order to automatically extend these metadata to the whole web.

...read moreread less

327 citations

Proceedings Article•DOI•

Relation between PLSA and NMF and implications

[...]

Eric Gaussier¹, Cyril Goutte¹•Institutions (1)

Xerox¹

15 Aug 2005

TL;DR: It is shown that PLSA solves the problem of NMF with KL divergence, and the implications of this relationship are explored.

...read moreread less

Abstract: Non-negative Matrix Factorization (NMF, [5]) and Probabilistic Latent Semantic Analysis (PLSA, [4]) have been successfully applied to a number of text analysis tasks such as document clustering. Despite their different inspirations, both methods are instances of multinomial PCA [1]. We further explore this relationship and first show that PLSA solves the problem of NMF with KL divergence, and then explore the implications of this relationship.

...read moreread less

305 citations

Proceedings Article•DOI•

Question answering passage retrieval using dependency relations

[...]

Hang Cui¹, Renxu Sun¹, Keya Li¹, Min-Yen Kan¹, Tat-Seng Chua¹ - Show less +1 more•Institutions (1)

National University of Singapore¹

15 Aug 2005

TL;DR: This work presents two methods for learning relation mapping scores from past QA pairs: one based on mutual information and the other on expectation maximization, which significantly outperforms state-of-the-art density-based passage retrieval methods.

...read moreread less

Abstract: State-of-the-art question answering (QA) systems employ term-density ranking to retrieve answer passages Such methods often retrieve incorrect passages as relationships among question terms are not considered Previous studies attempted to address this problem by matching dependency relations between questions and answers They used strict matching, which fails when semantically equivalent relationships are phrased differently We propose fuzzy relation matching based on statistical models We present two methods for learning relation mapping scores from past QA pairs: one based on mutual information and the other on expectation maximization Experimental results show that our method significantly outperforms state-of-the-art density-based passage retrieval methods by up to 78% in mean reciprocal rank Relation matching also brings about a 50% improvement in a system enhanced by query expansion

...read moreread less

264 citations

Proceedings Article•DOI•

Multi-label informed latent semantic indexing

[...]

Kai Yu¹, Shipeng Yu², Volker Tresp¹•Institutions (2)

Siemens¹, Ludwig Maximilian University of Munich²

15 Aug 2005

TL;DR: This paper introduces the multi-label informed latent semantic indexing (MLSI) algorithm which preserves the information of inputs and meanwhile captures the correlations between the multiple outputs and incorporates the human-annotated category information.

...read moreread less

Abstract: Latent semantic indexing (LSI) is a well-known unsupervised approach for dimensionality reduction in information retrieval. However if the output information (i.e. category labels) is available, it is often beneficial to derive the indexing not only based on the inputs but also on the target values in the training data set. This is of particular importance in applications with multiple labels, in which each document can belong to several categories simultaneously. In this paper we introduce the multi-label informed latent semantic indexing (MLSI) algorithm which preserves the information of inputs and meanwhile captures the correlations between the multiple outputs. The recovered "latent semantics" thus incorporate the human-annotated category information and can be used to greatly improve the prediction accuracy. Empirical study based on two data sets, Reuters-21578 and RCV1, demonstrates very encouraging results.

...read moreread less

253 citations

Proceedings Article•DOI•

Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval

[...]

Elad Yom-Tov¹, Shai Fine¹, David Carmel¹, Adam Darlow¹•Institutions (1)

IBM¹

15 Aug 2005

TL;DR: Novel learning methods for estimating the quality of results returned by a search engine in response to a query and the usefulness of quality estimation for several applications, among them improvement of retrieval, detecting queries for which no relevant content exists in the document collection, and distributed information retrieval are presented.

...read moreread less

Abstract: In this article we present novel learning methods for estimating the quality of results returned by a search engine in response to a query. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. We demonstrate the usefulness of quality estimation for several applications, among them improvement of retrieval, detecting queries for which no relevant content exists in the document collection, and distributed information retrieval. Experiments on TREC data demonstrate the robustness and the effectiveness of our learning algorithms.

...read moreread less

Proceedings Article•DOI•

Multi-labelled classification using maximum entropy method

[...]

Shenghuo Zhu, Xiang Ji, Wei Xu, Yihong Gong

15 Aug 2005

TL;DR: This paper explores correlations among categories with maximum entropy method and derives a classification algorithm for multi-labelled documents that significantly outperforms the combination of single label approach.

...read moreread less

Abstract: Many classification problems require classifiers to assign each single document into more than one category, which is called multi-labelled classification. The categories in such problems usually are neither conditionally independent from each other nor mutually exclusive, therefore it is not trivial to directly employ state-of-the-art classification algorithms without losing information of relation among categories. In this paper, we explore correlations among categories with maximum entropy method and derive a classification algorithm for multi-labelled documents. Our experiments show that this method significantly outperforms the combination of single label approach.

...read moreread less

Proceedings Article•DOI•

Impedance coupling in content-targeted advertising

[...]

Berthier Ribeiro-Neto¹, Marco Cristo¹, Paulo B. Golgher², Edleno Silva de Moura³•Institutions (3)

Universidade Federal de Minas Gerais¹, Akwan Information Technologies², Federal University of Amazonas³

15 Aug 2005

TL;DR: This work proposes ten strategies for solving the problem of associating ads with a Web page from a computer science perspective and suggests that great accuracy in content-targeted advertising can be attained with appropriate algorithms.

...read moreread less

Abstract: The current boom of the Web is associated with the revenues originated from on-line advertising. While search-based advertising is dominant, the association of ads with a Web page (during user navigation) is becoming increasingly important. In this work, we study the problem of associating ads with a Web page, referred to as content-targeted advertising, from a computer science perspective. We assume that we have access to the text of the Web page, the keywords declared by an advertiser, and a text associated with the advertiser's business. Using no other information and operating in fully automatic fashion, we propose ten strategies for solving the problem and evaluate their effectiveness. Our methods indicate that a matching strategy that takes into account the semantics of the problem (referred to as AAK for "ads and keywords") can yield gains in average precision figures of 60% compared to a trivial vector-based strategy. Further, a more sophisticated impedance coupling strategy, which expands the text of the Web page to reduce vocabulary impedance with regard to an advertisement, can yield extra gains in average precision of 50%. These are first results. They suggest that great accuracy in content-targeted advertising can be attained with appropriate algorithms.

...read moreread less

Proceedings Article•DOI•

Improving web search results using affinity graph

[...]

Benyu Zhang¹, Hua Li², Yi Liu³, Lei Ji⁴, Wensi Xi⁵, Weiguo Fan⁵, Zheng Chen¹, Wei-Ying Ma¹ - Show less +4 more•Institutions (5)

Microsoft¹, Peking University², Michigan State University³, Beijing Institute of Technology⁴, Virginia Tech⁵

15 Aug 2005

TL;DR: A novel ranking scheme named Affinity Ranking (AR) is proposed to re-rank search results by optimizing two metrics: diversity -- which indicates the variance of topics in a group of documents; and information richness -- which measures the coverage of a single document to its topic.

...read moreread less

Abstract: In this paper, we propose a novel ranking scheme named Affinity Ranking (AR) to re-rank search results by optimizing two metrics: (1) diversity -- which indicates the variance of topics in a group of documents; (2) information richness -- which measures the coverage of a single document to its topic. Both of the two metrics are calculated from a directed link graph named Affinity Graph (AG). AG models the structure of a group of documents based on the asymmetric content similarities between each pair of documents. Experimental results in Yahoo! Directory, ODP Data, and Newsgroup data demonstrate that our proposed ranking algorithm significantly improves the search performance. Specifically, the algorithm achieves 31% improvement in diversity and 12% improvement in information richness relatively within the top 10 search results.

...read moreread less

Proceedings Article•

Expiriments with mood classification in blog posts

[...]

G.A. Mishne

01 Jan 2005

Proceedings Article•DOI•

A probabilistic model for retrospective news event detection

[...]

Zhiwei Li¹, Bin Wang², Mingjing Li¹, Wei-Ying Ma¹•Institutions (2)

Microsoft¹, University of Science and Technology of China²

15 Aug 2005

TL;DR: This work proposes a probabilistic model to incorporate both content and time information in a unified framework for retrospective news event detection and builds an interactive RED system, HISCOVERY, which provides additional functions to present events, Photo Story and Chronicle.

...read moreread less

Abstract: Retrospective news event detection (RED) is defined as the discovery of previously unidentified events in historical news corpus. Although both the contents and time information of news articles are helpful to RED, most researches focus on the utilization of the contents of news articles. Few research works have been carried out on finding better usages of time information. In this paper, we do some explorations on both directions based on the following two characteristics of news articles. On the one hand, news articles are always aroused by events; on the other hand, similar articles reporting the same event often redundantly appear on many news sources. The former hints a generative model of news articles, and the latter provides data enriched environments to perform RED. With consideration of these characteristics, we propose a probabilistic model to incorporate both content and time information in a unified framework. This model gives new representations of both news articles and news events. Furthermore, based on this approach, we build an interactive RED system, HISCOVERY, which provides additional functions to present events, Photo Story and Chronicle.

...read moreread less

Proceedings Article•DOI•

Integrating word relationships into language models

[...]

Guihong Cao¹, Jian-Yun Nie¹, Jing Bai¹•Institutions (1)

Université de Montréal¹

15 Aug 2005

TL;DR: The results show that the model achieves substantial and significant improvements with respect to the models without these relationships, and clearly shows the benefit of integrating word relationships into language models for IR.

...read moreread less

Abstract: In this paper, we propose a novel dependency language modeling approach for information retrieval. The approach extends the existing language modeling approach by relaxing the independence assumption. Our goal is to build a language model in which various word relationships can be integrated. In this work, we integrate two types of relationship extracted from WordNet and co-occurrence relationships respectively. The integrated model has been tested on several TREC collections. The results show that our model achieves substantial and significant improvements with respect to the models without these relationships. These results clearly show the benefit of integrating word relationships into language models for IR.

...read moreread less

Proceedings Article•DOI•

PageRank without hyperlinks: structural re-ranking using links induced by language models

[...]

Oren Kurland¹, Lillian Lee²•Institutions (2)

Cornell University¹, University of Pittsburgh²

15 Aug 2005

TL;DR: This paper proposed a structural re-ranking approach to ad hoc information retrieval, which reorder the documents in an initially retrieved set by exploiting asymmetric relationships between them, and showed that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks.

...read moreread less

Abstract: Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural re-ranking approach to ad hoc information retrieval: we reorder the documents in an initially retrieved set by exploiting asymmetric relationships between them. Specifically, we consider generation links, which indicate that the language model induced from one document assigns high probability to the text of another; in doing so, we take care to prevent bias against long documents. We study a number of re-ranking criteria based on measures of centrality in the graphs formed by generation links, and show that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks.

...read moreread less

Proceedings Article•DOI•

An exploration of axiomatic approaches to information retrieval

[...]

Hui Fang¹, ChengXiang Zhai¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

15 Aug 2005

TL;DR: This paper proposes a new axiomatic approach to developing retrieval models based on direct modeling of relevance with formalized retrieval constraints defined at the level of terms, and derives several new retrieval functions using this framework.

...read moreread less

Abstract: Existing retrieval models generally do not offer any guarantee for optimal retrieval performance. Indeed, it is even difficult, if not impossible, to predict a model's empirical performance analytically. This limitation is at least partly caused by the way existing retrieval models are developed where relevance is only coarsely modeled at the level of documents and queries as opposed to a finer granularity level of terms. In this paper, we present a new axiomatic approach to developing retrieval models based on direct modeling of relevance with formalized retrieval constraints defined at the level of terms. The basic idea of this axiomatic approach is to search in a space of candidate retrieval functions for one that can satisfy a set of reasonable retrieval constraints. To constrain the search space, we propose to define a retrieval function inductively and decompose a retrieval function into three component functions. Inspired by the analysis of the existing retrieval functions with the inductive definition, we derive several new retrieval functions using the axiomatic retrieval framework. Experiment results show that the derived new retrieval functions are more robust and less sensitive to parameter settings than the existing retrieval functions with comparable optimal performance.

...read moreread less

Proceedings Article•DOI•

Relevance weighting for query independent evidence

[...]

Nick Craswell¹, Stephen Robertson¹, Hugo Zaragoza¹, Michael J. Taylor¹•Institutions (1)

Microsoft¹

15 Aug 2005

TL;DR: FLOE is presented, a simple density analysis method for modelling the shape of the transformation required, based on training data and without assuming independence between feature and baseline, for a new query independent feature.

...read moreread less

Abstract: A query independent feature, relating perhaps to document content, linkage or usage, can be transformed into a static, per-document relevance weight for use in ranking. The challenge is to find a good function to transform feature values into relevance scores. This paper presents FLOE, a simple density analysis method for modelling the shape of the transformation required, based on training data and without assuming independence between feature and baseline. For a new query independent feature, it addresses the questions: is it required for ranking, what sort of transformation is appropriate and, after adding it, how successful was the chosen transformation? Based on this we apply sigmoid transformations to PageRank, indegree, URL Length and ClickDistance, tested in combination with a BM25 baseline.

...read moreread less

Proceedings Article•DOI•

On the collective classification of email "speech acts"

[...]

Vitor R. Carvalho¹, William W. Cohen¹•Institutions (1)

Carnegie Mellon University¹

15 Aug 2005

TL;DR: This paper proposed a dependency-network based collective classification method, in which the local classifiers are maximum entropy models based on words and certain relational features, exploiting the sequential correlation among email messages in the same thread.

...read moreread less

Abstract: We consider classification of email messages as to whether or not they contain certain "email acts", such as a request or a commitment. We show that exploiting the sequential correlation among email messages in the same thread can improve email-act classification. More specifically, we describe a new text-classification algorithm based on a dependency-network based collective classification method, in which the local classifiers are maximum entropy models based on words and certain relational features. We show that statistically significant improvements over a bag-of-words baseline classifier can be obtained for some, but not all, email-act classes. Performance improvements obtained by collective classification appears to be consistent across many email acts suggested by prior speech-act theory.

...read moreread less

Proceedings Article•DOI•

Topic themes for multi-document summarization

[...]

Sanda M. Harabagiu¹, Finley Lacatusu¹•Institutions (1)

Language Computer Corporation¹

15 Aug 2005

TL;DR: This paper presents eight different methods of generating MDS and evaluates each of these methods on a large set of topics used in past DUC workshops, showing a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.

...read moreread less

Abstract: The problem of using topic representations for multi-document summarization (MDS) has received considerable attention recently. In this paper, we describe five different topic representations and introduce a novel representation of topics based on topic themes. We present eight different methods of generating MDS and evaluate each of these methods on a large set of topics used in past DUC workshops. Our evaluation results show a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.

...read moreread less

Journal Article•DOI•

Effective web crawling

[...]

Carlos Castillo¹•Institutions (1)

University of Chile¹

01 Jun 2005

TL;DR: The World Wide Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small.

...read moreread less

Abstract: The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content.

...read moreread less

Proceedings Article•DOI•

Detecting phrase-level duplication on the world wide web

[...]

Dennis Fetterly¹, Mark S. Manasse¹, Marc Najork¹•Institutions (1)

Microsoft¹

15 Aug 2005

TL;DR: The algorithms used to discover a number of other instances of large-scale phrase-level replication within the two data sets collected in December 2002 and June 2004 are described.

...read moreread less

Abstract: Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated "spam" web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together grammatically well-formed German sentences drawn from a large collection of sentences. This discovery motivated us to develop techniques for finding other instances of such "slice and dice" generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus. We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets. This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.

...read moreread less

Journal Article•DOI•

The TREC robust retrieval track

[...]

Ellen M. Voorhees¹•Institutions (1)

National Institute of Standards and Technology¹

01 Jun 2005

TL;DR: The robust retrieval track explores methods for improving the consistency of retrieval technology by focusing on poorly performing topics by investigating appropriate evaluation measures to support the focus on ineffective topics.

...read moreread less

Abstract: The robust retrieval track explores methods for improving the consistency of retrieval technology by focusing on poorly performing topics. The retrieval task in the track is a traditional ad hoc retrieval task where the evaluation methodology emphasizes a system's least effective topics. The most promising approach to improving poorly performing topics is exploiting text collections other than the target collection such as the web.The track has also investigated appropriate evaluation measures to support the focus on ineffective topics. Traditional measure are dominated by the better-performing topics, and the first two measures used in the track that do emphasize the poorly performing topics are unstable in practice. A third measure, a variant of the traditional MAP measure that uses a geometric mean rather than an arithmetic mean to average individual topic results, shows promise of giving appropriate emphasis to poorly performing topics while being more stable at equal topic set sizes.

...read moreread less

Proceedings Article•DOI•

SimFusion: measuring similarity using unified relationship matrix

[...]

Wensi Xi¹, Edward A. Fox¹, Weiguo Fan¹, Benyu Zhang², Zheng Chen², Jun Yan³, Dong Zhuang⁴ - Show less +3 more•Institutions (4)

Virginia Tech¹, Microsoft², Peking University³, Beijing Institute of Technology⁴

15 Aug 2005

TL;DR: It is claimed that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com- bination of information from heterogeneous sources.

...read moreread less

Abstract: In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user click-through sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com- bination of information from heterogeneous sources. To support our claim, we present a unified similarity-calculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.

...read moreread less

Proceedings Article•DOI•

Detecting dominant locations from search queries

[...]

Lee Wang¹, Chuang Wang², Xing Xie¹, Josh Forman³, Yansheng Lu², Wei-Ying Ma¹, Ying Li¹ - Show less +3 more•Institutions (3)

Microsoft¹, Huazhong University of Science and Technology², Harvard University³

15 Aug 2005

TL;DR: This paper defines a search query's dominant location (QDL) and proposes a solution to correctly detect it and shows that the query location detection solution has consistent high accuracy for all query frequency ranges.

...read moreread less

Abstract: Accurately and effectively detecting the locations where search queries are truly about has huge potential impact on increasing search relevance. In this paper, we define a search query's dominant location (QDL) and propose a solution to correctly detect it. QDL is geographical location(s) associated with a query in collective human knowledge, i.e., one or few prominent locations agreed by majority of people who know the answer to the query. QDL is a subjective and collective attribute of search queries and we are able to detect QDLs from both queries containing geographical location names and queries not containing them. The key challenges to QDL detection include false positive suppression (not all contained location names in queries mean geographical locations), and detecting implied locations by the context of the query. In our solution, a query is recursively broken into atomic tokens according to its most popular web usage for reducing false positives. If we do not find a dominant location in this step, we mine the top search results and/or query logs (with different approaches discussed in this paper) to discover implicit query locations. Our large-scale experiments on recent MSN Search queries show that our query location detection solution has consistent high accuracy for all query frequency ranges.

...read moreread less

Journal Article•DOI•

Using generative probabilistic models for multimedia retrieval

[...]

Thijs Westerveld

01 Jun 2005

TL;DR: This thesis investigating search and retrieval in collections of images and video, where video is defined as a sequence of still images, focuses on retrieval from generic, heterogeneous multimedia collections.

...read moreread less

Abstract: This thesis discusses information retrieval from multimedia archives, focusing on documents containing visual material. We investigate search and retrieval in collections of images and video, where video is defined as a sequence of still images. No assumptions are made with respect to the content of the documents; we concentrate on retrieval from generic, heterogeneous multimedia collections. In this research area a user's query typically consists of one or more example images and the implicit request is: "Find images similar to this one." In addition the query may contain a textual description of the information need. The research presented here addresses three issues within this area.

...read moreread less

Proceedings Article•DOI•

Active feedback in ad hoc information retrieval

[...]

Xuehua Shen¹, ChengXiang Zhai¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

15 Aug 2005

TL;DR: This paper studies how a retrieval system can perform active feedback, i.e., how to choose documents for relevance feedback so that the system can learn most from the feedback information.

...read moreread less

Abstract: Information retrieval is, in general, an iterative search process, in which the user often has several interactions with a retrieval system for an information need. The retrieval system can actively probe a user with questions to clarify the information need instead of just passively responding to user queries. A basic question is thus how a retrieval system should propose questions to the user so that it can obtain maximum benefits from the feedback on these questions. In this paper, we study how a retrieval system can perform active feedback, i.e., how to choose documents for relevance feedback so that the system can learn most from the feedback information. We present a general framework for such an active feedback problem, and derive several practical algorithms as special cases. Empirical evaluation of these algorithms shows that the performance of traditional relevance feedback (presenting the top K documents) is consistently worse than that of presenting documents with more diversity. With a diversity-based selection algorithm, we obtain fewer relevant documents, however, these fewer documents have more learning benefits.

...read moreread less

Proceedings Article•DOI•

A study of factors affecting the utility of implicit relevance feedback

[...]

Ryen W. White¹, Ian Ruthven², Joemon M. Jose³•Institutions (3)

University of Maryland, College Park¹, University of Strathclyde², University of Glasgow³

15 Aug 2005

TL;DR: Investigation of how the use and effectiveness of IRF is affected by three factors: search task complexity, the search experience of the user and the stage in the search suggests that all three of these factors contribute to the utility ofIRF.

...read moreread less

Abstract: Implicit relevance feedback (IRF) is the process by which a search system unobtrusively gathers evidence on searcher interests from their interaction with the system. IRF is a new method of gathering information on user interest and, if IRF is to be used in operational IR systems, it is important to establish when it performs well and when it performs poorly. In this paper we investigate how the use and effectiveness of IRF is affected by three factors: search task complexity, the search experience of the user and the stage in the search. Our findings suggest that all three of these factors contribute to the utility of IRF.

...read moreread less

Collapse