Showing papers presented at "International ACM SIGIR Conference on Research and Development in Information Retrieval in 2004"

PDF

Open Access

Proceedings Article•DOI•

Retrieval evaluation with incomplete information

[...]

Chris Buckley, Ellen M. Voorhees¹•Institutions (1)

National Institute of Standards and Technology¹

25 Jul 2004

TL;DR: It is shown that current evaluation measures are not robust to substantially incomplete relevance judgments, and a new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets.

...read moreread less

Abstract: This paper examines whether the Cranfield evaluation methodology is robust to gross violations of the completeness assumption (i.e., the assumption that all relevant documents within a test collection have been identified and are present in the collection). We show that current evaluation measures are not robust to substantially incomplete relevance judgments. A new measure is introduced that is both highly correlated with existing measures when complete judgments are available and more robust to incomplete judgment sets. This finding suggests that substantially larger or dynamic test collections built using current pooling practices should be viable laboratory tools, despite the fact that the relevance information will be incomplete and imperfect.

...read moreread less

756 citations

Proceedings Article•DOI•

Eye-tracking analysis of user behavior in WWW search

[...]

Laura Granka¹, Thorsten Joachims¹•Institutions (1)

Cornell University¹

25 Jul 2004

TL;DR: This work investigates how users interact with the results page of a WWW search engine using eye-tracking to gain insight into how users browse the presented abstracts and how they select links for further exploration.

...read moreread less

Abstract: We investigate how users interact with the results page of a WWW search engine using eye-tracking. The goal is to gain insight into how users browse the presented abstracts and how they select links for further exploration. Such understanding is valuable for improved interface design, as well as for more accurate interpretations of implicit feedback (e.g. clickthrough) for machine learning. The following presents initial results, focusing on the amount of time spent viewing the presented abstracts, the total number of abstract viewed, as well as measures of how thoroughly searchers evaluate their results set.

...read moreread less

738 citations

Proceedings Article•DOI•

Learning to cluster web search results

[...]

Hua-Jun Zeng¹, Qi-Cai He², Zheng Chen¹, Wei-Ying Ma¹, Jinwen Ma² - Show less +1 more•Institutions (2)

Microsoft¹, Peking University²

25 Jul 2004

TL;DR: This paper reformalizes the clustering problem as a salient phrase ranking problem, and first extracts and ranks salient phrases as candidate cluster names, based on a regression model learned from human labeled training data.

...read moreread less

Abstract: Organizing Web search results into clusters facilitates users' quick browsing through search results. Traditional clustering techniques are inadequate since they don't generate clusters with highly readable names. In this paper, we reformalize the clustering problem as a salient phrase ranking problem. Given a query and the ranked list of documents (typically a list of titles and snippets) returned by a certain Web search engine, our method first extracts and ranks salient phrases as candidate cluster names, based on a regression model learned from human labeled training data. The documents are assigned to relevant salient phrases to form candidate clusters, and the final clusters are generated by merging these candidate clusters. Experimental results verify our method's feasibility and effectiveness.

...read moreread less

678 citations

Proceedings Article•DOI•

Web-a-where: geotagging web content

[...]

Einat Amitay¹, Nadav Har'El¹, Ron Sivan¹, Aya Soffer¹•Institutions (1)

IBM¹

25 Jul 2004

TL;DR: Web-a-Where, a system for associating geography with Web pages that locates mentions of places and determines the place each name refers to, is described and an implementation of the tagger within the framework of the WebFountain data mining system is described.

...read moreread less

Abstract: We describe Web-a-Where, a system for associating geography with Web pages. Web-a-Where locates mentions of places and determines the place each name refers to. In addition, it assigns to each page a geographic focus --- a locality that the page discusses as a whole. The tagging process is simple and fast, aimed to be applied to large collections of Web pages and to facilitate a variety of location-based applications and data analyses.Geotagging involves arbitrating two types of ambiguities: geo/non-geo and geo/geo. A geo/non-geo ambiguity occurs when a place name also has a non-geographic meaning, such as a person name (e.g., Berlin) or a common word (Turkey). Geo/geo ambiguity arises when distinct places have the same name, as in London, England vs. London, Ontario.An implementation of the tagger within the framework of the WebFountain data mining system is described, and evaluated on several corpora of real Web pages. Precision of up to 82% on individual geotags is achieved. We also evaluate the relative contribution of various heuristics the tagger employs, and evaluate the focus-finding algorithm using a corpus pretagged with localities, showing that as many as 91% of the foci reported are correct up to the country level.

...read moreread less

603 citations

Proceedings Article•DOI•

Cluster-based retrieval using language models

[...]

Xiaoyong Liu¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

25 Jul 2004

TL;DR: It is shown that cluster- based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.

...read moreread less

Abstract: Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.

...read moreread less

503 citations

Proceedings Article•DOI•

Text classification and named entities for new event detection

[...]

Giridhar Kumaran¹, James Allan¹•Institutions (1)

University of Massachusetts Amherst¹

25 Jul 2004

TL;DR: This paper shows how performance on New Event Detection (NED) can be improved by the use of text classification techniques as well as by using named entities in a new way, and explores modifications to the document representation in a vector space-based NED system.

...read moreread less

Abstract: New Event Detection is a challenging task that still offers scope for great improvement after years of effort. In this paper we show how performance on New Event Detection (NED) can be improved by the use of text classification techniques as well as by using named entities in a new way. We explore modifications to the document representation in a vector space-based NED system. We also show that addressing named entities preferentially is useful only in certain situations. A combination of all the above results in a multi-stage NED system that performs much better than baseline single-stage NED systems.

...read moreread less

399 citations

Proceedings Article•DOI•

A collaborative filtering algorithm and evaluation metric that accurately model the user experience

[...]

Matt Mclaughlin¹, Jonathan L. Herlocker¹•Institutions (1)

Oregon State University¹

25 Jul 2004

TL;DR: It is empirically demonstrated that two of the most acclaimed CF recommendation algorithms have flaws that result in a dramatically unacceptable user experience, and a new Belief Distribution Algorithm is introduced that overcomes these flaws and provides substantially richer user modeling.

...read moreread less

Abstract: Collaborative Filtering (CF) systems have been researched for over a decade as a tool to deal with information overload. At the heart of these systems are the algorithms which generate the predictions and recommendations.In this article we empirically demonstrate that two of the most acclaimed CF recommendation algorithms have flaws that result in a dramatically unacceptable user experience.In response, we introduce a new Belief Distribution Algorithm that overcomes these flaws and provides substantially richer user modeling. The Belief Distribution Algorithm retains the qualities of nearest-neighbor algorithms which have performed well in the past, yet produces predictions of belief distributions across rating values rather than a point rating value.In addition, we illustrate how the exclusive use of the mean absolute error metric has concealed these flaws for so long, and we propose the use of a modified Precision metric for more accurately evaluating the user experience.

...read moreread less

360 citations

Proceedings Article•DOI•

A formal study of information retrieval heuristics

[...]

Hui Fang¹, Tao Tao¹, ChengXiang Zhai¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

25 Jul 2004

TL;DR: A formal study of retrieval heuristics is presented and it is found that the empirical performance of a retrieval formula is tightly related to how well it satisfies basic desirable constraints.

...read moreread less

Abstract: Empirical studies of information retrieval methods show that good retrieval performance is closely related to the use of various retrieval heuristics, such as TF-IDF weighting. One basic research question is thus what exactly are these "necessary" heuristics that seem to cause good retrieval performance. In this paper, we present a formal study of retrieval heuristics. We formally define a set of basic desirable constraints that any reasonable retrieval function should satisfy, and check these constraints on a variety of representative retrieval functions. We find that none of these retrieval functions satisfies all the constraints unconditionally. Empirical results show that when a constraint is not satisfied, it often indicates non-optimality of the method, and when a constraint is satisfied only for a certain range of parameter values, its performance tends to be poor when the parameter is out of the range. In general, we find that the empirical performance of a retrieval formula is tightly related to how well it satisfies these constraints. Thus the proposed constraints provide a good explanation of many empirical observations and make it possible to evaluate any existing or new retrieval formula analytically.

...read moreread less

354 citations

Proceedings Article•DOI•

Hourly analysis of a very large topically categorized web query log

[...]

Steven M. Beitzel¹, Eric C. Jensen¹, Abdur Chowdhury¹, David A. Grossman¹, Ophir Frieder¹ - Show less +1 more•Institutions (1)

Illinois Institute of Technology¹

25 Jul 2004

TL;DR: It is shown that query traffic from particular topical categories differs both from the query stream as a whole and from other categories, which is relevant to the development of enhanced query disambiguation, routing, and caching algorithms.

...read moreread less

Abstract: We review a query log of hundreds of millions of queries that constitute the total query traffic for an entire week of a general-purpose commercial web search service. Previously, query logs have been studied from a single, cumulative view. In contrast, our analysis shows changes in popularity and uniqueness of topically categorized queries across the hours of the day. We examine query traffic on an hourly basis by matching it against lists of queries that have been topically pre-categorized by human editors. This represents 13% of the query traffic. We show that query traffic from particular topical categories differs both from the query stream as a whole and from other categories. This analysis provides valuable insight for improving retrieval effectiveness and efficiency. It is also relevant to the development of enhanced query disambiguation, routing, and caching algorithms.

...read moreread less

338 citations

Proceedings Article•DOI•

An automatic weighting scheme for collaborative filtering

[...]

Rong Jin¹, Joyce Y. Chai¹, Luo Si²•Institutions (2)

Michigan State University¹, Carnegie Mellon University²

25 Jul 2004

TL;DR: An optimization algorithm to automatically compute the weights for different items based on their ratings from training users will create a clustered distribution for user vectors in the item space by bringing users of similar interests closer and separating users of different interests more distant.

...read moreread less

Abstract: Collaborative filtering identifies information interest of a particular user based on the information provided by other similar users. The memory-based approaches for collaborative filtering (e.g., Pearson correlation coefficient approach) identify the similarity between two users by comparing their ratings on a set of items. In these approaches, different items are weighted either equally or by some predefined functions. The impact of rating discrepancies among different users has not been taken into consideration. For example, an item that is highly favored by most users should have a smaller impact on the user-similarity than an item for which different types of users tend to give different ratings. Even though simple weighting methods such as variance weighting try to address this problem, empirical studies have shown that they are ineffective in improving the performance of collaborative filtering. In this paper, we present an optimization algorithm to automatically compute the weights for different items based on their ratings from training users. More specifically, the new weighting scheme will create a clustered distribution for user vectors in the item space by bringing users of similar interests closer and separating users of different interests more distant. Empirical studies over two datasets have shown that our new weighting scheme substantially improves the performance of the Pearson correlation coefficient method for collaborative filtering.

...read moreread less

313 citations

Proceedings Article•DOI•

Discriminative models for information retrieval

[...]

Ramesh Nallapati¹•Institutions (1)

University of Massachusetts Amherst¹

25 Jul 2004

TL;DR: It is argued that the main reason to prefer SVMs over language models is their ability to learn arbitrary features automatically as demonstrated by the experiments on the home-page finding task of TREC-10.

...read moreread less

Abstract: Discriminative models have been preferred over generative models in many machine learning problems in the recent past owing to some of their attractive theoretical properties. In this paper, we explore the applicability of discriminative classifiers for IR. We have compared the performance of two popular discriminative models, namely the maximum entropy model and support vector machines with that of language modeling, the state-of-the-art generative model for IR. Our experiments on ad-hoc retrieval indicate that although maximum entropy is significantly worse than language models, support vector machines are on par with language models. We argue that the main reason to prefer SVMs over language models is their ability to learn arbitrary features automatically as demonstrated by our experiments on the home-page finding task of TREC-10.

...read moreread less

Proceedings Article•DOI•

Dependence language model for information retrieval

[...]

Jianfeng Gao¹, Jian-Yun Nie², Guangyuan Wu³, Guihong Cao³•Institutions (3)

Microsoft¹, Université de Montréal², Tianjin University³

25 Jul 2004

TL;DR: The linkage of a query is integrated as a hidden variable, which expresses the term dependencies within the query as an acyclic, planar, undirected graph, which extends the basic language modeling approach based on unigram by relaxing the independence assumption.

...read moreread less

Abstract: This paper presents a new dependence language modeling approach to information retrieval. The approach extends the basic language modeling approach based on unigram by relaxing the independence assumption. We integrate the linkage of a query as a hidden variable, which expresses the term dependencies within the query as an acyclic, planar, undirected graph. We then assume that a query is generated from a document in two stages: the linkage is generated first, and then each term is generated in turn depending on other related terms according to the linkage. We also present a smoothing method for model parameter estimation and an approach to learning the linkage of a sentence in an unsupervised manner. The new approach is compared to the classical probabilistic retrieval model and the previously proposed language models with and without taking into account term dependencies. Results show that our model achieves substantial and significant improvements on TREC collections.

...read moreread less

Proceedings Article•DOI•

Document clustering by concept factorization

[...]

Wei Xu, Yihong Gong

25 Jul 2004

TL;DR: The experimental results show that the proposed data clustering method and its variations performs best among 11 algorithms and their variations that have been evaluated on both TDT2 and Reuters-21578 corpus.

...read moreread less

Abstract: In this paper, we propose a new data clustering method called concept factorization that models each concept as a linear combination of the data points, and each data point as a linear combination of the concepts. With this model, the data clustering task is accomplished by computing the two sets of linear coefficients, and this linear coefficients computation is carried out by finding the non-negative solution that minimizes the reconstruction error of the data points. The cluster label of each data point can be easily derived from the obtained linear coefficients. This method differs from the method of clustering based on non-negative matrix factorization (NMF) \citeXu03 in that it can be applied to data containing negative values and the method can be implemented in the kernel space. Our experimental results show that the proposed data clustering method and its variations performs best among 11 algorithms and their variations that we have evaluated on both TDT2 and Reuters-21578 corpus. In addition to its good performance, the new method also has the merit in its easy and reliable derivation of the clustering results.

...read moreread less

Proceedings Article•DOI•

An effective approach to document retrieval via utilizing WordNet and recognizing phrases

[...]

Shuang Liu¹, Fang Liu¹, Clement Yu¹, Weiyi Meng²•Institutions (2)

University of Illinois at Chicago¹, Binghamton University²

25 Jul 2004

TL;DR: This work utilizes WordNet to disambiguate word senses of query terms and shows that its approach yields between 23% and 31% improvements over the best-known results on the TREC 9, 10 and 12 collections for short (title only) queries, without using Web data.

...read moreread less

Abstract: Noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. A document has a phrase if all content words in the phrase are within a window of a certain size. The window sizes for different types of phrases are different and are determined using a decision tree. Phrases are more important than individual terms. Consequently, documents in response to a query are ranked with matching phrases given a higher priority. We utilize WordNet to disambiguate word senses of query terms. Whenever the sense of a query term is determined, its synonyms, hyponyms, words from its definition and its compound words are considered for possible additions to the query. Experimental results show that our approach yields between 23% and 31% improvements over the best-known results on the TREC 9, 10 and 12 collections for short (title only) queries, without using Web data.

...read moreread less

Proceedings Article•DOI•

Display time as implicit feedback: understanding task effects

[...]

Diane Kelly¹, Nicholas J. Belkin²•Institutions (2)

University of North Carolina at Chapel Hill¹, Rutgers University²

25 Jul 2004

TL;DR: Results of an intensive naturalistic study of the online information-seeking behaviors of seven subjects during a fourteen-week period demonstrate no general, direct relationship between display time and usefulness, and that display times differ significantly according to specific task, and according to Specific user.

...read moreread less

Abstract: Recent research has had some success using the length of time a user displays a document in their web browser as implicit feedback for document preference. However, most studies have been confined to specific search domains, such as news, and have not considered the effects of task on display time, and the potential impact of this relationship on the effectiveness of display time as implicit feedback. We describe the results of an intensive naturalistic study of the online information-seeking behaviors of seven subjects during a fourteen-week period. Throughout the study, subjects' online information-seeking activities were monitored with various pieces of logging and evaluation software. Subjects were asked to identify the tasks with which they were working, classify the documents that they viewed according to these tasks, and evaluate the usefulness of the documents. Results of a user-centered analysis demonstrate no general, direct relationship between display time and usefulness, and that display times differ significantly according to specific task, and according to specific user.

...read moreread less

Proceedings Article•DOI•

GaP: a factor model for discrete data

[...]

John Canny¹•Institutions (1)

University of California, Berkeley¹

25 Jul 2004

TL;DR: The GaP model projects documents and terms into a low-dimensional space of "themes," and models texts as "passages" of terms on the same theme, and gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model.

...read moreread less

Abstract: We present a probabilistic model for a document corpus that combines many of the desirable features of previous models. The model is called "GaP" for Gamma-Poisson, the distributions of the first and last random variable. GaP is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices Λ and X. These factors have strictly non-negative terms. GaP is a generative probabilistic model that assigns finite probabilities to documents in a corpus. It can be computed with an efficient and simple EM recurrence. For a suitable choice of parameters, the GaP factorization maximizes independence between the factors. So it can be used as an independent-component algorithm adapted to document data. The form of the GaP model is empirically as well as analytically motivated. It gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. The GaP model projects documents and terms into a low-dimensional space of "themes," and models texts as "passages" of terms on the same theme.

...read moreread less

Proceedings Article•DOI•

Block-based web search

[...]

Deng Cai¹, Shipeng Yu¹, Ji-Rong Wen¹, Wei-Ying Ma¹•Institutions (1)

Microsoft¹

25 Jul 2004

TL;DR: The experimental results show that such a semantic partitioning of web pages effectively deals with the problem of multiple drifting topics and mixed lengths, and thus has great potential to boost up the performance of current web search engines.

...read moreread less

Abstract: Multiple-topic and varying-length of web pages are two negative factors significantly affecting the performance of web search. In this paper, we explore the use of page segmentation algorithms to partition web pages into blocks and investigate how to take advantage of block-level evidence to improve retrieval performance in the web context. Because of the special characteristics of web pages, different page segmentation method will have different impact on web search performance. We compare four types of methods, including fixed-length page segmentation, DOM-based page segmentation, vision-based page segmentation, and a combined method which integrates both semantic and fixed-length properties. Experiments on block-level query expansion and retrieval are performed. Among the four approaches, the combined method achieves the best performance for web search. Our experimental results also show that such a semantic partitioning of web pages effectively deals with the problem of multiple drifting topics and mixed lengths, and thus has great potential to boost up the performance of current web search engines.

...read moreread less

Proceedings Article•DOI•

Feature selection using linear classifier weights: interaction with classification models

[...]

Dunja Mladenic¹, Janez Brank¹, Marko Grobelnik¹, Natasa Milic-Frayling²•Institutions (2)

Jožef Stefan Institute¹, Microsoft²

25 Jul 2004

TL;DR: Experiments show that feature selection using weights from linear SVMs yields better classification performance than other feature weighting methods when combined with the three explored learning algorithms.

...read moreread less

Abstract: This paper explores feature scoring and selection based on weights from linear classification models. It investigates how these methods combine with various learning models. Our comparative analysis includes three learning algorithms: Naive Bayes, Perceptron, and Support Vector Machines (SVM) in combination with three feature weighting methods: Odds Ratio, Information Gain, and weights from linear models, the linear SVM and Perceptron. Experiments show that feature selection using weights from linear SVMs yields better classification performance than other feature weighting methods when combined with the three explored learning algorithms. The results support the conjecture that it is the sophistication of the feature weighting method rather than its apparent compatibility with the learning algorithm that improves classification performance.

...read moreread less

Proceedings Article•DOI•

Web-page classification through summarization

[...]

Dou Shen¹, Zheng Chen², Qiang Yang³, Hua-Jun Zeng², Benyu Zhang², Yuchang Lu¹, Wei-Ying Ma² - Show less +3 more•Institutions (3)

Tsinghua University¹, Microsoft², Hong Kong University of Science and Technology³

25 Jul 2004

TL;DR: This paper gives empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web- page classification algorithms and proposes a new Web summarization-based classification algorithm that achieves an approximately 8.8% improvement over pure-text based methods.

...read moreread less

Abstract: Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.

...read moreread less

Proceedings Article•DOI•

Block-level link analysis

[...]

Deng Cai¹, Xiaofei He¹, Ji-Rong Wen¹, Wei-Ying Ma¹•Institutions (1)

Microsoft¹

25 Jul 2004

TL;DR: Based on block-level link analysis, two new algorithms are proposed, Block Level PageRank and Block Level HITS, whose performances are studied extensively using web data.

...read moreread less

Abstract: Link Analysis has shown great potential in improving the performance of web search. PageRank and HITS are two of the most popular algorithms. Most of the existing link analysis algorithms treat a web page as a single node in the web graph. However, in most cases, a web page contains multiple semantics and hence the web page might not be considered as the atomic node. In this paper, the web page is partitioned into blocks using the vision-based page segmentation algorithm. By extracting the page-to-block, block-to-page relationships from link structure and page layout analysis, we can construct a semantic graph over the WWW such that each node exactly represents a single semantic topic. This graph can better describe the semantic structure of the web. Based on block-level link analysis, we proposed two new algorithms, Block Level PageRank and Block Level HITS, whose performances we study extensively using web data.

...read moreread less

Proceedings Article•DOI•

Corpus structure, language models, and ad hoc information retrieval

[...]

Oren Kurland¹, Lillian Lee¹•Institutions (1)

Cornell University¹

25 Jul 2004

TL;DR: A novel algorithmic framework is proposed in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents, and a suite of new algorithms are developed.

...read moreread less

Abstract: Most previous work on the recently developed language-modeling approach to information retrieval focuses on document-specific characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in precision and recall, and our new interpolation algorithm posts statistically significant improvements for both metrics over all three corpora tested.

...read moreread less

Proceedings Article•DOI•

A search engine for historical manuscript images

[...]

Toni M. Rath¹, R. Manmatha¹, Victor Lavrenko¹•Institutions (1)

University of Massachusetts Amherst¹

25 Jul 2004

TL;DR: This work describes two statistical models for retrieval in large collections of handwritten manuscripts given a text query, which is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.

...read moreread less

Abstract: Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.

...read moreread less

Proceedings Article•DOI•

Classifying racist texts using a support vector machine

[...]

Edel P. Greevy¹, Alan F. Smeaton¹•Institutions (1)

Dublin City University¹

25 Jul 2004

TL;DR: An overview of the techniques used to develop and evaluate a text categorisation system to automatically classify racist texts is presented, and three representations of a web page within an SVM are looked at.

...read moreread less

Abstract: In this poster we present an overview of the techniques we used to develop and evaluate a text categorisation system to automatically classify racist texts. Detecting racism is difficult because the presence of indicator words is insufficient to indicate racist texts, unlike some other text classification tasks. Support Vector Machines (SVM) are used to automatically categorise web pages based on whether or not they are racist. Different interpretations of what constitutes a term are taken, and in this poster we look at three representations of a web page within an SVM -- bag-of-words, bigrams and part-of-speech tags.

...read moreread less

Proceedings Article•DOI•

Parsimonious language models for information retrieval

[...]

Djoerd Hiemstra¹, Stephen Robertson, Hugo Zaragoza•Institutions (1)

University of Twente¹

25 Jul 2004

TL;DR: Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing, and need fewer (non-zero) parameters to describe the data.

...read moreread less

Abstract: We systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing. As such, they need fewer (non-zero) parameters to describe the data. We apply parsimonious models at three stages of the retrieval process: 1) at indexing time; 2) at search time; 3) at feedback time. Experimental results show that we are able to build models that are significantly smaller than standard models, but that still perform at least as well as the standard approaches.

...read moreread less

Proceedings Article•DOI•

Query based event extraction along a timeline

[...]

Hai Leong Chieu¹, Yoong Keok Lee¹•Institutions (1)

DSO National Laboratories¹

25 Jul 2004

TL;DR: A framework and a system that extracts events relevant to a query from a collection C of documents, and places such events along a timeline, based on the assumption that "important" events are widely cited in many documents for a period of time within which these events are of interest.

...read moreread less

Abstract: In this paper, we present a framework and a system that extracts events relevant to a query from a collection C of documents, and places such events along a timeline. Each event is represented by a sentence extracted from C, based on the assumption that "important" events are widely cited in many documents for a period of time within which these events are of interest. In our experiments, we used queries that are event types ("earthquake") and person names (e.g. "George Bush"). Evaluation was performed using G8 leader names as queries: comparison made by human evaluators between manually and system generated timelines showed that although manually generated timelines are on average more preferable, system generated timelines are sometimes judged to be better than manually constructed ones.

...read moreread less

Proceedings Article•DOI•

Document clustering via adaptive subspace iteration

[...]

Tao Li¹, Sheng Ma², Mitsunori Ogihara¹•Institutions (2)

University of Rochester¹, IBM²

25 Jul 2004

TL;DR: A new clustering algorithm ASI1 is presented, which uses explicitly modeling of the subspace structure associated with each cluster, and a novel method to determine the number of clusters is provided.

...read moreread less

Abstract: Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1 , which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.

...read moreread less

Proceedings Article•DOI•

Locality preserving indexing for document representation

[...]

Xiaofei He¹, Deng Cai², Haifeng Liu³, Wei-Ying Ma⁴•Institutions (4)

University of Chicago¹, Tsinghua University², University of Toronto³, Microsoft⁴

25 Jul 2004

TL;DR: Experimental results show that LPI provides better representation in the sense of semantic structure, and a novel algorithm called Locality Preserving Indexing (LPI) is proposed for document indexing.

...read moreread less

Abstract: Document representation and indexing is a key problem for document analysis and processing, such as clustering, classification and retrieval. Conventionally, Latent Semantic Indexing (LSI) is considered effective in deriving such an indexing. LSI essentially detects the most representative features for document representation rather than the most discriminative features. Therefore, LSI might not be optimal in discriminating documents with different semantics. In this paper, a novel algorithm called Locality Preserving Indexing (LPI) is proposed for document indexing. Each document is represented by a vector with low dimensionality. In contrast to LSI which discovers the global structure of the document space, LPI discovers the local structure and obtains a compact document representation subspace that best detects the essential semantic structure. We compare the proposed LPI approach with LSI on two standard databases. Experimental results show that LPI provides better representation in the sense of semantic structure.

...read moreread less

Proceedings Article•DOI•

Using the web for automated translation extraction in cross-language information retrieval

[...]

Ying Zhang¹, Phil Vines¹•Institutions (1)

RMIT University¹

25 Jul 2004

TL;DR: This work uses a method that extends earlier work in this area by augmenting this with statistical analysis, and corpus-based translation disambiguation to dynamically discover translations of OOV terms.

...read moreread less

Abstract: There have been significant advances in Cross-Language Information Retrieval (CLIR) in recent years. One of the major remaining reasons that CLIR does not perform as well as monolingual retrieval is the presence of out of vocabulary (OOV) terms. Previous work has either relied on manual intervention or has only been partially successful in solving this problem. We use a method that extends earlier work in this area by augmenting this with statistical analysis, and corpus-based translation disambiguation to dynamically discover translations of OOV terms. The method can be applied to both Chinese-English and English-Chinese CLIR, correctly extracting translations of OOV terms from the Web automatically, and thus is a significant improvement on earlier work.

...read moreread less

Proceedings Article•DOI•

The NRRC reliable information access (RIA) workshop

[...]

Donna Harman¹, Chris Buckley•Institutions (1)

National Institute of Standards and Technology¹

25 Jul 2004

TL;DR: The goal of this workshop (RIA) was to understand how to choose good approaches on a per topic basis for retrieval variability in IR systems.

...read moreread less

Abstract: Current statistical approaches to IR have shown themselves to be effective and reliable in both research and commercial settings. However, experimental environments such as TREC show that retrieval results vary widely according to both topic (question asked) and system [2]. This is true for both the basic IR systems and for any of the more advanced implementations using, for example, query expansion. Some retrieval approaches work well on one topic but poorly on a second, while other approaches may work poorly on the first topic, but succeed on the second. If it could be determined in advance which approach would work well, then a guided approach could strongly improve performance. Unfortunately, despite many efforts no one knows how to choose good approaches on a per topic basis [1, 3]. The major problem in understanding retrieval variability is that the variability is due to a number of factors. There are topic factors due to the topic (question) statement itself and to the relationship of the topic to the document collection as a whole, and then there are system dependent factors including the approach algorithm and implementation details. In general, any researcher working with only one system finds it very difficult to separate out the topic variability factors from the system variability. In the summer of 2003 NIST organized a 6-week workshop as part of the ARDA NRRC Summer Workshop series.. The goal of this workshop (RIA) was to understand This research was funded by the Advanced Research and Development Activity in Information Technology (ARDA), a U.S. Government entity which sponsors and promotes research of import to the Intelligence Community which include but is not limited to the CIA, DIA, NSA, NIMA and NRO

...read moreread less

Proceedings Article•DOI•

Why current IR engines fail

[...]

Chris Buckley

25 Jul 2004

TL;DR: Observations from a unique investigation of failure analysis of Information Retrieval (IR) research engines are presented, finding that despite systems retrieving very different documents, the major cause of failure for any particular topic was almost always the same across all systems.

...read moreread less

Abstract: Observations from a unique investigation of failure analysis of Information Retrieval (IR) research engines are presented. The Reliable Information Access (RIA) Workshop invited seven leading IR research groups to supply both their systems and their experts to an effort to analyze why their systems fail on some topics and whether the failures are due to system flaws, approach flaws, or the topic itself. There were surprising results from this cross-system failure analysis. One is that despite systems retrieving very different documents, the major cause of failure for any particular topic was almost always the same across all systems. Another is that relationships between aspects of a topic are not especially important for state-of-the-art systems; the systems are failing at a much more basic level where the top-retrieved documents are not reflecting some aspect at all.

...read moreread less

Collapse