Showing papers on "Document retrieval published in 2010"

PDF

Open Access

Proceedings Article•DOI•

A new approach to cross-modal multimedia retrieval

[...]

Nikhil Rasiwasia¹, Jose Costa Pereira¹, Emanuele Coviello¹, Gabriel Doyle¹, Gert R. G. Lanckriet¹, Roger Levy¹, Nuno Vasconcelos¹ - Show less +3 more•Institutions (1)

University of California, San Diego¹

25 Oct 2010

TL;DR: It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy and are shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.

...read moreread less

Abstract: The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.

...read moreread less

1,284 citations

Journal Issue•DOI•

Introduction to Information Retrieval

[...]

Ray R. Larson¹•Institutions (1)

University of California, Berkeley¹

01 Apr 2010-Journal of the Association for Information Science and Technology

TL;DR: This chapter discusses Information Retrieval, the science and technology behind information retrieval and retrieval, and some of the techniques used in the retrieval of information.

...read moreread less

Abstract: Introduction To Information Retrieval Overdrive Digital. Introduction To Information Retrieval. Introduction To Information Retrieval Putao Ufcg. Introduction To Information Retrieval Arbeitsbereiche. Introduction To Information Retrieval. Introduction To Information Retrieval Stanford Nlp Group. Introduction To Information Retrieval Cs Ucr Edu. Introduction To Information Retrieval By Christopher D. Introduction To Information Retrieval Book. Information Retrieval The Mit Press. Introduction Information Retrieval Uvm. Information Retrieval Lmu Munich. Introduction To Information Retrieval Stanford University. Introduction To Information Retrieval. Introduction To Information Retrieval Amp Models Slideshare. Introduction To Information Retrieval Kangwon Ac Kr. Information Retrieval. Introduction To Information Retrieval Assets. Introduction To Information Retrieval. Introduction To Information Retrieval

...read moreread less

885 citations

Book•

Information Retrieval: Implementing and Evaluating Search Engines

[...]

Stefan Büttcher, Charles L. A. Clarke¹, Gordon V. Cormack•Institutions (1)

Massachusetts Institute of Technology¹

23 Jul 2010

TL;DR: Information Retrieval offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation, and is a valuable reference for professionals in computer science, computer engineering, and software engineering.

...read moreread less

Abstract: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects. Wumpusa multiuser open-source information retrieval system developed by one of the authors and available onlineprovides model implementations and a basis for student work. The modular structure of the book allows instructors to use it in a variety of graduate-level courses, including courses taught from a database systems perspective, traditional information retrieval courses with a focus on IR theory, and courses covering the basics of Web retrieval. After an introduction to the basics of information retrieval, the text covers three major topic areasindexing, retrieval, and evaluationin self-contained parts. The final part of the book draws on and extends the general material in the earlier parts, treating such specific applications as parallel search engines, Web search, and XML retrieval. End-of-chapter references point to further reading; exercises range from pencil and paper problems to substantial programming projects. In addition to its classroom use, Information Retrieval will be a valuable reference for professionals in computer science, computer engineering, and software engineering.

...read moreread less

523 citations

Journal Article•DOI•

Learning author-topic models from text corpora

[...]

Michal Rosen-Zvi¹, Chaitanya Chemudugunta², Thomas L. Griffiths³, Padhraic Smyth², Mark Steyvers² - Show less +1 more•Institutions (3)

IBM¹, University of California, Irvine², University of California, Berkeley³

29 Jan 2010-ACM Transactions on Information Systems

TL;DR: The interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors are discussed.

...read moreread less

Abstract: We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.

...read moreread less

329 citations

Journal Article•DOI•

LINNAEUS: A species name identification system for biomedical literature

[...]

Martin Gerner¹, Goran Nenadic¹, Casey M. Bergman¹•Institutions (1)

University of Manchester¹

11 Feb 2010-BMC Bioinformatics

TL;DR: LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can be integrated into a range of bioinformatics and text-mining applications.

...read moreread less

Abstract: The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/ .

...read moreread less

328 citations

Proceedings Article•DOI•

Ad-hoc object retrieval in the web of data

[...]

Jeffrey Pound¹, Peter Mika², Hugo Zaragoza²•Institutions (2)

University of Waterloo¹, Yahoo!²

26 Apr 2010

TL;DR: This work proposes a formal model of one specific semantic search task: ad-hoc object retrieval and shows that this task provides a solid framework to study some of the semantic search problems currently tackled by commercial Web search engines.

...read moreread less

Abstract: Semantic Search refers to a loose set of concepts, challenges and techniques having to do with harnessing the information of the growing Web of Data (WoD) for Web search. Here we propose a formal model of one specific semantic search task: ad-hoc object retrieval. We show that this task provides a solid framework to study some of the semantic search problems currently tackled by commercial Web search engines. We connect this task to the traditional ad-hoc document retrieval and discuss appropriate evaluation metrics. Finally, we carry out a realistic evaluation of this task in the context of a Web search application.

...read moreread less

228 citations

Proceedings Article•DOI•

Scalable indexing for layout based document retrieval and ranking

[...]

Loïc Lecerf¹, Boris Chidlovskii¹•Institutions (1)

Xerox¹

22 Mar 2010

TL;DR: A model of layout indexing of a collection adapted for the quick retrieval of top k relevant documents by document layout is developed and a direct evaluation of the similarity between a query and each document in the collection is avoided.

...read moreread less

Abstract: In this paper we propose a schema for querying large documents collections by document layout. We develop a model of layout indexing of a collection adapted for the quick retrieval of top k relevant documents. Fort the sake of scalability, we avoid a direct evaluation of the similarity between a query and each document in the collection; their similarity is instead approximated by the similarity between their projections on the set of representative blocks which are inferred from the collection on the indexed step. The technique also proposes new functions for the relevance ranking and the cluster pruning that ensure a scalable retrieval and ranking.

...read moreread less

162 citations

Proceedings Article•

Translingual Document Representations from Discriminative Projections

[...]

John Platt¹, Kristina Toutanova¹, Wen-tau Yih¹•Institutions (1)

Microsoft¹

09 Oct 2010

TL;DR: This work uses discriminative training to create a projection of documents from multiple languages into a single translingual vector space and evaluates these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters.

...read moreread less

Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corresponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

...read moreread less

126 citations

Posted Content•

New Algorithms on Wavelet Trees and Applications to Information Retrieval

[...]

Travis Gagie¹, Gonzalo Navarro², Simon J. Puglisi³•Institutions (3)

Aalto University¹, University of Chile², RMIT University³

19 Nov 2010-arXiv: Data Structures and Algorithms

TL;DR: This work shows how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries.

...read moreread less

Abstract: Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as {\em range quantile} queries, {\em range next value} queries, and {\em range intersection} queries. We explore several applications of these queries in Information Retrieval, in particular {\em document retrieval} in hierarchical and temporal documents, and in the representation of {\em inverted lists}.

...read moreread less

108 citations

Journal Issue•DOI•

Search Engines Information Retrieval in Practice

[...]

Christopher C. Yang¹•Institutions (1)

Drexel University¹

01 Feb 2010-Journal of the Association for Information Science and Technology

88 citations

Book Chapter•DOI•

Top-k ranked document search in general text databases

[...]

J. Shane Culpepper¹, Gonzalo Navarro², Simon J. Puglisi¹, Andrew Turpin¹•Institutions (2)

RMIT University¹, University of Chile²

06 Sep 2010

TL;DR: This paper presents two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text, significantly faster than existing methods in RAM and even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.

...read moreread less

Abstract: Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.

...read moreread less

Journal Article•DOI•

Evaluation of information retrieval for E-discovery

[...]

Douglas W. Oard¹, Jason R. Baron, Bruce Hedin, David D. Lewis, Stephen Tomlinson² - Show less +1 more•Institutions (2)

University of Maryland, College Park¹, Open Text Corporation²

01 Dec 2010-Artificial Intelligence and Law

TL;DR: A multi-year effort carried out as part of the Text Retrieval Conference to develop evaluation methods for responsive review tasks in E-discovery leads to new approaches to measuring effectiveness in both batch and interactive frameworks, large data sets, and some surprising results for the recall and precision of Boolean and statistical information retrieval methods.

...read moreread less

Abstract: The effectiveness of information retrieval technology in electronic discovery (E-discovery) has become the subject of judicial rulings and practitioner controversy The scale and nature of E-discovery tasks, however, has pushed traditional information retrieval evaluation approaches to their limits This paper reviews the legal and operational context of E-discovery and the approaches to evaluating search technology that have evolved in the research community It then describes a multi-year effort carried out as part of the Text Retrieval Conference to develop evaluation methods for responsive review tasks in E-discovery This work has led to new approaches to measuring effectiveness in both batch and interactive frameworks, large data sets, and some surprising results for the recall and precision of Boolean and statistical information retrieval methods The paper concludes by offering some thoughts about future research in both the legal and technical communities toward the goal of reliable, effective use of information retrieval in E-discovery

...read moreread less

Book Chapter•DOI•

Colored range queries and document retrieval

[...]

Travis Gagie, Gonzalo Navarro, Simon J. Puglisi¹•Institutions (1)

RMIT University¹

11 Oct 2010

TL;DR: Improved time and space bounds are given for three important one-dimensional colored range queries -- colored range listing, colored range top-k queries and colored range counting -- and, thus, new bounds for various document retrieval problems on general collections of sequences are given.

...read moreread less

Abstract: Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries -- colored range listing, colored range top-k queries and colored range counting -- and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the high-order entropies of the library of documents. We then show how (approximate) colored top-k queries can be reduced to (approximate) range-mode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.

...read moreread less

Journal Article•DOI•

Privacy-preserving similarity-based text retrieval

[...]

HweeHwa Pang¹, Jialie Shen¹, Ramayya Krishnan²•Institutions (2)

Singapore Management University¹, Carnegie Mellon University²

08 Feb 2010-ACM Transactions on Internet Technology

TL;DR: A privacy-preserving, similarity-based text retrieval scheme that prevents the server from accurately reconstructing the term composition of queries and documents, and anonymizes the search results from unauthorized observers is introduced.

...read moreread less

Abstract: Users of online services are increasingly wary that their activities could disclose confidential information on their business or personal activities. It would be desirable for an online document service to perform text retrieval for users, while protecting the privacy of their activities. In this article, we introduce a privacy-preserving, similarity-based text retrieval scheme that (a) prevents the server from accurately reconstructing the term composition of queries and documents, and (b) anonymizes the search results from unauthorized observers. At the same time, our scheme preserves the relevance-ranking of the search server, and enables accounting of the number of documents that each user opens. The effectiveness of the scheme is verified empirically with two real text corpora.

...read moreread less

Journal Article•DOI•

Embellishing text search queries to protect user privacy

[...]

HweeHwa Pang¹, Xuhua Ding¹, Xiaokui Xiao²•Institutions (2)

Singapore Management University¹, Nanyang Technological University²

01 Sep 2010

TL;DR: This paper identifies the privacy risks arising from semantically related search terms within a query, and from recurring high-specificity query terms in a search session, and proposes a solution for a similarity text retrieval system to offer anonymity and plausible deniability for the query terms, and hence the user intent, without degrading the system's precision-recall performance.

...read moreread less

Abstract: Users of text search engines are increasingly wary that their activities may disclose confidential information about their business or personal profiles. It would be desirable for a search engine to perform document retrieval for users while protecting their intent. In this paper, we identify the privacy risks arising from semantically related search terms within a query, and from recurring high-specificity query terms in a search session. To counter the risks, we propose a solution for a similarity text retrieval system to offer anonymity and plausible deniability for the query terms, and hence the user intent, without degrading the system's precision-recall performance. The solution comprises a mechanism that embellishes each user query with decoy terms that exhibit similar specificity spread as the genuine terms, but point to plausible alternative topics. We also provide an accompanying retrieval scheme that enables the search engine to compute the encrypted document relevance scores from only the genuine search terms, yet remain oblivious to their distinction from the decoys. Empirical evaluation results are presented to substantiate the effectiveness of our solution.

...read moreread less

Proceedings Article•DOI•

Efficient term proximity search with term-pair indexes

[...]

Hao Yan¹, Shuming Shi², Fan Zhang³, Torsten Suel¹, Ji-Rong Wen² - Show less +1 more•Institutions (3)

New York University¹, Microsoft², Nankai University³

26 Oct 2010

TL;DR: This paper proposes new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model, and proposes new index structures based on a term-pair index, and studies new document retrieval strategies on the resulting indexes.

...read moreread less

Abstract: There has been a large amount of research on early termination techniques in web search and information retrieval. Such techniques return the top-k documents without scanning and evaluating the full inverted lists of the query terms. Thus, they can greatly improve query processing efficiency. However, only a limited amount of efficient top-k processing work considers the impact of term proximity, i.e., the distance between term occurrences in a document, which has recently been integrated into a number of retrieval models to improve effectiveness. In this paper, we propose new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model. We propose new index structures based on a term-pair index, and study new document retrieval strategies on the resulting indexes. We perform a detailed experimental evaluation on our new techniques and compare them with the existing approaches. Experimental results on large-scale data sets show that our techniques can significantly improve the efficiency of query processing.

...read moreread less

Journal Article•DOI•

Concept-based query expansion for retrieving gene related publications from MEDLINE

[...]

Sérgio Matos¹, Joel P. Arrais¹, João Maia-Rodrigues², José Luís Oliveira¹•Institutions (2)

University of Aveiro¹, Stanford University²

28 Apr 2010-BMC Bioinformatics

TL;DR: QuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources and gives the user control over the ranking of the results by means of a simple weighting scheme.

...read moreread less

Abstract: Advances in biotechnology and in high-throughput methods for gene analysis have contributed to an exponential increase in the number of scientific publications in these fields of study. While much of the data and results described in these articles are entered and annotated in the various existing biomedical databases, the scientific literature is still the major source of information. There is, therefore, a growing need for text mining and information retrieval tools to help researchers find the relevant articles for their study. To tackle this, several tools have been proposed to provide alternative solutions for specific user requests. This paper presents QuExT, a new PubMed-based document retrieval and prioritization tool that, from a given list of genes, searches for the most relevant results from the literature. QuExT follows a concept-oriented query expansion methodology to find documents containing concepts related to the genes in the user input, such as protein and pathway names. The retrieved documents are ranked according to user-definable weights assigned to each concept class. By changing these weights, users can modify the ranking of the results in order to focus on documents dealing with a specific concept. The method's performance was evaluated using data from the 2004 TREC genomics track, producing a mean average precision of 0.425, with an average of 4.8 and 31.3 relevant documents within the top 10 and 100 retrieved abstracts, respectively. QuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources. The main advantage of the system is to give the user control over the ranking of the results by means of a simple weighting scheme. Using this approach, researchers can effortlessly explore the literature regarding a group of genes and focus on the different aspects relating to these genes.

...read moreread less

Proceedings Article•DOI•

Fast Logo Detection and Recognition in Document Images

[...]

Zhe Li¹, Matthias Schulte-Austum¹, Martin Neschen•Institutions (1)

Siemens¹

23 Aug 2010

TL;DR: This paper introduces a system architecture which is aiming at segmentation-free and layout-independent logo detection and recognition and can achieve improvements concerning both the recognition performance and the running time.

...read moreread less

Abstract: The scientific significance of automatic logo detection and recognition is more and more growing because of the increasing requirements of intelligent document image analysis and retrieval. In this paper, we introduce a system architecture which is aiming at segmentation-free and layout-independent logo detection and recognition. Along with the unique logo feature design, a novel way to ensure the geometrical relationships among the features, and different optimizations in the recognition process, this system can achieve improvements concerning both the recognition performance and the running time. The experimental results on several sets of real-word documents demonstrate the effectiveness of our approach.

...read moreread less

Patent•

Document Management Techniques To Account For User-Specific Patterns in Document Metadata

[...]

Sih X. Lee¹, Adrian X. Kunzle¹•Institutions (1)

JPMorgan Chase¹

09 Jun 2010

TL;DR: In this paper, a method for facilitating document retrieval may comprise: assigning a first entitlement to a first user for accessing a first plurality of documents; identifying patterns in the first user's creation or modification of metadata related to the first plurality; recording the identified patterns associated with the first users; receiving a document query from a second user who has been assigned a second entitlement to access a second plurality of document; determining, based on the second entitlement, an access right of the second user with respect to the second users, and modifying the document query, such that the query returns relevant documents

...read moreread less

Abstract: Document management techniques to account for user-specific patterns in document metadata are disclosed. In one embodiment, a method for facilitating document retrieval may comprise: assigning a first entitlement to a first user for accessing a first plurality of documents; identifying patterns in the first user's creation or modification of metadata related to the first plurality of documents; recording the identified patterns associated with the first user; receiving a document query from a second user who has been assigned a second entitlement to access a second plurality of documents; determining, based on the second entitlement, an access right of the second user with respect to the first plurality of documents; and modifying the document query based on the access right of the second user and the identified patterns, such that the document query returns relevant documents from the first plurality of documents despite the second user's ignorance of the identified patterns.

...read moreread less

Journal Article•DOI•

Knowledge transfer for cross domain learning to rank

[...]

Depin Chen¹, Yan Xiong¹, Jun Yan², Gui-Rong Xue³, Gang Wang², Zheng Chen² - Show less +2 more•Institutions (3)

University of Science and Technology of China¹, Microsoft², Shanghai Jiao Tong University³

01 Jun 2010-Information Retrieval

TL;DR: This paper aims at improving the learning of a ranking model in target domain by leveraging knowledge from the outdated or out-of-domain data by proposing two novel methods to conduct knowledge transfer at feature level and instance level.

...read moreread less

Abstract: Recently, learning to rank technology is attracting increasing attention from both academia and industry in the areas of machine learning and information retrieval. A number of algorithms have been proposed to rank documents according to the user-given query using a human-labeled training dataset. A basic assumption behind general learning to rank algorithms is that the training and test data are drawn from the same data distribution. However, this assumption does not always hold true in real world applications. For example, it can be violated when the labeled training data become outdated or originally come from another domain different from its counterpart of test data. Such situations bring a new problem, which we define as cross domain learning to rank. In this paper, we aim at improving the learning of a ranking model in target domain by leveraging knowledge from the outdated or out-of-domain data (both are referred to as source domain data). We first give a formal definition of the cross domain learning to rank problem. Following this, two novel methods are proposed to conduct knowledge transfer at feature level and instance level, respectively. These two methods both utilize Ranking SVM as the basic learner. In the experiments, we evaluate these two methods using data from benchmark datasets for document retrieval. The results show that the feature-level transfer method performs better with steady improvements over baseline approaches across different datasets, while the instance-level transfer method comes out with varying performance depending on the dataset used.

...read moreread less

Journal Article•DOI•

Engineering basic algorithms of an in-memory text search engine

[...]

Frederik Transier¹, Peter Sanders¹•Institutions (1)

Karlsruhe Institute of Technology¹

27 Dec 2010-ACM Transactions on Information Systems

TL;DR: The experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.

...read moreread less

Abstract: Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.

...read moreread less

Journal Article•DOI•

Integrating multiple document features in language models for expert finding

[...]

Jianhan Zhu¹, Xiangji Huang², Dawei Song³, Stefan Rüger⁴•Institutions (4)

University College London¹, York University², Robert Gordon University³, Open University⁴

01 Apr 2010-Knowledge and Information Systems

TL;DR: This work proposes a novel language modeling approach, which integrates multiple document features, for expert finding and achieves better results in terms of MAP than previous language model based approaches and the best automatic runs in both the TREC2006 and TREC2007 expert search tasks, respectively.

...read moreread less

Abstract: We argue that expert finding is sensitive to multiple document features in an organizational intranet. These document features include multiple levels of associations between experts and a query topic from sentence, paragraph, up to document levels, document authority information such as the PageRank, indegree, and URL length of documents, and internal document structures that indicate the experts’ relationship with the content of documents. Our assumption is that expert finding can largely benefit from the incorporation of these document features. However, existing language modeling approaches for expert finding have not sufficiently taken into account these document features. We propose a novel language modeling approach, which integrates multiple document features, for expert finding. Our experiments on two large scale TREC Enterprise Track datasets, i.e., the W3C and CSIRO datasets, demonstrate that the natures of the two organizational intranets and two types of expert finding tasks, i.e., key contact finding for CSIRO and knowledgeable person finding for W3C, influence the effectiveness of different document features. Our work provides insights into which document features work for certain types of expert finding tasks, and helps design expert finding strategies that are effective for different scenarios. Our main contribution is to develop an effective formal method for modeling multiple document features in expert finding, and conduct a systematic investigation of their effects. It is worth noting that our novel approach achieves better results in terms of MAP than previous language model based approaches and the best automatic runs in both the TREC2006 and TREC2007 expert search tasks, respectively.

...read moreread less

Journal Article•DOI•

Statistical lattice-based spoken document retrieval

[...]

Tee Kiah Chia¹, Khe Chai Sim², Haizhou Li², Hwee Tou Ng¹•Institutions (2)

National University of Singapore¹, Institute for Infocomm Research Singapore²

29 Jan 2010-ACM Transactions on Information Systems

TL;DR: Experimental results show that the method consistently achieves better retrieval performance than using only the 1-best transcripts in statistical retrieval, outperforms a recently proposed lattice-based vector space retrieval method, and also compares favorably with a lattICE-based retrieval method based on the Okapi BM25 model.

...read moreread less

Abstract: Recent research efforts on spoken document retrieval have tried to overcome the low quality of 1-best automatic speech recognition transcripts, especially in the case of conversational speech, by using statistics derived from speech lattices containing multiple transcription hypotheses as output by a speech recognizer. We present a method for lattice-based spoken document retrieval based on a statistical n-gram modeling approach to information retrieval. In this statistical lattice-based retrieval (SLBR) method, a smoothed statistical model is estimated for each document from the expected counts of words given the information in a lattice, and the relevance of each document to a query is measured as a probability under such a model. We investigate the efficacy of our method under various parameter settings of the speech recognition and lattice processing engines, using the Fisher English Corpus of conversational telephone speech. Experimental results show that our method consistently achieves better retrieval performance than using only the 1-best transcripts in statistical retrieval, outperforms a recently proposed lattice-based vector space retrieval method, and also compares favorably with a lattice-based retrieval method based on the Okapi BM25 model.

...read moreread less

Proceedings Article•DOI•

Using the past to score the present: extending term weighting models through revision history analysis

[...]

Ablimit Aji¹, Yu Wang¹, Eugene Agichtein¹, Evgeniy Gabrilovich²•Institutions (2)

Emory University¹, Yahoo!²

26 Oct 2010

TL;DR: This paper proposes a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks.

...read moreread less

Abstract: The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks.

...read moreread less

Proceedings Article•DOI•

Constructing Japanese Test Collections for Spoken Term Detection

[...]

Yoshiaki Itoh¹, Hiromitsu Nishizaki², Xinhui Hu, Hiroaki Nanjo³, Tomoyosi Akiba⁴, Tatsuya Kawahara⁵, Seiichi Nakagawa⁴, Tomoko Matsui, Yoichi Yamashita⁶, Kiyoaki Aikawa⁷ - Show less +6 more•Institutions (7)

Iwate Prefectural University¹, University of Yamanashi², Ryukoku University³, Toyohashi University of Technology⁴, Kyoto University⁵, Ritsumeikan University⁶, Tokyo University of Technology⁷

26 Sep 2010

TL;DR: The present paper introduces the policies, outline, and schedule of the new test collections for SDR and STD, and compared with the NIST STD test collections.

...read moreread less

Abstract: Spoken Document Retrieval (SDR) and Spoken Term Detection (STD) have been two of the most intensively investigated topics in spoken document processing research according to the establishment of the SDR and STD test collections by the Text REtrieval Conference (TREC) and NIST. Because Japanese spoken document processing researchers also requires such test collections for SDR and STD, we have established a working group to develop these collections in Special Interest Group -Spoken Language Processing (SIG-SLP) of the Information Processing Society of Japan. The working group has constructed and made available a test collection for SDR, and is now constructing new test collections for STD that will be open to researchers. The present paper introduces the policies, outline, and schedule of the new test collections. Then, the new test collections are compared with the NIST STD test collections. Index Terms: spoken term detection, test collection

...read moreread less

Book Chapter•DOI•

Dual-sorted inverted lists

[...]

Gonzalo Navarro¹, Simon J. Puglisi²•Institutions (2)

University of Chile¹, RMIT University²

11 Oct 2010

TL;DR: This paper introduces a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index.

...read moreread less

Abstract: Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including space-free treatment of stemming and hierarchical documents.

...read moreread less

in Information Retrieval

[...]

Leif Azzopardi, Kalervo Järvelin, Jaap Kamps, Mark D. Smucker

01 Jan 2010

TL;DR: A list of Information Retrieval Publications that I have produced up to 2008, including books, monographs, and other publications.

...read moreread less

Abstract: A list of Information Retrieval Publications I have produced up to 2008.

...read moreread less

Journal Article•DOI•

A Document Image Retrieval System

[...]

Konstantinos Zagoris¹, Kavallieratou Ergina², Nikos Papamarkos¹•Institutions (2)

Democritus University of Thrace¹, University of the Aegean²

01 Sep 2010-Engineering Applications of Artificial Intelligence

TL;DR: A system that locates words in document image archives bypassing character recognition and using word images as queries makes use of document image processing techniques, in order to extract powerful features for the description of the word images.

...read moreread less

Proceedings Article•DOI•

A new mathematics retrieval system

[...]

Shahab Kamali¹, Frank Wm. Tompa¹•Institutions (1)

University of Waterloo¹

26 Oct 2010

TL;DR: This work proposes a powerful query language for mathematical expressions that augments exact matching with approximate matching, but in a way that is controlled by the user and introduces a novel indexing scheme that scales well for large collections of expressions.

...read moreread less

Abstract: The Web contains a large collection of documents, some with mathematical expressions. Because mathematical expressions are objects with complex structures and rather few distinct symbols, conventional text retrieval systems are not very successful in mathematics retrieval. The lack of a definition for similarity between mathematical expressions, and the inadequacy of searching for exact matches only, makes the problem of mathematics retrieval even harder. As a result, the few existing mathematics retrieval systems are not very helpful in addressing users' needs. We propose a powerful query language for mathematical expressions that augments exact matching with approximate matching, but in a way that is controlled by the user. We also introduce a novel indexing scheme that scales well for large collections of expressions. Based on this indexing scheme, an efficient lookup algorithm is proposed.

...read moreread less

Proceedings Article•DOI•

Reverted indexing for feedback and expansion

[...]

Jeremy Pickens¹, Matthew Cooper¹, Gene Golovchinsky¹•Institutions (1)

FX Palo Alto Laboratory¹

26 Oct 2010

TL;DR: This paper turns the process around: instead of indexing documents, the authors index query result sets, called a reverted index, which can be used to identify additional documents, or to aid the user in query formulation, selection, and feedback.

...read moreread less

Abstract: Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its relative frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify the matching documents for a query. In this paper, we turn the process around: instead of indexing documents, we index query result sets. First, queries are run through a chosen retrieval system. For each query, the resulting document IDs are treated as terms and the score or rank of the document is used as the frequency statistic. An index of documents retrieved by basis queries is created. We call this index a reverted index. With reverted indexes, standard retrieval algorithms can retrieve the matching queries (as results) for a set of documents (used as queries). These recovered queries can then be used to identify additional documents, or to aid the user in query formulation, selection, and feedback.

...read moreread less

Collapse