scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2010"


Proceedings ArticleDOI
25 Oct 2010
TL;DR: It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy and are shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.
Abstract: The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.

1,284 citations


Journal IssueDOI
TL;DR: This chapter discusses Information Retrieval, the science and technology behind information retrieval and retrieval, and some of the techniques used in the retrieval of information.
Abstract: Introduction To Information Retrieval Overdrive Digital. Introduction To Information Retrieval. Introduction To Information Retrieval Putao Ufcg. Introduction To Information Retrieval Arbeitsbereiche. Introduction To Information Retrieval. Introduction To Information Retrieval Stanford Nlp Group. Introduction To Information Retrieval Cs Ucr Edu. Introduction To Information Retrieval By Christopher D. Introduction To Information Retrieval Book. Information Retrieval The Mit Press. Introduction Information Retrieval Uvm. Information Retrieval Lmu Munich. Introduction To Information Retrieval Stanford University. Introduction To Information Retrieval. Introduction To Information Retrieval Amp Models Slideshare. Introduction To Information Retrieval Kangwon Ac Kr. Information Retrieval. Introduction To Information Retrieval Assets. Introduction To Information Retrieval. Introduction To Information Retrieval

885 citations


Book
23 Jul 2010
TL;DR: Information Retrieval offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation, and is a valuable reference for professionals in computer science, computer engineering, and software engineering.
Abstract: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects. Wumpusa multiuser open-source information retrieval system developed by one of the authors and available onlineprovides model implementations and a basis for student work. The modular structure of the book allows instructors to use it in a variety of graduate-level courses, including courses taught from a database systems perspective, traditional information retrieval courses with a focus on IR theory, and courses covering the basics of Web retrieval. After an introduction to the basics of information retrieval, the text covers three major topic areasindexing, retrieval, and evaluationin self-contained parts. The final part of the book draws on and extends the general material in the earlier parts, treating such specific applications as parallel search engines, Web search, and XML retrieval. End-of-chapter references point to further reading; exercises range from pencil and paper problems to substantial programming projects. In addition to its classroom use, Information Retrieval will be a valuable reference for professionals in computer science, computer engineering, and software engineering.

523 citations


Journal ArticleDOI
TL;DR: The interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors are discussed.
Abstract: We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.

329 citations


Journal ArticleDOI
TL;DR: LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can be integrated into a range of bioinformatics and text-mining applications.
Abstract: The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/ .

328 citations


Proceedings ArticleDOI
26 Apr 2010
TL;DR: This work proposes a formal model of one specific semantic search task: ad-hoc object retrieval and shows that this task provides a solid framework to study some of the semantic search problems currently tackled by commercial Web search engines.
Abstract: Semantic Search refers to a loose set of concepts, challenges and techniques having to do with harnessing the information of the growing Web of Data (WoD) for Web search. Here we propose a formal model of one specific semantic search task: ad-hoc object retrieval. We show that this task provides a solid framework to study some of the semantic search problems currently tackled by commercial Web search engines. We connect this task to the traditional ad-hoc document retrieval and discuss appropriate evaluation metrics. Finally, we carry out a realistic evaluation of this task in the context of a Web search application.

228 citations


Proceedings ArticleDOI
Loïc Lecerf1, Boris Chidlovskii1
22 Mar 2010
TL;DR: A model of layout indexing of a collection adapted for the quick retrieval of top k relevant documents by document layout is developed and a direct evaluation of the similarity between a query and each document in the collection is avoided.
Abstract: In this paper we propose a schema for querying large documents collections by document layout. We develop a model of layout indexing of a collection adapted for the quick retrieval of top k relevant documents. Fort the sake of scalability, we avoid a direct evaluation of the similarity between a query and each document in the collection; their similarity is instead approximated by the similarity between their projections on the set of representative blocks which are inferred from the collection on the indexed step. The technique also proposes new functions for the relevance ranking and the cluster pruning that ensure a scalable retrieval and ranking.

162 citations


Proceedings Article
09 Oct 2010
TL;DR: This work uses discriminative training to create a projection of documents from multiple languages into a single translingual vector space and evaluates these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters.
Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corresponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

126 citations


Posted Content
TL;DR: This work shows how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries.
Abstract: Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as {\em range quantile} queries, {\em range next value} queries, and {\em range intersection} queries. We explore several applications of these queries in Information Retrieval, in particular {\em document retrieval} in hierarchical and temporal documents, and in the representation of {\em inverted lists}.

108 citations



Book ChapterDOI
06 Sep 2010
TL;DR: This paper presents two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text, significantly faster than existing methods in RAM and even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.
Abstract: Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.

Journal ArticleDOI
TL;DR: A multi-year effort carried out as part of the Text Retrieval Conference to develop evaluation methods for responsive review tasks in E-discovery leads to new approaches to measuring effectiveness in both batch and interactive frameworks, large data sets, and some surprising results for the recall and precision of Boolean and statistical information retrieval methods.
Abstract: The effectiveness of information retrieval technology in electronic discovery (E-discovery) has become the subject of judicial rulings and practitioner controversy The scale and nature of E-discovery tasks, however, has pushed traditional information retrieval evaluation approaches to their limits This paper reviews the legal and operational context of E-discovery and the approaches to evaluating search technology that have evolved in the research community It then describes a multi-year effort carried out as part of the Text Retrieval Conference to develop evaluation methods for responsive review tasks in E-discovery This work has led to new approaches to measuring effectiveness in both batch and interactive frameworks, large data sets, and some surprising results for the recall and precision of Boolean and statistical information retrieval methods The paper concludes by offering some thoughts about future research in both the legal and technical communities toward the goal of reliable, effective use of information retrieval in E-discovery

Book ChapterDOI
11 Oct 2010
TL;DR: Improved time and space bounds are given for three important one-dimensional colored range queries -- colored range listing, colored range top-k queries and colored range counting -- and, thus, new bounds for various document retrieval problems on general collections of sequences are given.
Abstract: Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries -- colored range listing, colored range top-k queries and colored range counting -- and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the high-order entropies of the library of documents. We then show how (approximate) colored top-k queries can be reduced to (approximate) range-mode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.

Journal ArticleDOI
TL;DR: A privacy-preserving, similarity-based text retrieval scheme that prevents the server from accurately reconstructing the term composition of queries and documents, and anonymizes the search results from unauthorized observers is introduced.
Abstract: Users of online services are increasingly wary that their activities could disclose confidential information on their business or personal activities. It would be desirable for an online document service to perform text retrieval for users, while protecting the privacy of their activities. In this article, we introduce a privacy-preserving, similarity-based text retrieval scheme that (a) prevents the server from accurately reconstructing the term composition of queries and documents, and (b) anonymizes the search results from unauthorized observers. At the same time, our scheme preserves the relevance-ranking of the search server, and enables accounting of the number of documents that each user opens. The effectiveness of the scheme is verified empirically with two real text corpora.

Journal ArticleDOI
01 Sep 2010
TL;DR: This paper identifies the privacy risks arising from semantically related search terms within a query, and from recurring high-specificity query terms in a search session, and proposes a solution for a similarity text retrieval system to offer anonymity and plausible deniability for the query terms, and hence the user intent, without degrading the system's precision-recall performance.
Abstract: Users of text search engines are increasingly wary that their activities may disclose confidential information about their business or personal profiles. It would be desirable for a search engine to perform document retrieval for users while protecting their intent. In this paper, we identify the privacy risks arising from semantically related search terms within a query, and from recurring high-specificity query terms in a search session. To counter the risks, we propose a solution for a similarity text retrieval system to offer anonymity and plausible deniability for the query terms, and hence the user intent, without degrading the system's precision-recall performance. The solution comprises a mechanism that embellishes each user query with decoy terms that exhibit similar specificity spread as the genuine terms, but point to plausible alternative topics. We also provide an accompanying retrieval scheme that enables the search engine to compute the encrypted document relevance scores from only the genuine search terms, yet remain oblivious to their distinction from the decoys. Empirical evaluation results are presented to substantiate the effectiveness of our solution.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This paper proposes new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model, and proposes new index structures based on a term-pair index, and studies new document retrieval strategies on the resulting indexes.
Abstract: There has been a large amount of research on early termination techniques in web search and information retrieval. Such techniques return the top-k documents without scanning and evaluating the full inverted lists of the query terms. Thus, they can greatly improve query processing efficiency. However, only a limited amount of efficient top-k processing work considers the impact of term proximity, i.e., the distance between term occurrences in a document, which has recently been integrated into a number of retrieval models to improve effectiveness. In this paper, we propose new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model. We propose new index structures based on a term-pair index, and study new document retrieval strategies on the resulting indexes. We perform a detailed experimental evaluation on our new techniques and compare them with the existing approaches. Experimental results on large-scale data sets show that our techniques can significantly improve the efficiency of query processing.

Journal ArticleDOI
TL;DR: QuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources and gives the user control over the ranking of the results by means of a simple weighting scheme.
Abstract: Advances in biotechnology and in high-throughput methods for gene analysis have contributed to an exponential increase in the number of scientific publications in these fields of study. While much of the data and results described in these articles are entered and annotated in the various existing biomedical databases, the scientific literature is still the major source of information. There is, therefore, a growing need for text mining and information retrieval tools to help researchers find the relevant articles for their study. To tackle this, several tools have been proposed to provide alternative solutions for specific user requests. This paper presents QuExT, a new PubMed-based document retrieval and prioritization tool that, from a given list of genes, searches for the most relevant results from the literature. QuExT follows a concept-oriented query expansion methodology to find documents containing concepts related to the genes in the user input, such as protein and pathway names. The retrieved documents are ranked according to user-definable weights assigned to each concept class. By changing these weights, users can modify the ranking of the results in order to focus on documents dealing with a specific concept. The method's performance was evaluated using data from the 2004 TREC genomics track, producing a mean average precision of 0.425, with an average of 4.8 and 31.3 relevant documents within the top 10 and 100 retrieved abstracts, respectively. QuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources. The main advantage of the system is to give the user control over the ranking of the results by means of a simple weighting scheme. Using this approach, researchers can effortlessly explore the literature regarding a group of genes and focus on the different aspects relating to these genes.

Proceedings ArticleDOI
23 Aug 2010
TL;DR: This paper introduces a system architecture which is aiming at segmentation-free and layout-independent logo detection and recognition and can achieve improvements concerning both the recognition performance and the running time.
Abstract: The scientific significance of automatic logo detection and recognition is more and more growing because of the increasing requirements of intelligent document image analysis and retrieval. In this paper, we introduce a system architecture which is aiming at segmentation-free and layout-independent logo detection and recognition. Along with the unique logo feature design, a novel way to ensure the geometrical relationships among the features, and different optimizations in the recognition process, this system can achieve improvements concerning both the recognition performance and the running time. The experimental results on several sets of real-word documents demonstrate the effectiveness of our approach.

Patent
09 Jun 2010
TL;DR: In this paper, a method for facilitating document retrieval may comprise: assigning a first entitlement to a first user for accessing a first plurality of documents; identifying patterns in the first user's creation or modification of metadata related to the first plurality; recording the identified patterns associated with the first users; receiving a document query from a second user who has been assigned a second entitlement to access a second plurality of document; determining, based on the second entitlement, an access right of the second user with respect to the second users, and modifying the document query, such that the query returns relevant documents
Abstract: Document management techniques to account for user-specific patterns in document metadata are disclosed. In one embodiment, a method for facilitating document retrieval may comprise: assigning a first entitlement to a first user for accessing a first plurality of documents; identifying patterns in the first user's creation or modification of metadata related to the first plurality of documents; recording the identified patterns associated with the first user; receiving a document query from a second user who has been assigned a second entitlement to access a second plurality of documents; determining, based on the second entitlement, an access right of the second user with respect to the first plurality of documents; and modifying the document query based on the access right of the second user and the identified patterns, such that the document query returns relevant documents from the first plurality of documents despite the second user's ignorance of the identified patterns.

Journal ArticleDOI
TL;DR: This paper aims at improving the learning of a ranking model in target domain by leveraging knowledge from the outdated or out-of-domain data by proposing two novel methods to conduct knowledge transfer at feature level and instance level.
Abstract: Recently, learning to rank technology is attracting increasing attention from both academia and industry in the areas of machine learning and information retrieval. A number of algorithms have been proposed to rank documents according to the user-given query using a human-labeled training dataset. A basic assumption behind general learning to rank algorithms is that the training and test data are drawn from the same data distribution. However, this assumption does not always hold true in real world applications. For example, it can be violated when the labeled training data become outdated or originally come from another domain different from its counterpart of test data. Such situations bring a new problem, which we define as cross domain learning to rank. In this paper, we aim at improving the learning of a ranking model in target domain by leveraging knowledge from the outdated or out-of-domain data (both are referred to as source domain data). We first give a formal definition of the cross domain learning to rank problem. Following this, two novel methods are proposed to conduct knowledge transfer at feature level and instance level, respectively. These two methods both utilize Ranking SVM as the basic learner. In the experiments, we evaluate these two methods using data from benchmark datasets for document retrieval. The results show that the feature-level transfer method performs better with steady improvements over baseline approaches across different datasets, while the instance-level transfer method comes out with varying performance depending on the dataset used.

Journal ArticleDOI
TL;DR: The experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.
Abstract: Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.

Journal ArticleDOI
TL;DR: This work proposes a novel language modeling approach, which integrates multiple document features, for expert finding and achieves better results in terms of MAP than previous language model based approaches and the best automatic runs in both the TREC2006 and TREC2007 expert search tasks, respectively.
Abstract: We argue that expert finding is sensitive to multiple document features in an organizational intranet. These document features include multiple levels of associations between experts and a query topic from sentence, paragraph, up to document levels, document authority information such as the PageRank, indegree, and URL length of documents, and internal document structures that indicate the experts’ relationship with the content of documents. Our assumption is that expert finding can largely benefit from the incorporation of these document features. However, existing language modeling approaches for expert finding have not sufficiently taken into account these document features. We propose a novel language modeling approach, which integrates multiple document features, for expert finding. Our experiments on two large scale TREC Enterprise Track datasets, i.e., the W3C and CSIRO datasets, demonstrate that the natures of the two organizational intranets and two types of expert finding tasks, i.e., key contact finding for CSIRO and knowledgeable person finding for W3C, influence the effectiveness of different document features. Our work provides insights into which document features work for certain types of expert finding tasks, and helps design expert finding strategies that are effective for different scenarios. Our main contribution is to develop an effective formal method for modeling multiple document features in expert finding, and conduct a systematic investigation of their effects. It is worth noting that our novel approach achieves better results in terms of MAP than previous language model based approaches and the best automatic runs in both the TREC2006 and TREC2007 expert search tasks, respectively.

Journal ArticleDOI
TL;DR: Experimental results show that the method consistently achieves better retrieval performance than using only the 1-best transcripts in statistical retrieval, outperforms a recently proposed lattice-based vector space retrieval method, and also compares favorably with a lattICE-based retrieval method based on the Okapi BM25 model.
Abstract: Recent research efforts on spoken document retrieval have tried to overcome the low quality of 1-best automatic speech recognition transcripts, especially in the case of conversational speech, by using statistics derived from speech lattices containing multiple transcription hypotheses as output by a speech recognizer. We present a method for lattice-based spoken document retrieval based on a statistical n-gram modeling approach to information retrieval. In this statistical lattice-based retrieval (SLBR) method, a smoothed statistical model is estimated for each document from the expected counts of words given the information in a lattice, and the relevance of each document to a query is measured as a probability under such a model. We investigate the efficacy of our method under various parameter settings of the speech recognition and lattice processing engines, using the Fisher English Corpus of conversational telephone speech. Experimental results show that our method consistently achieves better retrieval performance than using only the 1-best transcripts in statistical retrieval, outperforms a recently proposed lattice-based vector space retrieval method, and also compares favorably with a lattice-based retrieval method based on the Okapi BM25 model.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This paper proposes a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks.
Abstract: The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks.

Proceedings ArticleDOI
26 Sep 2010
TL;DR: The present paper introduces the policies, outline, and schedule of the new test collections for SDR and STD, and compared with the NIST STD test collections.
Abstract: Spoken Document Retrieval (SDR) and Spoken Term Detection (STD) have been two of the most intensively investigated topics in spoken document processing research according to the establishment of the SDR and STD test collections by the Text REtrieval Conference (TREC) and NIST. Because Japanese spoken document processing researchers also requires such test collections for SDR and STD, we have established a working group to develop these collections in Special Interest Group -Spoken Language Processing (SIG-SLP) of the Information Processing Society of Japan. The working group has constructed and made available a test collection for SDR, and is now constructing new test collections for STD that will be open to researchers. The present paper introduces the policies, outline, and schedule of the new test collections. Then, the new test collections are compared with the NIST STD test collections. Index Terms: spoken term detection, test collection

Book ChapterDOI
11 Oct 2010
TL;DR: This paper introduces a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index.
Abstract: Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including space-free treatment of stemming and hierarchical documents.

01 Jan 2010
TL;DR: A list of Information Retrieval Publications that I have produced up to 2008, including books, monographs, and other publications.
Abstract: A list of Information Retrieval Publications I have produced up to 2008.

Journal ArticleDOI
TL;DR: A system that locates words in document image archives bypassing character recognition and using word images as queries makes use of document image processing techniques, in order to extract powerful features for the description of the word images.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This work proposes a powerful query language for mathematical expressions that augments exact matching with approximate matching, but in a way that is controlled by the user and introduces a novel indexing scheme that scales well for large collections of expressions.
Abstract: The Web contains a large collection of documents, some with mathematical expressions. Because mathematical expressions are objects with complex structures and rather few distinct symbols, conventional text retrieval systems are not very successful in mathematics retrieval. The lack of a definition for similarity between mathematical expressions, and the inadequacy of searching for exact matches only, makes the problem of mathematics retrieval even harder. As a result, the few existing mathematics retrieval systems are not very helpful in addressing users' needs. We propose a powerful query language for mathematical expressions that augments exact matching with approximate matching, but in a way that is controlled by the user. We also introduce a novel indexing scheme that scales well for large collections of expressions. Based on this indexing scheme, an efficient lookup algorithm is proposed.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This paper turns the process around: instead of indexing documents, the authors index query result sets, called a reverted index, which can be used to identify additional documents, or to aid the user in query formulation, selection, and feedback.
Abstract: Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its relative frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify the matching documents for a query. In this paper, we turn the process around: instead of indexing documents, we index query result sets. First, queries are run through a chosen retrieval system. For each query, the resulting document IDs are treated as terms and the score or rank of the document is used as the frequency statistic. An index of documents retrieved by basis queries is created. We call this index a reverted index. With reverted indexes, standard retrieval algorithms can retrieve the matching queries (as results) for a set of documents (used as queries). These recovered queries can then be used to identify additional documents, or to aid the user in query formulation, selection, and feedback.