scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2011"


Book
01 Jan 2011
TL;DR: In this article, Gonzalo Navarro and Nivio Ziviani present a user interface for search based on user interfaces for search and a model for retrieval evaluation, including relevance feedback and query expansion.
Abstract: Contents Preface Acknowledgements 1 Introduction 2 User Interfaces for Search by Marti Hearst 3 Modeling 4 Retrieval Evaluation 5 Relevance Feedback and Query Expansion 6 Documents: Languages & Properties with Gonzalo Navarro and Nivio Ziviani 7 Queries: Languages & Properties with Gonzalo Navarro 8 Text Classification with Marcos Gonccalves 9 Indexing and Searching with Gonzalo Navarro 10 Parallel and Distributed IR with Eric Brown 11 Web Retrieval with Yoelle Maarek 12 Web Crawling with Carlos Castillo 13 Structured Text Retrieval with Mounia Lalmas 14 Multimedia Information Retrieval by Dulce Poncele'on and Malcolm Slaney 15 Enterprise Search by David Hawking 16 Library Systems by Edie Rasmussen 17 Digital Libraries by Marcos Gonccalves A Open Source Search Engines with Christian Middleton B Biographies Bibliography Index

400 citations


Proceedings Article
23 Jun 2011
TL;DR: A novel discriminative training method that projects the raw term vectors into a common, low-dimensional vector space, which not only outperforms existing state-of-the-art approaches, but also achieves high accuracy at low dimensions and is thus more efficient.
Abstract: Traditional text similarity measures consider each term similar only to itself and do not model semantic relatedness of terms. We propose a novel discriminative training method that projects the raw term vectors into a common, low-dimensional vector space. Our approach operates by finding the optimal matrix to minimize the loss of the pre-selected similarity function (e.g., cosine) of the projected vectors, and is able to efficiently handle a large number of training examples in the high-dimensional space. Evaluated on two very different tasks, cross-lingual document retrieval and ad relevance measure, our method not only outperforms existing state-of-the-art approaches, but also achieves high accuracy at low dimensions and is thus more efficient.

298 citations


Proceedings ArticleDOI
24 Jul 2011
TL;DR: This work proposes a new set of algorithms for early termination that outperform previous methods for disjunctive queries, and introduces a simple augmented inverted index structure called a block-max index that stores the maximum impact score for each block of a compressed inverted list in uncompressed form.
Abstract: Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

210 citations


Proceedings Article
12 Dec 2011
TL;DR: This work proposes two methods to regularize the learning of topic models by creating a structured prior over words that reflect broad patterns in the external data that make topic models more useful across a broader range of text data.
Abstract: Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.

187 citations


Book
01 Apr 2011
TL;DR: The use of graph-based algorithms for natural language processing and information retrieval is extensively covered in this article, which brings together topics as diverse as lexical semantics, text summarization, text mining, ontology construction, text classification, and text classification.
Abstract: Graph theory and the fields of natural language processing and information retrieval are well-studied disciplines. Traditionally, these areas have been perceived as distinct, with different algorithms, different applications, and different potential end-users. However, recent research has shown that these disciplines are intimately connected, with a large variety of natural language processing and information retrieval applications finding efficient solutions within graph-theoretical frameworks. This book extensively covers the use of graph-based algorithms for natural language processing and information retrieval. It brings together topics as diverse as lexical semantics, text summarization, text mining, ontology construction, text classification, and information retrieval, which are connected by the common underlying theme of the use of graph-theoretical methods for text and information processing tasks. Readers will come away with a firm understanding of the major methods and applications in natural language processing and information retrieval that rely on graph-based representations and algorithms.

167 citations


Journal ArticleDOI
Xiaoming Fan1, Jianyong Wang1, Xu Pu1, Lizhu Zhou1, Bing Lv1 
TL;DR: This article presents an effective framework named GHOST (abbreviation for GrapHical framewOrk for name diSambiguaTion), to solve the problem in digital libraries to distinguish publications written by authors with identical names, and devise a novel similarity metric.
Abstract: Name ambiguity stems from the fact that many people or objects share identical names in the real world. Such name ambiguity decreases the performance of document retrieval, Web search, information integration, and may cause confusion in other applications. Due to the same name spellings and lack of information, it is a nontrivial task to distinguish them accurately. In this article, we focus on investigating the problem in digital libraries to distinguish publications written by authors with identical names. We present an effective framework named GHOST (abbreviation for GrapHical framewOrk for name diSambiguaTion), to solve the problem systematically. We devise a novel similarity metric, and utilize only one type of attribute (i.e., coauthorship) in GHOST. Given the similarity matrix, intermediate results are grouped into clusters with a recently introduced powerful clustering algorithm called Affinity Propagation. In addition, as a complementary technique, user feedback can be used to enhance the performance. We evaluated the framework on the real DBLP and PubMed datasets, and the experimental results show that GHOST can achieve both high precision and recall.

132 citations


01 Jan 2011
TL;DR: This is an electronic version of the paper presented at the International Workshop on Diversity in Document Retrieval, held in Dublin on 2011.
Abstract: This is an electronic version of the paper presented at the International Workshop on Diversity in Document Retrieval, held in Dublin on 2011

128 citations


Proceedings ArticleDOI
24 Oct 2011
TL;DR: The results indicate that a query processor that combines state-of-the-art text processing techniques with a simple coarse-grained spatial structure can outperform existing approaches by up to two orders of magnitude.
Abstract: Many web search services allow users to constrain text queries to a geographic location (e.g., yoga classes near Santa Monica). Important examples include local search engines such as Google Local and location-based search services for smart phones. Several research groups have studied the efficient execution of queries mixing text and geography; their approaches usually combine inverted lists with a spatial access method such as an R-tree or space-filling curve. In this paper, we take a fresh look at this problem. We feel that previous work has often focused on the spatial aspect at the expense of performance considerations in text processing, such as inverted index access, compression, and caching. We describe new and existing approaches and discuss their different perspectives. We then compare their performance in extensive experiments on large document collections. Our results indicate that a query processor that combines state-of-the-art text processing techniques with a simple coarse-grained spatial structure can outperform existing approaches by up to two orders of magnitude. In fact, even a naive approach that first uses a simple inverted index and then filters out any documents outside the query range outperforms many previous methods.

123 citations


Journal ArticleDOI
TL;DR: A deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document is described, which allows more accurate and much faster retrieval than latent semantic analysis.
Abstract: We describe a deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document. The top two layers of the generative model form an undirected associative memory and the remaining layers form a belief net with directed, top-down connections. We present efficient learning and inference procedures for this type of generative model and show that it allows more accurate and much faster retrieval than latent semantic analysis. By using our method as a filter for a much slower method called TF-IDF we achieve higher accuracy than TF-IDF alone and save several orders of magnitude in retrieval time. By using short binary codes as addresses, we can perform retrieval on very large document sets in a time that is independent of the size of the document set using only one word of memory to describe each document.

119 citations


Journal ArticleDOI
TL;DR: The InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks and provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.
Abstract: The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested. A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and gene-oriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation. The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users should be actively involved in every phase of software development, and this will be strongly encouraged in future tasks. The IAT Task provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.

95 citations


Patent
Eric Chang1, Michael Gillam1, Yan Xu1, Craig F. Feied1, Jonathan A. Handler1 
06 Jun 2011
TL;DR: In this paper, a user can identify key attributes of potential target documents that are desirable (e.g., have a particular semantic content for the user) and relevant documents that comprise the desired semantic content can be retrieved.
Abstract: One or more techniques and/or systems are disclosed that provide for document retrieval where a user can identify key attributes of potential target documents that are desirable (e.g., have a particular semantic content for the user). Further, relevant documents that comprise the desired semantic content can be retrieved. Additionally, the user can provide feedback on the retrieved documents, for example, based on key semantic concepts found in the documents, and the input can be used to update the classification. For example, this process can be iterated to improve the retrieval and precision of documents found through machine learning techniques.

Journal ArticleDOI
TL;DR: To support more effective biomedical information management, Semantic MEDLINE integrates document retrieval, advanced natural language processing, automatic summarization and visualization into a single Web portal.
Abstract: To support more effective biomedical information management, Semantic MEDLINE integrates document retrieval, advanced natural language processing, automatic summarization and visualization into a single Web portal. The application is intended to help manage the results of PubMed searches by condensing core semantic content in the citations retrieved. Output is presented as a connected graph of semantic relations, with links to the original MEDLINE citations. The ability to connect salient information across documents helps users keep up with the research literature and discover connections which might otherwise go unnoticed. Semantic MEDLINE can make an impact on biomedicine by supporting scientific discovery and the timely translation of insights from basic research into advances in clinical practice and patient care. Semantic MEDLINE is illustrated here with recent research on the clock genes.

Book
30 Sep 2011
TL;DR: This chapter discusses Probabilistic Retrieval, which automates the very labor-intensive and therefore time-heavy process of manually cataloging and extracting text and speech data from scanned library cards.
Abstract: 1. Introduction. 2. Probabilistic Retrieval. 3. Text Retrieval. 4. Automatic Speech Recognition. 5. Speech Retrieval. 6. Case Study: Retrieving Scanned Library Cards. 7. Integrating Information Retrieval and Database Functions. 8. Outlook. A: Theorems and Proofs. Bibliography. Index.

01 Jan 2011
TL;DR: This paper explains the data used in the subtasks, how to make transcriptions by speech recognition and the details of each subtask of the IR for Spoken Documents Task in NTCIR-9 Workshop.
Abstract: This paper describes an overview of the IR for Spoken Documents Task in NTCIR-9 Workshop. In this task, the spoken term detection (STD) subtask and ad-hoc spoken document retrieval subtask (SDR) are conducted. Both of the subtasks target to search terms, passages and documents included in academic and simulated lectures of the Corpus of Spontaneous Japanese. Finally, seven and five teams participated in the STD subtask and the SDR subtask, respectively. This paper explains the data used in the subtasks, how to make transcriptions by speech recognition and the details of each subtask.

Proceedings ArticleDOI
24 Jul 2011
TL;DR: This paper shows the first set of inverted indexes which work naturally for strings as well as phrase searching, and shows efficient top-k based retrieval under relevance metrics like frequency and tf-idf.
Abstract: Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted indexes which work naturally for strings as well as phrase searching. The central idea is to exclude document d in the inverted list of a string P if every occurrence of P in d is subsumed by another string of which P is a prefix. With this we show that our space utilization is close to the optimal. Techniques from succinct data structures are deployed to achieve compression while allowing fast access in terms of frequency and document id based retrieval. Compression and speed trade-offs are evaluated for different variants of the proposed index. For phrase searching, we show that our indexes compare favorably against a typical inverted index deploying position-wise intersections. We also show efficient top-k based retrieval under relevance metrics like frequency and tf-idf.

Book ChapterDOI
05 May 2011
TL;DR: This paper presents the first practical proposal to compress the document arrays, and shows that the resulting structure is significatively smaller than the uncompressed counterpart, and than alternatives to the document array proposed in the literature.
Abstract: Recent research on document retrieval for general texts has established the virtues of explicitly representing the so-called document array, which stores the document each pointer of the suffix array belongs to. While it makes document retrieval faster, this array occupies a significative amount of redundant space and is not easily compressible. In this paper we present the first practical proposal to compress the document array. We show that the resulting structure is significatively smaller than the uncompressed counterpart, and than alternatives to the document array proposed in the literature. We also compare various known algorithms for document listing and top-k retrieval, and find that the most useful combinations of algorithms run over our new compressed document arrays.

Journal ArticleDOI
03 Jan 2011
TL;DR: This thesis investigates heuristics for obtaining word-based representations from biomedical text for robust retrieval and proposes a cross-lingual framework for monolingual biomedical IR.
Abstract: In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge is, however, far from trivial.The first research theme investigates heuristics for obtaining word-based representations from biomedical text for robust retrieval. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. Document preprocessing heuristics such as stop word removal, stemming, and breakpoint identification and normalization were shown to strongly affect retrieval performance. An effective combination of heuristics was identified to obtain a word-based representation from text for the remainder of this thesis.The second research theme deals with concept-based retrieval. We compared a word-based to a concept-based representation and determined to what extent a manual concept-based representation can be automatically obtained from text. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval.In the third and last research theme we propose a cross-lingual framework for monolingual biomedical IR. In this framework, the integration of a concept-based representation is viewed as a cross-lingual matching problem involving a word-based and concept-based representation language. This framework gives us the opportunity to adopt a large set of established crosslingual information retrieval methods and techniques for this domain. Experiments with basic term-to-term translation models demonstrate that this approach can significantly improve word-based retrieval.Directions for future work are using these concepts for communication between user and retrieval system, extending upon the translation models and extending CLIR-enhanced concept-based retrieval outside the biomedical domain.Available online from http://purl.utwente.nl/publications/72481.

Journal ArticleDOI
01 Sep 2011
TL;DR: The experimental results show that the proposed Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.
Abstract: With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discover generalized frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Classic4, Re0, R8, and WebKB datasets. Our experimental results show that our proposed approach indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.

Proceedings ArticleDOI
23 Jan 2011
TL;DR: In this paper, a new data structure for the top-k color problem is presented, which is asymptotically optimal with respect to worst-case query time and space.
Abstract: In this paper we describe a new efficient (in fact optimal) data structure for the top-K color problem. Each element of an array A is assigned a color c with priority p(c). For a query range [a, b] and a value K, we have to report K colors with the highest priorities among all colors that occur in A[a..b], sorted in reverse order by their priorities. We show that such queries can be answered in O(K) time using an O(N log σ) bits data structure, where N is the number of elements in the array and σ is the number of colors. Thus our data structure is asymptotically optimal with respect to the worst-case query time and space. As an immediate application of our results, we obtain optimal time solutions for several document retrieval problems. The method of the paper could be also of independent interest.

Journal ArticleDOI
01 Nov 2011
TL;DR: In this paper, the authors present a scalable compression method for large document collections that allows fast random access by using a representative sample of the collection and using it as a dictionary in a LZ77-like encoding.
Abstract: Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as gzip. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.

Proceedings ArticleDOI
29 Dec 2011
TL;DR: A novel mobile printed document retrieval system that utilizes both text and low bit-rate features that can reliably match retrieved documents to the query document and reduce the transmitted query size significantly is presented.
Abstract: We present a novel mobile printed document retrieval system that utilizes both text and low bit-rate features. On the client phone, text are detected using an algorithm based on edge-enhanced Maximally Stable Extremal Regions. The title text image patch is rectified using a gradient based algorithm and recognized using Optical Character Recognition. Low bit-rate image features are extracted from the query image. Both text and compressed features are sent to a server. On the server, the title text is used for on-line search and the features are used for image-based comparison. The proposed system is capable of web-scale document retrieval using title text without the need of constructing a document image database. Using features for image-based comparison, we can reliably match retrieved documents to the query document. Last, by using text and low bit-rate features, we can reduce the transmitted query size significantly.

Journal ArticleDOI
01 Jul 2011
TL;DR: This work creates reduced versions of full text documents that contain only important portions, and explores the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full texts.
Abstract: Motivation: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents. Results: Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation, we were able to achieve maximum ROUGE-1, ROUGE-2 and ROUGE-SU4 F-scores of 0.4150, 0.1435 and 0.1782, respectively, for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy, we were able to achieve maximum ROUGE F-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts. Contact: ude.awoiu@ayrahcattahb-artimnas; ude.awoiu@nasavinirs-inimdap

Journal ArticleDOI
TL;DR: This paper proposes a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform and demonstrates that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.
Abstract: Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

Journal ArticleDOI
TL;DR: A topic-based feedback model with three different strategies for finding a good query-related topic based on the Latent Dirichlet Allocation model is proposed that achieves statistically significant improvements over a strong feedback model in the language modeling framework.
Abstract: Pseudo-relevance feedback (PRF) via query expansion (QE) assumes that the top-ranked documents from the first-pass retrieval are relevant. The most informative terms in the pseudo-relevant feedback documents are then used to update the original query representation in order to boost the retrieval performance. Most current PRF approaches estimate the importance of the candidate expansion terms based on their statistics on document level. However, a document for PRF may consist of different topics, which may not be all related to the query even if the document is judged relevant. The main argument of this article is the proposal to conduct PRF on a granularity smaller than on the document level. In this article, we propose a topic-based feedback model with three different strategies for finding a good query-related topic based on the Latent Dirichlet Allocation model. The experimental results on four representative TREC collections show that QE based on the derived topic achieves statistically significant improvements over a strong feedback model in the language modeling framework, which updates the query representation based on the top-ranked documents. © 2011 Wiley Periodicals, Inc.

Book ChapterDOI
17 Oct 2011
TL;DR: Using monotone minimum perfect hash functions, this work gives new algorithms for document listing with frequencies and top-k document retrieval using just |CSA| + O(n lg lg LG lgD) bits.
Abstract: We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at least |CSA| + O(n lgD/lg lgD) or 2|CSA| + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequencies and top-k document retrieval using just |CSA| + O(n lg lg lgD) bits. We also improve current solutions that use 2|CSA| + o(n) bits, and consider other problems such as colored range listing, top-k most important documents, and computing arbitrary frequencies.

Book ChapterDOI
01 Jan 2011
TL;DR: This chapter examines some new types of system that use different types of user context to learn about users, to adapt their response to different users or to help us make better search decisions.
Abstract: The situations in which we search form a context: a complex set of variables describing our intentions, our personal characteristics, the data and systems available for searching, and our physical, social and organizational environments. Different contexts can mean that we want search systems to behave differently or to offer different responses. Creating search systems and search interfaces to be contextually sensitive raises many research challenges: what aspects of a searcher’s context are useful to know about, how can we model context for use by retrieval systems and how do we evaluate search systems in context? In this chapter we will look at why differences in context can affect how we want search systems to operate and ways that we can use contextual information to help search systems behave more intelligently to our changing context. We will examine some new types of system that use different types of user context to learn about users, to adapt their response to different users or to help us make better search decisions.

BookDOI
27 Jun 2011
TL;DR: Information retrieval is the science concerned with the effective and efficient retrieval of documents starting from their semantic content as mentioned in this paper, which is employed to fulfill some information need from a large number of digital documents.
Abstract: Information retrieval is the science concerned with the effective and efficient retrieval of documents starting from their semantic content. It is employed to fulfill some information need from a large number of digital documents. Given the ever-growing amount of documents available and the heterogeneous data structures used for storage, information retrieval has recently faced and tackled novel applications. In this book, Melucci and Baeza-Yates present a wide-spectrum illustration of recent research results in advanced areas related to information retrieval. Readers will find chapters on e.g. aggregated search, digital advertising, digital libraries, discovery of spam and opinions, information retrieval in context, multimedia resource discovery, quantum mechanics applied to information retrieval, scalability challenges in web search engines, and interactive information retrieval evaluation. All chapters are written by well-known researchers, are completely self-contained and comprehensive, and are complemented by an integrated bibliography and subject index. With this selection, the editors provide the most up-to-date survey of topics usually not addressed in depth in traditional (text)books on information retrieval. The presentation is intended for a wide audience of people interested in information retrieval: undergraduate and graduate students, post-doctoral researchers, lecturers, and industrial researchers.

Journal ArticleDOI
TL;DR: The Earth Mover's Distance (EMD) is employed to retrieve relevant documents, which enables us to markedly shrink the searching domain and corroborate that the proposed approach is accurate and computationally efficient for performing PD.

Journal ArticleDOI
TL;DR: An MPEG-like descriptor is proposed that contains conventional contour and region shape features with a wide applicability from any arbitrary shape to document retrieval through word spotting that is kept to minimum without limiting its discriminating ability.

Book ChapterDOI
18 Apr 2011
TL;DR: The experiment results show that the proposed methods for leveraging forum threads to improve estimation of document language models are effective, and they outperform the existing smoothing methods for the forum post retrieval task.
Abstract: Due to many unique characteristics of forum data, forum post retrieval is different from traditional document retrieval and web search, raising interesting research questions about how to optimize the accuracy of forum post retrieval In this paper, we study how to exploit the naturally available raw thread structures of forums to improve retrieval accuracy in the language modeling framework Specifically, we propose and study two different schemes for smoothing the language model of a forum post based on the thread containing the post We explore several different variants of the two schemes to exploit thread structures in different ways We also create a human annotated test data set for forum post retrieval and evaluate the proposed smoothing methods using this data set The experiment results show that the proposed methods for leveraging forum threads to improve estimation of document language models are effective, and they outperform the existing smoothing methods for the forum post retrieval task