scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2012"


Journal ArticleDOI
TL;DR: The development of Sentiment Analysis and Opinion Mining during the last years are reviewed, and the evolution of a relatively new research direction is discussed, namely, Contradiction Analysis.
Abstract: In the past years we have witnessed Sentiment Analysis and Opinion Mining becoming increasingly popular topics in Information Retrieval and Web data analysis. With the rapid growth of the user-generated content represented in blogs, wikis and Web forums, such an analysis became a useful tool for mining the Web, since it allowed us to capture sentiments and opinions at a large scale. Opinion retrieval has established itself as an important part of search engines. Ratings, opinion trends and representative opinions enrich the search experience of users when combined with traditional document retrieval, by revealing more insights about a subject. Opinion aggregation over product reviews can be very useful for product marketing and positioning, exposing the customers' attitude towards a product and its features along different dimensions, such as time, geographical location, and experience. Tracking how opinions or discussions evolve over time can help us identify interesting trends and patterns and better understand the ways that information is propagated in the Internet. In this study, we review the development of Sentiment Analysis and Opinion Mining during the last years, and also discuss the evolution of a relatively new research direction, namely, Contradiction Analysis. We give an overview of the proposed methods and recent advances in these areas, and we try to layout the future research directions in the field.

414 citations


Journal ArticleDOI
TL;DR: This paper surveys the state of the art in recognition and retrieval of mathematical expressions, organized around four key problems in math retrieval (query construction, normalization, indexing, and relevance feedback), and four key problem in math recognition (detecting expressions, detecting and classifying symbols, analyzing symbol layout, and constructing a representation of meaning).
Abstract: Document recognition and retrieval technologies complement one another, providing improved access to increasingly large document collections. While recognition and retrieval of textual information is fairly mature, with wide-spread availability of optical character recognition and text-based search engines, recognition and retrieval of graphics such as images, figures, tables, diagrams, and mathematical expressions are in comparatively early stages of research. This paper surveys the state of the art in recognition and retrieval of mathematical expressions, organized around four key problems in math retrieval (query construction, normalization, indexing, and relevance feedback), and four key problems in math recognition (detecting expressions, detecting and classifying symbols, analyzing symbol layout, and constructing a representation of meaning). Of special interest is the machine learning problem of jointly optimizing the component algorithms in a math recognition system, and developing effective indexing, retrieval and relevance feedback algorithms for math retrieval. Another important open problem is developing user interfaces that seamlessly integrate recognition and retrieval. Activity in these important research areas is increasing, in part because math notation provides an excellent domain for studying problems common to many document and graphics recognition and retrieval applications, and also because mature applications will likely provide substantial benefits for education, research, and mathematical literacy.

267 citations


Proceedings Article
13 Sep 2012
TL;DR: Variations of the BoAW method are explored and results on NIST 2011 multimedia event detection (MED) dataset are presented.
Abstract: : With the popularity of online multimedia videos, there has been much interest in recent years in acoustic event detection and classification for the improvement of online video search The audio component of a video has the potential to contribute significantly to multimedia event classification Recent research in audio document classification has drawn parallels to text and image document retrieval by employing what is referred to as the bag-of-audio words (BoAW) method Compared to supervised approaches where audio concept detectors are trained using annotated data and extracted labels are used as low level features for multimedia event classification The BoAW approach extracts audio concepts in an unsupervised fashion Hence this method has the advantage that it can be employed easily for a new set of audio concepts in multimedia videos without going through a laborious annotation effort In this paper, we explore variations of the BoAW method and present results on NIST 2011 multimedia event detection (MED) dataset

119 citations


Journal ArticleDOI
TL;DR: Techniques and tools from the fields of natural language processing, information retrieval, and content-based image retrieval allow the development of building blocks for advanced information services.
Abstract: The search for relevant and actionable information is a key to achieving clinical and research goals in biomedicine. Biomedical information exists in different forms: as text and illustrations in journal articles and other documents, in images stored in databases, and as patients’ cases in electronic health records. This paper presents ways to move beyond conventional text-based searching of these resources, by combining text and visual features in search queries and document representation. A combination of techniques and tools from the fields of natural language processing, information retrieval, and content-based image retrieval allows the development of building blocks for advanced information services. Such services enable searching by textual as well as visual queries, and retrieving documents enriched by relevant images, charts, and other illustrations from the journal literature, patient records and image databases.

106 citations


Journal ArticleDOI
TL;DR: In this article, the authors show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries.

100 citations


Proceedings Article
26 Jun 2012
TL;DR: In this paper, the joint problem of recommending items to a user with respect to a given query, which is a surprisingly common task, has been studied and a factorized model is proposed to optimize the top-ranked items returned for the given query and user.
Abstract: Retrieval tasks typically require a ranking of items given a query. Collaborative filtering tasks, on the other hand, learn to model user's preferences over items. In this paper we study the joint problem of recommending items to a user with respect to a given query, which is a surprisingly common task. This setup differs from the standard collaborative filtering one in that we are given a query × user × item tensor for training instead of the more traditional user × item matrix. Compared to document retrieval we do have a query, but we may or may not have content features (we will consider both cases) and we can also take account of the user's profile. We introduce a factorized model for this new task that optimizes the top-ranked items returned for the given query and user. We report empirical results where it outperforms several baselines.

73 citations


Journal ArticleDOI
TL;DR: A novel method is used to restructure the vector space model of visual words with respect to a structural ontology model in order to resolve visual synonym and polysemy problems and can significantly improve classification, interpretation, and retrieval performance for the athletics images.
Abstract: Images that have a different visual appearance may be semantically related using a higher level conceptualization. However, image classification and retrieval systems tend to rely only on the low-level visual structure within images. This paper presents a framework to deal with this semantic gap limitation by exploiting the well-known bag-of-visual words (BVW) to represent visual content. The novelty of this paper is threefold. First, the quality of visual words is improved by constructing visual words from representative keypoints. Second, domain specific “non-informative visual words” are detected which are useless to represent the content of visual data but which can degrade the categorization capability. Distinct from existing frameworks, two main characteristics for non-informative visual words are defined: a high document frequency (DF) and a small statistical association with all the concepts in the collection. The third contribution in this paper is that a novel method is used to restructure the vector space model of visual words with respect to a structural ontology model in order to resolve visual synonym and polysemy problems. The experimental results show that our method can disambiguate visual word senses effectively and can significantly improve classification, interpretation, and retrieval performance for the athletics images.

57 citations


Proceedings ArticleDOI
03 Sep 2012
TL;DR: This work introduces an automatic query performance assessment approach for software artifact retrieval, which uses 21 measures from the field of text retrieval, and shows that the approach is able to predict the performance of queries with 79% accuracy, using very little training data.
Abstract: Text-based search and retrieval is used by developers in the context of many SE tasks, such as, concept location, traceability link retrieval, reuse, impact analysis, etc. Solutions for software text search range from regular expression matching to complex techniques using text retrieval. In all cases, the results of a search depend on the query formulated by the developer. A developer needs to run a query and look at the results before realizing that it needs reformulating. Our aim is to automatically assess the performance of a query before it is executed. We introduce an automatic query performance assessment approach for software artifact retrieval, which uses 21 measures from the field of text retrieval. We evaluate the approach in the context of concept location in source code. The evaluation shows that our approach is able to predict the performance of queries with 79% accuracy, using very little training data.

47 citations


Patent
02 Feb 2012
TL;DR: In this article, a set of word embedding transforms are applied to transform text words of an input document into K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document.
Abstract: A set of word embedding transforms are applied to transform text words of a set of documents into K-dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents. A probabilistic topic model is learned using the sets or sequences of word vectors representing the documents of the set of documents. The set of word embedding transforms are applied to transform text words of an input document into K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document. The learned probabilistic topic model is applied to assign probabilities for topics of the probabilistic topic model to the set or sequence of word vectors representing the input document. A document processing operation such as annotation, classification, or similar document retrieval may be performed using the assigned topic probabilities.

46 citations


Posted Content
TL;DR: This paper introduces a factorized model for this new task that optimizes the top-ranked items returned for the given query and user and reports empirical results where it outperforms several baselines.
Abstract: Retrieval tasks typically require a ranking of items given a query. Collaborative filtering tasks, on the other hand, learn to model user's preferences over items. In this paper we study the joint problem of recommending items to a user with respect to a given query, which is a surprisingly common task. This setup differs from the standard collaborative filtering one in that we are given a query x user x item tensor for training instead of the more traditional user x item matrix. Compared to document retrieval we do have a query, but we may or may not have content features (we will consider both cases) and we can also take account of the user's profile. We introduce a factorized model for this new task that optimizes the top-ranked items returned for the given query and user. We report empirical results where it outperforms several baselines.

40 citations


05 Jul 2012
TL;DR: A novel framework is introduced, in which evaluation is done in an extrinsic, and query-dependent manner but without depending on relevance judgments, which is expected to be helpful for the task of optimizing the configuration of ASR systems for the transcription of (large) speech collections for use in Spoken Document Retrieval.
Abstract: Spoken Document Retrieval (SDR) is usually implemented by using an Information Retrieval (IR) engine on speech transcripts that are produced by an Automatic Speech Recognition (ASR) system. These transcripts generally contain a substantial amount of transcription errors (noise) and are mostly unstructured. This thesis addresses two challenges that arise when doing IR on this type of source material: i. segmentation of speech transcripts into suitable retrieval units, and ii. evaluation of the impact of transcript noise on the results of an IR task. It is shown that intrinsic evaluation results in different conclusions with regard to the quality of automatic story boundaries than when (extrinsic) Mean Average Precision (MAP) is used. This indicates that for automatic story segmentation for search applications, the traditionally used (intrinsic) segmentation cost may not be a good performance target. The best performance in an SDR context was achieved using lexical cohesion-based approaches, rather than the statistical approaches that were most popular in story segmentation benchmarks. For the evaluation of speech transcript noise in an SDR context a novel framework is introduced, in which evaluation is done in an extrinsic, and query-dependent manner but without depending on relevance judgments. This is achieved by making a direct comparison between the ranked results lists of IR tasks on a reference and an ASR-derived transcription. The resulting measures are highly correlated with MAP, making it possible to do extrinsic evaluation of ASR transcripts for ad-hoc collections, while using a similar amount of reference material as the popular intrinsic metric Word Error Rate. The proposed evaluation methods are expected to be helpful for the task of optimizing the configuration of ASR systems for the transcription of (large) speech collections for use in Spoken Document Retrieval, rather than the more traditional dictation tasks.

Proceedings ArticleDOI
30 Mar 2012
TL;DR: A practical privacy-preserving ranked keyword search scheme based on PIR that allows multi-keyword queries with ranking capability and outperforms the most efficient proposals in literature in terms of time complexity by several orders of magnitude.
Abstract: Information search and document retrieval from a remote database (e.g. cloud server) requires submitting the search terms to the database holder. However, the search terms may contain sensitive information that must be kept secret from the database holder. Moreover, the privacy concerns apply to the relevant documents retrieved by the user in the later stage since they may also contain sensitive data and reveal information about sensitive search terms. A related protocol, Private Information Retrieval (PIR), provides useful cryptographic tools to hide the queried search terms and the data retrieved from the database while returning most relevant documents to the user. In this paper, we propose a practical privacy-preserving ranked keyword search scheme based on PIR that allows multi-keyword queries with ranking capability. The proposed scheme increases the security of the keyword search scheme while still satisfying efficient computation and communication requirements. To the best of our knowledge the majority of previous works are not efficient for assumed scenario where documents are large files. Our scheme outperforms the most efficient proposals in literature in terms of time complexity by several orders of magnitude.

Proceedings Article
12 Jul 2012
TL;DR: Evaluations on a real world data set show that the lexicon models, integrated into a ranker-based QE system, not only significantly improve the document retrieval performance but also outperform two state-of-the-art log- based QE methods.
Abstract: This paper explores log-based query expansion (QE) models for Web search. Three lexicon models are proposed to bridge the lexical gap between Web documents and user queries. These models are trained on pairs of user queries and titles of clicked documents. Evaluations on a real world data set show that the lexicon models, integrated into a ranker-based QE system, not only significantly improve the document retrieval performance but also outperform two state-of-the-art log-based QE methods.

Book ChapterDOI
19 Sep 2012
TL;DR: Wang et al. as mentioned in this paper proposed the concept of selective document retrieval (SDR) from an encrypted database which allows a client to store encrypted data on a third-party server and perform efficient search remotely.
Abstract: We propose the concept of selective document retrieval (SDR) from an encrypted database which allows a client to store encrypted data on a third-party server and perform efficient search remotely. We propose a new SDR scheme based on the recent advances in fully homomorphic encryption schemes. The proposed scheme is secure in our security model and can be adapted to support many useful search features, including aggregating search results, supporting conjunctive keyword search queries, advanced keyword search, search with keyword occurrence frequency, and search based on inner product. To evaluate the performance, we implement the search algorithm of our scheme in C. The experiment results show that a search query takes only 47 seconds in an encrypted database with 1000 documents on a Linux server, and it demonstrates that our scheme is much more efficient, i.e., around 1250 times faster, than a solution based on the SSW scheme with similar security guarantees.

Proceedings ArticleDOI
02 Jun 2012
TL;DR: A novel pre-retrieval metric is proposed, which reflects the quality of a query by measuring the specificity of its terms, and is a good effort predictor for text retrieval-based concept location, outperforming existing techniques from the field of natural language document retrieval.
Abstract: Text retrieval approaches have been used to address many software engineering tasks. In most cases, their use involves issuing a textual query to retrieve a set of relevant software artifacts from the system. The performance of all these approaches depends on the quality of the given query (i.e., its ability to describe the information need in such a way that the relevant software artifacts are retrieved during the search). Currently, the only way to tell that a query failed to lead to the expected software artifacts is by investing time and effort in analyzing the search results. In addition, it is often very difficult to ascertain what part of the query leads to poor results. We propose a novel pre-retrieval metric, which reflects the quality of a query by measuring the specificity of its terms. We exemplify the use of the new specificity metric on the task of concept location in source code. A preliminary empirical study shows that our metric is a good effort predictor for text retrieval-based concept location, outperforming existing techniques from the field of natural language document retrieval.

Proceedings ArticleDOI
12 Aug 2012
TL;DR: This paper presents a hybrid algorithmic framework for in-memory bag of-words ranked document retrieval using a self-index derived from the FM-Index, wavelet tree, and the compressed suffix tree data structures, and describes new capabilities provided by the algorithms that can be leveraged by future systems to improve effectiveness and efficiency.
Abstract: For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many efficiency-bound search tasks can now easily be supported entirely in memory as a result of recent hardware advances. In this paper we present a hybrid algorithmic framework for in-memory bag of-words ranked document retrieval using a self-index derived from the FM-Index, wavelet tree, and the compressed suffix tree data structures, and evaluate the various algorithmic trade-offs for performing efficient queries entirely in-memory. We compare our approach with two classic approaches to bag-of-words queries using inverted indexes, term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing. We show that our framework is competitive with state-of-the-art indexing structures, and describe new capabilities provided by our algorithms that can be leveraged by future systems to improve effectiveness and efficiency for a variety of fundamental search operations.

Journal ArticleDOI
TL;DR: iSCOUT reliably assembled relevant radiology reports for a cohort of patients with liver cysts with significant improvement in document retrieval when utilizing controlled lexicons.
Abstract: Radiology reports are permanent legal documents that serve as official interpretation of imaging tests. Manual analysis of textual information contained in these reports requires significant time and effort. This study describes the development and initial evaluation of a toolkit that enables automated identification of relevant information from within these largely unstructured text reports. We developed and made publicly available a natural language processing toolkit, Information from Searching Content with an Ontology-Utilizing Toolkit (iSCOUT). Core functions are included in the following modules: the Data Loader, Header Extractor, Terminology Interface, Reviewer, and Analyzer. The toolkit enables search for specific terms and retrieval of (radiology) reports containing exact term matches as well as similar or synonymous term matches within the text of the report. The Terminology Interface is the main component of the toolkit. It allows query expansion based on synonyms from a controlled terminology (e.g., RadLex or National Cancer Institute Thesaurus [NCIT]). We evaluated iSCOUT document retrieval of radiology reports that contained liver cysts, and compared precision and recall with and without using NCIT synonyms for query expansion. iSCOUT retrieved radiology reports with documented liver cysts with a precision of 0.92 and recall of 0.96, utilizing NCIT. This recall (i.e., utilizing the Terminology Interface) is significantly better than using each of two search terms alone (0.72, p=0.03 for liver cyst and 0.52, p=0.0002 for hepatic cyst). iSCOUT reliably assembled relevant radiology reports for a cohort of patients with liver cysts with significant improvement in document retrieval when utilizing controlled lexicons.

Journal ArticleDOI
TL;DR: The experiments carried out on the TREC Genomics 2004 and 2005 test sets show that the context-sensitive IR approach significantly outperforms state-of-the-art baseline approaches.

Journal ArticleDOI
TL;DR: A novel use of a relevance language modeling framework for SDR that not only inherits the merits of several existing techniques but also provides a principled way to render the lexical and topical relationships between a query and a spoken document.
Abstract: Ever-increasing amounts of publicly available multimedia associated with speech information have motivated spoken document retrieval (SDR) to be an active area of intensive research in the speech processing community. Much work has been dedicated to developing elaborate indexing and modeling techniques for representing spoken documents, but only little to improving query formulations for better representing the information needs of users. The latter is critical to the success of a SDR system. In view of this, we present in this paper a novel use of a relevance language modeling framework for SDR. It not only inherits the merits of several existing techniques but also provides a principled way to render the lexical and topical relationships between a query and a spoken document. We further explore various ways to glean both relevance and non-relevance cues from the spoken document collection so as to enhance query modeling in an unsupervised fashion. In addition, we also investigate representing the query and documents with different granularities of index features to work in conjunction with the various relevance and/or non-relevance cues. Empirical evaluations performed on the TDT (Topic Detection and Tracking) collections reveal that the methods derived from our modeling framework hold good promise for SDR and are very competitive with existing retrieval methods.

Proceedings ArticleDOI
12 Aug 2012
TL;DR: It is argued that while the user study shows the subtopic model is good, there are many other factors apart from novelty and redundancy that may be influencing user preferences and a new framework is introduced to construct an ideal diversity ranking using only preference judgments, with no explicit subtopic judgments whatsoever.
Abstract: There has been considerable interest in incorporating diversity in search results to account for redundancy and the space of possible user needs. Most work on this problem is based on subtopics: diversity rankers score documents against a set of hypothesized subtopics, and diversity rankings are evaluated by assigning a value to each ranked document based on the number of novel (and redundant) subtopics it is relevant to. This can be seen as modeling a user who is always interested in seeing more novel subtopics, with progressively decreasing interest in seeing the same subtopic multiple times. We put this model to test: if it is correct, then users, when given a choice, should prefer to see a document that has more value to the evaluation. We formulate some specific hypotheses from this model and test them with actual users in a novel preference-based design in which users express a preference for document A or document B given document C. We argue that while the user study shows the subtopic model is good, there are many other factors apart from novelty and redundancy that may be influencing user preferences. From this, we introduce a new framework to construct an ideal diversity ranking using only preference judgments, with no explicit subtopic judgments whatsoever.

Journal ArticleDOI
TL;DR: This paper aims to investigate the QDP problem in Web image search by proposing a novel method to automatically predict the quality of image search results for an arbitrary query, built based on a set of valuable features designed by exploring the visual characteristic of images in the search results.
Abstract: Image search plays an important role in our daily life. Given a query, the image search engine is to retrieve images related to it. However, different queries have different search difficulty levels. For some queries, they are easy to be retrieved (the search engine can return very good search results). While for others, they are difficult (the search results are very unsatisfactory). Thus, it is desirable to identify those “difficult” queries in order to handle them properly. Query difficulty prediction (QDP) is an attempt to predict the quality of the search result for a query over a given collection. QDP problem has been investigated for many years in text document retrieval, and its importance has been recognized in the information retrieval (IR) community. However, little effort has been conducted on the image query difficulty prediction problem for image search. Compared with QDP in document retrieval, QDP in image search is more challenging due to the noise of textual features and the well-known semantic gap of visual features. This paper aims to investigate the QDP problem in Web image search. A novel method is proposed to automatically predict the quality of image search results for an arbitrary query. This model is built based on a set of valuable features that are designed by exploring the visual characteristic of images in the search results. The experiments on two real image search datasets demonstrate the effectiveness of the proposed query difficulty prediction method. Two applications, including optimal image search engine selection and search results merging, are presented to show the promising applicability of QDP.

Journal ArticleDOI
TL;DR: This work proposes an interactive SDR approach in which given the user's query, the system returns not only the retrieval results but also a short list of key terms describing distinct topics, which are properly ranked such that the retrieval success rate is maximized while the number of interactive steps is minimized.
Abstract: Interaction with users is a powerful strategy that potentially yields better information retrieval for all types of media, including text, images, and videos. While spoken document retrieval (SDR) is a crucial technology for multimedia access in the network era, it is also more challenging than text information retrieval because of the inevitable recognition errors. It is therefore reasonable to consider interactive functionalities for SDR systems. We propose an interactive SDR approach in which given the user's query, the system returns not only the retrieval results but also a short list of key terms describing distinct topics. The user selects these key terms to expand the query if the retrieval results are not satisfactory. The entire retrieval process is organized around a hierarchy of key terms that define the allowable state transitions; this is modeled by a Markov decision process, which is popularly used in spoken dialogue systems. By reinforcement learning with simulated users, the key terms on the short list are properly ranked such that the retrieval success rate is maximized while the number of interactive steps is minimized. Significant improvements over existing approaches were observed in preliminary experiments performed on information needs provided by real users. A prototype system was also implemented.

Book ChapterDOI
07 Jun 2012
TL;DR: The experimental results show that the novel structures and algorithms dominate almost all the space/time tradeoff in various reduced-space structures that support top-k retrieval and propose new alternatives.
Abstract: Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.

Journal ArticleDOI
TL;DR: An approach to the definition of an ontology and a set of operations as an instrument of the formation and quantitatively estimated correlation of identifying images of objects of the subject area in a dialectical relationship of objective, conceptual, and symbolic spaces is proposed.
Abstract: An approach to the definition of an ontology and a set of operations as an instrument of the formation and quantitatively estimated correlation of identifying images of objects of the subject area in a dialectical relationship of objective, conceptual, and symbolic spaces is proposed. Ontological representation of the image of the object in an computing environment corresponds to an object-oriented approach and includes not only property but also behavior. In practice, this approach will automate dynamic reformulation and correlating the retrieval images of query and documents based on their reduction to a common conceptual and terminological context.

Proceedings Article
30 Mar 2012
TL;DR: A text analyzer is developed to derive the structure of the input text using rule reduction technique in three stages namely, Token Creation, Feature Identification and Categorization and Summarization.
Abstract: Text mining is a new field that attempts to bring together meaningful information from natural language text. Automatic Text categorization and summarization is the process of assigning pre-defined class labels to incoming, unclassified documents. The class labels are defined based on a set of examples of pre-classified documents used as a training corpus. This research work comprises an automatic text categorization and summarization approach to analyze the structure of input text. In this work a text analyzer is developed to derive the structure of the input text using rule reduction technique in three stages namely, Token Creation, Feature Identification and Categorization and Summarization. This analyzer is tested with sample input texts and gives noteworthy results. Extensive experimentation validates the selection of parameters and the efficacy of our approach for text classification. This work can be expanded and used in many practical applications, including indexing for document retrieval, organizing and maintaining large catalogues of Web resources, automatically extracting metadata, and Word sense disambiguation, etc.

Proceedings Article
01 Dec 2012
TL;DR: This work proposes a novel term weighting scheme by incorporating the dependency relation cues between term pairs and demonstrates that it achieves promising performance as compared to the state-of-the-art methods.
Abstract: With the emergence of community-based question answering (cQA) services, question retrieval has become an integral part of information and knowledge acquisition. Though existing information retrieval (IR) technologies have been found to be successful for document retrieval, they are less effective for question retrieval due to the inherent characteristics of questions, which have shorter texts. One of the major common drawbacks for the term weightingbased question retrieval models is that they overlook the relations between term pairs when computing their weights. To tackle this problem, we propose a novel term weighting scheme by incorporating the dependency relation cues between term pairs. Given a question, we first construct a dependency graph and compute the relation strength between each term pairs. Next, based on the dependency relation scores, we refine the initial term weights estimated by conventional term weighting approaches. We demonstrate that the proposed term weighting scheme can be seamlessly integrated with popular question retrieval models. Comprehensive experiments well validate our proposed scheme and show that it achieves promising performance as compared to the state-of-the-art methods. TITLE AND ABSTRACT IN ANOTHER LANGUAGE (CHINESE)

Journal ArticleDOI
TL;DR: This paper uses subword units as recognition and indexing units to reduce the OOV rate and index alternative recognition hypotheses to handle ASR errors, and presents extensive analysis of retrieval performance depending on query length, and proposes length-based index combination and thresholding strategies for the STD task.
Abstract: This paper presents our work on the retrieval of spoken information in Turkish. Traditional speech retrieval systems perform indexing and retrieval over automatic speech recognition (ASR) transcripts, which include errors either because of out-of-vocabulary (OOV) words or ASR inaccuracy. We use subword units as recognition and indexing units to reduce the OOV rate and index alternative recognition hypotheses to handle ASR errors. Performance of such methods is evaluated on our Turkish Broadcast News Corpus with two types of speech retrieval systems: a spoken term detection (STD) and a spoken document retrieval (SDR) system. To evaluate the SDR system, we also build a spoken information retrieval (IR) collection, which is the first for Turkish. Experiments showed that word segmentation algorithms are quite useful for both tasks. SDR performance is observed to be less dependent on the ASR component, whereas any performance change in ASR directly affects STD. We also present extensive analysis of retrieval performance depending on query length, and propose length-based index combination and thresholding strategies for the STD task. Finally, a new approach, which depends on the detection of stems instead of complete terms, is tried for STD and observed to give promising results. Although evaluations were performed in Turkish, we expect the proposed methods to be effective for similar languages as well.

Proceedings ArticleDOI
Deepak Agarwal1, Maxim Gurevich1
08 Feb 2012
TL;DR: A two-stage approach to reduce the first stage to a standard IR problem, where each item is represented by a sparse feature vector and the query-item relevance score is given by vector dot product, which allows leveraging extensive work in IR that resulted in highly efficient retrieval systems.
Abstract: A crucial task in many recommender problems like computational advertising, content optimization, and others is to retrieve a small set of items by scoring a large item inventory through some elaborate statistical/machine-learned model. This is challenging since the retrieval has to be fast (few milliseconds) to load the page quickly. Fast retrieval is well studied in the information retrieval (IR) literature, especially in the context of document retrieval for queries. When queries and documents have sparse representation and relevance is measured through cosine similarity (or some variant thereof), one could build highly efficient retrieval algorithms that scale gracefully to increasing item inventory. The key components exploited by such algorithms is sparse query-document representation and the special form of the relevance function. Many machine-learned models used in modern recommender problems do not satisfy these properties and since brute force evaluation is not an option with large item inventory, heuristics that filter out some items are often employed to reduce model computations at runtime.In this paper, we take a two-stage approach where the first stage retrieves top-K items using our approximate procedures and the second stage selects the desired top-k using brute force model evaluation on the K retrieved items. The main idea of our approach is to reduce the first stage to a standard IR problem, where each item is represented by a sparse feature vector (a.k.a. the vector-space representation) and the query-item relevance score is given by vector dot product. The sparse item representation is learnt to closely approximate the original machine-learned score by using retrospective data. Such a reduction allows leveraging extensive work in IR that resulted in highly efficient retrieval systems. Our approach is model-agnostic, relying only on data generated from the machine-learned model. We obtain significant improvements in the computational cost vs. accuracy tradeoff compared to several baselines in our empirical evaluation on both synthetic models and on a click-through (CTR) model used in online advertising.

Proceedings ArticleDOI
27 Mar 2012
TL;DR: This paper presents a novel interface running on smart phones which is capable of seamlessly linking physical and digital worlds through paper documents, based on a real-time document image retrieval method called Locally Likely Arrangement Hashing.
Abstract: This paper presents a novel interface running on smart phones which is capable of seamlessly linking physical and digital worlds through paper documents. This interface is based on a real-time document image retrieval method called Locally Likely Arrangement Hashing. By just only pointing a smart phone to a paper document, the user can obtain its corresponding electronic document. This can easily provide the user with the information associated with the retrieved document. This relevant information can be superimposed on the display of smart phones. Therefore, we consider that with the help of this interface, the user can utilize paper documents as a new medium to display various information.

Proceedings Article
01 Nov 2012
TL;DR: This paper presents a multipage administrative document image retrieval system based on textual and visual representations of document pages based on a bag-of-words framework.
Abstract: In this paper we present a multipage administrative document image retrieval system based on textual and visual representations of document pages. Individual pages are represented by textual or visual information using a bag-of-words framework. Different fusion strategies are evaluated which allow the system to perform multipage document retrieval on the basis of a single page retrieval system. Results are reported on a large dataset of document images sampled from a banking workflow.