scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2018"


Proceedings ArticleDOI
03 Sep 2018
TL;DR: In this article, the authors present a claim verification pipeline approach, which, according to the preliminary results, scored third in the shared task, out of 23 competing systems, using two extensions to the Enhanced LSTM (ESIM).
Abstract: The Fact Extraction and VERification (FEVER) shared task was launched to support the development of systems able to verify claims by extracting supporting or refuting facts from raw text. The shared task organizers provide a large-scale dataset for the consecutive steps involved in claim verification, in particular, document retrieval, fact extraction, and claim classification. In this paper, we present our claim verification pipeline approach, which, according to the preliminary results, scored third in the shared task, out of 23 competing systems. For the document retrieval, we implemented a new entity linking approach. In order to be able to rank candidate facts and classify a claim on the basis of several selected facts, we introduce two extensions to the Enhanced LSTM (ESIM).

176 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: This system is a four stage model consisting of document retrieval, sentence retrieval, natural language inference and aggregation that achieved a FEVER score of 62.52% on the provisional test set (without additional human evaluation), and 65.41%" on the development set.
Abstract: In this paper we describe our 2nd place FEVER shared-task system that achieved a FEVER score of 62.52% on the provisional test set (without additional human evaluation), and 65.41% on the development set. Our system is a four stage model consisting of document retrieval, sentence retrieval, natural language inference and aggregation. Retrieval is performed leveraging task-specific features, and then a natural language inference model takes each of the retrieved sentences paired with the claimed fact. The resulting predictions are aggregated across retrieved sentences with a Multi-Layer Perceptron, and re-ranked corresponding to the final prediction.

107 citations


Posted Content
TL;DR: This paper supports the interdependencies between fact checking, document retrieval, source credibility, stance detection and rationale extraction as annotations in the same corpus and implements this setup on an Arabic fact checking corpus, the first of its kind.
Abstract: A reasonable approach for fact checking a claim involves retrieving potentially relevant documents from different sources (e.g., news websites, social media, etc.), determining the stance of each document with respect to the claim, and finally making a prediction about the claim's factuality by aggregating the strength of the stances, while taking the reliability of the source into account. Moreover, a fact checking system should be able to explain its decision by providing relevant extracts (rationales) from the documents. Yet, this setup is not directly supported by existing datasets, which treat fact checking, document retrieval, source credibility, stance detection and rationale extraction as independent tasks. In this paper, we support the interdependencies between these tasks as annotations in the same corpus. We implement this setup on an Arabic fact checking corpus, the first of its kind.

91 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: In this paper, the authors support the interdependencies between fact checking, document retrieval, source credibility, stance detection and rationale extraction as annotations in the same corpus, and implement this setup on an Arabic fact checking corpus.
Abstract: A reasonable approach for fact checking a claim involves retrieving potentially relevant documents from different sources (e.g., news websites, social media, etc.), determining the stance of each document with respect to the claim, and finally making a prediction about the claim’s factuality by aggregating the strength of the stances, while taking the reliability of the source into account. Moreover, a fact checking system should be able to explain its decision by providing relevant extracts (rationales) from the documents. Yet, this setup is not directly supported by existing datasets, which treat fact checking, document retrieval, source credibility, stance detection and rationale extraction as independent tasks. In this paper, we support the interdependencies between these tasks as annotations in the same corpus. We implement this setup on an Arabic fact checking corpus, the first of its kind.

86 citations


Journal ArticleDOI
TL;DR: This paper proposed the Neural Vector Space Model (NVSM) to learn low-dimensional representations of words and documents from scratch using gradient descent and rank documents according to their similarity with query representations that are composed from word representations.
Abstract: We propose the Neural Vector Space Model (NVSM), a method that learns representations of documents in an unsupervised manner for news article retrieval. In the NVSM paradigm, we learn low-dimensional representations of words and documents from scratch using gradient descent and rank documents according to their similarity with query representations that are composed from word representations. We show that NVSM performs better at document ranking than existing latent semantic vector space methods. The addition of NVSM to a mixture of lexical language models and a state-of-the-art baseline vector space model yields a statistically significant increase in retrieval effectiveness. Consequently, NVSM adds a complementary relevance signal. Next to semantic matching, we find that NVSM performs well in cases where lexical matching is needed. NVSM learns a notion of term specificity directly from the document collection without feature engineering. We also show that NVSM learns regularities related to Luhn significance. Finally, we give advice on how to deploy NVSM in situations where model selection (e.g., cross-validation) is infeasible. We find that an unsupervised ensemble of multiple models trained with different hyperparameter values performs better than a single cross-validated model. Therefore, NVSM can safely be used for ranking documents without supervised relevance judgments.

64 citations


Journal ArticleDOI
TL;DR: The HITS-PR-HHblits predictor is a protocol for protein remote homology detection using different sets of programs, which will become a very useful computational tool for proteome analysis.
Abstract: As one of the most important fundamental problems in protein sequence analysis, protein remote homology detection is critical for both theoretical research (protein structure and function studies) and real world applications (drug design). Although several computational predictors have been proposed, their detection performance is still limited. In this study, we treat protein remote homology detection as a document retrieval task, where the proteins are considered as documents and its aim is to find the highly related documents with the query documents in a database. A protein similarity network was constructed based on the true labels of proteins in the database, and the query proteins were then connected into the network based on the similarity scores calculated by three ranking methods, including PSI-BLAST, Hmmer and HHblits. The PageRank algorithm and Hyperlink-Induced Topic Search (HITS) algorithm were respectively performed on this network to move the homologous proteins of query proteins to the neighbors of the query proteins in the network. Finally, PageRank and HITS algorithms were combined, and a predictor called HITS-PR-HHblits was proposed to further improve the predictive performance. Tested on the SCOP and SCOPe benchmark datasets, the experimental results showed that the proposed protocols outperformed other state-of-the-art methods. For the convenience of the most experimental scientists, a web server for HITS-PR-HHblits was established at http://bioinformatics.hitsz.edu.cn/HITS-PR-HHblits, by which the users can easily get the results without the need to go through the mathematical details. The HITS-PR-HHblits predictor is a protocol for protein remote homology detection using different sets of programs, which will become a very useful computational tool for proteome analysis.

58 citations


Proceedings ArticleDOI
01 Jan 2018
TL;DR: This paper proposed an adaptive document retrieval model, which learns the optimal candidate number for document retrieval, conditional on the size of the corpus and the query, and reported extensive experimental results showing that their adaptive approach outperforms state-of-the-art methods on multiple benchmark datasets, as well as in the context of corpora with variable sizes.
Abstract: State-of-the-art systems in deep question answering proceed as follows: (1)an initial document retrieval selects relevant documents, which (2) are then processed by a neural network in order to extract the final answer. Yet the exact interplay between both components is poorly understood, especially concerning the number of candidate documents that should be retrieved. We show that choosing a static number of documents - as used in prior research - suffers from a noise-information trade-off and yields suboptimal results. As a remedy, we propose an adaptive document retrieval model. This learns the optimal candidate number for document retrieval, conditional on the size of the corpus and the query. We report extensive experimental results showing that our adaptive approach outperforms state-of-the-art methods on multiple benchmark datasets, as well as in the context of corpora with variable sizes.

51 citations


Proceedings ArticleDOI
01 Jul 2018
TL;DR: This paper proposes NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way and is the first application of deep learning for the problem of document dating.
Abstract: Document date is essential for many important tasks, such as document retrieval, summarization, event detection, etc. While existing approaches for these tasks assume accurate knowledge of the document date, this is not always available, especially for arbitrary documents from the Web. Document Dating is a challenging problem which requires inference over the temporal structure of the document. Prior document dating systems have largely relied on handcrafted features while ignoring such document-internal structures. In this paper, we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way. To the best of our knowledge, this is the first application of deep learning for the problem of document dating. Through extensive experiments on real-world datasets, we find that NeuralDater significantly outperforms state-of-the-art baseline by 19% absolute (45% relative) accuracy points.

49 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: A large-scale dataset derived from Wikipedia is introduced to support CLIR research in 25 languages and a simple yet effective neural learning-to-rank model is presented that shares representations across languages and reduces the data requirement.
Abstract: Cross-lingual information retrieval (CLIR) is a document retrieval task where the documents are written in a language different from that of the user’s query. This is a challenging problem for data-driven approaches due to the general lack of labeled training data. We introduce a large-scale dataset derived from Wikipedia to support CLIR research in 25 languages. Further, we present a simple yet effective neural learning-to-rank model that shares representations across languages and reduces the data requirement. This model can exploit training data in, for example, Japanese-English CLIR to improve the results of Swahili-English CLIR.

49 citations


Proceedings ArticleDOI
27 Jun 2018
TL;DR: This work presents the first plagiarism detection approach that combines the analysis of mathematical expressions, images, citations and text and demonstrates the usefulness of the hybrid detection and result visualization approaches by using HyPlag to analyze a confirmed case of content reuse present in a retracted research publication.
Abstract: Current plagiarism detection systems reliably find instances of copied and moderately altered text, but often fail to detect strong paraphrases, translations, and the reuse of non-textual content and ideas. To improve upon the detection capabilities for such concealed content reuse in academic publications, we make four contributions: i) We present the first plagiarism detection approach that combines the analysis of mathematical expressions, images, citations and text. ii) We describe the implementation of this hybrid detection approach in the research prototype HyPlag. iii) We present novel visualization and interaction concepts to aid users in reviewing content similarities identified by the hybrid detection approach. iv) We demonstrate the usefulness of the hybrid detection and result visualization approaches by using HyPlag to analyze a confirmed case of content reuse present in a retracted research publication.

43 citations


Proceedings ArticleDOI
01 Dec 2018
TL;DR: A two-stage framework to perform phonetic-and-semantic embedding on spoken words considering the context of the spoken words and the phonetic structure and semantics of the audio embeddings obtained in Stage 2 are proposed.
Abstract: Word embedding or Word2Vec has been successful in offering semantics for text words learned from the context of words. Audio Word2Vec was shown to offer phonetic structures for spoken words (signal segments for words) learned from signals within spoken words. This paper proposes a two-stage framework to perform phonetic-and-semantic embedding on spoken words considering the context of the spoken words. Stage 1 performs phonetic embedding with speaker characteristics disentangled. Stage 2 then performs semantic embedding in addition. We further propose to evaluate the phonetic-and-semantic nature of the audio embeddings obtained in Stage 2 by parallelizing with text embeddings.In general, phonetic structure and semantics inevitably disturb each other. For example the words “brother” and “sister” are close in semantics but very different in phonetic structure, while the words “brother” and “bother” are in the other way around. But phonetic-and-semantic embedding is attractive, as shown in the initial experiments on spoken document retrieval. Not only spoken documents including the spoken query can be retrieved based on the phonetic structures, but spoken documents semantically related to the query but not including the query can also be retrieved based on the semantics.

Proceedings ArticleDOI
01 Aug 2018
TL;DR: The potential of deep learning-based object detectors namely, Faster R-CNN and YOLOv2 were examined for automatic detection of signatures and logos from scanned administrative documents, which can be used for document retrieval using signature or logo information.
Abstract: Signature and logo as a query are important for content-based document image retrieval from a scanned document repository. This paper deals with signature and logo detection from a repository of scanned documents, which can be used for document retrieval using signature or logo information. A large intra-category variance among signature and logo samples poses challenges to traditional hand-crafted feature extraction-based approaches. Hence, the potential of deep learning-based object detectors namely, Faster R-CNN and YOLOv2 were examined for automatic detection of signatures and logos from scanned administrative documents. Four different network models namely ZF, VGG16, VGG_M, and YOLOv2 were considered for analysis and identifying their potential in document image retrieval. The experiments were conducted on the publicly available "Tobacco-800" dataset. The proposed approach detects Signatures and Logos simultaneously. The results obtained from the experiments are promising and at par with the existing methods.

Journal ArticleDOI
TL;DR: This study finds that word embeddings do not show competitive performance to any of the baselines and when interpolated, outperform the best baselines for both hard and soft queries.
Abstract: Learning low dimensional dense representations of the vocabularies of a corpus, known as neural embeddings, has gained much attention in the information retrieval community. While there have been several successful attempts at integrating embeddings within the ad hoc document retrieval task, yet, no systematic study has been reported that explores the various aspects of neural embeddings and how they impact retrieval performance. In this paper, we perform a methodical study on how neural embeddings influence the ad hoc document retrieval task. More specifically, we systematically explore the following research questions: (i) do methods solely based on neural embeddings perform competitively with state of the art retrieval methods with and without interpolation? (ii) are there any statistically significant difference between the performance of retrieval models when based on word embeddings compared to when knowledge graph entity embeddings are used? and (iii) is there significant difference between using locally trained neural embeddings compared to when globally trained neural embeddings are used? We examine these three research questions across both hard and all queries. Our study finds that word embeddings do not show competitive performance to any of the baselines. In contrast, entity embeddings show competitive performance to the baselines and when interpolated, outperform the best baselines for both hard and soft queries.

Journal ArticleDOI
01 Feb 2018
TL;DR: This work retrieves a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection, and confirms that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation.
Abstract: Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.

Journal ArticleDOI
TL;DR: This article exploits both the query features and the system configuration features in the learning-to-rank method so that the selection of configuration is query dependent, and shows that query expansion features are among the most important for adaptive systems.
Abstract: Modern Information Retrieval (IR) systems have become more and more complex, involving a large number of parameters. For example, a system may choose from a set of possible retrieval models (BM25, language model, etc.), or various query expansion parameters, whose values greatly influence the overall retrieval effectiveness. Traditionally, these parameters are set at a system level based on training queries, and the same parameters are then used for different queries. We observe that it may not be easy to set all these parameters separately, since they can be dependent. In addition, a global setting for all queries may not best fit all individual queries with different characteristics. The parameters should be set according to these characteristics. In this article, we propose a novel approach to tackle this problem by dealing with the entire system configurations (i.e., a set of parameters representing an IR system behaviour) instead of selecting a single parameter at a time. The selection of the best configuration is cast as a problem of ranking different possible configurations given a query. We apply learning-to-rank approaches for this task. We exploit both the query features and the system configuration features in the learning-to-rank method so that the selection of configuration is query dependent. The experiments we conducted on four TREC ad hoc collections show that this approach can significantly outperform the traditional method to tune system configuration globally (i.e., grid search) and leads to higher effectiveness than the top performing systems of the TREC tracks. We also perform an ablation analysis on the impact of different features on the model learning capability and show that query expansion features are among the most important for adaptive systems.

Posted Content
TL;DR: This paper established baseline joint embedding results measured via both local and global retrieval methods on the soon-to-be released MIMIC-CXR dataset consisting of both chest X-ray images and the associated radiology reports.
Abstract: Joint embeddings between medical imaging modalities and associated radiology reports have the potential to offer significant benefits to the clinical community, ranging from cross-domain retrieval to conditional generation of reports to the broader goals of multimodal representation learning. In this work, we establish baseline joint embedding results measured via both local and global retrieval methods on the soon to be released MIMIC-CXR dataset consisting of both chest X-ray images and the associated radiology reports. We examine both supervised and unsupervised methods on this task and show that for document retrieval tasks with the learned representations, only a limited amount of supervision is needed to yield results comparable to those of fully-supervised methods.

Journal ArticleDOI
TL;DR: This work focuses on the task of ad hoc document retrieval within the academic search domain, and works with two search engines, CiteSeerX and SSOAR, that provide us with traffic.
Abstract: We report on our experience with TREC OpenSearch, an online evaluation campaign that enabled researchers to evaluate their experimental retrieval methods using real users of a live website. Specifically, we focus on the task of ad hoc document retrieval within the academic search domain, and work with two search engines, CiteSeerX and SSOAR, that provide us with traffic. We describe our experimental platform, which is based on the living labs methodology, and report on the experimental results obtained. We also share our experiences, challenges, and the lessons learned from running this track in 2016 and 2017.

Proceedings ArticleDOI
01 Mar 2018
TL;DR: This study is the first to investigate how document-level features are associated with actual learning outcomes when users get results from a personalized learning-oriented retrieval algorithm, and finds that while users who read the documents in the personalized retrieval condition had immediate learning gains comparable to the other two conditions, they had better long-term retention of more difficult vocabulary.
Abstract: A growing body of information retrieval research has studied the potential of search engines as effective, scalable platforms for self-directed learning. Towards this goal, we explore document representations for retrieval that include features associated with effective learning outcomes. While prior studies have investigated different retrieval models designed for teaching, this study is the first to investigate how document-level features are associated with actual learning outcomes when users get results from a personalized learning-oriented retrieval algorithm. We also conduct what is, to our knowledge, the first crowdsourced longitudinal study of long-term learning retention, in which we gave a subset of users who participated in an initial learning and assessment study a delayed post-test approximately nine months later. With this data, we were able to analyze how the three retrieval conditions in the original study were associated with changes in long-term vocabulary knowledge. We found that while users who read the documents in the personalized retrieval condition had immediate learning gains comparable to the other two conditions, they had better long-term retention of more difficult vocabulary.

Journal ArticleDOI
TL;DR: A concept coupling relationship analysis model is proposed to learn and aggregate the intra- and inter-concept coupling relationships, which represents user queries and documents in a concept space based on fuzzy formal concept analysis, utilizes a concept lattice as a semantic index to organize documents, and ranks documents with respect to the learned concept coupling relationships.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: An end-to-end pipeline that extracts factual evidence from Wikipedia and infers a decision about the truthfulness of the claim based on the extracted evidence achieves significant improvement over the baseline for all the components both on the development set and the test set.
Abstract: This paper presents the ColumbiaNLP submission for the FEVER Workshop Shared Task. Our system is an end-to-end pipeline that extracts factual evidence from Wikipedia and infers a decision about the truthfulness of the claim based on the extracted evidence. Our pipeline achieves significant improvement over the baseline for all the components (Document Retrieval, Sentence Selection and Textual Entailment) both on the development set and the test set. Our team finished 6th out of 24 teams on the leader-board based on the preliminary results with a FEVER score of 49.06 on the blind test set compared to 27.45 of the baseline system.

Book ChapterDOI
04 Apr 2018
TL;DR: Stemming and lemmatization are two language modeling techniques used to improve the document retrieval precision performances.
Abstract: Stemming and lemmatization are two language modeling techniques used to improve the document retrieval precision performances. Stemming is a procedure to reduce all words with the same stem to a common form whereas lemmatization removes inflectional endings and returns the base form of a word.

Journal ArticleDOI
TL;DR: A text categorization scheme, which begins with preprocessing, indexing of secure patterns, ends up by querying SRS features for retrieving secure design pattern using document retrieval model, which is used for the evaluation of the proposed model.
Abstract: Secure patterns provide a solution for the security requirement of the software. There are large number of secure patterns, and it is quite difficult to choose an appropriate pattern. Moreover, selection of these patterns needs security knowledge; generally, developers are not specialized in the domain of security knowledge. This paper can help in the selection of secure pattern on the basis of tradeoffs of the secure pattern using text categorization. A repository of secure design patterns is used as a data set and a repository of requirements artifacts in the form of software requirements specification (SRS) are used for this paper. A text categorization scheme, which begins with preprocessing, indexing of secure patterns, ends up by querying SRS features for retrieving secure design pattern using document retrieval model. For the evaluation of the proposed model, we have used three different domains’ SRS. These three SRS documents represent three different domains, i.e., e-commerce, social media, and desktop utility program. A traditional precision and recall method along with F-measure used for evaluation of information/document retrieval model is used to evaluate the results. F-measure for 17 different design problems shows around 81% accuracy with recall up to 0.69%.

Posted Content
TL;DR: This paper presents the claim verification pipeline approach, which, according to the preliminary results, scored third in the shared task, out of 23 competing systems, and introduces two extensions to the Enhanced LSTM (ESIM).
Abstract: The Fact Extraction and VERification (FEVER) shared task was launched to support the development of systems able to verify claims by extracting supporting or refuting facts from raw text. The shared task organizers provide a large-scale dataset for the consecutive steps involved in claim verification, in particular, document retrieval, fact extraction, and claim classification. In this paper, we present our claim verification pipeline approach, which, according to the preliminary results, scored third in the shared task, out of 23 competing systems. For the document retrieval, we implemented a new entity linking approach. In order to be able to rank candidate facts and classify a claim on the basis of several selected facts, we introduce two extensions to the Enhanced LSTM (ESIM).

Posted Content
TL;DR: In this paper, the authors extend Word Mover's Distance to incorporate term weighting schemes and provide more accurate and computationally efficient matching between documents using entropic regularization.
Abstract: Many information retrieval algorithms rely on the notion of a good distance that allows to efficiently compare objects of different nature. Recently, a new promising metric called Word Mover's Distance was proposed to measure the divergence between text passages. In this paper, we demonstrate that this metric can be extended to incorporate term-weighting schemes and provide more accurate and computationally efficient matching between documents using entropic regularization. We evaluate the benefits of both extensions in the task of cross-lingual document retrieval (CLDR). Our experimental results on eight CLDR problems suggest that the proposed methods achieve remarkable improvements in terms of Mean Reciprocal Rank compared to several baselines.

Journal ArticleDOI
01 Feb 2018
TL;DR: This work proposes the proposal of using partially observable Markov decision process to model session search, and evaluates different combinations of these choices over the TREC 2012 and 2013 session track datasets.
Abstract: Session search, the task of document retrieval for a series of queries in a session, has been receiving increasing attention from the information retrieval research community. Session search exhibits the properties of rich user-system interactions and temporal dependency. These properties lead to our proposal of using partially observable Markov decision process to model session search. On the basis of a design choice schema for states, actions and rewards, we evaluate different combinations of these choices over the TREC 2012 and 2013 session track datasets. According to the experimental results, practical design recommendations for using PODMP in session search are discussed.

Book ChapterDOI
26 Mar 2018
TL;DR: This work proposes a neural passage model (NPM) that uses passage-level information to improve the performance of ad-hoc retrieval and shows that the NPM can significantly outperform the existing passage-based retrieval models.
Abstract: Traditional statistical retrieval models often treat each document as a whole. In many cases, however, a document is relevant to a query only because a small part of it contain the targeted information. In this work, we propose a neural passage model (NPM) that uses passage-level information to improve the performance of ad-hoc retrieval. Instead of using a single window to extract passages, our model automatically learns to weight passages with different granularities in the training process. We show that the passage-based document ranking paradigm from previous studies can be directly derived from our neural framework. Also, our experiments on a TREC collection showed that the NPM can significantly outperform the existing passage-based retrieval models.

Journal ArticleDOI
TL;DR: Promising performance, under moderately constrained situation, substantiates competence of the proposed approach to formula embedding to scientific document retrieval.
Abstract: Intricate math formulae, which majorly constitute the content of scientific documents, add to the complexity of scientific document retrieval. Although modifications in conventional indexing and search mechanisms have eased the complexity and exhibited notable performance, the formula embedding approach to scientific document retrieval sounds equally appealing and promising. Formula Embedding Module of the proposed system uses a Bit Position Information Table to transform math formulae, contained inside scientific documents, into binary formulae vectors. Each set bit of a formula vector designates presence of aspecific mathematical entity. Mathematical user query is transformed into query vector, in similar fashion, and the corresponding relevant documents are retrieved. Relevance of a search result is characterized by extent of similarity between the indexed formula vector and the query vector. Promising performance, under moderately constrained situation, substantiates competence of the proposed approach.

Posted Content
TL;DR: This work proposes a method to efficiently learn diverse strategies in reinforcement learning for query reformulation in the tasks of document retrieval and question answering that has better generalization performance than strong baselines, such as an ensemble of agents trained on the full data.
Abstract: We propose a method to efficiently learn diverse strategies in reinforcement learning for query reformulation in the tasks of document retrieval and question answering. In the proposed framework an agent consists of multiple specialized sub-agents and a meta-agent that learns to aggregate the answers from sub-agents to produce a final answer. Sub-agents are trained on disjoint partitions of the training data, while the meta-agent is trained on the full training set. Our method makes learning faster, because it is highly parallelizable, and has better generalization performance than strong baselines, such as an ensemble of agents trained on the full data. We show that the improved performance is due to the increased diversity of reformulation strategies.

Proceedings ArticleDOI
01 May 2018
TL;DR: The result of the experiment shows this method is able to yield better document retrieval results and the concept of the document semantic unit in consideration of the overhead of graph computing is proposed.
Abstract: A new document retrieval method based on graph was proposed in this paper. Queries and documents are represented by graphs. The paper also proposes the concept of the document semantic unit in consideration of the overhead of graph computing. The size of semantic unit is used as the granularity for graph construction. This new method puts queries and documents in an unequal level instead of regarding them as equivalent entities which conventional IR system does. The paper further proposes the similarity calculating method of graphs based on general maximum common subgraph. The result of the experiment shows this method is able to yield better document retrieval results.

Journal ArticleDOI
19 Apr 2018
TL;DR: This article proposes cost-efficient remedies for information retrieval that leverage metadata through a filtering mechanism, which increases the precision of document retrieval, and develops a novel fuse-and-oversample approach for transfer learning to improve the performance of answer extraction.
Abstract: Traditional information retrieval (such as that offered by web search engines) impedes users with information overload from extensive result pages and the need to manually locate the desired information therein. Conversely, question-answering systems change how humans interact with information systems: users can now ask specific questions and obtain a tailored answer - both conveniently in natural language. Despite obvious benefits, their use is often limited to an academic context, largely because of expensive domain customizations, which means that the performance in domain-specific applications often fails to meet expectations. This paper proposes cost-efficient remedies: (i) we leverage metadata through a filtering mechanism, which increases the precision of document retrieval, and (ii) we develop a novel fuse-and-oversample approach for transfer learning in order to improve the performance of answer extraction. Here knowledge is inductively transferred from a related, yet different, tasks to the domain-specific application, while accounting for potential differences in the sample sizes across both tasks. The resulting performance is demonstrated with actual use cases from a finance company and the film industry, where fewer than 400 question-answer pairs had to be annotated in order to yield significant performance gains. As a direct implication to management, this presents a promising path to better leveraging of knowledge stored in information systems.