scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2019"


Proceedings ArticleDOI
18 Jul 2019
TL;DR: This paper proposed a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval, which takes into account the original query and previous question-answer interactions while selecting the next question.
Abstract: Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of P@1, which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.

193 citations


Journal ArticleDOI
17 Jul 2019
TL;DR: Li et al. as mentioned in this paper presented a connected system consisting of three homogeneous neural semantic matching models that conduct document retrieval, sentence selection, and claim verification jointly for fact extraction and verification.
Abstract: The increasing concern with misinformation has stimulated research efforts on automatic fact checking. The recentlyreleased FEVER dataset introduced a benchmark factverification task in which a system is asked to verify a claim using evidential sentences from Wikipedia documents. In this paper, we present a connected system consisting of three homogeneous neural semantic matching models that conduct document retrieval, sentence selection, and claim verification jointly for fact extraction and verification. For evidence retrieval (document retrieval and sentence selection), unlike traditional vector space IR models in which queries and sources are matched in some pre-designed term vector space, we develop neural models to perform deep semantic matching from raw textual input, assuming no intermediate term representation and no access to structured external knowledge bases. We also show that Pageview frequency can also help improve the performance of evidence retrieval results, that later can be matched by using our neural semantic matching network. For claim verification, unlike previous approaches that simply feed upstream retrieved evidence and the claim to a natural language inference (NLI) model, we further enhance the NLI model by providing it with internal semantic relatedness scores (hence integrating it with the evidence retrieval modules) and ontological WordNet features. Experiments on the FEVER dataset indicate that (1) our neural semantic matching method outperforms popular TF-IDF and encoder models, by significant margins on all evidence retrieval metrics, (2) the additional relatedness score and WordNet features improve the NLI model via better semantic awareness, and (3) by formalizing all three subtasks as a similar semantic matching problem and improving on all three stages, the complete model is able to achieve the state-of-the-art results on the FEVER test set (two times greater than baseline results).1

192 citations


Proceedings ArticleDOI
20 Aug 2019
TL;DR: This paper is able to leverage passage-level relevance judgments fortuitously available in other domains to fine-tune BERT models that are able to capture cross-domain notions of relevance, and can be directly used for ranking news articles.
Abstract: This paper applies BERT to ad hoc document retrieval on news articles, which requires addressing two challenges: relevance judgments in existing test collections are typically provided only at the document level, and documents often exceed the length that BERT was designed to handle. Our solution is to aggregate sentence-level evidence to rank documents. Furthermore, we are able to leverage passage-level relevance judgments fortuitously available in other domains to fine-tune BERT models that are able to capture cross-domain notions of relevance, and can be directly used for ranking news articles. Our simple neural ranking models achieve state-of-the-art effectiveness on three standard test collections.

153 citations


Posted Content
TL;DR: The authors apply inference on sentences individually, and then aggregating sentence scores to produce document scores, and report the highest average precision on these datasets by neural approaches that they are aware of. But they do not address the challenge posed by documents that are typically longer than the length of input BERT was designed to handle.
Abstract: Following recent successes in applying BERT to question answering, we explore simple applications to ad hoc document retrieval. This required confronting the challenge posed by documents that are typically longer than the length of input BERT was designed to handle. We address this issue by applying inference on sentences individually, and then aggregating sentence scores to produce document scores. Experiments on TREC microblog and newswire test collections show that our approach is simple yet effective, as we report the highest average precision on these datasets by neural approaches that we are aware of.

117 citations


DOI
08 Jun 2019
TL;DR: This work addresses the challenge posed by documents that are typically longer than the length of input BERT was designed to handle by applying inference on sentences individually, and then aggregating sentence scores to produce document scores.

113 citations


Proceedings ArticleDOI
TL;DR: This paper formulate the task of asking clarifying questions in open-domain information-seeking conversational systems, propose an offline evaluation methodology for the task, and collect a dataset, called Qulac, through crowdsourcing, which significantly outperforms competitive baselines.
Abstract: Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of P@1, which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.

109 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: This demonstration focuses on technical challenges in the integration of NLP and IR capabilities, along with the design rationale behind the approach to tightly-coupled integration between Python and Java Virtual Machine.
Abstract: We present Birch, a system that applies BERT to document retrieval via integration with the open-source Anserini information retrieval toolkit to demonstrate end-to-end search over large document collections. Birch implements simple ranking models that achieve state-of-the-art effectiveness on standard TREC newswire and social media test collections. This demonstration focuses on technical challenges in the integration of NLP and IR capabilities, along with the design rationale behind our approach to tightly-coupled integration between Python (to support neural networks) and the Java Virtual Machine (to support document retrieval using the open-source Lucene search library). We demonstrate integration of Birch with an existing search interface as well as interactive notebooks that highlight its capabilities in an easy-to-understand manner.

107 citations


Proceedings ArticleDOI
30 Jan 2019
TL;DR: A Semantic-aware Heterogeneous Network Embedding model (SHNE) is developed that performs joint optimization of heterogeneous SkipGram and deep semantic encoding for capturing both heterogeneous structural closeness and unstructured semantic relations among all nodes, as function of node content, that exist in the network.
Abstract: Representation learning in heterogeneous networks faces challenges due to heterogeneous structural information of multiple types of nodes and relations, and also due to the unstructured attribute or content (e.g., text) associated with some types of nodes. While many recent works have studied homogeneous, heterogeneous, and attributed networks embedding, there are few works that have collectively solved these challenges in heterogeneous networks. In this paper, we address them by developing a Semantic-aware Heterogeneous Network Embedding model (SHNE). SHNE performs joint optimization of heterogeneous SkipGram and deep semantic encoding for capturing both heterogeneous structural closeness and unstructured semantic relations among all nodes, as function of node content, that exist in the network. Extensive experiments demonstrate that SHNE outperforms state-of-the-art baselines in various heterogeneous network mining tasks, such as link prediction, document retrieval, node recommendation, relevance search, and class visualization.

83 citations


Journal ArticleDOI
TL;DR: This survey paper presents a critical study of different document layout analysis techniques and discusses comprehensively the different phases of the DLA algorithms based on a general framework that is formed as an outcome of reviewing the research in the field.
Abstract: Document layout analysis (DLA) is a preprocessing step of document understanding systems. It is responsible for detecting and annotating the physical structure of documents. DLA has several important applications such as document retrieval, content categorization, text recognition, and the like. The objective of DLA is to ease the subsequent analysis/recognition phases by identifying the document-homogeneous blocks and by determining their relationships. The DLA pipeline consists of several phases that could vary among DLA methods, depending on the documents’ layouts and final analysis objectives. In this regard, a universal DLA algorithm that fits all types of document-layouts or that satisfies all analysis objectives has not been developed, yet. In this survey paper, we present a critical study of different document layout analysis techniques. The study highlights the motivational reasons for pursuing DLA and discusses comprehensively the different phases of the DLA algorithms based on a general framework that is formed as an outcome of reviewing the research in the field. The DLA framework consists of preprocessing, layout analysis strategies, post-processing, and performance evaluation phases. Overall, the article delivers an essential baseline for pursuing further research in document layout analysis.

76 citations


Posted Content
TL;DR: To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.
Abstract: This paper presents the formal release of MedMentions, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

56 citations


Journal ArticleDOI
TL;DR: The proposed approach can significantly improve the capability of defending the privacy breaches, the scalability and the time efficiency of query processing over the state-of-the-art methods.
Abstract: Cloud computing provides individuals and enterprises massive computing power and scalable storage capacities to support a variety of big data applications in domains like health care and scientific research, therefore more and more data owners are involved to outsource their data on cloud servers for great convenience in data management and mining. However, data sets like health records in electronic documents usually contain sensitive information, which brings about privacy concerns if the documents are released or shared to partially untrusted third-parties in cloud. A practical and widely used technique for data privacy preservation is to encrypt data before outsourcing to the cloud servers, which however reduces the data utility and makes many traditional data analytic operators like keyword-based top- $k$ k document retrieval obsolete. In this paper, we investigate the multi-keyword top- $k$ k search problem for big data encryption against privacy breaches, and attempt to identify an efficient and secure solution to this problem. Specifically, for the privacy concern of query data, we construct a special tree-based index structure and design a random traversal algorithm, which makes even the same query to produce different visiting paths on the index, and can also maintain the accuracy of queries unchanged under stronger privacy. For improving the query efficiency, we propose a group multi-keyword top- $k$ k search scheme based on the idea of partition, where a group of tree-based indexes are constructed for all documents. Finally, we combine these methods together into an efficient and secure approach to address our proposed top- $k$ k similarity search. Extensive experimental results on real-life data sets demonstrate that our proposed approach can significantly improve the capability of defending the privacy breaches, the scalability and the time efficiency of query processing over the state-of-the-art methods.

Proceedings Article
12 Mar 2019
TL;DR: The MedMentions corpus as mentioned in this paper is a manually annotated resource for the recognition of biomedical concepts, which includes over 4,000 abstracts and over 350,000 linked mentions.
Abstract: This paper presents the formal release of {\em MedMentions}, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

Proceedings ArticleDOI
10 Jul 2019
TL;DR: Retrieval Question Answering (ReQA), a benchmark for evaluating large-scale sentence- and paragraph-level answer retrieval models, is introduced and baselines are established using both neural encoding models as well as classical information retrieval techniques.
Abstract: Popular QA benchmarks like SQuAD have driven progress on the task of identifying answer spans within a specific passage, with models now surpassing human performance. However, retrieving relevant answers from a huge corpus of documents is still a challenging problem, and places different requirements on the model architecture. There is growing interest in developing scalable answer retrieval models trained end-to-end, bypassing the typical document retrieval step. In this paper, we introduce Retrieval Question-Answering (ReQA), a benchmark for evaluating large-scale sentence-level answer retrieval models. We establish baselines using both neural encoding models as well as classical information retrieval techniques. We release our evaluation code to encourage further work on this challenging task.

Posted Content
TL;DR: A new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation is presented.
Abstract: Automated fact-checking based on machine learning is a promising approach to identify false information distributed on the web. In order to achieve satisfactory performance, machine learning methods require a large corpus with reliable annotations for the different tasks in the fact-checking process. Having analyzed existing fact-checking corpora, we found that none of them meets these criteria in full. They are either too small in size, do not provide detailed annotations, or are limited to a single domain. Motivated by this gap, we present a new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation. To aid future corpus construction, we describe our methodology for corpus creation and annotation, and demonstrate that it results in substantial inter-annotator agreement. As baselines for future research, we perform experiments on our corpus with a number of model architectures that reach high performance in similar problem settings. Finally, to support the development of future models, we provide a detailed error analysis for each of the tasks. Our results show that the realistic, multi-domain setting defined by our data poses new challenges for the existing models, providing opportunities for considerable improvement by future systems.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: This paper contains the system description for the second Fact Extraction and VERification (FEVER) challenge and proposes a two-staged sentence selection strategy to account for examples in the dataset where evidence is not only conditioned on the claim, but also on previously retrieved evidence.
Abstract: This paper contains our system description for the second Fact Extraction and VERification (FEVER) challenge. We propose a two-staged sentence selection strategy to account for examples in the dataset where evidence is not only conditioned on the claim, but also on previously retrieved evidence. We use a publicly available document retrieval module and have fine-tuned BERT checkpoints for sentence se- lection and as the entailment classifier. We report a FEVER score of 68.46% on the blind testset.

Proceedings ArticleDOI
29 Aug 2019
TL;DR: This paper presented a new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation.
Abstract: Automated fact-checking based on machine learning is a promising approach to identify false information distributed on the web. In order to achieve satisfactory performance, machine learning methods require a large corpus with reliable annotations for the different tasks in the fact-checking process. Having analyzed existing fact-checking corpora, we found that none of them meets these criteria in full. They are either too small in size, do not provide detailed annotations, or are limited to a single domain. Motivated by this gap, we present a new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation. To aid future corpus construction, we describe our methodology for corpus creation and annotation, and demonstrate that it results in substantial inter-annotator agreement. As baselines for future research, we perform experiments on our corpus with a number of model architectures that reach high performance in similar problem settings. Finally, to support the development of future models, we provide a detailed error analysis for each of the tasks. Our results show that the realistic, multi-domain setting defined by our data poses new challenges for the existing models, providing opportunities for considerable improvement by future systems.

Journal ArticleDOI
TL;DR: It is clarified that the combination of graph-based algorithms and co-citation contexts are effective in improving the performance of co-Citation search techniques, and that sole use of a graph- based algorithm is not enough to enhance search performances from the baselines.
Abstract: This study proposes a novel extended co-citation search technique, which is graph-based document retrieval on a co-citation network containing citation context information. The proposed search expands the scope of the target documents by repetitively spreading the relationship of co-citation in order to obtain relevant documents that are not identified by traditional co-citation searches. Specifically, this search technique is a combination of (a) applying a graph-based algorithm to compute the similarity score on a complicated network, and (b) incorporating co-citation contexts into the process of calculating similarity scores to reduce the negative effects of an increasing number of irrelevant documents. To evaluate the search performance of the proposed search, 10 proposed methods (five representative graph-based algorithms applied to co-citation networks weighted with/without contexts) are compared with two kinds of baselines (a traditional co-citation search with/without contexts) in information retrieval experiments based on two test collections (biomedicine and computer linguistic articles). The experiment results showed that the scores of the normalized discounted cumulative gain (nDCG@K) of the proposed methods using co-citation contexts tended to be higher than those of the baselines. In addition, the combination of the random walk with restart (RWR) algorithm and the network weighted with contexts achieved the best search performance among the 10 proposed methods. Thus, it is clarified that the combination of graph-based algorithms and co-citation contexts are effective in improving the performance of co-citation search techniques, and that sole use of a graph-based algorithm is not enough to enhance search performances from the baselines.

Journal ArticleDOI
TL;DR: Fuzzy logic is also employed, which improves the performance of accelerated particle swarm optimization by controlling various parameters in this paper, which gets better results in comparison to other automatic query expansion approaches.
Abstract: Nowadays, searching the relevant documents from a large dataset becomes a big challenge. Automatic query expansion is one of the techniques, which addresses this problem by refining the query. A new query expansion approach using cuckoo search and accelerated particle swarm optimization technique is proposed in this paper. The proposed approach mainly focused to find the most relevant expanded query rather than suitable expansion terms. In this paper, Fuzzy logic is also employed, which improves the performance of accelerated particle swarm optimization by controlling various parameters. We have compared the proposed approach with other existing and recently developed automatic query expansion approaches on various evaluating parameters such as average recall, average precision, Mean-Average Precision, F-measure and precision-recall graph. We have evaluated the performance of all approaches on three datasets CISI, CACM and TREC-3. The results obtained for all three datasets depict that the proposed approach gets better results in comparison to other automatic query expansion approaches.

Posted Content
TL;DR: FAKTA as mentioned in this paper is a unified framework that integrates various components of a fact checking process: document retrieval from media sources with various types of reliability, stance detection of documents with respect to given claims, evidence extraction, and linguistic analysis.
Abstract: We present FAKTA which is a unified framework that integrates various components of a fact checking process: document retrieval from media sources with various types of reliability, stance detection of documents with respect to given claims, evidence extraction, and linguistic analysis. FAKTA predicts the factuality of given claims and provides evidence at the document and sentence level to explain its predictions

Book ChapterDOI
14 Oct 2019
TL;DR: In this paper, load centrality, a graph-theoretic measure applied to graphs derived from a given text, is used to efficiently identify and rank keywords, and meta vertices (aggregates of existing vertices) and systematic redundancy filters are introduced.
Abstract: Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and systematic redundancy filters, the proposed method performs on par with state-of-the-art for the keyword extraction task on 14 diverse datasets. The proposed method is unsupervised, interpretable and can also be used for document visualization.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: FAKTA predicts the factuality of given claims and provides evidence at the document and sentence level to explain its predictions, and FAKTA integrates various components of a fact-checking process.
Abstract: We present FAKTA which is a unified framework that integrates various components of a fact-checking process: document retrieval from media sources with various types of reliability, stance detection of documents with respect to given claims, evidence extraction, and linguistic analysis. FAKTA predicts the factuality of given claims and provides evidence at the document and sentence level to explain its predictions.

Book ChapterDOI
16 Sep 2019
TL;DR: The submissions of aueb to the bioasq 7 document and snippet retrieval tasks (parts of Task 7b, Phase A) are presented and models that jointly learn to retrieve documents and snippets are experimented with.
Abstract: We present the submissions of aueb to the bioasq 7 document and snippet retrieval tasks (parts of Task 7b, Phase A) Our systems build upon the methods we used in bioasq 6 This year we also experimented with models that jointly learn to retrieve documents and snippets, as opposed to using separate pipelined models for document and snippet retrieval We also experimented with models based on bert [5] Our systems obtained the best document and snippet retrieval results for all batches of the challenge that we participated in

Proceedings ArticleDOI
TL;DR: The authors proposed the Retrieval Question Answering (ReQA), a benchmark for evaluating large-scale sentence-and paragraph-level answer retrieval models, and established baselines using both neural encoding models as well as classical information retrieval techniques.
Abstract: Popular QA benchmarks like SQuAD have driven progress on the task of identifying answer spans within a specific passage, with models now surpassing human performance. However, retrieving relevant answers from a huge corpus of documents is still a challenging problem, and places different requirements on the model architecture. There is growing interest in developing scalable answer retrieval models trained end-to-end, bypassing the typical document retrieval step. In this paper, we introduce Retrieval Question Answering (ReQA), a benchmark for evaluating large-scale sentence- and paragraph-level answer retrieval models. We establish baselines using both neural encoding models as well as classical information retrieval techniques. We release our evaluation code to encourage further work on this challenging task.

Proceedings ArticleDOI
18 Jul 2019
TL;DR: This workshop tackles the replicability challenge for ad hoc document retrieval, via a common Docker interface specification to support images that capture systems performing ad hoc retrieval experiments on standard test collections.
Abstract: The importance of repeatability, replicability, and reproducibility is broadly recognized in the computational sciences, both in supporting desirable scientific methodology as well as sustaining empirical progress. This workshop tackles the replicability challenge for ad hoc document retrieval, via a common Docker interface specification to support images that capture systems performing ad hoc retrieval experiments on standard test collections.

Proceedings ArticleDOI
08 Mar 2019
TL;DR: This paper used Amazon Mechanical Turk to investigate three answer presentation and interaction approaches in a non-factoid question answering setting and found that people perceive and react to good and bad answers very differently and can identify good answers relatively quickly.
Abstract: Information retrieval systems are evolving from document retrieval to answer retrieval. Web search logs provide large amounts of data about how people interact with ranked lists of documents, but very little is known about interaction with answer texts. In this paper, we use Amazon Mechanical Turk to investigate three answer presentation and interaction approaches in a non-factoid question answering setting. We find that people perceive and react to good and bad answers very differently, and can identify good answers relatively quickly. Our results provide the basis for further investigation of effective answer interaction and feedback methods.

Book ChapterDOI
TL;DR: This work explores how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords.
Abstract: Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and systematic redundancy filters, the proposed method performs on par with state-of-the-art for the keyword extraction task on 14 diverse datasets. The proposed method is unsupervised, interpretable and can also be used for document visualization.

Journal ArticleDOI
TL;DR: This paper proposes a new expansion method for medical text (query/document) based on retro-semantic mapping between textual terms and UMLS concepts that are relevant in medical image retrieval that significantly improves the retrieval accuracy and outperforms the approaches offered in the literature.

Journal ArticleDOI
TL;DR: A domain ontology model with document processing and document retrieval is proposed, and the feasibility and superiority of the domain ontological model are proved by the method of experiment.
Abstract: An information retrieval system not only occupies an important position in the network information platform, but also plays an important role in information acquisition, query processing, and wireless sensor networks. It is a procedure to help researchers extract documents from data sets as document retrieval tools. The classic keyword-based information retrieval models neglect the semantic information which is not able to represent the user’s needs. Therefore, how to efficiently acquire personalized information that users need is of concern. The ontology-based systems lack an expert list to obtain accurate index term frequency. In this paper, a domain ontology model with document processing and document retrieval is proposed, and the feasibility and superiority of the domain ontology model are proved by the method of experiment.

Posted Content
TL;DR: The model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog, and English-Somali cross-lingual information retrieval tasks and can also be directly applied to another language pair without any training label.
Abstract: In this paper, we propose to boost low-resource cross-lingual document retrieval performance with deep bilingual query-document representations. We match queries and documents in both source and target languages with four components, each of which is implemented as a term interaction-based deep neural network with cross-lingual word embeddings as input. By including query likelihood scores as extra features, our model effectively learns to rerank the retrieved documents by using a small number of relevance labels for low-resource language pairs. Due to the shared cross-lingual word embedding space, the model can also be directly applied to another language pair without any training label. Experimental results on the MATERIAL dataset show that our model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog, and English-Somali cross-lingual information retrieval tasks.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper presents TopicSifter, a visual analytics system for interactive search space reduction that utilizes targeted topic modeling based on nonnegative matrix factorization and allows users to give relevance feedback in order to refine their target and guide the topic modeling to the most relevant results.
Abstract: Topic modeling is commonly used to analyze and understand large document collections. However, in practice, users want to focus on specific aspects or “targets” rather than the entire corpus. For example, given a large collection of documents, users may want only a smaller subset which more closely aligns with their interests, tasks, and domains. In particular, our paper focuses on large-scale document retrieval with high recall where any missed relevant documents can be critical. A simple keyword matching search is generally not effective nor efficient as 1) it is difficult to find a list of keyword queries that can cover the documents of interest before exploring the dataset, 2) some documents may not contain the exact keywords of interest but may still be highly relevant, and 3) some words have multiple meanings, which would result in irrelevant documents included in the retrieved subset. In this paper, we present TopicSifter, a visual analytics system for interactive search space reduction. Our system utilizes targeted topic modeling based on nonnegative matrix factorization and allows users to give relevance feedback in order to refine their target and guide the topic modeling to the most relevant results.